Cross-Compiling and Running LLMs with MLC-LLM for Jetson AGX Orin 32GB
This post is a follow-up to three earlier articles:
- Building MLC-LLM from Source with
uvon Ubuntu 24.04 - Compiling and Running LLMs with MLC-LLM on a Desktop PC
- Building MLC-LLM from Source with
uvon NVIDIA Jetson AGX Orin 32GB
In the Ubuntu desktop post, I built MLC-LLM on my x86 development machine. Then, I showed the simple same-machine flow: download a model, convert it, compile it, and run it on the same PC. In the Jetson post, I documented a native build on the device itself.
This article covers the next step: cross-compiling a model library for Jetson AGX Orin 32GB on a desktop host, then copying the artifacts to the device and running them there.
The useful split is this:
- model download, weight conversion, and config generation are target-independent
- the final compiled model library is target-specific
That means I can do the heavier preparation work on the desktop host, then only run the final Jetson-targeted artifact on the device.
Machines used
Host machine
I used the same desktop PC from my earlier Ubuntu 24.04 posts:
- CPU: AMD Ryzen 9 7950X 16-Core Processor
- GPU: NVIDIA RTX 4090
- Memory: 64GB
- Storage: 4TB SSD
- OS: Ubuntu 24.04
Target machine
- NVIDIA Jetson AGX Orin 32GB
Models I tried
In this workflow, I used the following model and conversation-template combinations:
| Model | Conversation template |
|---|---|
Qwen3-4B | qwen2 |
llm-jp-3.1-1.8b-instruct4 | llm-jp |
Prerequisites
This post assumes two things are already true:
- the Jetson already has a working MLC-LLM Python environment, as described in my earlier Jetson build post
- the host machine already has Docker and the NVIDIA container runtime working
If you have not already set up MLC-LLM on the Jetson itself, read the Jetson build post first.
Overview of the workflow
At a high level, the process was:
- check the Jetson Linux version on the device
- build a Jetson-oriented MLC-LLM Docker image on the desktop host
- download the model from Hugging Face on the host
- convert the weights and generate the config on the host
- cross-compile the model library for Jetson inside the container
- copy the compiled artifacts to the device
- run the model on Jetson with
mlc_llm chat
1. Check the Jetson Linux version
On the device, I first checked the Jetson Linux release:
cat /etc/nv_tegra_release
In my notes, the expected result was one of these branches:
R36 or R35
I wanted to verify this before treating the device as the target for the cross-compiled build.
2. Build the MLC-LLM Docker image on the host
On the host side, I built a Docker image that contains:
- CUDA development environment matching to the Jetson version
- LLVM 20
- Rust
uvand a Python 3.13 virtual environment- a built copy of TVM
- a built copy of MLC-LLM
- the Bootlin AArch64 cross toolchain
The point of this image was to keep the cross-compilation environment reproducible and self-contained.
I built the image with:
./build.sh
where build.sh was:
docker build -f docker/Dockerfile.mlc-jetson \
--build-arg MLC_LLM_REF=main \
-t mlc-llm-jetson:cu126 .
A detail that matters here is that the container itself is built on the desktop host, so some host-side CUDA settings in the image reflect the desktop GPU. The actual Jetson-targeted model library is produced later by the explicit cross-compilation command using --host aarch64-unknown-linux-gnu and sm_87.
Docker image structure
The image I used was based on this idea:
- base image:
nvidia/cuda:12.6.3-devel-ubuntu22.04 - install LLVM 20 and related build tools
- install Rust
- install
uvand create a Python 3.13 virtual environment - clone MLC-LLM and its submodules
- build TVM from source
- build MLC-LLM C++ components
- install the MLC-LLM Python package
- install the Bootlin AArch64 toolchain
- add a small entrypoint script that maps the host UID/GID into the container
For reference, the exact files I used are included at the end of this post.
3. Download the model from Hugging Face
Next, on the host, I downloaded the original model files.
mkdir -p dist/models
export TARGET_MODEL_NAME="Qwen3-4B"
uvx hf download Qwen/${TARGET_MODEL_NAME} --local-dir dist/models/${TARGET_MODEL_NAME}
For the llm-jp model, I used the same pattern with a different model name and repository.
I kept the original Hugging Face files under dist/models/ so that the download step stayed separate from the MLC-converted artifacts.
4. Convert the weights and generate the config
This part is target-independent, so I could do it on the host using either a local MLC-LLM install or the containerized environment.
That separation is one of the main reasons this workflow is useful: I only need to cross-compile the final library, not the weight conversion step.
For Qwen3-4B, I used:
export TARGET_MODEL_NAME="Qwen3-4B"
export QUANTIZATION_TYPE="q4f16_1"
export CONV_TEMPLATE="qwen2"
mlc_llm convert_weight ./dist/models/${TARGET_MODEL_NAME}/ \
--quantization ${QUANTIZATION_TYPE} \
-o dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC
mlc_llm gen_config ./dist/models/${TARGET_MODEL_NAME}/ \
--quantization ${QUANTIZATION_TYPE} \
--conv-template ${CONV_TEMPLATE} \
-o dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/
For llm-jp-3.1-1.8b-instruct4, I kept the same structure and changed the conversation template:
export TARGET_MODEL_NAME="llm-jp-3.1-1.8b-instruct4"
export QUANTIZATION_TYPE="q4f16_1"
export CONV_TEMPLATE="llm-jp"
After these steps, I had a target-independent directory like:
dist/Qwen3-4B-q4f16_1-MLC/
or:
dist/llm-jp-3.1-1.8b-instruct4-q4f16_1-MLC/
5. Launch the cross-compilation container
To do the Jetson-targeted compile, I launched the Docker image with:
./bash.sh
where bash.sh was:
docker run --rm -it \
--gpus all --runtime=nvidia \
-e LOCAL_UID="$(id -u)" \
-e LOCAL_GID="$(id -g)" \
--mount type=bind,src="$PWD",dst=/workspace \
--mount type=volume,src=mlc-cache,dst=/cache \
-w /workspace \
mlc-llm-jetson:cu126 \
bash
I mounted the current working directory into /workspace, which made the converted model directory and output library available both inside and outside the container.
6. Cross-compile the model library for Jetson AGX Orin
Inside the container, I first loaded the cross-toolchain environment:
source /usr/local/bin/jetson-aarch64-env.sh
Then I compiled the model library:
mkdir -p ./dist/libs
export TARGET_MODEL_NAME="Qwen3-4B" # or "llm-jp-3.1-1.8b-instruct4"
export MLC_MULTI_ARCH=87
export QUANTIZATION_TYPE="q4f16_1"
mlc_llm compile ./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/mlc-chat-config.json \
--device '{"kind":"cuda","tag":"","keys":["cuda","gpu"],"max_num_threads":1024,"thread_warp_size":32,"arch":"sm_87", "max_threads_per_block":1024,"max_shared_memory_per_block":49152}' \
--host aarch64-unknown-linux-gnu \
--opt "flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE" \
-o ./dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda-jetson_orin_sm87.so
This was the core cross-compilation step.
A few parts of the command are especially important:
--host aarch64-unknown-linux-gnutargets Linux on AArch64--device ... "arch":"sm_87"targets the Jetson-side CUDA architecture I used- the output file name explicitly records that this is a Jetson Orin
sm87build - the
--optstring captures the exact kernel/runtime options that worked for me in this setup
I treated that option string as a known-good configuration rather than a general rule for all Jetson deployments.
7. Copy the compiled artifacts to the device
Once the compile finished, I copied both:
- the converted model directory
- the compiled Jetson-specific shared library
to the Jetson.
export TARGET_MODEL_NAME="Qwen3-4B" # or "llm-jp-3.1-1.8b-instruct4"
export MLC_MULTI_ARCH=87
export QUANTIZATION_TYPE="q4f16_1"
scp -r ./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC agx-orin-1:/home/ubuntu/work/dist/
scp ./dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda-jetson_orin_sm87.so agx-orin-1:/home/ubuntu/work/dist/libs/
This split is worth remembering:
- the converted MLC model directory is the model artifact
- the
.sofile is the device-specific compiled runtime library
8. Run the model on the Jetson
Finally, on the Jetson itself, I activated the existing Python environment and launched the model:
cd /home/ubuntu/work
source .venv/bin/activate
export TARGET_MODEL_NAME="Qwen3-4B"
export QUANTIZATION_TYPE="q4f16_1"
mlc_llm chat --device "cuda:0" --model-lib "./dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda-jetson_orin_sm87.so" "./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC"
I tried prompts such as:
What is the meaning of life?
自然言語処理とは何か
The same structure applies to llm-jp-3.1-1.8b-instruct4; the main changes are the model name, the conversation template during config generation, and the matching output paths.
What changed compared with the desktop workflow
In my earlier desktop post, I compiled and ran the model on the same x86 machine. The flow was simple because the host and runtime target were identical.
For Jetson cross-compilation, the main changes were:
- the final
mlc_llm compilestep became target-specific - I used a Docker image containing a cross toolchain instead of compiling directly on the device
- I copied the finished artifacts to the Jetson after the build
- I ran the model on the Jetson using the Jetson-side MLC-LLM environment from the previous post
The important point is that only the final library build is Jetson-specific.
Final thoughts
This post ties together the earlier parts of the series:
- build MLC-LLM on the desktop host
- build MLC-LLM on the Jetson device
- compile and run models locally on desktop
- then cross-compile the Jetson-specific model library on the host and run it on the device
Appendix: files I used
build.sh
docker build -f docker/Dockerfile.mlc-jetson \
--build-arg MLC_LLM_REF=main \
-t mlc-llm-jetson:cu126 .
bash.sh
docker run --rm -it \
--gpus all --runtime=nvidia \
-e LOCAL_UID="$(id -u)" \
-e LOCAL_GID="$(id -g)" \
--mount type=bind,src="$PWD",dst=/workspace \
--mount type=volume,src=mlc-cache,dst=/cache \
-w /workspace \
mlc-llm-jetson:cu126 \
bash
docker-entrypoint.sh
#!/usr/bin/env bash
set -euo pipefail
LOCAL_UID="${LOCAL_UID:-1000}"
LOCAL_GID="${LOCAL_GID:-1000}"
LOCAL_USER="${LOCAL_USER:-builder}"
LOCAL_GROUP="${LOCAL_GROUP:-builder}"
# use existing group, otherwise create it
if getent group "${LOCAL_GID}" >/dev/null 2>&1; then
GROUP_NAME="$(getent group "${LOCAL_GID}" | cut -d: -f1)"
else
groupadd -g "${LOCAL_GID}" "${LOCAL_GROUP}"
GROUP_NAME="${LOCAL_GROUP}"
fi
# use existing user, otherwise create it
if getent passwd "${LOCAL_UID}" >/dev/null 2>&1; then
USER_NAME="$(getent passwd "${LOCAL_UID}" | cut -d: -f1)"
HOME_DIR="$(getent passwd "${LOCAL_UID}" | cut -d: -f6)"
else
useradd -m -u "${LOCAL_UID}" -g "${LOCAL_GID}" -s /bin/bash "${LOCAL_USER}"
USER_NAME="${LOCAL_USER}"
HOME_DIR="/home/${LOCAL_USER}"
fi
mkdir -p "${HOME_DIR}" /cache /workspace
chown -R "${LOCAL_UID}:${LOCAL_GID}" "${HOME_DIR}" /cache
export HOME="${HOME_DIR}"
export XDG_CACHE_HOME="${XDG_CACHE_HOME:-/cache/xdg}"
export UV_CACHE_DIR="${UV_CACHE_DIR:-/cache/uv}"
# bind mount side (/workspace) is not chown
exec gosu "${LOCAL_UID}:${LOCAL_GID}" "$@"
docker/Dockerfile.mlc-jetson
ARG CUDA_IMAGE=nvidia/cuda:12.6.3-devel-ubuntu22.04
FROM ${CUDA_IMAGE}
ARG DEBIAN_FRONTEND=noninteractive
ARG TOOLCHAIN_DIRNAME=aarch64--glibc--stable-2022.08-1
ARG PYTHON_VER=3.13
ARG MLC_LLM_REPO=https://github.com/mlc-ai/mlc-llm.git
ARG MLC_LLM_REF=main
ARG TVM_CUDA_ARCHS="87;89"
ARG HOST_CUDA_ARCH=89
SHELL ["/bin/bash", "-o", "pipefail", "-c"]
# ----------------------------------------------------------------------
# Base deps
# ----------------------------------------------------------------------
RUN apt update && apt install -y --no-install-recommends \
git git-lfs gnupg ca-certificates curl wget gosu \
build-essential binutils pkg-config \
zlib1g-dev libzstd-dev libxml2-dev libedit-dev libtinfo-dev \
&& rm -rf /var/lib/apt/lists/*
# ----------------------------------------------------------------------
# LLVM 20
# ----------------------------------------------------------------------
RUN wget -q -O - https://apt.llvm.org/llvm-snapshot.gpg.key \
| gpg --dearmor -o /usr/share/keyrings/llvm-archive.gpg \
&& echo "deb [signed-by=/usr/share/keyrings/llvm-archive.gpg] http://apt.llvm.org/jammy/ llvm-toolchain-jammy-20 main" \
> /etc/apt/sources.list.d/llvm.list \
&& apt update \
&& apt install -y --no-install-recommends \
llvm-20-dev clang-20 lld-20 libpolly-20-dev \
&& rm -rf /var/lib/apt/lists/*
RUN ln -sf /usr/bin/llvm-config-20 /usr/local/bin/llvm-config \
&& ln -sf /usr/bin/clang-20 /usr/local/bin/clang \
&& ln -sf /usr/bin/clang++-20 /usr/local/bin/clang++ \
&& ln -sf /usr/bin/lld-20 /usr/local/bin/lld \
&& if [ -x /usr/bin/ld.lld-20 ]; then \
ln -sf /usr/bin/ld.lld-20 /usr/local/bin/ld.lld ; \
else \
ln -sf /usr/bin/lld-20 /usr/local/bin/ld.lld ; \
fi
# ----------------------------------------------------------------------
# Rust
# ----------------------------------------------------------------------
RUN curl https://sh.rustup.rs -sSf | sh -s -- -y --default-toolchain stable
ENV PATH="/root/.cargo/bin:${PATH}"
# ----------------------------------------------------------------------
# uv + uv-managed Python (global path) + baked venv
# ----------------------------------------------------------------------
ENV UV_INSTALL_DIR=/opt/uv/bin \
UV_PYTHON_INSTALL_DIR=/opt/uv/python \
UV_PYTHON_BIN_DIR=/opt/uv/bin
RUN mkdir -p /opt/uv/bin /opt/uv/python \
&& curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/opt/venv/bin:/opt/uv/bin:/root/.cargo/bin:${PATH}"
RUN uv python install ${PYTHON_VER} \
&& uv venv /opt/venv --python ${PYTHON_VER}
ENV VIRTUAL_ENV=/opt/venv
# ----------------------------------------------------------------------
# Clone MLC-LLM source
# ----------------------------------------------------------------------
WORKDIR /opt/src
RUN git clone --recursive ${MLC_LLM_REPO} mlc-llm \
&& cd mlc-llm \
&& git checkout ${MLC_LLM_REF} \
&& git submodule update --init --recursive \
&& git lfs install --system
ENV MLC_LLM_HOME=/opt/src/mlc-llm
# ----------------------------------------------------------------------
# Build TVM from source (non-editable install into /opt/venv)
# ----------------------------------------------------------------------
WORKDIR /opt/src/mlc-llm/3rdparty/tvm
RUN uv pip install --upgrade setuptools wheel cmake ninja \
&& uv pip install ./3rdparty/tvm-ffi --verbose \
&& uv pip install . --verbose \
'--config-setting=cmake.define.CMAKE_BUILD_TYPE=RelWithDebInfo' \
'--config-setting=cmake.define.USE_LLVM=llvm-config --ignore-libllvm --link-static' \
'--config-setting=cmake.define.HIDE_PRIVATE_SYMBOLS=ON' \
'--config-setting=cmake.define.USE_CUDA=ON' \
"--config-setting=cmake.define.CMAKE_CUDA_ARCHITECTURES=${TVM_CUDA_ARCHS}" \
'--config-setting=cmake.define.USE_CUBLAS=ON' \
'--config-setting=cmake.define.USE_CUTLASS=ON' \
'--config-setting=cmake.define.USE_THRUST=ON' \
'--config-setting=cmake.define.USE_NVTX=ON'
# ----------------------------------------------------------------------
# Build MLC-LLM C++ part
# ----------------------------------------------------------------------
WORKDIR /opt/src/mlc-llm
RUN rm -rf build \
&& mkdir -p build \
&& cat > build/config.cmake <<EOF
set(TVM_SOURCE_DIR /opt/src/mlc-llm/3rdparty/tvm)
set(CMAKE_BUILD_TYPE RelWithDebInfo)
set(USE_CUDA ON)
set(USE_CUTLASS ON)
set(USE_CUBLAS ON)
set(USE_ROCM OFF)
set(USE_VULKAN OFF)
set(USE_METAL OFF)
set(USE_OPENCL OFF)
set(USE_THRUST ON)
set(CMAKE_CUDA_ARCHITECTURES ${HOST_CUDA_ARCH})
set(FLASHINFER_CUDA_ARCHITECTURES ${HOST_CUDA_ARCH})
EOF
RUN cmake -S /opt/src/mlc-llm -B /opt/src/mlc-llm/build -G Ninja \
&& cmake --build /opt/src/mlc-llm/build --parallel "$(nproc)"
# ----------------------------------------------------------------------
# Install MLC-LLM python package (non-editable)
# ----------------------------------------------------------------------
RUN uv pip install ./python --verbose
RUN mkdir -p /workspace /cache \
&& chmod -R a+rX /opt/uv /opt/venv /opt/src/mlc-llm
# ----------------------------------------------------------------------
# Bootlin toolchain
# ----------------------------------------------------------------------
RUN mkdir -p /opt/l4t-gcc \
&& curl -L -o /tmp/${TOOLCHAIN_DIRNAME}.tar.bz2 \
https://toolchains.bootlin.com/downloads/releases/toolchains/aarch64/tarballs/${TOOLCHAIN_DIRNAME}.tar.bz2 \
&& tar -xjf /tmp/${TOOLCHAIN_DIRNAME}.tar.bz2 -C /opt/l4t-gcc \
&& rm /tmp/${TOOLCHAIN_DIRNAME}.tar.bz2
ENV TOOLCHAIN_DIRNAME=${TOOLCHAIN_DIRNAME}
ENV TOOLCHAIN_ROOT=/opt/l4t-gcc/${TOOLCHAIN_DIRNAME}
ENV CROSS_COMPILE=${TOOLCHAIN_ROOT}/bin/aarch64-buildroot-linux-gnu-
ENV PATH="${TOOLCHAIN_ROOT}/bin:${PATH}"
RUN cat > /usr/local/bin/jetson-aarch64-env.sh <<'EOF' \
&& chmod +x /usr/local/bin/jetson-aarch64-env.sh
#!/usr/bin/env bash
export TOOLCHAIN_ROOT="${TOOLCHAIN_ROOT:-/opt/l4t-gcc/${TOOLCHAIN_DIRNAME}}"
export CROSS_COMPILE="${TOOLCHAIN_ROOT}/bin/aarch64-buildroot-linux-gnu-"
export CC="${CROSS_COMPILE}gcc"
export CXX="${CROSS_COMPILE}g++"
export AR="${CROSS_COMPILE}ar"
export RANLIB="${CROSS_COMPILE}ranlib"
export LD="${CROSS_COMPILE}ld"
export STRIP="${CROSS_COMPILE}strip"
echo "Using cross toolchain:"
echo " CC=${CC}"
echo " CXX=${CXX}"
EOF
# ----------------------------------------------------------------------
# entrypoint
# ----------------------------------------------------------------------
COPY docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh
RUN chmod +x /usr/local/bin/docker-entrypoint.sh
ENV PYTHONUNBUFFERED=1
ENV MLC_LLM_HOME=/workspace/.mlc_llm
WORKDIR /workspace
ENTRYPOINT ["/usr/local/bin/docker-entrypoint.sh"]
CMD ["bash"]