Cross-Compiling and Running LLMs with MLC-LLM for Jetson AGX Orin 32GB

This post is a follow-up to three earlier articles:

In the Ubuntu desktop post, I built MLC-LLM on my x86 development machine. Then, I showed the simple same-machine flow: download a model, convert it, compile it, and run it on the same PC. In the Jetson post, I documented a native build on the device itself.

This article covers the next step: cross-compiling a model library for Jetson AGX Orin 32GB on a desktop host, then copying the artifacts to the device and running them there.

The useful split is this:

model download, weight conversion, and config generation are target-independent
the final compiled model library is target-specific

That means I can do the heavier preparation work on the desktop host, then only run the final Jetson-targeted artifact on the device.

Machines used

Host machine

I used the same desktop PC from my earlier Ubuntu 24.04 posts:

CPU: AMD Ryzen 9 7950X 16-Core Processor
GPU: NVIDIA RTX 4090
Memory: 64GB
Storage: 4TB SSD
OS: Ubuntu 24.04

Target machine

NVIDIA Jetson AGX Orin 32GB

Models I tried

In this workflow, I used the following model and conversation-template combinations:

Model	Conversation template
`Qwen3-4B`	`qwen2`
`llm-jp-3.1-1.8b-instruct4`	`llm-jp`

Prerequisites

This post assumes two things are already true:

the Jetson already has a working MLC-LLM Python environment, as described in my earlier Jetson build post
the host machine already has Docker and the NVIDIA container runtime working

If you have not already set up MLC-LLM on the Jetson itself, read the Jetson build post first.

Overview of the workflow

At a high level, the process was:

check the Jetson Linux version on the device
build a Jetson-oriented MLC-LLM Docker image on the desktop host
download the model from Hugging Face on the host
convert the weights and generate the config on the host
cross-compile the model library for Jetson inside the container
copy the compiled artifacts to the device
run the model on Jetson with mlc_llm chat

1. Check the Jetson Linux version

On the device, I first checked the Jetson Linux release:

cat /etc/nv_tegra_release

In my notes, the expected result was one of these branches:

R36 or R35

I wanted to verify this before treating the device as the target for the cross-compiled build.

2. Build the MLC-LLM Docker image on the host

On the host side, I built a Docker image that contains:

CUDA development environment matching to the Jetson version
LLVM 20
Rust
uv and a Python 3.13 virtual environment
a built copy of TVM
a built copy of MLC-LLM
the Bootlin AArch64 cross toolchain

The point of this image was to keep the cross-compilation environment reproducible and self-contained.

I built the image with:

./build.sh

where build.sh was:

docker build -f docker/Dockerfile.mlc-jetson \
  --build-arg MLC_LLM_REF=main \
  -t mlc-llm-jetson:cu126 .

A detail that matters here is that the container itself is built on the desktop host, so some host-side CUDA settings in the image reflect the desktop GPU. The actual Jetson-targeted model library is produced later by the explicit cross-compilation command using --host aarch64-unknown-linux-gnu and sm_87.

Docker image structure

The image I used was based on this idea:

base image: nvidia/cuda:12.6.3-devel-ubuntu22.04
install LLVM 20 and related build tools
install Rust
install uv and create a Python 3.13 virtual environment
clone MLC-LLM and its submodules
build TVM from source
build MLC-LLM C++ components
install the MLC-LLM Python package
install the Bootlin AArch64 toolchain
add a small entrypoint script that maps the host UID/GID into the container

For reference, the exact files I used are included at the end of this post.

3. Download the model from Hugging Face

Next, on the host, I downloaded the original model files.

mkdir -p dist/models
export TARGET_MODEL_NAME="Qwen3-4B"
uvx hf download Qwen/${TARGET_MODEL_NAME} --local-dir dist/models/${TARGET_MODEL_NAME}

For the llm-jp model, I used the same pattern with a different model name and repository.

I kept the original Hugging Face files under dist/models/ so that the download step stayed separate from the MLC-converted artifacts.

4. Convert the weights and generate the config

This part is target-independent, so I could do it on the host using either a local MLC-LLM install or the containerized environment.

That separation is one of the main reasons this workflow is useful: I only need to cross-compile the final library, not the weight conversion step.

For Qwen3-4B, I used:

export TARGET_MODEL_NAME="Qwen3-4B"
export QUANTIZATION_TYPE="q4f16_1"
export CONV_TEMPLATE="qwen2"

mlc_llm convert_weight ./dist/models/${TARGET_MODEL_NAME}/ \
  --quantization ${QUANTIZATION_TYPE} \
  -o dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC

mlc_llm gen_config ./dist/models/${TARGET_MODEL_NAME}/ \
  --quantization ${QUANTIZATION_TYPE} \
  --conv-template ${CONV_TEMPLATE} \
  -o dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/

For llm-jp-3.1-1.8b-instruct4, I kept the same structure and changed the conversation template:

export TARGET_MODEL_NAME="llm-jp-3.1-1.8b-instruct4"
export QUANTIZATION_TYPE="q4f16_1"
export CONV_TEMPLATE="llm-jp"

After these steps, I had a target-independent directory like:

dist/Qwen3-4B-q4f16_1-MLC/

or:

dist/llm-jp-3.1-1.8b-instruct4-q4f16_1-MLC/

5. Launch the cross-compilation container

To do the Jetson-targeted compile, I launched the Docker image with:

./bash.sh

where bash.sh was:

docker run --rm -it \
  --gpus all --runtime=nvidia \
  -e LOCAL_UID="$(id -u)" \
  -e LOCAL_GID="$(id -g)" \
  --mount type=bind,src="$PWD",dst=/workspace \
  --mount type=volume,src=mlc-cache,dst=/cache \
  -w /workspace \
  mlc-llm-jetson:cu126 \
  bash

I mounted the current working directory into /workspace, which made the converted model directory and output library available both inside and outside the container.

6. Cross-compile the model library for Jetson AGX Orin

Inside the container, I first loaded the cross-toolchain environment:

source /usr/local/bin/jetson-aarch64-env.sh

Then I compiled the model library:

mkdir -p ./dist/libs

export TARGET_MODEL_NAME="Qwen3-4B" # or "llm-jp-3.1-1.8b-instruct4"
export MLC_MULTI_ARCH=87
export QUANTIZATION_TYPE="q4f16_1"

mlc_llm compile ./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/mlc-chat-config.json \
  --device '{"kind":"cuda","tag":"","keys":["cuda","gpu"],"max_num_threads":1024,"thread_warp_size":32,"arch":"sm_87", "max_threads_per_block":1024,"max_shared_memory_per_block":49152}' \
  --host aarch64-unknown-linux-gnu \
  --opt "flashinfer=0;cublas_gemm=0;faster_transformer=0;cudagraph=1;cutlass=1;ipc_allreduce_strategy=NONE" \
  -o ./dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda-jetson_orin_sm87.so

This was the core cross-compilation step.

A few parts of the command are especially important:

--host aarch64-unknown-linux-gnu targets Linux on AArch64
--device ... "arch":"sm_87" targets the Jetson-side CUDA architecture I used
the output file name explicitly records that this is a Jetson Orin sm87 build
the --opt string captures the exact kernel/runtime options that worked for me in this setup

I treated that option string as a known-good configuration rather than a general rule for all Jetson deployments.

7. Copy the compiled artifacts to the device

Once the compile finished, I copied both:

the converted model directory
the compiled Jetson-specific shared library

to the Jetson.

export TARGET_MODEL_NAME="Qwen3-4B" # or "llm-jp-3.1-1.8b-instruct4"
export MLC_MULTI_ARCH=87
export QUANTIZATION_TYPE="q4f16_1"

scp -r ./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC agx-orin-1:/home/ubuntu/work/dist/
scp ./dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda-jetson_orin_sm87.so agx-orin-1:/home/ubuntu/work/dist/libs/

This split is worth remembering:

the converted MLC model directory is the model artifact
the .so file is the device-specific compiled runtime library

8. Run the model on the Jetson

Finally, on the Jetson itself, I activated the existing Python environment and launched the model:

cd /home/ubuntu/work
source .venv/bin/activate

export TARGET_MODEL_NAME="Qwen3-4B"
export QUANTIZATION_TYPE="q4f16_1"

mlc_llm chat --device "cuda:0" --model-lib "./dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda-jetson_orin_sm87.so" "./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC"

I tried prompts such as:

What is the meaning of life?
自然言語処理とは何か

The same structure applies to llm-jp-3.1-1.8b-instruct4; the main changes are the model name, the conversation template during config generation, and the matching output paths.

What changed compared with the desktop workflow

In my earlier desktop post, I compiled and ran the model on the same x86 machine. The flow was simple because the host and runtime target were identical.

For Jetson cross-compilation, the main changes were:

the final mlc_llm compile step became target-specific
I used a Docker image containing a cross toolchain instead of compiling directly on the device
I copied the finished artifacts to the Jetson after the build
I ran the model on the Jetson using the Jetson-side MLC-LLM environment from the previous post

The important point is that only the final library build is Jetson-specific.

Final thoughts

This post ties together the earlier parts of the series:

build MLC-LLM on the desktop host
build MLC-LLM on the Jetson device
compile and run models locally on desktop
then cross-compile the Jetson-specific model library on the host and run it on the device

Appendix: files I used

`build.sh`

docker build -f docker/Dockerfile.mlc-jetson \
  --build-arg MLC_LLM_REF=main \
  -t mlc-llm-jetson:cu126 .

`bash.sh`

docker run --rm -it \
  --gpus all --runtime=nvidia \
  -e LOCAL_UID="$(id -u)" \
  -e LOCAL_GID="$(id -g)" \
  --mount type=bind,src="$PWD",dst=/workspace \
  --mount type=volume,src=mlc-cache,dst=/cache \
  -w /workspace \
  mlc-llm-jetson:cu126 \
  bash

`docker-entrypoint.sh`

#!/usr/bin/env bash
set -euo pipefail

LOCAL_UID="${LOCAL_UID:-1000}"
LOCAL_GID="${LOCAL_GID:-1000}"
LOCAL_USER="${LOCAL_USER:-builder}"
LOCAL_GROUP="${LOCAL_GROUP:-builder}"

# use existing group, otherwise create it
if getent group "${LOCAL_GID}" >/dev/null 2>&1; then
  GROUP_NAME="$(getent group "${LOCAL_GID}" | cut -d: -f1)"
else
  groupadd -g "${LOCAL_GID}" "${LOCAL_GROUP}"
  GROUP_NAME="${LOCAL_GROUP}"
fi

# use existing user, otherwise create it
if getent passwd "${LOCAL_UID}" >/dev/null 2>&1; then
  USER_NAME="$(getent passwd "${LOCAL_UID}" | cut -d: -f1)"
  HOME_DIR="$(getent passwd "${LOCAL_UID}" | cut -d: -f6)"
else
  useradd -m -u "${LOCAL_UID}" -g "${LOCAL_GID}" -s /bin/bash "${LOCAL_USER}"
  USER_NAME="${LOCAL_USER}"
  HOME_DIR="/home/${LOCAL_USER}"
fi

mkdir -p "${HOME_DIR}" /cache /workspace
chown -R "${LOCAL_UID}:${LOCAL_GID}" "${HOME_DIR}" /cache

export HOME="${HOME_DIR}"
export XDG_CACHE_HOME="${XDG_CACHE_HOME:-/cache/xdg}"
export UV_CACHE_DIR="${UV_CACHE_DIR:-/cache/uv}"

# bind mount side (/workspace) is not chown
exec gosu "${LOCAL_UID}:${LOCAL_GID}" "$@"

docker/Dockerfile.mlc-jetson

ARG CUDA_IMAGE=nvidia/cuda:12.6.3-devel-ubuntu22.04
FROM ${CUDA_IMAGE}

ARG DEBIAN_FRONTEND=noninteractive

ARG TOOLCHAIN_DIRNAME=aarch64--glibc--stable-2022.08-1
ARG PYTHON_VER=3.13
ARG MLC_LLM_REPO=https://github.com/mlc-ai/mlc-llm.git
ARG MLC_LLM_REF=main
ARG TVM_CUDA_ARCHS="87;89"
ARG HOST_CUDA_ARCH=89

SHELL ["/bin/bash", "-o", "pipefail", "-c"]

# ----------------------------------------------------------------------
# Base deps
# ----------------------------------------------------------------------
RUN apt update && apt install -y --no-install-recommends \
    git git-lfs gnupg ca-certificates curl wget gosu \
    build-essential binutils pkg-config \
    zlib1g-dev libzstd-dev libxml2-dev libedit-dev libtinfo-dev \
    && rm -rf /var/lib/apt/lists/*

# ----------------------------------------------------------------------
# LLVM 20
# ----------------------------------------------------------------------
RUN wget -q -O - https://apt.llvm.org/llvm-snapshot.gpg.key \
    | gpg --dearmor -o /usr/share/keyrings/llvm-archive.gpg \
 && echo "deb [signed-by=/usr/share/keyrings/llvm-archive.gpg] http://apt.llvm.org/jammy/ llvm-toolchain-jammy-20 main" \
    > /etc/apt/sources.list.d/llvm.list \
 && apt update \
 && apt install -y --no-install-recommends \
    llvm-20-dev clang-20 lld-20 libpolly-20-dev \
 && rm -rf /var/lib/apt/lists/*

RUN ln -sf /usr/bin/llvm-config-20 /usr/local/bin/llvm-config \
 && ln -sf /usr/bin/clang-20 /usr/local/bin/clang \
 && ln -sf /usr/bin/clang++-20 /usr/local/bin/clang++ \
 && ln -sf /usr/bin/lld-20 /usr/local/bin/lld \
 && if [ -x /usr/bin/ld.lld-20 ]; then \
      ln -sf /usr/bin/ld.lld-20 /usr/local/bin/ld.lld ; \
    else \
      ln -sf /usr/bin/lld-20 /usr/local/bin/ld.lld ; \
    fi

# ----------------------------------------------------------------------
# Rust
# ----------------------------------------------------------------------
RUN curl https://sh.rustup.rs -sSf | sh -s -- -y --default-toolchain stable
ENV PATH="/root/.cargo/bin:${PATH}"

# ----------------------------------------------------------------------
# uv + uv-managed Python (global path) + baked venv
# ----------------------------------------------------------------------
ENV UV_INSTALL_DIR=/opt/uv/bin \
    UV_PYTHON_INSTALL_DIR=/opt/uv/python \
    UV_PYTHON_BIN_DIR=/opt/uv/bin

RUN mkdir -p /opt/uv/bin /opt/uv/python \
 && curl -LsSf https://astral.sh/uv/install.sh | sh

ENV PATH="/opt/venv/bin:/opt/uv/bin:/root/.cargo/bin:${PATH}"

RUN uv python install ${PYTHON_VER} \
 && uv venv /opt/venv --python ${PYTHON_VER}

ENV VIRTUAL_ENV=/opt/venv

# ----------------------------------------------------------------------
# Clone MLC-LLM source
# ----------------------------------------------------------------------
WORKDIR /opt/src
RUN git clone --recursive ${MLC_LLM_REPO} mlc-llm \
 && cd mlc-llm \
 && git checkout ${MLC_LLM_REF} \
 && git submodule update --init --recursive \
 && git lfs install --system

ENV MLC_LLM_HOME=/opt/src/mlc-llm

# ----------------------------------------------------------------------
# Build TVM from source (non-editable install into /opt/venv)
# ----------------------------------------------------------------------
WORKDIR /opt/src/mlc-llm/3rdparty/tvm

RUN uv pip install --upgrade setuptools wheel cmake ninja \
 && uv pip install ./3rdparty/tvm-ffi --verbose \
 && uv pip install . --verbose \
      '--config-setting=cmake.define.CMAKE_BUILD_TYPE=RelWithDebInfo' \
      '--config-setting=cmake.define.USE_LLVM=llvm-config --ignore-libllvm --link-static' \
      '--config-setting=cmake.define.HIDE_PRIVATE_SYMBOLS=ON' \
      '--config-setting=cmake.define.USE_CUDA=ON' \
      "--config-setting=cmake.define.CMAKE_CUDA_ARCHITECTURES=${TVM_CUDA_ARCHS}" \
      '--config-setting=cmake.define.USE_CUBLAS=ON' \
      '--config-setting=cmake.define.USE_CUTLASS=ON' \
      '--config-setting=cmake.define.USE_THRUST=ON' \
      '--config-setting=cmake.define.USE_NVTX=ON'

# ----------------------------------------------------------------------
# Build MLC-LLM C++ part
# ----------------------------------------------------------------------
WORKDIR /opt/src/mlc-llm

RUN rm -rf build \
 && mkdir -p build \
 && cat > build/config.cmake <<EOF
set(TVM_SOURCE_DIR /opt/src/mlc-llm/3rdparty/tvm)
set(CMAKE_BUILD_TYPE RelWithDebInfo)
set(USE_CUDA ON)
set(USE_CUTLASS ON)
set(USE_CUBLAS ON)
set(USE_ROCM OFF)
set(USE_VULKAN OFF)
set(USE_METAL OFF)
set(USE_OPENCL OFF)
set(USE_THRUST ON)
set(CMAKE_CUDA_ARCHITECTURES ${HOST_CUDA_ARCH})
set(FLASHINFER_CUDA_ARCHITECTURES ${HOST_CUDA_ARCH})
EOF

RUN cmake -S /opt/src/mlc-llm -B /opt/src/mlc-llm/build -G Ninja \
 && cmake --build /opt/src/mlc-llm/build --parallel "$(nproc)"

# ----------------------------------------------------------------------
# Install MLC-LLM python package (non-editable)
# ----------------------------------------------------------------------
RUN uv pip install ./python --verbose

RUN mkdir -p /workspace /cache \
 && chmod -R a+rX /opt/uv /opt/venv /opt/src/mlc-llm

# ----------------------------------------------------------------------
# Bootlin toolchain
# ----------------------------------------------------------------------
RUN mkdir -p /opt/l4t-gcc \
 && curl -L -o /tmp/${TOOLCHAIN_DIRNAME}.tar.bz2 \
      https://toolchains.bootlin.com/downloads/releases/toolchains/aarch64/tarballs/${TOOLCHAIN_DIRNAME}.tar.bz2 \
 && tar -xjf /tmp/${TOOLCHAIN_DIRNAME}.tar.bz2 -C /opt/l4t-gcc \
 && rm /tmp/${TOOLCHAIN_DIRNAME}.tar.bz2

ENV TOOLCHAIN_DIRNAME=${TOOLCHAIN_DIRNAME}
ENV TOOLCHAIN_ROOT=/opt/l4t-gcc/${TOOLCHAIN_DIRNAME}
ENV CROSS_COMPILE=${TOOLCHAIN_ROOT}/bin/aarch64-buildroot-linux-gnu-
ENV PATH="${TOOLCHAIN_ROOT}/bin:${PATH}"

RUN cat > /usr/local/bin/jetson-aarch64-env.sh <<'EOF' \
 && chmod +x /usr/local/bin/jetson-aarch64-env.sh
#!/usr/bin/env bash
export TOOLCHAIN_ROOT="${TOOLCHAIN_ROOT:-/opt/l4t-gcc/${TOOLCHAIN_DIRNAME}}"
export CROSS_COMPILE="${TOOLCHAIN_ROOT}/bin/aarch64-buildroot-linux-gnu-"
export CC="${CROSS_COMPILE}gcc"
export CXX="${CROSS_COMPILE}g++"
export AR="${CROSS_COMPILE}ar"
export RANLIB="${CROSS_COMPILE}ranlib"
export LD="${CROSS_COMPILE}ld"
export STRIP="${CROSS_COMPILE}strip"
echo "Using cross toolchain:"
echo "  CC=${CC}"
echo "  CXX=${CXX}"
EOF

# ----------------------------------------------------------------------
# entrypoint
# ----------------------------------------------------------------------
COPY docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh
RUN chmod +x /usr/local/bin/docker-entrypoint.sh

ENV PYTHONUNBUFFERED=1
ENV MLC_LLM_HOME=/workspace/.mlc_llm
WORKDIR /workspace
ENTRYPOINT ["/usr/local/bin/docker-entrypoint.sh"]
CMD ["bash"]