Skip to main content

Compiling and Running Open-Source LLMs with MLC-LLM on an RTX 4090 Desktop

In my previous post, I wrote about building MLC-LLM from source with uv on Ubuntu 24.04 on my desktop machine. Once that environment was working, the next step was to actually compile models and run them locally.

This post is a memo-based walkthrough of the workflow I used to compile and run several open-source LLMs with MLC-LLM on the same desktop PC.

Test environment

I ran these steps on the following machine:

  • CPU: AMD Ryzen 9 7950X 16-Core Processor
  • GPU: NVIDIA RTX 4090
  • Memory: 64GB
  • Storage: 4TB SSD

I compiled the model libraries for the CUDA backend and ran them on the RTX 4090.

Models I tried

I tested the following models:

  • Qwen3-0.6B
  • llm-jp-3-440m-instruct3
  • llm-jp-3.1-1.8b-instruct4

For these models, I used the following conversation templates:

  • Qwen3-0.6Bqwen2
  • llm-jp-3-440m-instruct3llm-jp
  • llm-jp-3.1-1.8b-instruct4llm-jp

The conversation template should match the model family.

Prerequisite

This article assumes that mlc_llm is already built and installed. If not, see my earlier post first.

Overall workflow

The workflow I used was:

  1. download the original model from Hugging Face
  2. convert the model weights into MLC format
  3. generate the model config
  4. compile the model library for CUDA
  5. run the model with mlc_llm chat

I also kept everything under a local dist/ directory so that the downloaded model, converted artifacts, and compiled libraries were easy to track.

Directory layout

I used the following layout under ~/work:

  • dist/models/<model-name>/ for the original model files downloaded from Hugging Face
  • dist/<model-name>-<quantization>-MLC/ for converted weights and generated config
  • dist/libs/<model-name>-<quantization>-cuda.so for the compiled model library

Generic command sequence

The basic pattern looked like this:

cd ~/work
mkdir -p dist/models
mkdir -p dist/libs

export MODEL_REPO="<publisher>/<model-repo>"
export TARGET_MODEL_NAME="<model-name>"
export QUANTIZATION_TYPE="q4f16_1"
export CONV_TEMPLATE="<conversation-template>"

# Download model from Hugging Face
uvx hf download "${MODEL_REPO}" --local-dir "dist/models/${TARGET_MODEL_NAME}"

# Convert model weights
mlc_llm convert_weight "./dist/models/${TARGET_MODEL_NAME}/" \
  --quantization "${QUANTIZATION_TYPE}" \
  -o "dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/"

# Generate config
mlc_llm gen_config "./dist/models/${TARGET_MODEL_NAME}/" \
  --quantization "${QUANTIZATION_TYPE}" \
  --conv-template "${CONV_TEMPLATE}" \
  -o "dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/"

# Compile model library
mlc_llm compile "./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC" \
  --device cuda \
  -o "dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda.so"

# Run
mlc_llm chat \
  --device "cuda:0" \
  --model-lib "./dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda.so" \
  "./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC"

In my case, I used q4f16_1 for quantization across the models I tested.

What each step does

1. Download the original model

I used:

uvx hf download "${MODEL_REPO}" --local-dir "dist/models/${TARGET_MODEL_NAME}"

This downloads the source model files from Hugging Face into a local directory under dist/models/.

2. Convert the model weights

Next, I converted the model into MLC format:

mlc_llm convert_weight "./dist/models/${TARGET_MODEL_NAME}/" \
  --quantization "${QUANTIZATION_TYPE}" \
  -o "dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/"

This is the point where the original model weights are transformed into an MLC-compatible representation.

3. Generate the model config

After converting the weights, I generated the runtime config:

mlc_llm gen_config "./dist/models/${TARGET_MODEL_NAME}/" \
  --quantization "${QUANTIZATION_TYPE}" \
  --conv-template "${CONV_TEMPLATE}" \
  -o "dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/"

The key setting here is --conv-template. That needs to match the model family correctly.

For the models I tested:

  • use qwen2 for Qwen3-0.6B
  • use llm-jp for the llm-jp models

4. Compile the model library

Then I compiled the model library for CUDA:

mlc_llm compile "./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC" \
  --device cuda \
  -o "dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda.so"

This step produces the shared library that MLC-LLM will load at runtime.

5. Run the model

Finally, I launched the chat interface:

mlc_llm chat \
  --device "cuda:0" \
  --model-lib "./dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda.so" \
  "./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC"

There are two paths involved here:

  • --model-lib points to the compiled shared library
  • the final positional argument points to the converted model directory

Concrete example: Qwen3-0.6B

For Qwen3-0.6B, I used the following settings:

cd ~/work
mkdir -p dist/models
mkdir -p dist/libs

export MODEL_REPO="Qwen/Qwen3-0.6B"
export TARGET_MODEL_NAME="Qwen3-0.6B"
export QUANTIZATION_TYPE="q4f16_1"
export CONV_TEMPLATE="qwen2"

uvx hf download "${MODEL_REPO}" --local-dir "dist/models/${TARGET_MODEL_NAME}"

mlc_llm convert_weight "./dist/models/${TARGET_MODEL_NAME}/" \
  --quantization "${QUANTIZATION_TYPE}" \
  -o "dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/"

mlc_llm gen_config "./dist/models/${TARGET_MODEL_NAME}/" \
  --quantization "${QUANTIZATION_TYPE}" \
  --conv-template "${CONV_TEMPLATE}" \
  -o "dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/"

mlc_llm compile "./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC" \
  --device cuda \
  -o "dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda.so"

mlc_llm chat \
  --device "cuda:0" \
  --model-lib "./dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda.so" \
  "./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC"

For this model, I tried the following prompt:

What is the meaning of life?

Running the llm-jp models

For the two llm-jp models, the overall process was exactly the same. The main things that changed were:

  • the model repository
  • the target model name
  • the conversation template

I used:

  • llm-jp-3-440m-instruct3 with CONV_TEMPLATE="llm-jp"
  • llm-jp-3.1-1.8b-instruct4 with CONV_TEMPLATE="llm-jp"

In practice, that meant replacing only a few environment variables before re-running the same sequence of commands.

A simplified pattern for the llm-jp models looked like this:

export MODEL_REPO="<llm-jp model repo>"
export TARGET_MODEL_NAME="llm-jp-3-440m-instruct3"
export QUANTIZATION_TYPE="q4f16_1"
export CONV_TEMPLATE="llm-jp"

or:

export MODEL_REPO="<llm-jp model repo>"
export TARGET_MODEL_NAME="llm-jp-3.1-1.8b-instruct4"
export QUANTIZATION_TYPE="q4f16_1"
export CONV_TEMPLATE="llm-jp"

Then I ran the same uvx hf download, mlc_llm convert_weight, mlc_llm gen_config, mlc_llm compile, and mlc_llm chat commands.

For the llm-jp models, I tried this prompt:

自然言語処理とは何か

That gave me a quick sanity check for Japanese instruction-following behavior.

A few points that mattered in practice

A few details were easy to miss but important:

Use the correct conversation template

This was the main model-specific setting in my workflow.

For my tests:

  • Qwen3-0.6B used qwen2
  • both llm-jp models used llm-jp

Compile for the device you plan to run on

On this desktop machine, I compiled with:

--device cuda

and then ran with:

--device "cuda:0"

That matched the target hardware, which in this case was the RTX 4090.

Final thoughts

This post was a continuation of my earlier article. Once that build environment was ready, compiling and running actual models turned out to be fairly structured:

  1. download the model
  2. convert the weights
  3. generate the config
  4. compile the library
  5. run the model

On my RTX 4090 desktop, this workflow worked for:

  • Qwen3-0.6B
  • llm-jp-3-440m-instruct3
  • llm-jp-3.1-1.8b-instruct4

The core process stayed the same across models. That consistency is one of the things I like about using MLC-LLM for local model execution: once the toolchain is in place, the path from model download to local chat is relatively straightforward.