Compiling and Running Open-Source LLMs with MLC-LLM on an RTX 4090 Desktop
In my previous post, I wrote about building MLC-LLM from source with uv on Ubuntu 24.04 on my desktop machine. Once that environment was working, the next step was to actually compile models and run them locally.
This post is a memo-based walkthrough of the workflow I used to compile and run several open-source LLMs with MLC-LLM on the same desktop PC.
Test environment
I ran these steps on the following machine:
- CPU: AMD Ryzen 9 7950X 16-Core Processor
- GPU: NVIDIA RTX 4090
- Memory: 64GB
- Storage: 4TB SSD
I compiled the model libraries for the CUDA backend and ran them on the RTX 4090.
Models I tried
I tested the following models:
Qwen3-0.6Bllm-jp-3-440m-instruct3llm-jp-3.1-1.8b-instruct4
For these models, I used the following conversation templates:
Qwen3-0.6B→qwen2llm-jp-3-440m-instruct3→llm-jpllm-jp-3.1-1.8b-instruct4→llm-jp
The conversation template should match the model family.
Prerequisite
This article assumes that mlc_llm is already built and installed. If not, see my earlier post first.
Overall workflow
The workflow I used was:
- download the original model from Hugging Face
- convert the model weights into MLC format
- generate the model config
- compile the model library for CUDA
- run the model with
mlc_llm chat
I also kept everything under a local dist/ directory so that the downloaded model, converted artifacts, and compiled libraries were easy to track.
Directory layout
I used the following layout under ~/work:
dist/models/<model-name>/for the original model files downloaded from Hugging Facedist/<model-name>-<quantization>-MLC/for converted weights and generated configdist/libs/<model-name>-<quantization>-cuda.sofor the compiled model library
Generic command sequence
The basic pattern looked like this:
cd ~/work
mkdir -p dist/models
mkdir -p dist/libs
export MODEL_REPO="<publisher>/<model-repo>"
export TARGET_MODEL_NAME="<model-name>"
export QUANTIZATION_TYPE="q4f16_1"
export CONV_TEMPLATE="<conversation-template>"
# Download model from Hugging Face
uvx hf download "${MODEL_REPO}" --local-dir "dist/models/${TARGET_MODEL_NAME}"
# Convert model weights
mlc_llm convert_weight "./dist/models/${TARGET_MODEL_NAME}/" \
--quantization "${QUANTIZATION_TYPE}" \
-o "dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/"
# Generate config
mlc_llm gen_config "./dist/models/${TARGET_MODEL_NAME}/" \
--quantization "${QUANTIZATION_TYPE}" \
--conv-template "${CONV_TEMPLATE}" \
-o "dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/"
# Compile model library
mlc_llm compile "./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC" \
--device cuda \
-o "dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda.so"
# Run
mlc_llm chat \
--device "cuda:0" \
--model-lib "./dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda.so" \
"./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC"
In my case, I used q4f16_1 for quantization across the models I tested.
What each step does
1. Download the original model
I used:
uvx hf download "${MODEL_REPO}" --local-dir "dist/models/${TARGET_MODEL_NAME}"
This downloads the source model files from Hugging Face into a local directory under dist/models/.
2. Convert the model weights
Next, I converted the model into MLC format:
mlc_llm convert_weight "./dist/models/${TARGET_MODEL_NAME}/" \
--quantization "${QUANTIZATION_TYPE}" \
-o "dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/"
This is the point where the original model weights are transformed into an MLC-compatible representation.
3. Generate the model config
After converting the weights, I generated the runtime config:
mlc_llm gen_config "./dist/models/${TARGET_MODEL_NAME}/" \
--quantization "${QUANTIZATION_TYPE}" \
--conv-template "${CONV_TEMPLATE}" \
-o "dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/"
The key setting here is --conv-template. That needs to match the model family correctly.
For the models I tested:
- use
qwen2forQwen3-0.6B - use
llm-jpfor thellm-jpmodels
4. Compile the model library
Then I compiled the model library for CUDA:
mlc_llm compile "./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC" \
--device cuda \
-o "dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda.so"
This step produces the shared library that MLC-LLM will load at runtime.
5. Run the model
Finally, I launched the chat interface:
mlc_llm chat \
--device "cuda:0" \
--model-lib "./dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda.so" \
"./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC"
There are two paths involved here:
--model-libpoints to the compiled shared library- the final positional argument points to the converted model directory
Concrete example: Qwen3-0.6B
For Qwen3-0.6B, I used the following settings:
cd ~/work
mkdir -p dist/models
mkdir -p dist/libs
export MODEL_REPO="Qwen/Qwen3-0.6B"
export TARGET_MODEL_NAME="Qwen3-0.6B"
export QUANTIZATION_TYPE="q4f16_1"
export CONV_TEMPLATE="qwen2"
uvx hf download "${MODEL_REPO}" --local-dir "dist/models/${TARGET_MODEL_NAME}"
mlc_llm convert_weight "./dist/models/${TARGET_MODEL_NAME}/" \
--quantization "${QUANTIZATION_TYPE}" \
-o "dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/"
mlc_llm gen_config "./dist/models/${TARGET_MODEL_NAME}/" \
--quantization "${QUANTIZATION_TYPE}" \
--conv-template "${CONV_TEMPLATE}" \
-o "dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC/"
mlc_llm compile "./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC" \
--device cuda \
-o "dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda.so"
mlc_llm chat \
--device "cuda:0" \
--model-lib "./dist/libs/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-cuda.so" \
"./dist/${TARGET_MODEL_NAME}-${QUANTIZATION_TYPE}-MLC"
For this model, I tried the following prompt:
What is the meaning of life?
Running the llm-jp models
For the two llm-jp models, the overall process was exactly the same. The main things that changed were:
- the model repository
- the target model name
- the conversation template
I used:
llm-jp-3-440m-instruct3withCONV_TEMPLATE="llm-jp"llm-jp-3.1-1.8b-instruct4withCONV_TEMPLATE="llm-jp"
In practice, that meant replacing only a few environment variables before re-running the same sequence of commands.
A simplified pattern for the llm-jp models looked like this:
export MODEL_REPO="<llm-jp model repo>"
export TARGET_MODEL_NAME="llm-jp-3-440m-instruct3"
export QUANTIZATION_TYPE="q4f16_1"
export CONV_TEMPLATE="llm-jp"
or:
export MODEL_REPO="<llm-jp model repo>"
export TARGET_MODEL_NAME="llm-jp-3.1-1.8b-instruct4"
export QUANTIZATION_TYPE="q4f16_1"
export CONV_TEMPLATE="llm-jp"
Then I ran the same uvx hf download, mlc_llm convert_weight, mlc_llm gen_config, mlc_llm compile, and mlc_llm chat commands.
For the llm-jp models, I tried this prompt:
自然言語処理とは何か
That gave me a quick sanity check for Japanese instruction-following behavior.
A few points that mattered in practice
A few details were easy to miss but important:
Use the correct conversation template
This was the main model-specific setting in my workflow.
For my tests:
Qwen3-0.6Busedqwen2- both
llm-jpmodels usedllm-jp
Compile for the device you plan to run on
On this desktop machine, I compiled with:
--device cuda
and then ran with:
--device "cuda:0"
That matched the target hardware, which in this case was the RTX 4090.
Final thoughts
This post was a continuation of my earlier article. Once that build environment was ready, compiling and running actual models turned out to be fairly structured:
- download the model
- convert the weights
- generate the config
- compile the library
- run the model
On my RTX 4090 desktop, this workflow worked for:
Qwen3-0.6Bllm-jp-3-440m-instruct3llm-jp-3.1-1.8b-instruct4
The core process stayed the same across models. That consistency is one of the things I like about using MLC-LLM for local model execution: once the toolchain is in place, the path from model download to local chat is relatively straightforward.
...rce with on Ubuntu 24.04](/blog/2026-03-10-build-mlc-llm-from-source-with-uv-in-ubuntu-2404) -...