FlashAttention support: compile with the flash-attn feature
cuDNN support: compile with thecudnn feature: --features cudnn
Metal:
Compile with the metal feature: --features metal
CPU:
Intel MKL: compile with the mkl feature: --features mkl
Apple Accelerate: compile with the accelerate feature: --features accelerate
ARM NEON and AVX are used automatically
Enabling features is done by passing --features ... to the build system. When using cargo run or maturin develop, pass the --features flag before the -- separating build flags from runtime flags.
To enable a single feature like metal: cargo build --release --features metal.
To enable multiple features, specify them in quotes: cargo build --release --features "cuda flash-attn cudnn".
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source $HOME/.cargo/env
Optional: Set HF token correctly (skip if already set or your model is not gated, or if you want to use the token_source parameters in Python or the command line.)
Note: you can install huggingface-cli as documented here.
huggingface-cli login
Download the code:
git clone https://github.com/EricLBuehler/mistral.rs.git
cd mistral.rs
Install with cargo install for easy command line usage
Pass the same values to --features as you would for cargo build
cargo install --path mistralrs-server --features cuda
The build process will output a binary mistralrs-server at ./target/release/mistralrs-server which may be copied into the working directory with the following command:
If token cannot be loaded, no token will be used (i.e. effectively using none).
Loading models from local files:
You can also instruct mistral.rs to load models fully locally by modifying the *_model_id arguments or options:
./mistralrs-server --port 1234 plain -m . -a mistral
Throughout mistral.rs, any model ID argument or option may be a local path and should contain the following files for each model ID option:
--model-id (server) or model_id (python/rust) or --tok-model-id (server) or tok_model_id (python/rust):
config.json
tokenizer_config.json
tokenizer.json (if not specified separately)
.safetensors/.bin/.pth/.pt files (defaults to .safetensors)
preprocessor_config.json (required for vision models).
processor_config.json (optional for vision models).
--quantized-model-id (server) or quantized_model_id (python/rust):
Specified .gguf or .ggml file.
--x-lora-model-id (server) or xlora_model_id (python/rust):
xlora_classifier.safetensors
xlora_config.json
Adapters .safetensors and adapter_config.json files in their respective directories
--adapters-model-id (server) or adapters_model_id (python/rust):
Adapters .safetensors and adapter_config.json files in their respective directories
Running GGUF models
To run GGUF models, the only mandatory arguments are the quantized model ID and the quantized filename. The quantized model ID can be a HF model ID.
GGUF models contain a tokenizer. However, mistral.rs allows you to run the model with a tokenizer from a specified model, typically the official one. This means there are two options:
The chat template can be automatically detected and loaded from the GGUF file if no other chat template source is specified including the tokenizer model ID.
If that does not work, you can either provide a tokenizer (recommended), or specify a custom chat template.
The following tokenizer model types are currently supported. If you would like one to be added, please raise an issue. Otherwise,
please consider using the method demonstrated in examples below, where the tokenizer is sourced from Hugging Face.
Supported GGUF tokenizer types
llama (sentencepiece)
gpt2 (BPE)
Run with the CLI
Mistral.rs uses subcommands to control the model type. They are generally of format <XLORA/LORA>-<QUANTIZATION>. Please run ./mistralrs-server --help to see the subcommands.
Architecture for plain models
Note: for plain models, you can specify the data type to load and run in. This must be one of f32, f16, bf16 or auto to choose based on the device. This is specified in the --dype/-d parameter after the model architecture (plain). For quantized models (gguf/ggml), you may specify data type of f32 or bf16 (f16 is not recommended due to its lower precision in quantized inference).
If you do not specify the architecture, an attempt will be made to use the model’s config. If this fails, please raise an issue.
mistral
gemma
mixtral
llama
phi2
phi3
phi3.5moe
qwen2
gemma2
starcoder2
Architecture for vision models
Note: for vision models, you can specify the data type to load and run in. This must be one of f32, f16, bf16 or auto to choose based on the device. This is specified in the --dype/-d parameter after the model architecture (vision-plain).
phi3v
idefics2
llava_next
llava
vllama
qwen2vl
idefics3
Supported GGUF architectures
Plain:
llama
phi2
phi3
starcoder2
qwen2
With adapters:
llama
phi3
Interactive mode
You can launch interactive mode, a simple chat application running in the terminal, by passing -i:
./mistralrs-server -i plain -m microsoft/Phi-3-mini-128k-instruct -a phi3
Vision models work too:
./mistralrs-server -i vision-plain -m lamm-mit/Cephalo-Llama-3.2-11B-Vision-Instruct-128k -a vllama
And even diffusion models:
./mistralrs-server -i diffusion-plain -m black-forest-labs/FLUX.1-schnell -a flux
On Apple Silicon (Metal), run with throughput log, settings of paged attention (maximum usage of 4GB for kv cache) and dtype (bf16 for kv cache and attention)
./mistralrs-server --port 1234 plain -m microsoft/Phi-3.5-MoE-instruct -a phi3.5moe
Structured selection with a .toml file
We provide a method to select models with a .toml file. The keys are the same as the command line, with no_kv_cache and tokenizer_json being “global” keys.
To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass --help after the subcommand. For example, when using a different model than the default, specify the following for the following types of models:
Plain: Model id
Quantized: Quantized model id, quantized filename, and tokenizer id
X-LoRA: Model id, X-LoRA ordering
X-LoRA quantized: Quantized model id, quantized filename, tokenizer id, and X-LoRA ordering
LoRA: Model id, LoRA ordering
LoRA quantized: Quantized model id, quantized filename, tokenizer id, and LoRA ordering
Vision Plain: Model id
See this section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed.
It is also important to check the chat template style of the model. If the HF hub repo has a tokenizer_config.json file, it is not necessary to specify. Otherwise, templates can be found in chat_templates and should be passed before the subcommand. If the model is not instruction tuned, no chat template will be found and the APIs will only accept a prompt, no messages.
An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the x-lora-* architecture, and LoRA support by selecting the lora-* architecture. Please find docs for adapter models here. Examples may be found here.
Chat Templates and Tokenizer
Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.
Contributing
Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request.
If you want to add a new model, please contact us via an issue and we can coordinate how to do this.
FAQ
Debugging with the environment variable MISTRALRS_DEBUG=1 causes the following things
If loading a GGUF or GGML model, this will output a file containing the names, shapes, and types of each tensor.
mistralrs_gguf_tensors.txt or mistralrs_ggml_tensors.txt
More logging.
Setting the CUDA compiler path:
Set the NVCC_CCBIN environment variable during build.
Error: recompile with -fPIE:
Some Linux distributions require compiling with -fPIE.
Set the CUDA_NVCC_FLAGS environment variable to -fPIE during build: CUDA_NVCC_FLAGS=-fPIE
Error CUDA_ERROR_NOT_FOUND or symbol not found when using a normal or vison model:
For non-quantized models, you can specify the data type to load and run in. This must be one of f32, f16, bf16 or auto to choose based on the device.
What is the minimum supported CUDA compute cap?
The minimum CUDA compute cap is 5.3.
Credits
This project would not be possible without the excellent work at candle. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.
mistral.rs
Blazingly fast LLM inference.
| Rust Documentation | Python Documentation | Discord | Matrix |
Please submit requests for new models here.
Get started fast 🚀
Install
Get models
Deploy with our easy to use APIs
Quick examples
After following installation instructions
Check out UQFF for prequantized models of various methods!
🦙📷 Run the Llama 3.2 Vision Model: documentation and guide here
Credit
🌟📷 Run the Qwen2-VL Model: documentation and guide here
🤗📷 Run the Smol VLM Model: documentation and guide here
φ³ Run the new Phi 3.5/3.1/3 model with 128K context window
🧮 Enhance ISQ by collecting an imatrix from calibration data: documentation
φ³ 📷 Run the Phi 3 vision model: documentation and guide here
🌲📷 Run the FLUX.1 diffusion model: documentation and guide here
Other models: see a support matrix and how to run them
Mistral.rs supports several model categories:
Description
Easy:
.safetensors
models directly from 🤗 Hugging Face by quantizing in-placeFast:
Quantization:
Powerful:
Advanced features:
Documentation for mistral.rs can be found here.
This is a demo of interactive mode with streaming running Phi 3 128k mini with quantization via ISQ to Q4K.
https://github.com/EricLBuehler/mistral.rs/assets/65165915/09d9a30f-1e22-4b9a-9006-4ec6ebc6473c
Support matrix
APIs and Integrations
Rust Crate
Rust multithreaded/async API for easy integration into any application.
mistralrs = { git = "https://github.com/EricLBuehler/mistral.rs.git" }
Python API
Python API for mistral.rs.
HTTP Server
OpenAI API compatible API server
Llama Index integration (Python)
Supported accelerators
cuda
feature:--features cuda
flash-attn
featurecudnn
feature:--features cudnn
metal
feature:--features metal
mkl
feature:--features mkl
accelerate
feature:--features accelerate
Enabling features is done by passing
--features ...
to the build system. When usingcargo run
ormaturin develop
, pass the--features
flag before the--
separating build flags from runtime flags.metal
:cargo build --release --features metal
.cargo build --release --features "cuda flash-attn cudnn"
.Installation and Build
Install required packages:
OpenSSL
(Example on Ubuntu:sudo apt install libssl-dev
)pkg-config
(Example on Ubuntu:sudo apt install pkg-config
)Install Rust: https://rustup.rs/
Example on Ubuntu:
Optional: Set HF token correctly (skip if already set or your model is not gated, or if you want to use the
token_source
parameters in Python or the command line.)huggingface-cli
as documented here.Download the code:
Build or install:
Base build command
Build with CUDA support
Build with CUDA and Flash Attention V2 support
Build with Metal support
Build with Accelerate support
Build with MKL support
Install with
cargo install
for easy command line usagePass the same values to
--features
as you would forcargo build
The build process will output a binary
mistralrs-server
at./target/release/mistralrs-server
which may be copied into the working directory with the following command:Example on Ubuntu:
Use our APIs and integrations:
APIs and integrations list
Getting models
There are 2 ways to get models with mistral.rs:
Getting models from Hugging Face Hub
Mistral.rs can automatically download models from HF Hub. To access gated models, you should provide a token source. They may be one of:
literal:<value>
: Load from a specified literalenv:<value>
: Load from a specified environment variablepath:<value>
: Load from a specified filecache
: default: Load from the HF token at ~/.cache/huggingface/token or equivalent.none
: Use no HF tokenThis is passed in the following ways:
Here is an example of setting the token source.
If token cannot be loaded, no token will be used (i.e. effectively using
none
).Loading models from local files:
You can also instruct mistral.rs to load models fully locally by modifying the
*_model_id
arguments or options:Throughout mistral.rs, any model ID argument or option may be a local path and should contain the following files for each model ID option:
--model-id
(server) ormodel_id
(python/rust) or--tok-model-id
(server) ortok_model_id
(python/rust):config.json
tokenizer_config.json
tokenizer.json
(if not specified separately).safetensors
/.bin
/.pth
/.pt
files (defaults to.safetensors
)preprocessor_config.json
(required for vision models).processor_config.json
(optional for vision models).--quantized-model-id
(server) orquantized_model_id
(python/rust):.gguf
or.ggml
file.--x-lora-model-id
(server) orxlora_model_id
(python/rust):xlora_classifier.safetensors
xlora_config.json
.safetensors
andadapter_config.json
files in their respective directories--adapters-model-id
(server) oradapters_model_id
(python/rust):.safetensors
andadapter_config.json
files in their respective directoriesRunning GGUF models
To run GGUF models, the only mandatory arguments are the quantized model ID and the quantized filename. The quantized model ID can be a HF model ID.
GGUF models contain a tokenizer. However, mistral.rs allows you to run the model with a tokenizer from a specified model, typically the official one. This means there are two options:
With a specified tokenizer
Running with a tokenizer model ID enables you to specify the model ID to source the tokenizer from:
If the specified tokenizer model ID contains a
tokenizer.json
, then it will be used over the GGUF tokenizer.With the builtin tokenizer
Using the builtin tokenizer:
(or using a local file):
There are a few more ways to configure:
Chat template:
The chat template can be automatically detected and loaded from the GGUF file if no other chat template source is specified including the tokenizer model ID.
If that does not work, you can either provide a tokenizer (recommended), or specify a custom chat template.
Tokenizer
The following tokenizer model types are currently supported. If you would like one to be added, please raise an issue. Otherwise, please consider using the method demonstrated in examples below, where the tokenizer is sourced from Hugging Face.
Supported GGUF tokenizer types
llama
(sentencepiece)gpt2
(BPE)Run with the CLI
Mistral.rs uses subcommands to control the model type. They are generally of format
<XLORA/LORA>-<QUANTIZATION>
. Please run./mistralrs-server --help
to see the subcommands.Architecture for plain models
If you do not specify the architecture, an attempt will be made to use the model’s config. If this fails, please raise an issue.
mistral
gemma
mixtral
llama
phi2
phi3
phi3.5moe
qwen2
gemma2
starcoder2
Architecture for vision models
phi3v
idefics2
llava_next
llava
vllama
qwen2vl
idefics3
Supported GGUF architectures
Plain:
llama
phi2
phi3
starcoder2
qwen2
With adapters:
llama
phi3
Interactive mode
You can launch interactive mode, a simple chat application running in the terminal, by passing
-i
:Vision models work too:
And even diffusion models:
On Apple Silicon (
Metal
), run with throughput log, settings of paged attention (maximum usage of 4GB for kv cache) and dtype (bf16 for kv cache and attention)OpenAI HTTP server
You can an HTTP server
Structured selection with a
.toml
fileWe provide a method to select models with a
.toml
file. The keys are the same as the command line, withno_kv_cache
andtokenizer_json
being “global” keys.Example:
Benchmarks
Please submit more benchmarks via raising an issue!
Supported models
Quantization support |Model|GGUF|GGML|ISQ| |–|–|–|–| |Mistral|✅| |✅| |Gemma| | |✅| |Llama|✅|✅|✅| |Mixtral|✅| |✅| |Phi 2|✅| |✅| |Phi 3|✅| |✅| |Phi 3.5 MoE| | |✅| |Qwen 2.5| | |✅| |Phi 3 Vision| | |✅| |Idefics 2| | |✅| |Gemma 2| | |✅| |Starcoder 2| |✅|✅| |LLaVa Next| | |✅| |LLaVa| | |✅| |Llama 3.2 Vision| | |✅| |Qwen2-VL| | |✅| |Idefics 3| | |✅|
Device mapping support |Model category|Supported| |–|–| |Plain|✅| |GGUF|✅| |GGML| | |Vision Plain|✅|
X-LoRA and LoRA support |Model|X-LoRA|X-LoRA+GGUF|X-LoRA+GGML| |–|–|–|–| |Mistral|✅|✅| | |Gemma|✅| | | |Llama|✅|✅|✅| |Mixtral|✅|✅| | |Phi 2|✅| | | |Phi 3|✅|✅| | |Phi 3.5 MoE| | | | |Qwen 2.5| | | | |Phi 3 Vision| | | | |Idefics 2| | | | |Gemma 2|✅| | | |Starcoder 2|✅| | | |LLaVa Next| | | | |LLaVa| | | | |Qwen2-VL| | | | |Idefics 3| | | |
AnyMoE support |Model|AnyMoE| |–|–| |Mistral 7B|✅| |Gemma|✅| |Llama|✅| |Mixtral| | |Phi 2|✅| |Phi 3|✅| |Phi 3.5 MoE| | |Qwen 2.5|✅| |Phi 3 Vision| | |Idefics 2| | |Gemma 2|✅| |Starcoder 2|✅| |LLaVa Next|✅| |LLaVa|✅| |Llama 3.2 Vision| | |Qwen2-VL| | |Idefics 3|✅|
Using derivative model
To use a derivative model, select the model architecture using the correct subcommand. To see what can be passed for the architecture, pass
--help
after the subcommand. For example, when using a different model than the default, specify the following for the following types of models:See this section to determine if it is necessary to prepare an X-LoRA/LoRA ordering file, it is always necessary if the target modules or architecture changed, or if the adapter order changed.
It is also important to check the chat template style of the model. If the HF hub repo has a
tokenizer_config.json
file, it is not necessary to specify. Otherwise, templates can be found inchat_templates
and should be passed before the subcommand. If the model is not instruction tuned, no chat template will be found and the APIs will only accept a prompt, no messages.For example, when using a Zephyr model:
./mistralrs-server --port 1234 --log output.txt gguf -t HuggingFaceH4/zephyr-7b-beta -m TheBloke/zephyr-7B-beta-GGUF -f zephyr-7b-beta.Q5_0.gguf
Adapter model support: X-LoRA and LoRA
An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting the
x-lora-*
architecture, and LoRA support by selecting thelora-*
architecture. Please find docs for adapter models here. Examples may be found here.Chat Templates and Tokenizer
Mistral.rs will attempt to automatically load a chat template and tokenizer. This enables high flexibility across models and ensures accurate and flexible chat templating. However, this behavior can be customized. Please find detailed documentation here.
Contributing
Thank you for contributing! If you have any problems or want to contribute something, please raise an issue or pull request. If you want to add a new model, please contact us via an issue and we can coordinate how to do this.
FAQ
MISTRALRS_DEBUG=1
causes the following thingsmistralrs_gguf_tensors.txt
ormistralrs_ggml_tensors.txt
NVCC_CCBIN
environment variable during build.recompile with -fPIE
:-fPIE
.CUDA_NVCC_FLAGS
environment variable to-fPIE
during build:CUDA_NVCC_FLAGS=-fPIE
CUDA_ERROR_NOT_FOUND
or symbol not found when using a normal or vison model:f32
,f16
,bf16
orauto
to choose based on the device.Credits
This project would not be possible without the excellent work at
candle
. Additionally, thank you to all contributors! Contributing can range from raising an issue or suggesting a feature to adding some new functionality.⬆️ Back to Top