Transpose HuggingFace Models to GGUF Format

3 min readSep 23, 2023

Llama.cpp provides an efficient method for running LLMs on both CPUs and GPUs. However, these two frameworks utilize different file formats for weight packing, so you must keep in mind the additional step of transposing between them.

Specifically, llama.cpp employs the GGUF format for efficient CPU execution, while the GPTF format is designed for GPU inference. In this blog post, our focus will be on converting models from the HuggingFace format to GGUF. Nevertheless, there is no impediment to running GGUF on a GPU; in fact, it runs even faster compared to CPU execution. However, if your primary concern is efficiency, GPTQ is the optimal choice.

At a higher level, the process involves the following steps:

Install the requirements for the below process.
Fetch a HuggingFace model.
Executing the llama.cpp transpose.py script on the HuggingFace model.
Optionally, Upload GGUF model to HuggingFace Models repo.

Install the requirements for the below process.

The huggingface_hub library enables you to engage with the Hugging Face Hub. Here, you can explore pre-trained models and datasets for your projects, experiment with numerous machine learning applications hosted on the Hub, and even upload your own models and datasets.

pip install huggingface_hub

Fetch a HuggingFace Model

The Python script provided below will retrieve and save the necessary model to your local environment.

from huggingface_hub import snapshot_download

model_id="meta-llama/Llama-2-7b-chat-hf"
snapshot_download(repo_id=model_id, local_dir="myllama-hf",
                  local_dir_use_symlinks=False, revision="main")

Transpose HF to GGUF

The file llama.cpp contains a conversion script that we will use to transform the HF model into the GGUF format.

For this to happen we need to clone llama.cpp and install required libraries:

$ git clone https://github.com/ggerganov/llama.cpp.git

$ pip install -r llama.cpp/requirements.txt

Finally that magicall command that will convert the model:

$ python llama.cpp/convert.py myllama-hf \
  --outfile myllama-7b-v0.1.gguf \
  --outtype q8_0

In our case, we are additionally applying an 8-bit (integer)quantization to the model, specified by using the flag — outtype q8_0. Quantization can enhance inference speed, although it may have an effect on the model’s overall quality. Alternatively, you have the option to maintain the original quality by using — outtype f16 (16-bit) or — outtype f32 (32-bit).

Now, that we got our GGUF model, time to push it to hub for later access:

Upload GGUF to Hub (optional)

To upload, you need to ensure that you have write access to the repository. If you don’t have a HuggingFace account, please create one, and then obtain your token from the following link: https://huggingface.co/settings/tokens

Either you run the below command on your shell or set an environment variable:

$ huggingface-cli login

    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    A token is already saved on your machine. Run `huggingface-cli whoami` to get more information or `huggingface-cli logout` if you want to log out.
    Setting a new token will erase the existing one.
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: <your token here>
Add token as git credential? (Y/n) n

$ export HUGGING_FACE_HUB_TOKEN=<your token here>

You have the option to decide not to upload this model to the HUB and instead use it locally. However, just in case:

from huggingface_hub import HfApi
api = HfApi()

model_id = "tomdeore/myllama-7b-v0.1.gguf"
api.create_repo(model_id, exist_ok=True, repo_type="model")
api.upload_file(
    repo_id=model_id,
    path_or_fileobj="myllama-7b-v0.1.gguf",
    path_in_repo="myllama-7b-v0.1.gguf"
)

Done! This should upload the freshly baked GGUF model to your repository.

Transpose HuggingFace Models to GGUF Format

Install the requirements for the below process.

Fetch a HuggingFace Model

Transpose HF to GGUF

Upload GGUF to Hub (optional)

Written by Milind Deore