Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp

Large Language Models (LLMs) have demonstrated remarkable capabilities but their significant computational and memory demands hinder widespread deployment, especially on resource-constrained devices. Quantization, the process of reducing the numerical precision of model parameters, has emerged as a critical technique for compressing LLMs and accelerating inference. This paper provides an overview of LLM quantization, with a particular focus on the Post-Training Quantization (PTQ) methods implemented within the popular llama. cpp framework and its GGUF file format. We begin by covering quantization fundamentals, including the distinction between PTQ and Quantization-Aware Training (QAT). We then describe the specific PTQ schemes employed by llama. cpp, including legacy methods, advanced K-quants, and recent IQ-quants, along with their underlying mathematical principles. The paper also discusses the impact of these techniques on model fidelity, hardware requirements, inference speed, and traces the adoption of GGUF as a de facto standard in the open-source community. This work serves as a practical guide and comprehensive reference for researchers aiming to deploy LLMs on resource-constrained hardware. By systematically documenting and comparing the PTQ methods within llama. cpp, we provide the necessary insights to navigate the trade-offs between model fidelity, inference speed, and memory footprint. This enables informed decision-making for real-world applications, from local CPU-based inference to efficient edge deployment.

  • Published in:
    2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA)
  • Type:
    Inproceedings
  • Authors:
    Sparrenberg, Lorenz; Deußer, Tobias; Berger, Armin; Sifa, Rafet
  • Year:
    2025
  • Source:
    https://ieeexplore.ieee.org/document/11247985

Citation information

Sparrenberg, Lorenz; Deußer, Tobias; Berger, Armin; Sifa, Rafet: Small and Fast LLMs on Commodity Hardware: Post-Training Quantization in llama.cpp, 2025 IEEE 12th International Conference on Data Science and Advanced Analytics (DSAA), 2025, 1--10, https://ieeexplore.ieee.org/document/11247985, Sparrenberg.etal.2025a,

Associated Lamarr Researchers

Prof. Dr. Rafet Sifa

Prof. Dr. Rafet Sifa

Principal Investigator Hybrid ML to the profile