{"id":32212,"date":"2026-01-21T17:01:29","date_gmt":"2026-01-21T17:01:29","guid":{"rendered":"https:\/\/lamarr-institute.org\/publication\/revisiting-pruning-vs-quantization-for-small-language-models\/"},"modified":"2026-01-21T17:19:36","modified_gmt":"2026-01-21T17:19:36","slug":"revisiting-pruning-vs-quantization-for-small-language-models","status":"publish","type":"publication","link":"https:\/\/lamarr-institute.org\/de\/publication\/revisiting-pruning-vs-quantization-for-small-language-models\/","title":{"rendered":"Revisiting Pruning vs Quantization for Small Language Models"},"content":{"rendered":"<p>Deploying language models on resource-constrained devices, such as mobile phones, wearables, and on-device {AI} assistants, demands compact, efficient models without sacrificing performance. Compressing Small Language Models ({SLMs}) is particularly suited for these scenarios, yet their compression dynamics remain underexplored compared to Large Language Models ({LLMs}). We systematically evaluate leading post-training pruning ({SparseGPT}, Wanda) and quantization ({GPTQ}, {AWQ}) methods across six {SLMs} from 0.5 to 3.8B, seven languages, and seven downstream tasks. Our results show that quantization consistently outperforms pruning in preserving model fidelity, multilingual perplexity, and reasoning accuracy. However, quantization&#8217;s advantages diminish on complex knowledge and reasoning tasks like {OpenBookQA}, highlighting a disconnect between compression fidelity and downstream task performance. Notably, trends observed in {LLMs} (e.g., Wanda&#8217;s competitive performance to {SparseGPT}) do not generalize to {SLMs}. For practitioners, we recommend prioritizing quantization (particularly {AWQ}) for {SLM} compression and caution against relying on a single metric.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Deploying language models on resource-constrained devices, such as mobile phones, wearables, and on-device {AI} assistants, demands compact, efficient models without sacrificing performance. Compressing Small Language Models ({SLMs}) is particularly suited for these scenarios, yet their compression dynamics remain underexplored compared to Large Language Models ({LLMs}). We systematically evaluate leading post-training pruning ({SparseGPT}, Wanda) and quantization ({GPTQ}, {AWQ}) methods across six {SLMs} from 0.5 to 3.8B, seven languages, and seven downstream [&hellip;]<\/p>\n","protected":false},"author":12,"featured_media":0,"template":"","meta":{"_acf_changed":false,"footnotes":""},"publication-type":[32],"class_list":["post-32212","publication","type-publication","status-publish","hentry","publication-type-inproceedings"],"acf":[],"publishpress_future_workflow_manual_trigger":{"enabledWorkflows":[]},"_links":{"self":[{"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/publication\/32212","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/publication"}],"about":[{"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/types\/publication"}],"author":[{"embeddable":true,"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/users\/12"}],"version-history":[{"count":0,"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/publication\/32212\/revisions"}],"wp:attachment":[{"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/media?parent=32212"}],"wp:term":[{"taxonomy":"publication-type","embeddable":true,"href":"https:\/\/lamarr-institute.org\/de\/wp-json\/wp\/v2\/publication-type?post=32212"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}