← Back to news

GLM-5.2 – How to Run Locally

unsloth.ai|544 points|262 comments|by TechTechTech|Jun 22, 2026

🦥 GLM-5.2: Local Deployment Guide

Overview: Run the cutting-edge GLM-5.2 model from Z.ai on your own local hardware using Unsloth's optimized Dynamic GGUFs.

GLM-5.2 is a state-of-the-art open model engineered by Z.ai, specifically designed to excel in agentic tasks, complex reasoning, and long-horizon coding. It currently stands as one of the most powerful open models available, rivaling the performance of proprietary giants like Gemini 3.1 Pro, GPT-5.5, and Claude 4.8 Opus across various benchmarks (including Artificial Analysis).

🧬 Model Architecture & Capabilities

The model is a massive Mixture-of-Experts (MoE) system with the following specifications:

  • Total Parameters: 744B
  • Active Parameters: 40B
  • Context Window: 1,048,576 tokens

📉 Quantization & Efficiency

Unsloth utilizes Dynamic GGUFs to make this behemoth runnable on consumer-grade or prosumer hardware. By dynamically keeping critical layers at higher precision and compressing less important ones, the model maintains high utility despite massive size reductions.

Quantization LevelTop-1 AccuracySize ReductionNote
Dynamic 1-bit76.2%\approx 76.2\%86% SmallerHighly efficient
Dynamic 2-bit82%\approx 82\%84% SmallerBalanced performance

It is a common misconception that an 84% size reduction leads to an 82% drop in quality. In reality, the 2-bit version is only about 18% less accurate than the full 1.5TB BF16 model.


🖥️ Hardware Requirements

To ensure stability, your total available memory (VRAM + System RAM) should exceed the model file size by a comfortable margin.

Memory Requirements Table

QuantizationTotal Memory Required (RAM + VRAM)
1-bit223 GB
2-bit245 GB
3-bit290–360 GB
4-bit372–475 GB
5-bit570 GB
8-bit810 GB

Deployment Tips:

  • Mac Users: The 2-bit quant (UD-IQ2_M) uses 239GB of disk space and fits perfectly on a 256GB Unified Memory Mac.
  • PC Users: This can run on a 1x24GB GPU paired with 256GB of system RAM utilizing MoE offloading.

⚙️ Usage & Configuration

Recommended Settings

Depending on your goal, use the following hyperparameters:

  • General Tasks (Default)
    • temperature = 1.0
    • top_p = 0.95
  • SWE-Bench Pro
    • temperature = 1.0
    • top_p = 1.0

Managing "Thinking" (Reasoning)

GLM-5.2 features built-in reasoning. You can adjust the reasoning_effort to high, max, or disabled.

To disable reasoning via CLI:

# Standard Linux/Mac
--chat-template-kwargs '{ enable_thinking :false}'
# Windows PowerShell
--chat-template-kwargs {\ enable_thinking\ :false}

Alternatively, if using llama.cpp, you can simply use the flags: --reasoning on or --reasoning off.


📈 Technical Quantization Analysis

Unsloth employs KL Divergence (KLD) to measure the distance between the probability distributions of the baseline model (BF16/Q8_0) and the quantized version.

The Mathematical Objective

The goal is to minimize the following objective function:

minimize 1nDKL[ f(q(W))  f(W)) ]\text{minimize } \frac{1}{n} \sum{\text{D}_{\text{KL}}[\text{ }f(q(W))\text{ }||\text{ } f(W))\text{ }]}

Where:

  • ff: The forward pass of the language model.
  • qq: The quantization operation.
  • WW: The model weights/parameters.

Understanding Accuracy vs. KLD

A "76.2% top-1 accuracy" for the 1-bit model does not mean the model gives wrong answers 24% of the time. Instead, it refers to the distribution of tokens.

Example: If the baseline model always starts a response with "I", the quantized model might start with "I" 76% of the time and "The" 24% of the time. Both are grammatically correct; the distribution has simply shifted.


🖼️ Visuals & Documentation

Unsloth Logo Black

Performance Benchmarks:

  • Top-1 Accuracy vs Size:
  • Mean KLD vs Size:

Example Output: