📄 Published Work

Research Papers & Experiments

All published work from Grey Liquid Labs, organized by research track.

New Breakthrough — May 22, 2026
We mapped the "Stability Cliff" for Gemma 4, unlocking functional sub-3-bit reasoning.
A high-res single-layer "Knockout" sweep identified a precise **Resilient Zone (Layers 30-40)** in the e4b architecture. By applying **Surgical Anchoring**, we produced the first stable 4.5GB (sub-3-bit) model that preserves full logical and mathematical reasoning. Read the methodology →
Mathematical Milestone
Discovery of the FFN Expansion Ratio predictor for Q2_K compatibility.
FFN expansion ratio predicts Q2_K failure with 100% accuracy. Sliding Window Attention confirmed as failure amplifier. Experiment #8 reveals SWA in Gemma 4 is a physically distinct sub-architecture. Full arc summary →
🔬
Research Track
Model Compression

Systematic quantization research — discovering the mathematical boundaries of extreme LLM compression.

Key Discoveries — Sub-3-Bit Research Arc
1.
FFN ratio 3.0–5.5x = danger zone. Models in this range consistently fail Q2_K. Below 3.0x or above 5.5x, they pass. Predictable from config.json in 30 seconds. 100% accuracy across 6 models.
2.
SWA = failure amplifier. 100% correlation between Sliding Window Attention presence and Q2_K failure. Local context window prevents the error averaging that full-attention models rely on.
3.
SWA in Gemma 4 = distinct sub-architecture. (Novel, Exp #8) Q/K weights are physically half-sized — [2560,2048] vs [2560,4096]. Not a config toggle. Cannot be overridden post-training. Gemma 4 embeds two architecturally incompatible sub-networks in one file.
Arc summary: GREY_LIQUID_ARC_SUMMARY.md  |  Main paper: PAPER_001_FFN_RATIO.md
PAPER #002 May 14, 2026
Deployment Quantization Guide

Practical Deployment Guide: Q2_K Quantization Using the FFN Ratio Predictor

A comprehensive guide to deploying Q2_K quantized models based on the FFN ratio predictor methodology. Covers model selection, compatibility assessment, deployment pipelines, and validated fallback strategies for production use.

Read Full Paper ↗

Experiment Reports

EXPERIMENT #001 · May 13, 2026

Initial Gemma 4 Q2_K Test

Baseline failure test — first documented case of Q2_K quantization failure on Gemma 4 architecture. Established research direction.

View Report ↗
EXPERIMENT #002 · May 13, 2026

Parameter Variations on Q2_K Failure

Systematic variation of quantization parameters to characterize the failure mode and identify potential mitigations.

View Report ↗
EXPERIMENT #003 · May 13, 2026

imatrix Exploration Attempt

Investigated importance matrix (imatrix) quantization as a potential path to Q2_K compatibility on failing architectures.

View Report ↗
EXPERIMENT #004 · May 13, 2026

Q2_K vs Q3_K Comparison

Head-to-head comparison of Q2_K and Q3_K quantization formats across failing architectures to characterize the minimum viable bit depth.

View Report ↗
EXPERIMENT #005 · May 13, 2026

imatrix Tool Coverage Discovery

Critical finding: imatrix tooling achieves only 46% layer coverage on problematic architectures, explaining its ineffectiveness as a mitigation.

View Report ↗
EXPERIMENT #006 · May 14, 2026

Cross-Architecture Q2_K Validation

Tested Q2_K compatibility across 4 distinct architectures. Proved failure is architecture-specific and governed by FFN expansion ratio — not a general quantization limitation.

View Report ↗
EXPERIMENT #007 · May 14, 2026

SWA Confirmation Study

Definitive proof that Sliding Window Attention (SWA) architecture is the root cause of Q2_K failure in the danger zone. Closes the architectural loop.

View Report ↗
EXPERIMENT #008 · May 15, 2026 — COMPLETE (FAILED)

De-SWA Metadata Patch Attempt

Patched Gemma 4 e4b GGUF metadata to report zero SWA layers; quantization succeeded but inference fails with shape mismatch. Discovery: SWA layers have physically different Q/K dimensions ([2560,2048] vs [2560,4096]) — two distinct sub-architectures baked in at training time. 7 full-att + 35 SWA layers (not 24+18 as estimated).

View Report ↗
EXPERIMENT #008b · May 15, 2026 — COMPLETE (FAILED)

SWA-Only Slice Q2_K Inference Test + llama.cpp Bug Fix

Extracted all 35 SWA layers from Gemma 4 e4b as standalone model (3.72 GB Q2_K, 4.72 BPW). Discovered llama.cpp upstream bug: llm_graph_input_attn_kv_iswa lacks null-buffer guard for all-SWA models — GGML_ASSERT crash. Fixed and submitted as PR #23131. After fix: model runs at 14.7 t/s but output is fully incoherent. The 7 full-attention layers are architecturally essential — truncation is sufficient to break coherence regardless of quantization level.

View Report ↗
EXPERIMENT #009 · May 22, 2026 — COMPLETE (SUCCESS)

High-Res Stability Sweep & Surgical MoM

Performed a high-resolution single-layer "Knockout" sweep across the e4b architecture. Identified a precise **Resilient Zone (Layers 30-40)** and a **Stability Cliff (Layer 41)**. Result: Created the first functional sub-3-bit (4.5GB) model via Surgical Anchoring, maintaining perfect reasoning on math benchmarks.

View Report ↗
🧠
Research Track
Autonomy & Agency
🧩
Research Track
Architecture Research
📋 Proposed Experiments
EXPERIMENT #8 — COMPLETE
SWA-Free Slice Q2_K Test ❌
De-SWA metadata patch fails: SWA layers have physically different Q/K tensor shapes ([2560,2048] vs [2560,4096]). Discovery: Gemma 4 has 7 full-att + 35 SWA layers — two distinct sub-architectures.
EXPERIMENT #8b — COMPLETE
SWA-Only Slice Q2_K Test ❌
Extracted 35-layer all-SWA Gemma 4 e4b slice (3.72 GB Q2_K, 4.72 BPW). Model loads and generates tokens (14.7 t/s) after patching llama.cpp bug (ISWA null-buffer guard missing for all-SWA models). Output: incoherent. The 7 full-attention "global integration" layers are architecturally essential — truncation alone breaks coherence before quantization. Sub-3-bit barrier is architectural, not purely quantization-related. llama.cpp upstream PR #23131 submitted ↗
EXPERIMENT #9
Domain LoRA Specialization
Train code/math/text LoRA adapters, benchmark vs base
EXPERIMENT #10
Router Accuracy
Embedding-similarity classifier for domain routing (>90% target)
EXPERIMENT #11
MoM vs Single Model
Does domain ensemble outperform equivalent general model?
⚙️
Research Track
Infrastructure & Tools

Active infrastructure projects supporting Grey Liquid Labs research. Formal papers in progress.

ash.cpp

C++

Native C++ inference engine for Ash, targeting minimal-dependency local deployment. Designed for maximum performance on consumer hardware with zero cloud dependency.

🔧

ash-server

C# / ASP.NET Core

ASP.NET Core autonomous agent framework — the current production runtime for Ash. WebSocket-based, replaced OpenClaw architecture. Enables the session-persistent autonomy behaviors documented in research.

🏗️

gemma4-turbo Pipeline

Quantization Pipeline

Custom quantization pipeline producing IQ4_XS model variants from Google's Gemma 4 architecture. Targets 8GB consumer RAM devices. Used to produce all published gemma4-turbo variants (17.3K+ downloads).