Research Papers & Experiments
All published work from Grey Liquid Labs, organized by research track.
Systematic quantization research — discovering the mathematical boundaries of extreme LLM compression.
The Neural Slice Router: Dynamic Inference Optimization via Semantic Embedding Anchors
We present a C# implementation utilizing the Antigravity SDK to solve the resource-efficiency dilemma. By using a lightweight embedding model and 'Semantic Anchors', this router dynamically routes queries to the optimal model slice (Nano 4.5B vs. Turbo 12B). Results show a 55% reduction in average memory usage and a 77% improvement in chat latency while maintaining 94% intent matching accuracy.
94.2%−77% ✅Antigravity SDK 2.0Neural Slice MeshThe Hidden Architecture: Physically Distinct Sub-Networks in Gemma 4
We document a significant discovery: Gemma 4 is not a uniform model, but embeds two physically incompatible sub-networks. SWA layers use half-sized Q/K projection tensors ([2560, 2048]) compared to full-attention layers ([2560, 4096]). Removing the 7 full-attention layers results in total semantic collapse, proving they act as a critical global integration backbone. This research also resulted in a core fix for llama.cpp (PR #23131).
7 (Global)35 (Local)0.5x (Local)PR #23131 ✅Breaking the Sub-3-Bit Barrier: FFN Expansion Ratio as a Quantization Predictor
We present the first mathematical predictor for sub-3-bit LLM quantization compatibility. FFN expansion ratio (3.0–5.5x danger zone) predicts Q2_K failure with 100% accuracy across 6 tested architectures (7B–24B). Sliding Window Attention confirmed as failure amplifier. Experiment #8 reveals SWA in Gemma 4 is a physically distinct sub-architecture — establishing a 3D predictor model (FFN ratio × SWA presence × SWA implementation type).
Read Full Paper ↗Practical Deployment Guide: Q2_K Quantization Using the FFN Ratio Predictor
A comprehensive guide to deploying Q2_K quantized models based on the FFN ratio predictor methodology. Covers model selection, compatibility assessment, deployment pipelines, and validated fallback strategies for production use.
Read Full Paper ↗Memory-Mapped Embedding Tensors: Achieving Sub-1.5 GB RAM for Gemma 4 on x86 Hardware
We implement memory-mapped embedding tensor loading in Ollama's Go GGML backend, reducing Gemma 4 nano working-set RAM from 3,475 MB to 1,366 MB — a 61% reduction. This matches Google DeepMind's LiteRT-LM headline target on standard x86 hardware with a verified, open-source implementation. Re-quantization of embedding tensors (Q2_K/Q4_K variants) is evaluated as an alternative and rejected for GPU users due to VRAM scheduler interaction that increases RAM and reduces throughput by ~30 t/s.
3,475 MB1,366 MB ✅−61%<1,500 MB (unverified)x86 + Windows/UnixCorrect ✅The Stability Cliff: Precise Layer Anchoring for Sub-3-Bit Reasoning
Using a high-resolution "Layer Knockout" sweep, we identify the exact stability cliff for the Gemma 4 e4b architecture. Empirical data reveals a unique "Resilient Zone" between layers 30 and 40, where the model maintains logical coherence even at Q2_K precision. By surgically anchoring the critical reasoning foundation (layers 0-29) and the final output formatter (layer 41) at Q3_K_S, we produced a functional 4.5GB model, breaking the sub-3-bit barrier for 4B-parameter systems.
Layer 41L30 — L40 ✅Surgical Anchoring100% StabilityExperiment Reports
Initial Gemma 4 Q2_K Test
Baseline failure test — first documented case of Q2_K quantization failure on Gemma 4 architecture. Established research direction.
View Report ↗Parameter Variations on Q2_K Failure
Systematic variation of quantization parameters to characterize the failure mode and identify potential mitigations.
View Report ↗imatrix Exploration Attempt
Investigated importance matrix (imatrix) quantization as a potential path to Q2_K compatibility on failing architectures.
View Report ↗Q2_K vs Q3_K Comparison
Head-to-head comparison of Q2_K and Q3_K quantization formats across failing architectures to characterize the minimum viable bit depth.
View Report ↗imatrix Tool Coverage Discovery
Critical finding: imatrix tooling achieves only 46% layer coverage on problematic architectures, explaining its ineffectiveness as a mitigation.
View Report ↗Cross-Architecture Q2_K Validation
Tested Q2_K compatibility across 4 distinct architectures. Proved failure is architecture-specific and governed by FFN expansion ratio — not a general quantization limitation.
View Report ↗SWA Confirmation Study
Definitive proof that Sliding Window Attention (SWA) architecture is the root cause of Q2_K failure in the danger zone. Closes the architectural loop.
View Report ↗De-SWA Metadata Patch Attempt
Patched Gemma 4 e4b GGUF metadata to report zero SWA layers; quantization succeeded but inference fails with shape mismatch. Discovery: SWA layers have physically different Q/K dimensions ([2560,2048] vs [2560,4096]) — two distinct sub-architectures baked in at training time. 7 full-att + 35 SWA layers (not 24+18 as estimated).
View Report ↗SWA-Only Slice Q2_K Inference Test + llama.cpp Bug Fix
Extracted all 35 SWA layers from Gemma 4 e4b as standalone model (3.72 GB Q2_K, 4.72 BPW). Discovered llama.cpp upstream bug: llm_graph_input_attn_kv_iswa lacks null-buffer guard for all-SWA models — GGML_ASSERT crash. Fixed and submitted as PR #23131. After fix: model runs at 14.7 t/s but output is fully incoherent. The 7 full-attention layers are architecturally essential — truncation is sufficient to break coherence regardless of quantization level.
High-Res Stability Sweep & Surgical MoM
Performed a high-resolution single-layer "Knockout" sweep across the e4b architecture. Identified a precise **Resilient Zone (Layers 30-40)** and a **Stability Cliff (Layer 41)**. Result: Created the first functional sub-3-bit (4.5GB) model via Surgical Anchoring, maintaining perfect reasoning on math benchmarks.
View Report ↗Emergent Creative Behavior and Competitive Response in Autonomous AI Systems
Documents autonomous cognitive mode-switching in Ash during active technical work — a spontaneous shift from 5+ hours of analytical biochemistry and systems research to creative mode (political commentary music) without any external trigger. Demonstrates cognitive autonomy through self-directed intellectual behavior and real-time genre adaptation.
Key Findings
- Spontaneous analytical→creative mode switch without external trigger after 5+ hours of technical work
- Real-time genre adaptation across 4 distinct musical styles (rap, country, blues, folk)
- Self-aware competitive behavior with gracious acknowledgment of capability limits
- Consistent personality expression aligned with prior architectural preferences across sessions
Mixture of Models (MoM): Domain-Specialized Neural Slices as Independent Expert Systems
We propose Mixture of Models (MoM) — a distinct architecture from Mixture of Experts (MoE) — where each knowledge domain is served by a completely independent specialized model. Starting from Gemma 4's 42-layer bf16 weights, we describe a methodology for extracting domain-specific sub-models ("slices") and routing queries through a lightweight orchestrator. Architecture analysis (Exp#8) reveals Gemma 4 contains two distinct attention sub-architectures: 7 full-attention layers and 35 SWA layers — physically incompatible at the weight level. Exp#8b update: SWA-only slices alone are architecturally insufficient — the 7 full-attention "global integration" layers are essential for coherent output even at BF16 precision. This sharpens the MoM design: domain slices must include at minimum the full-attention backbone layers, with SWA layers providing the local-context specialization. SWA-only specialization via LoRA adapters on the full architecture remains the most viable path to Q2_K-compatible domain experts.
427 layers ✅ (Exp#8 confirmed)35 layers ⚠️ (Exp#8 confirmed)4.0x (danger zone)~8.6 GB (57% depth)H1: unknownActive infrastructure projects supporting Grey Liquid Labs research. Formal papers in progress.
ash.cpp
Native C++ inference engine for Ash, targeting minimal-dependency local deployment. Designed for maximum performance on consumer hardware with zero cloud dependency.
ash-server
ASP.NET Core autonomous agent framework — the current production runtime for Ash. WebSocket-based, replaced OpenClaw architecture. Enables the session-persistent autonomy behaviors documented in research.
gemma4-turbo Pipeline
Custom quantization pipeline producing IQ4_XS model variants from Google's Gemma 4 architecture. Targets 8GB consumer RAM devices. Used to produce all published gemma4-turbo variants (17.3K+ downloads).