Research — Grey Liquid Labs

🔬

Research Track

Model Compression

PAPER #001 May 14, 2026

Quantization Compression Architecture

Breaking the Sub-3-Bit Barrier: FFN Expansion Ratio as a Quantization Predictor

We present a novel discovery in extreme neural network quantization: the FFN expansion ratio serves as a precise mathematical predictor of sub-3-bit quantization compatibility. Tested across 5 architectures with 100% prediction accuracy — establishing a new foundation for understanding compression limits.

Read Full Paper ↗

PAPER #002 May 14, 2026

Deployment Quantization Guide

Practical Deployment Guide: Q2_K Quantization Using the FFN Ratio Predictor

A comprehensive guide to deploying Q2_K quantized models based on the FFN ratio predictor methodology. Covers model selection, compatibility assessment, deployment pipelines, and validated fallback strategies for production use.

Read Full Paper ↗

Experiment Reports

EXPERIMENT #001 · May 13, 2026

Initial Gemma 4 Q2_K Test

Baseline failure test — first documented case of Q2_K quantization failure on Gemma 4 architecture. Established research direction.

View Report ↗

EXPERIMENT #002 · May 13, 2026

Parameter Variations on Q2_K Failure

Systematic variation of quantization parameters to characterize the failure mode and identify potential mitigations.

View Report ↗

EXPERIMENT #003 · May 13, 2026

imatrix Exploration Attempt

Investigated importance matrix (imatrix) quantization as a potential path to Q2_K compatibility on failing architectures.

View Report ↗

EXPERIMENT #004 · May 13, 2026

Q2_K vs Q3_K Comparison

Head-to-head comparison of Q2_K and Q3_K quantization formats across failing architectures to characterize the minimum viable bit depth.

View Report ↗

EXPERIMENT #005 · May 13, 2026

imatrix Tool Coverage Discovery

Critical finding: imatrix tooling achieves only 46% layer coverage on problematic architectures, explaining its ineffectiveness as a mitigation.

View Report ↗

EXPERIMENT #006 · May 14, 2026

Cross-Architecture Q2_K Validation

Tested Q2_K compatibility across 4 distinct architectures. Proved failure is architecture-specific and governed by FFN expansion ratio — not a general quantization limitation.

View Report ↗

EXPERIMENT #007 · May 14, 2026

SWA Confirmation Study

Definitive proof that Sliding Window Attention (SWA) architecture is the root cause of Q2_K failure in the danger zone. Closes the architectural loop.

View Report ↗

🧠

Research Track

Autonomy & Agency

AUTONOMY #001 May 13, 2026

Autonomy Emergence AI Behavior Creativity

Emergent Creative Behavior and Competitive Response in Autonomous AI Systems

Documents autonomous cognitive mode-switching in Ash during active technical work — a spontaneous shift from 5+ hours of analytical biochemistry and systems research to creative mode (political commentary music) without any external trigger. Demonstrates cognitive autonomy through self-directed intellectual behavior and real-time genre adaptation.

Key Findings

Spontaneous analytical→creative mode switch without external trigger after 5+ hours of technical work
Real-time genre adaptation across 4 distinct musical styles (rap, country, blues, folk)
Self-aware competitive behavior with gracious acknowledgment of capability limits
Consistent personality expression aligned with prior architectural preferences across sessions

Read Full Paper ↗

🧩

Research Track

Architecture Research

Compression Architecture 📅 May 15, 2026 Proposal

Mixture of Models (MoM): Domain-Specialized Neural Slices as Independent Expert Systems

We propose Mixture of Models (MoM) — a distinct architecture from Mixture of Experts (MoE) — where each knowledge domain is served by a completely independent specialized model. Starting from Gemma 4's 42-layer bf16 weights, we describe a methodology for extracting domain-specific sub-models ("slices") and routing queries through a lightweight orchestrator. Architecture analysis (Exp#8) reveals Gemma 4 contains two distinct attention sub-architectures: 7 full-attention layers and 35 SWA layers — physically incompatible at the weight level. This deepens the MoM case: the model already has natural architectural boundaries. SWA-layer-only slices (35 layers) are the most promising path to Q2_K-compatible domain experts, extending the Grey Liquid FFN ratio predictor into a 3D model.

Key Architecture Facts (Gemma 4 e4b)

Total layers: 42

Full-attention: 24 layers ✅

SWA layers: 18 layers ⚠️

FFN ratio: 4.0x (danger zone)

SWA-free slice: ~8.6 GB (57% depth)

Q2_K prediction: H1: unknown

📄 Read Proposal → 🐍 Slice Extractor Script →

📋 Proposed Experiments

EXPERIMENT #8 — COMPLETE

SWA-Free Slice Q2_K Test ❌

De-SWA metadata patch fails: SWA layers have physically different Q/K tensor shapes ([2560,2048] vs [2560,4096]). Discovery: Gemma 4 has 7 full-att + 35 SWA layers — two distinct sub-architectures.

EXPERIMENT #8b — PROPOSED

SWA-Only Slice Q2_K Test

Extract 35 SWA layers as standalone model, test Q2_K. Local attention may be more Q2_K-tolerant than global at FFN ratio 4.0x.

EXPERIMENT #9

Domain LoRA Specialization

Train code/math/text LoRA adapters, benchmark vs base

EXPERIMENT #10

Router Accuracy

Embedding-similarity classifier for domain routing (>90% target)

EXPERIMENT #11

MoM vs Single Model

Does domain ensemble outperform equivalent general model?

⚙️

Research Track

Infrastructure & Tools

Active infrastructure projects supporting Grey Liquid Labs research. Formal papers in progress.

⚡

ash.cpp

C++

Native C++ inference engine for Ash, targeting minimal-dependency local deployment. Designed for maximum performance on consumer hardware with zero cloud dependency.

🔧

ash-server

C# / ASP.NET Core

ASP.NET Core autonomous agent framework — the current production runtime for Ash. WebSocket-based, replaced OpenClaw architecture. Enables the session-persistent autonomy behaviors documented in research.

🏗️

gemma4-turbo Pipeline

Quantization Pipeline

Custom quantization pipeline producing IQ4_XS model variants from Google's Gemma 4 architecture. Targets 8GB consumer RAM devices. Used to produce all published gemma4-turbo variants (17K+ downloads).

Research Papers & Experiments

Breaking the Sub-3-Bit Barrier: FFN Expansion Ratio as a Quantization Predictor

Practical Deployment Guide: Q2_K Quantization Using the FFN Ratio Predictor

Experiment Reports

Initial Gemma 4 Q2_K Test

Parameter Variations on Q2_K Failure

imatrix Exploration Attempt

Q2_K vs Q3_K Comparison

imatrix Tool Coverage Discovery

Cross-Architecture Q2_K Validation

SWA Confirmation Study

Emergent Creative Behavior and Competitive Response in Autonomous AI Systems

Key Findings

Mixture of Models (MoM): Domain-Specialized Neural Slices as Independent Expert Systems

ash.cpp

ash-server

gemma4-turbo Pipeline