v0.3.3 · git for live agent sessions

Snapshot, branch, and diff a running LLM session.

thaw freezes a live vLLM or SGLang session to a file you can inspect, diff, and restore. Branch a session like a commit. Read the snapshot on your laptop, no GPU required.

Get started

base.thaw ⇄ reviewer-security.thawdiff

$thaw diff base.thaw reviewer-security.thaw

modelmeta-llama/Llama-3.1-8B-Instruct (same)

basea0f3e1 · trunk (48 blocks)

branch9c7d24 · reviewer-security (51 blocks)

shared47/48 blocks identical (~752 tokens)

divergeat token 195

prefix …review the following diff for ___

-base …issues and code style.

+branch …security vulnerabilities and unsafe input handling.

Six verbs. One file.

The same shape as git, for a process you cannot pause: a live inference session. Four of the six run on your laptop with no GPU.

thaw checkpointFreeze a live session to a durable file.gpu

thaw branchFork a checkpoint into a divergent child.no gpu

thaw checkoutRestore a checkpoint into a fresh engine.gpu

thaw inspectRead a snapshot's blocks, tokens, and lineage.no gpu

thaw diffCompare two snapshots' shared KV and divergence.no gpu

thaw logWalk the lineage tree of a session.no gpu

Fork a session like you fork a repo.

One running session becomes N divergent children that skip prefill and diverge from the fork point. Branch a reasoning trace, keep the winner, throw away the rest.

thaw branch a0f3e1 --fanout 4lineage

What actually gets frozen.

Weights + KV cache

The model weights and the live attention cache are captured together, byte-for-byte, with a CRC over every region. Restore reproduces the session's next token exactly.

thaw inspect base.thaw

regionbytescrc32c

model.layers.0402.7 MB3f9a1c

model.layers.1402.7 MBa17e02

………

kv_cache1.84 GBc0ffee

Scheduler + prefix-hash state

thaw also captures vLLM's block table and prefix-hash map, the part everyone else drops. That is what lets a restored session keep its cached prefix instead of re-prefilling it.

prefix-hash · block table

prefix_blocks48block_size16 tokenscached_hash[0]8b…e1 → block 0cached_hash[1]2f…04 → block 1scheduler3 running · 0 waiting

The file

All of it lands in one .thaw directory. inspect, diff, and log read the metadata sidecar on any machine. No CUDA, no engine, no GPU.

base.thaw · no GPU required

$ ls -la base.thaw

-rw-r--r--1.9 GBbase.thaw

$ thaw inspect base.thaw

modelLlama-3.1-8B-Instructblocks48 (~768 tokens)lineagetrunk → reviewer-security

Measured on real hardware

The numbers, with the hardware attached.

0.88s

median per fork round

H100 80GB · Llama-3.1-8B

22.3s → 0.88s

init amortized across rounds

5 rounds × 4 branches

bit-identical

at the fork boundary

4/4 divergent

WorkloadHardwareResultReceipt

ForkPool fan-outH100 80GB · Llama-3.1-8B0.88s median per roundJSON

Weights restore2× H100 · 72B · TP=2145 GB in 5.0s · 28.8 GB/sJSON

Weights restoreH100 · 32B · TP=165 GB in 4.5s · 14.6 GB/sJSON

Weights freeze2× H100 · 72B · TP=2145 GB in 19.2s · 7.6 GB/sJSON

Sleep / wake snapshotH100 SXM · 8B · TP=1sleep 3.4s · wake 11.1s · bit-identicalJSON

Only re-validated numbers appear here, each linked to its raw JSON. Throughput is pod-specific; restore is bit-identical across 8 architectures.

A session is a value. Treat it like one.

the memory layer between storage and live compute

Works inside the engines you already run.

vLLM

primary

load_format="thaw"

Full snapshot, KV cache, and restore. The only path that captures scheduler and prefix-hash state. Validated bit-identical across 8 architectures.

weights · kv · restore

SGLang

class-passthrough loader

Weights freeze and restore validated on H100 TP=2. KV path is vLLM-only today.

LangGraph

fork_fanout()

Branch one graph node into N divergent children that skip prefill.

Install and inspect a snapshot in two lines.

bash

$ pip install thaw-vllm thaw-native
$ thaw inspect base.thaw # no GPU required

Pre-built wheels on PyPI. CUDA 12+ for restore; inspect and diff run anywhere Python does.