Open Source · MIT License

faster
LLM weight loading

Snapshot GPU state and restore via pipelined DMA. Multi-GPU tensor parallel. KV cache snapshots. Bit-identical output verified on H100.

$ pip install thaw-vllm[all]
View on GitHub
Llama-3.1-70B-Instruct · 141 GB · 2× H100 SXM 80 GB
safetensors
86.7s
thaw restore
11.2s
7.7× faster · 14.3 GB/s DMA throughput · Bit-identical
8/8
Models verified (bit-identical)
14.4 GB/s
Peak DMA throughput
8.8×
Peak weight loading speedup

Cold starts are the bottleneck

THE PROBLEM

Before an AI model can respond to a single request, it loads tens to hundreds of gigabytes onto GPU memory. Llama-3.1-70B takes 86 seconds just to load weights. Every restart, every scale-up event, every new server instance—the same wait. Expensive H100s sit idle while data trickles through a serialization bottleneck.

THE SOLUTION

thaw snapshots the raw GPU memory layout and restores it via pipelined DMA, bypassing all deserialization. Think fork() for GPU processes. The restored model produces bit-identical output—verified by automated greedy decoding across every model and architecture we support.

Serverless inference

GPU endpoints that cold start in seconds, not minutes. Scale to zero, scale back without the penalty. Sub-15s for 70B-class models.

Cost reduction

Every minute of cold start is wasted GPU time at $2-4/hr per H100. Faster starts mean more compute budget goes to actual inference, not loading.

Agent infrastructure

Clone a running AI session—weights plus KV cache. Fork parallel conversations from a shared context. Explore multiple reasoning paths simultaneously.

Built for production inference

Rust+CUDA hot path. Python integration layer. Zero compromise.

Pipelined DMA

Double-buffered CUDA streams with pinned memory. Overlaps disk reads with PCIe transfers. Rust hot path hits 14.4 GB/s on H100.

Multi-GPU Tensor Parallel

Per-rank snapshots with automatic sharding. Freeze and restore across any number of GPUs. Tested TP=2 on 2× H100 SXM, models up to 145 GB.

KV Cache Snapshots

Freeze prefix-cached KV blocks with hash mappings intact. Restore and skip prefill entirely. Nobody else does this.

Agent Forking

Clone a running AI session. Snapshot weights + KV cache, restore in a new process. Fork parallel completions from shared context.

Native vLLM Integration

load_format="thaw" plugs into vLLM's official ModelLoader system. Two-line API. pip installable. Works with any model vLLM supports.

Bit-Identical Output

Every benchmark verified by greedy decoding comparison. Restored models produce the exact same tokens as the original. Zero approximation.

Engine Pool

thaw serve — pre-warmed vLLM engines with hot model swapping. OpenAI-compatible API. Swap models in ~1s via DMA instead of 20s cold start.

Four lines to 8×

Works with any model vLLM supports. Single GPU or multi-GPU tensor parallel.

# Llama models are gated — authenticate first
$ huggingface-cli login

# Freeze once, serve forever
$ thaw freeze --model meta-llama/Llama-3.1-8B-Instruct \
              --output weights.thaw

# Pre-warmed engine pool with OpenAI-compatible API
$ thaw serve --model meta-llama/Llama-3.1-8B-Instruct \
             --snapshot weights.thaw

# Swap models in ~1s via DMA, not 20s cold start
$ curl localhost:8000/v1/chat/completions \
    -d '{"model": "...", "messages": [...]}'
import thaw_vllm
from vllm import LLM

llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct",
          tensor_parallel_size=2)

# Snapshot GPU state to disk (once)
thaw_vllm.freeze_model_tp(llm, "/snapshots/llama-70b.thaw")
import thaw_vllm

# Restore: 7.7x faster weight loading
llm = thaw_vllm.load(
    "meta-llama/Llama-3.1-70B-Instruct",
    "/snapshots/llama-70b.thaw",
    tensor_parallel_size=2
)

# Same model. Same outputs. 7.7x faster.
import thaw_vllm  # registers the loader
from vllm import LLM

llm = LLM(
    model="meta-llama/Llama-3.1-70B-Instruct",
    load_format="thaw",
    tensor_parallel_size=2,
    model_loader_extra_config={"snapshot": "/snapshots/llama-70b.thaw"}
)
# Fresh RunPod A40 — zero source builds, just pip
$ pip install thaw-vllm[all]
Successfully installed thaw-native-0.1.1 thaw-vllm-0.1.3 vllm-0.19.0

$ huggingface-cli login
Login successful.

$ thaw freeze --model meta-llama/Llama-3.1-8B-Instruct -o weights.thaw
[thaw] Weights: 195 regions, 16.06 GB in 35.5s (0.45 GB/s)

$ thaw serve --model meta-llama/Llama-3.1-8B-Instruct --snapshot weights.thaw
[thaw] Slot 0 ready in 7.9s
[thaw] Slot 0: loaded in 1.71s (9.4 GB/s, 16.06 GB)
[thaw] Serving on http://0.0.0.0:8000/v1

$ curl localhost:8000/v1/chat/completions \
    -d '{"model": "...", "messages": [{"role": "user", "content": "Hello!"}]}'
{"choices": [{"message": {"content": "Hello! It's nice to meet you.
Is there something I can help you with?"}}],
 "thaw_metadata": {"latency_s": 1.624}}

Benchmarks

8 models, 5 architectures, 2× H100 SXM 80 GB. All results verified with bit-identical greedy decoding.

Model Size TP Weight Load thaw DMA Speedup GB/s
Phi-3-mini 7.6 GB 1 10.0s 1.1s 8.8× 6.8
Mixtral-8x7B (MoE) 87 GB 2 62.1s 7.4s 8.4× 14.4
Llama-3.1-70B 141 GB 2 86.7s 11.2s 7.7× 14.3
Qwen-2.5-72B 145 GB 2 92.6s 13.0s 7.1× 13.2
Mistral-7B 14 GB 1 11.2s 2.1s 5.3× 6.9
Llama-3.1-8B 16 GB 1 10.9s 2.4s 4.5× 6.6
Qwen-2.5-7B 15 GB 1 9.7s 2.2s 4.4× 7.0
Gemma-2-9B 18 GB 1 7.6s 3.4s 2.3× 5.5

Weight speedup isolates weight transfer by subtracting vLLM init overhead (~9–20s), which is identical in both paths. End-to-end cold start speedup including init: 1.3–3.4×. Both paths load from local storage (NVMe for safetensors, /dev/shm for thaw). Raw data: benchmarks/h100_stress_test_2026-04-14.json

Ready to thaw?

Open source. MIT licensed. Works today.

$ pip install thaw-vllm[all]
Star on GitHub