Snapshot GPU state and restore via pipelined DMA. Multi-GPU tensor parallel. KV cache snapshots. Bit-identical output verified on H100.
Before an AI model can respond to a single request, it loads tens to hundreds of gigabytes onto GPU memory. Llama-3.1-70B takes 86 seconds just to load weights. Every restart, every scale-up event, every new server instance—the same wait. Expensive H100s sit idle while data trickles through a serialization bottleneck.
thaw snapshots the raw GPU memory layout and restores it via pipelined DMA, bypassing all deserialization. Think fork() for GPU processes. The restored model produces bit-identical output—verified by automated greedy decoding across every model and architecture we support.
GPU endpoints that cold start in seconds, not minutes. Scale to zero, scale back without the penalty. Sub-15s for 70B-class models.
Every minute of cold start is wasted GPU time at $2-4/hr per H100. Faster starts mean more compute budget goes to actual inference, not loading.
Clone a running AI session—weights plus KV cache. Fork parallel conversations from a shared context. Explore multiple reasoning paths simultaneously.
Rust+CUDA hot path. Python integration layer. Zero compromise.
Double-buffered CUDA streams with pinned memory. Overlaps disk reads with PCIe transfers. Rust hot path hits 14.4 GB/s on H100.
Per-rank snapshots with automatic sharding. Freeze and restore across any number of GPUs. Tested TP=2 on 2× H100 SXM, models up to 145 GB.
Freeze prefix-cached KV blocks with hash mappings intact. Restore and skip prefill entirely. Nobody else does this.
Clone a running AI session. Snapshot weights + KV cache, restore in a new process. Fork parallel completions from shared context.
load_format="thaw" plugs into vLLM's official ModelLoader system. Two-line API. pip installable. Works with any model vLLM supports.
Every benchmark verified by greedy decoding comparison. Restored models produce the exact same tokens as the original. Zero approximation.
thaw serve — pre-warmed vLLM engines with hot model swapping. OpenAI-compatible API. Swap models in ~1s via DMA instead of 20s cold start.
Works with any model vLLM supports. Single GPU or multi-GPU tensor parallel.
# Llama models are gated — authenticate first $ huggingface-cli login # Freeze once, serve forever $ thaw freeze --model meta-llama/Llama-3.1-8B-Instruct \ --output weights.thaw # Pre-warmed engine pool with OpenAI-compatible API $ thaw serve --model meta-llama/Llama-3.1-8B-Instruct \ --snapshot weights.thaw # Swap models in ~1s via DMA, not 20s cold start $ curl localhost:8000/v1/chat/completions \ -d '{"model": "...", "messages": [...]}'
import thaw_vllm from vllm import LLM llm = LLM(model="meta-llama/Llama-3.1-70B-Instruct", tensor_parallel_size=2) # Snapshot GPU state to disk (once) thaw_vllm.freeze_model_tp(llm, "/snapshots/llama-70b.thaw")
import thaw_vllm # Restore: 7.7x faster weight loading llm = thaw_vllm.load( "meta-llama/Llama-3.1-70B-Instruct", "/snapshots/llama-70b.thaw", tensor_parallel_size=2 ) # Same model. Same outputs. 7.7x faster.
import thaw_vllm # registers the loader from vllm import LLM llm = LLM( model="meta-llama/Llama-3.1-70B-Instruct", load_format="thaw", tensor_parallel_size=2, model_loader_extra_config={"snapshot": "/snapshots/llama-70b.thaw"} )
# Fresh RunPod A40 — zero source builds, just pip $ pip install thaw-vllm[all] Successfully installed thaw-native-0.1.1 thaw-vllm-0.1.3 vllm-0.19.0 $ huggingface-cli login Login successful. $ thaw freeze --model meta-llama/Llama-3.1-8B-Instruct -o weights.thaw [thaw] Weights: 195 regions, 16.06 GB in 35.5s (0.45 GB/s) $ thaw serve --model meta-llama/Llama-3.1-8B-Instruct --snapshot weights.thaw [thaw] Slot 0 ready in 7.9s [thaw] Slot 0: loaded in 1.71s (9.4 GB/s, 16.06 GB) [thaw] Serving on http://0.0.0.0:8000/v1 $ curl localhost:8000/v1/chat/completions \ -d '{"model": "...", "messages": [{"role": "user", "content": "Hello!"}]}' {"choices": [{"message": {"content": "Hello! It's nice to meet you. Is there something I can help you with?"}}], "thaw_metadata": {"latency_s": 1.624}}
8 models, 5 architectures, 2× H100 SXM 80 GB. All results verified with bit-identical greedy decoding.
| Model | Size | TP | Weight Load | thaw DMA | Speedup | GB/s |
|---|---|---|---|---|---|---|
| Phi-3-mini | 7.6 GB | 1 | 10.0s | 1.1s | 8.8× | 6.8 |
| Mixtral-8x7B (MoE) | 87 GB | 2 | 62.1s | 7.4s | 8.4× | 14.4 |
| Llama-3.1-70B | 141 GB | 2 | 86.7s | 11.2s | 7.7× | 14.3 |
| Qwen-2.5-72B | 145 GB | 2 | 92.6s | 13.0s | 7.1× | 13.2 |
| Mistral-7B | 14 GB | 1 | 11.2s | 2.1s | 5.3× | 6.9 |
| Llama-3.1-8B | 16 GB | 1 | 10.9s | 2.4s | 4.5× | 6.6 |
| Qwen-2.5-7B | 15 GB | 1 | 9.7s | 2.2s | 4.4× | 7.0 |
| Gemma-2-9B | 18 GB | 1 | 7.6s | 3.4s | 2.3× | 5.5 |
Weight speedup isolates weight transfer by subtracting vLLM init overhead (~9–20s), which is identical in both paths. End-to-end cold start speedup including init: 1.3–3.4×. Both paths load from local storage (NVMe for safetensors, /dev/shm for thaw). Raw data: benchmarks/h100_stress_test_2026-04-14.json