Learning to Use Tinker

Tinker has become publicly available in November, so below some thoughts on what makes it interesting. If you’re unfamiliar with Tinker, here is the description from the [docs](https://tinker-docs.thinkingmachines.ai/): > _Tinker: a training API for researchers and developers. It gives you full control over the training loop and all the algorithmic details. It's not a magic black box that makes fine-tuning "easy." It's a clean abstraction that shields you from the complexity of distributed training while preserving your control._ First thoughts: In most fine-tuning setups, you either manage a complex local GPU cluster (can be a headache) or you send a full dataset to an API and wait for a model to come back (zero control). Tinker is a kind of middle way. You write a standard Python loop that runs on your local, modest CPU. In that loop, you define your data loading, your custom loss functions, and your evaluation metrics. When you call a function like `forward_backward()`, the heavy math is instantly teleported to a massive GPU cluster, but the **control flow** stays with you. This means you can inspect the loss, swap out data, or even change the training objective **in real-time** while the training is done on a distributed cluster somewhere else. It turns distributed training into something that feels as interactive and immediate as writing a local script. ### How it actually works The core API is super minimal. You get a `TrainingClient` that represents your fine-tuned model, and you interact with it through a handful of functions: - `forward_backward()` — compute gradients for a batch - `optim_step()` — apply those gradients with Adam - `sample()` — generate text from the current weights - `save_state()` / `load_state()` — checkpoint and resume That's it. That's the whole API! The rest is just your Python code deciding what data to feed in and what to do with the outputs (and the whole backend training cycle on a 10 second heartbeat but which is handled on their end). I saw a talk with John Schulman where he said "Tinker is the infrastructure that I wanted when I was at OpenAI" and I can see why. It really does remove a lot of the complexity and boils it down to these 4 core abstractions, which is great. Behind the scenes, Tinker runs on discrete clock cycles of about 10 seconds each. Each training job gets assigned to a pool of machines that work together to run a fine-tuning update across multiple users and LoRA adapters. This is how they keep utilization high even when individual users are training with small batches or have slow data pipelines. It is literally like a huge stamping press that stamps everyone's gradients every 10 seconds. The tradeoff obviously is latency: even a very tiny batch that you send underneath the press will require the same step time as a large one. A practical consequence of this approach is that you will always want to submit your next request _before_ the current one finishes. If you await `forward_backward()` and then submit `optim_step()`, you might miss the next cycle. You have to submit both, then await both. In code (and from the docs), that looks like this: ```python fwd_bwd_future = await client.forward_backward_async(batch, loss_fn) optim_future = await client.optim_step_async(adam_params) fwd_bwd_result = await fwd_bwd_future optim_result = await optim_future ``` ### Tinker is LoRA only (for now and probably for the foreseeable future) Tinker currently only supports LoRA fine-tuning, not full fine-tuning. This ties back to a separate post I have on this ([[Notes on LoRA without Regret]]), where it basically shows that LoRA matches full fine-tuning for RL and for SL on small-to-medium datasets. It only underperforms on large SL datasets where you're trying to cram a lot of information into the weights. One thing that I got tripped up on: LoRA needs much higher learning rates than full fine-tuning. Around 20-100x higher, depending on model size. The [cookbook](https://tinker-docs.thinkingmachines.ai/overview-building) provides a utility function to calculate the right multiplier. Just something to keep in mind if porting a full fine-tuning setup - don't keep the same LR! It will feel like LoRA doesn't work which is far from the case. ### What can you train The model lineup includes Llama 3.x (8B, 70B) and Qwen 3 variants. There are also the big MoE models like Qwen3-235B-A22B. Vision-language models are supported too. While I haven't tried it, Qwen3-VL will work for image understanding tasks. For loss functions, all usual ones built in and you have a choice. Cross-entropy for SL, and policy gradient variants for RL (PPO, CISPO, DRO). If you need something custom, there's `forward_backward_custom()` which lets you define arbitrary differentiable loss functions. ### The rendering layer One detail I appreciated: they've built a rendering system that converts chat messages to tokens in a way that's consistent across training and inference. This sounds trivial until you've debugged a model that was trained with one chat template and served with another. The renderers handle things like which tokens get loss weights (you only want to train on the assistant's responses, not the user's messages), stop sequences, and special tokens for vision inputs. They're designed to match HuggingFace's `apply_chat_template` exactly, so if you train with the default renderer, your model will work correctly with their OpenAI-compatible endpoint. ### Who is Tinker for? I used Tinker because I needed a way to fine-tune many models for a single task related to Anthology. It was useful for me because I knew exactly what I needed and wanted the ability to quickly run real experiments without dealing with infrastructure or debugging NCCL errors across a cluster of rented GPUs. Services like this are super useful when you realistically don't need that much training (i.e you're in the LoRA regime) and you can get in and out simply with API calls. It is clear though that it's explicitly _not_ trying to be a "upload your data and get a fine-tuned model" service. The expectation is definitely there that you understand what you're doing, though the docs do help a lot and are very well-written. There is also guidance on learning rate scaling, batch size selection, and when to use which loss function, but they do assume you already know why these things matter. The pricing model is also nice, where you pay for compute used, not for idle time in the shared pool. This was particularly nice for research workflows where you're constantly tweaking things. I think I spent $180 overall for my purposes, which is far, far less than having done this myself (especially when one includes setup time) ### Tips Some other assorted learnings: **Use `renderer.build_supervised_example()` to get your loss weights right** It returns `(model_input, weights)` where the final assistant turn gets `weight=1` and everything else (system prompt, user messages, prior turns) gets `weight=0`. You only want to train on the target voice, not on the prompts. **Smaller batch sizes often give better results** Start with batch size ~128 and aim for at least 100 training steps. The docs suggest 1000+ steps for best results. LoRA is sensitive to batch size, don't move it up if you can avoid it. **Use `save_weights_for_sampler()` to test, `save_state()` to resume** The former is fast and small (just weights for inference). The latter includes optimizer state so you can continue training exactly where you left off. Don't mix them up. **Check your tokenized examples before training** Print out a few processed examples to verify the prompt/completion split is where you expect. A misplaced special token can mean you're training on the wrong content. For learning rate, use `get_lr(model_name)` as your starting point, then sweep ±1 order of magnitude. As always, log your metrics to a file and plot train/test loss to spot overfitting early.