Notes on LoRA without Regret

This post captures my essential technical takeaways from the Thinking Machines research paper *LoRA Without Regret* by John Schulman. The main takeaway from this paper is that adapting a low-rank projection of a large model(LoRA) can match the performance of fine-tuning the full model (Full Fine-Tuning) if you follow a specific set of engineering rules. The reason for this is that there is a **"low-regret" regime** where the relationship between the **trainable parameter capacity** and the **dataset's information content** allows LoRA to absorb learning signals with the same sample efficiency as full fine-tuning. Specifically, as long as the adapter is not **capacity-constrained**—meaning the number of trainable parameters in the low-rank matrices exceeds the amount of information to be learned from the data—the training dynamics of LoRA align almost perfectly with those of a full model update. **The "Low-Regret" Rules** * **Target All Layers:** Do not just hit Attention ($Q, K, V$). You must apply LoRA to the **MLP (and MoE) layers**. This is where the model's knowledge lives. Applying adapters only to attention projection layers misses the majority of the gradient signal. * **10x Your Learning Rate:** The optimal learning rate for LoRA is consistently **10x higher** than for FullFT. If the model isn't learning, your LR is probably too low. For runs <100 steps, you may even need 15x LR (not 10x) because the "B" matrix starts at zero and needs more "pressure" to move. * **Don't worry about ranks if your data is small:** Early in training, $r=16$ and $r=256$ have identical learning curves. Don't waste VRAM on high ranks for small-to-medium datasets. The capacity bottleneck tend to only appear much later in the run. * **The RL "Free Lunch":** Reinforcement Learning requires almost zero capacity. **Rank 1 LoRA** can match FullFT for reasoning tasks because RL only provides roughly **1 bit of information per episode** (correct vs. incorrect), which even the smallest adapter can absorb. * **Be sensitive to batch size:** LoRA is less tolerant of large batch sizes than FullFT. You must keep batches small to moderate (typically <128) to avoid a loss penalty that rank increases cannot fix. ## An Analogy: LoRA as Updating a Library Why these rules? An analogy I like to use is to imagine a base model like Qwen-235B, or Kimi-K2 as a massive library with millions of books. The books in this library are carefully chosen because they give you the information needed to do a set of tasks. When the task arrives, you use the knowledge you've gained from the books to do it. If the task is new, you quickly grab the books you need, read some of the chapters, and then use the information to do the task. Now what if you know that a task that is coming in will require adjustments to the knowledge that is contained in the books? You have two options: **Option 1: Change the Library Entirely (Full Fine-Tuning)**: You hire an army of librarians to go through every single book and rewrite specific sentences to better suit your needs. In principle, this is the ideal solution. But it’s impossibly slow, expensive, and you end up with a completely different (and massive) library. When the task changes, you will have to do it all over again. **Option 2: Adapt The Relevant Parts of the Library in a Temporary Way (LoRA)**. This is like giving the library a small box of **Sticky Notes**. Instead of rewriting the books, the librarians just put sticky notes on specific pages that are relevant to the task. The notes are tiny, can be removed easily, and simply tell the reader: _"When you see this sentence, think of it this way instead."_ ### Think of LoRA as updating a Library When you understand that LoRA is just a collection of "Sticky Notes", the above rules can be re-cast in a natural way - **Target All Layers (MLP/MoE):** If you only put the notes on the "Table of Contents" of each book (The Attention layers), you may find the right chapter, but the actual information in the pages (MLP layers) remains the same. To change what the library _knows_, **you have to put notes on the pages where the facts are kept.** - **The 10x Learning Rate (LR) Rule:** Because the sticky notes are so small, you have to write on them with a **thick, bold marker** (High LR) for the reader to actually notice them over the billions of original words. If you use a thin pencil (Low LR), the "notes" will get ignored. - **Rank Independence (at first):** Whether your sticky note is 1 inch wide ($r=16$) or 4 inches wide ($r=256$), it doesn't matter if you only have three things to say. They both work the same (!) until you have so much new information that you run out of physical space on the small note. - **The RL "Free Lunch":** In Reinforcement Learning, you aren't trying to teach the library new books; you're just teaching the reader a **new habit** (e.g., "Check your math twice"). You only need a tiny "reminder" note on the front door to change a habit, which is why **Rank 1** works perfectly. - **Batch Size Sensitivity:** If you get 1,000 librarians to try to put sticky notes on the same shelf of the library all at the same time (Large Batch update), they will bump into each other and make a mess. LoRA requires a calm room, it goes better when there are just a few librarians working carefully on one section at a time. ## Why This Matters ### Weights become baked knowledge per user The beauty of LoRA is that it is fundamentally more flexible at runtime. If you can update the library for specific tasks at will, you can keep the massive 140GB "Base" model (like Kimi-K2) frozen in memory and instantly swap t50MB boxes of notes in and out once you have the weight matrices. This allows a new class of AI experiences. For a project that is personalized for each user or gives each user a different type of experience, a LoRA system means you can serve thousands of different users from a single GPU server, each with their own personal adapter, just by swapping a tiny weight file in milliseconds. This effectively gives a private brain for the cost of a generic one. ### The full product is Model + Adapter I expect LoRA fine-tuning will become much more popular, given how well Tinker handles distributed training. There will be a shift in value from the base model to the **Adapter**. In this world, the "Model" is just the raw engine; the "Adapter" is the specific intelligence, the domain expertise, and the user's history. Products will move toward a world of sharded, specific intelligence that is hyper-personalized. ## Predictions: If the **LoRA Without Regret** thesis is correct, product-focused AI companies will change. Here are my 3 predictions: 1. **RAG Continues to Become Less Important in 2026**: As training small, high-rank adapters becomes cheaper and faster than searching a 10M-document vector database, the 'model as database' will become the standard. 2. **The Rise of the "Personal Adapter"**: The 'GPT Wrapper' style products will build moats by hosting their own infra for which they can create user-based adapters. These will be boxes of 'sticky notes' in the form of a 50MB weight file that is substituted in on the fly per user interaction. 3. **Multi-tenant Sharding baked behind each product**: The standard inference server of 2026 for a product-focused AI comapny will not run one model; it will run one **Base Model** with **Dynamic Adapters** on top. Serving a personalized, fine-tuned experience to a million users will cost nearly the same as serving a generic one, due to the efficiency of sharded LoRA weights.