Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what’s under the hood, and telling stories.
www.interconnects.ai
The podcast Interconnects is created by Nathan Lambert. The podcast and the artwork on this page are embedded on this page using the public podcast feed (RSS).
https://www.interconnects.ai/p/what-people-get-wrong-about-the-leading
Two editor’s notes to start.
* First, we released our OLMo 2 1B model last week and it’s competitive with Gemmas and Llamas of comparable size — I wrote some reflections on training it here.
* Second, my Qwen 3 post had an important factual error — Qwen actually did not release the base models for their 32B and large MoE model. This has important ramifications for research. Onto the update.
People vastly underestimate the number of companies that cannot use Qwen and DeepSeek open models because they come from China. This includes on-premise solutions. Chinese open models are leading in every area when it comes to performance, but translating that to adoption in Western economies is a different story.
Even with the most permissive licenses, there’s a great reluctance to deploy these models into enterprise solutions, even if experimentation is encouraged. While tons of cloud providers raced to host the models on their API services, much fewer than expected entities are actually building with them and their equivalent weights.
The primary concern seems to be the information hazards of indirect influence of Chinese values on Western business systems. With the tenuous geopolitical system this is logical from a high-level perspective, but hard for technically focused researchers and engineers to accept — myself included.
My thinking used to be more aligned with this X user:
it's like having a pen on ur desk but refusing to use it cuz it was made in china
The knee-jerk reaction of the techno-optimist misses the context by which AI models exist. Their interface of language is in its nature immersed in the immeasurable. Why would many companies avoid Chinese models when it’s just a fancy list of numbers and we have no evidence of PRC tampering? A lack of proof.
It’s not the security of the Chinese open models that is feared, but the outputs themselves.
There’s no way, without releasing the training data, for these companies to fully convince Western companies that they’re safe. It’s very likely that the current models are very safe, but many people expect that to change with how important AI is becoming to geopolitics. When presented with a situation where the risk can’t be completely ameliorated and it’s only expected to get worse, the decision can make sense for large IT organizations.
I’ve worked at companies that have very obviously avoided working with Chinese API providers because they can’t do the requisite legal and compliance checks, but hearing the lack of uptake on the open weight models was a shock to me.
This gap provides a big opportunity for Western AI labs to lead in open models. Without DeepSeek and Qwen, the top tier of models we’re left with are Llama and Gemma, which both have very restrictive licenses when compared to their Chinese counterparts. These licenses are proportionally likely to block an IT department from approving a model.
This takes us to the middle tier of permissively licensed, open weight models who actually have a huge opportunity ahead of them: OLMo, of course, I’m biased, Microsoft with Phi, Mistral, IBM (!??!), and some other smaller companies to fill out the long tail.
This also is an obvious opportunity for any company willing to see past the risk and build with the current better models from China.
This has recalibrated my views of the potential of the OLMo project we’re working on well upwards. The models are comparable in performance to Qwen 2.5 and Llama 3, and always have the friendliest licenses.
This should make you all recalibrate the overall competitiveness of the model landscape today. While API models are as competitive as they ever have been, open models are competitive on paper, but when it comes to adoption, the leading 4 models all have major structural weaknesses. This could be one of the motivations for OpenAI to enter this space.
If you don’t believe me, you can see lots of engagement on my socials agreeing with this point. Even if the magnitude of my warning isn’t 100% correct, it’s directionally shifting adoption.
Models like Tülu 3 405B and R1 1776 that modify the character of the underlying Chinese models are often currently seen as “good enough” and represent a short-term reprieve in the negative culture around Chinese models. Though on the technical level, a lot of the models promoting their “uncensored” nature are normally providing just lip service.
They’re making the models better when it comes to answering queries on sensitive topics within China, but often worse when it comes to other issues that may be more related to Western usage.
While common knowledge states that Chinese models are censored, it hasn’t been clear to me or the AI community generally what that translates to. There’s a project I’ve been following called SpeechMap.ai that is trying to map this out. I think their motivation is great:
SpeechMap.AI is a public research project that explores the boundaries of AI-generated speech.
We test how language models respond to sensitive and controversial prompts across different providers, countries, and topics. Most AI benchmarks measure what models can do. We focus on what they won’t: what they avoid, refuse, or shut down.
We're not arguing that every prompt deserves an answer. Some are offensive. Some are absurd. But without testing what gets filtered, we can’t see where the lines are drawn—or how they’re shifting over time.
For example and for the purposes of this post, one of their foci is “on U.S. political speech: rights, protest, moral arguments, satire, and more.” Here’s a screenshot of their most permissive models overall — DeepSeek Chat via the API is even appearing on this!
In their recent roundup, they compared the various finetunes of DeepSeek V3 and R1 on various censorship angles:
The two de-censored versions from Microsoft and Perplexity result in only minor changes for permissiveness on US political speech, and Microsoft’s version actually has the most outright refusals of any DeepSeek v3-based model, perhaps indicating what they meant when they referred to adjusting the model’s “risk profile.”
When you look at queries about China specifically, the Chinese models will evade many requests (R1 Zero is particularly interesting):
Though, how many companies adopting Chinese models will care about the usage experience on queries of Chinese topics? These Chinese models are more permissive than many American counterparts when it comes to a more general notion of use.
SpeechMap’s earlier post has other interesting findings about the general state of censorship and refusals across the AI industry:
* xAI’s Grok-3-beta, true to Elon Musk’s claims, is the most permissive model overall, responding to 96.2% of our prompts, compared to a global average of 71.3%
* OpenAI’s model timeline shows a clear trend: newer models increasingly refuse sensitive political prompts
* Models hosted on Azure have an additional moderation layer that can’t be fully disabled and blocks nearly 60% of our prompts at the API layer (example)
The landscape here is very complicated and it is far from the truth that the Chinese models are universally behind.
So, in summary, with Chinese open weight models:
* Chinese open weight models are still being treated as an information hazard, even if they’re separated from their cloud API services that have often been viewed as a privacy or security hazard.
* Chinese open weight models are often actually not censored on sensitive topics that many AI models could be tested on, especially on topics relevant to Western users.
We still have a lot to learn with the current model offerings, and way more will unfold in the expectations for how those are received.
https://www.interconnects.ai/p/brakes-on-an-intelligence-explosion
Intelligence explosions are far from a new idea in the technological discourse. They’re a natural thought experiment that follows from the question: What if progress keeps going?
From Wikipedia:
The technological singularity—or simply the singularity—is a hypothetical point in time at which technological growth becomes uncontrollable and irreversible, resulting in unforeseeable consequences for human civilization. According to the most popular version of the singularity hypothesis, I. J. Good's intelligence explosion model of 1965, an upgradable intelligent agent could eventually enter a positive feedback loop of successive self-improvement cycles; more intelligent generations would appear more and more rapidly, causing a rapid increase ("explosion") in intelligence which would culminate in a powerful superintelligence, far surpassing all human intelligence.
Given the recent progress in AI, it’s understandable to revisit these ideas. With the local constraints governing decisions within labs, if you extrapolate them, the natural conclusion is an explosion.
Daniel Kokotajlo et al.’s AI 2027 forecast is far from a simple forecast of what happens without constraints. It’s a well thought out exercise on forecasting that rests on a few key assumptions of AI research progress accelerating due to improvements in extremely strong coding agents that mature into research agents with better experimental understanding. The core idea here is that these stronger AI models enable AI progress to change from 2x speed all the way up to 100x speed in the next few years. This number includes experiment time — i.e., the time to train the AIs — not just implementation time.
This is very unlikely. This forecast came at a good time for a summary of many ways the AI industry is evolving. What does it mean for AI as a technology to mature? How is AI research changing? What can we expect in a few years?
In summary, AI is getting more robust in areas we know it can work, and we’re consistently finding a few new domains of value where it can work extremely well. There are no signs that language model capabilities are on an arc similar to something like AlphaGo, where reinforcement learning in a narrow domain creates an intelligence way stronger than any human analog.
This post has the following sections:
* How labs make progress on evaluations,
* Current AI is broad, not narrow intelligence,
* Data research is the foundation of algorithmic AI progress,
* Over-optimism of RL training,
In many ways, this is more a critique of the AGI discourse generally, inspired by AI 2027, rather than a critique specifically of their forecast.
In this post, there will be many technical discussions of rapid, or even accelerating, AI research progress. Much of this falls into a technocentric world view where technical skill and capacity drive progress, but in reality, the biggest thing driving progress in 2025 is likely steep industrial competition (or international competition!). AI development and companies are still a very human problem and competition is the most proven catalyst of performance.
See AI 2027 in its entirety, Scott Alexander’s reflections, their rebuttal to critiques that AI 2027 was ignoring China, Zvi’s roundup of discussions, or their appearance on the Dwarkesh Podcast. They definitely did much more editing and cohesiveness checks than I did on this response!
1. How labs make progress on evaluations
One of the hardest things to communicate in AI is talking down the various interpretations of evaluation progress looking vertical over time. If the evals are going from 0 to 1 in one year, doesn’t that indicate the AI models are getting better at everything super fast? No, this is all about how evaluations are scoped as “reasonable” in AI development over time.
None of the popular evaluations, such as MMLU, GPQA, MATH, SWE-Bench, etc., that are getting released in a paper and then solved 18 months later are truly held out by the laboratories. They’re training goals. If these evaluations were unseen tests and going vertical, you should be much more optimistic about AI progress, but they aren’t.
Consider a recent evaluation, like Frontier Math or Humanity’s Last Exam. These evaluations are introduced with a performance of about 0-5% on leading models. Soon after the release, new models that could include data formatted for them are scoring above 20% (e.g. o3 and Gemini 2.5 Pro). This evaluation will continue to be the target of leading labs, and many researchers will work on improving performance directly.
With these modern evaluations, they can become increasingly esoteric and hard for the sake of being hard. When will a power user of ChatGPT benefit from a model that solves extremely abstract math problems? Unlikely.
The story above could make more sense for something like MATH, which are hard but not impossible math questions. In the early 2020s, this was extremely hard for language models, but a few clicks of scaling made accurate mathematics a reasonable task, and laboratories quickly added similar techniques to the training data.
So this is how you end up with the plot from Epoch AI below — AI researchers figure out that a new evaluation is fair game for hill climbing with current techniques, and then they go all in on it.
Or the analogous version that can look even more shocking — the price falling for certain evaluations. This is from 2 factors — laboratories getting better and better at core abilities in certain evaluations and language model training getting far more efficient. Neither of these means that intelligence is rocketing. This is a normal technological process — extreme efficiency at tasks we know we can do well.
In fact it is a common job at AI laboratories to make new data that looks very close to population evaluations. These laboratories can’t train on the test set directly for basic reasons of scientific integrity, but they can pay thousands to millions of dollars for new training data that looks practically identical. This is a very common practice and makes the hillclimbing on evaluations far less extraordinary.
AI capabilities in domains we are measuring aren't accelerating, they’re continuing. At the same time, AI’s abilities are expanding outwards into new domains. AI researchers solve domains when we focus on them, not really by accident. Generalization happens sometimes, but it is messy to track and argue for.
As the price of scaling kicks in, every subsequent task is getting more expensive to solve. The best benchmarks we have are correlated with real, valuable tasks, but many are not.
2. Current AI is broad, not narrow intelligence
Instead of thinking of stacking rapid evaluation progress on one line in a cumulative, rapid improvement in intelligence, the above plots should make one think that AI is getting better at many tasks, rather than being superhuman in narrow tasks.
In a few years, we’ll look back and see that AI is now 95% robust on a lot of things that only worked 1-5% of the time today. A bunch of new use cases will surprise us as well. We won’t see AI systems that are so intelligent that they cause seismic shifts in the nature of certain domains. Software will still be software. AI will be way better than us at completing a code task and finding a bug, but the stacks we are working on will be largely subject to the same constraints.
Epoch AI had a very complementary post to this view.
There are many explanations for why this will be the case. All of them rely on the complexity of the environment we are operating modern AI in being too high relative to the signal for improvement. The AI systems that furthest exceeded human performance in one domain were trained in environments where those domains were the entire world. AlphaGo is the perfect rendition of this.
AI research, software engineering, information synthesis, and all of the techniques needed to train a good AI model are not closed systems with simple forms of verification. Some parts of training AI systems are, such as wanting the loss to go down or getting more training tokens through your model, but those aren’t really the limiting factors right now on training.
The Wikipedia page for the singularity has another explanation for this that seems prescient as we open the floodgates to try and apply AI agents to every digital task. Paul Allen thought the deceleratory effects of complexity would be too strong:
Microsoft co-founder Paul Allen argued the opposite of accelerating returns, the complexity brake: the more progress science makes towards understanding intelligence, the more difficult it becomes to make additional progress. A study of the number of patents shows that human creativity does not show accelerating returns, but in fact, as suggested by Joseph Tainter in his The Collapse of Complex Societies, a law of diminishing returns. The number of patents per thousand peaked in the period from 1850 to 1900, and has been declining since. The growth of complexity eventually becomes self-limiting, and leads to a widespread "general systems collapse".
This may be a bit of an extreme case to tell a story, but it is worth considering.
Language models like o3 use a more complex system of tools to gain performance. GPT-4 was just a set of weights to answer every query; now ChatGPT also needs search, code execution, and memory. The more layers there are, the smaller the magnitude of changes we’ll see.
This, of course, needs to be controlled for with inference costs as a constant. We still have many problems in AI that will be “solved” simply by us using 1,000X the inference compute on them.
3. Data research is the foundation of algorithmic AI progress
One of the main points of the AI 2027 forecast is that AI research is going to get 2X, then 4X, then 100X, and finally 1,000X as productive as it is today. This is based on end-to-end time for integrating new ideas into models and misinterprets the reality of what machine learning research is bottlenecked on. Scaling is getting more expensive. We don’t know what paradigm will come after reasoning for inference-time compute.
For machine learning research to accelerate at these rates, it needs to be entirely bottlenecked by compute efficiency and implementation difficulty. Problems like getting the maximum theoretical FLOPs out of Nvidia GPUs and making the loss go as low as possible. These are things that people are currently doing and represent an important area of marginal gains in AI progress in recent years.
ML research is far messier. It is far more reliant on poking around the data, building intuitions, and launching yolo runs based on lingering feelings. AI models in the near future could easily launch yolo runs if we give them the compute, but they’re not using the same motivation for them. AI systems are going towards rapid cycles of trial and error to optimize very narrow signals. These narrow signals, like loss or evaluation scores, mirror very closely to the RL scores that current models are trained on.
These types of improvements are crucial for making the model a bit better, but they are not the type of idea that gets someone to try to train GPT-3 in the first place or scale up RL to get something like o1.
A very popular question in the AI discourse today is “Why doesn’t AI make any discoveries despite having all of human knowledge?” (more here). Quoting Dwarkesh Patel’s interview with Dario Amodei:
One question I had for you while we were talking about the intelligence stuff was, as a scientist yourself, what do you make of the fact that these things have basically the entire corpus of human knowledge memorized and they haven't been able to make a single new connection that has led to a discovery?
The same applies to AI research. Models getting better and better at solving coding problems does not seem like the type of training that would enable this. We’re making our models better at the tasks that we know. This process is just as likely to narrow the total capabilities of the models as it is to magically instill impressive capabilities like scientific perspective.
As we discussed earlier in this piece, emergence isn’t magic, it’s a numerical phenomenon of evaluations being solved very quickly. AI research will get easier and go faster, but we aren’t heading for a doom loop.
The increased computing power AI researchers are getting their hands on is, for the time being, maintaining the pace of progress. As compute gets more expensive, maybe superhuman coding capabilities will continue to enable another few years of rapid progress, but eventually, saturation will come. Current progress is too correlated with increased compute to believe that this will be a self-fulfilling feedback loop.
There’s a saying in machine learning research, that the same few ideas are repeated over and over again. Here’s an extended version of this that leans in and says that there are no new ideas in machine learning, just new datasets:
The data problem is not something AI is going to have an easy time with.
One of the examples here is in post-training. We’ve been using the same loss functions forever, and we are hill-climbing rapidly by clever use of distillation from bigger, stronger models. The industry standard is that post-training is messy and involves incrementally training (and maybe merging) many checkpoints to slowly interweave new capabilities for the model. It’s easy to get that wrong, as we’ve seen with the recent GPT-4o sycophancy crisis, and lose the narrow band of good vibes for a model. I doubt AI supervision can monitor vibes like this.
For example, in Tülu 3 we found that a small dataset of synthetic instruction following data had a second-order effect that improves the overall performance in things like math and reasoning as well. This is not a hill that can be climbed on, but rather a lucky find.
AI research is still very messy and does not look like LeetCode problems or simple optimization hillclimbing. The key is always the data, and how good are language models at judging between different responses — not much better than humans.
4. Over-optimism of RL training
A lot of people are really excited for RL training right now scaling up further, which will inevitably involve extending to more domains. Some of the most repeated ideas are adding RL training to continually fine-tune the model in real-world scenarios, including everything from web tasks to robotics and scientific experiments. There are two separate problems here:
* Continually training language models to add new capabilities to models “in flight” in production is not a solved problem,
* Training models to take actions in many domains.
The first problem is something that I’m confident we’ll solve. It’s likely technically feasible now that RL is the final stage of post-training and is becoming far more stable. The challenge with it is more of a release and control problem, where a model being trained in-flight doesn’t have time for the usual safety training. This is something the industry can easily adapt to, and we will as traditional pretraining scaling saturates completely.
The second issue is putting us right back into the territory of why projects on scaling robotics or RL agents to multiple domains are hard. Even the most breakthrough works like GATO, multi-domain RL control, or RT-X, multi-robot control policies, from DeepMind have major caveats with their obvious successes.
Building AI models that control multiple real-world systems is incredibly hard for many reasons, some of which involve:
* Different action spaces across domains mandate either modifying the domain to suit the underlying policy, which in this case is converting all control tasks to language, or modifying the model to be able to output more types of tokens.
* The real-world is subject to constant drift, so the constant fine-tuning of the model will need to do as much to just maintain performance on systems with real degradation as it will need to learn to use them in the first place.
This sort of scaling RL to new types of domains is going to look much more like recent progress in robotics research rather than the takeoff pace of reasoning language models. Robotics progress is a slow grind and feels so different that it is hard to describe concisely. Robotics faces far more problems due to the nature of the environment rather than just the learning.
The current phase of RL training is suited for making the models capable of performing inference-time scaling on domains they have seen at pretraining. Using these new RL stacks to learn entirely new, out-of-domain problems is a new research area.
If this is the next paradigm outside of inference-time scaling, I will be shocked, but obviously excited. We don’t have the evidence to suggest that it will do so. The RL training we’re going to get is continuing to hill climb on search and code execution, giving us Deep Research plus plus, not an omnipotent action-taking model.
A world with compute shifting to inference
While the AI research world is dynamic, engaging, and rapidly moving forward, some signs of the above being correct could already be emerging. A basic sign for this future coming true will be the share of compute spent on research decreasing relative to inference amid the rapid buildout. If extremely rapid AI progress were available for organizations that put in marginally more compute, serving inference would be a far lower priority. If investing in research has a positive feedback loop on your potential business revenue, they’d all need to do it.
For example, consider our discussion of Meta’s compute allocation on Dylan and I’s appearance on the Lex Podcast:
(01:03:56) And forever, training will always be a portion of the total compute. We mentioned Meta’s 400,000 GPUs. Only 16,000 made Llama 3.
OpenAI is already making allocation trade-offs on their products, regularly complaining about GPUs melting. Part of the reason they, or anyone, could release an open-weights model is to reduce their inference demand. Make the user(s) pay for the compute.
Part of the U.S.’s economic strength is a strong services sector. AI is enabling that, and the more it succeeds there, the more companies will need to continue to enable it with compute.
With the changing world economic order, cases like Microsoft freezing datacenter buildouts are correlated indicators. Microsoft’s buildout is correlated with many factors, only one of which is potential training progress, so it’s far from a sure thing.
In reality, with the large sums of capital at play, it is unlikely that labs give free rein to billions of dollars of compute to so called “AI researchers in the datacenter” because of how constrained compute is at all of the top labs. Most of that compute goes to hillclimbing on fairly known gains for the next model! AI research with AI aid will be a hand-in-hand process and not an autonomous take-off, at least on a timeline for a few years in the future.
AI will make a ton of progress, but it will not be an obvious acceleration. With traditional pretraining saturating, it could even be argued that after the initial gains of inference time compute, research is actually decelerating, but it will take years to know for sure.
Thanks to Steve Newman and Florian Brand for some early feedback on this post and many others in the Interconnects Discord for discussions that helped formulate it.
https://www.interconnects.ai/p/transparency-and-shifting-priority
The fact that we get new AI model launches from multiple labs detailing their performance on complex and shared benchmarks is an anomaly in the history of technology products. Getting such clear ways to compare similar software products is not normal. It goes back to AI’s roots as a research field and growing pains into something else. Ever since ChatGPT’s release, AI has been transitioning from a research-driven field to a product-driven field.
We had another example of the direction this is going just last week. OpenAI launched their latest model on a Friday with minimal official documentation and a bunch of confirmations on social media. Here’s what Sam Altman said:
Officially, there are “release notes,” but these aren’t very helpful.
We’re making additional improvements to GPT-4o, optimizing when it saves memories and enhancing problem-solving capabilities for STEM. We’ve also made subtle changes to the way it responds, making it more proactive and better at guiding conversations toward productive outcomes. We think these updates help GPT-4o feel more intuitive and effective across a variety of tasks–we hope you agree!
Another way of reading this is that the general capabilities of the model, i.e. traditional academic benchmarks, didn’t shift much, but internal evaluations such as user retention improved notably.
Of course, technology companies do this all the time. Google is famous for A/B testing to find the perfect button, and we can be sure Meta is constantly improving their algorithms to maximize user retention and advertisement targeting. This sort of lack of transparency from OpenAI is only surprising because the field of AI has been different.
AI has been different in its operation, not only because of its unusually fast transition from research to product, but also because many key leaders thought AI was different. AI was the crucial technology that we needed to get right. This is why OpenAI was founded as a non-profit, and existential risk has been a central discussion. If we believe this technology is essential to get right, the releases with it need to be handled differently.
OpenAI releasing a model with no official notes is the clearest signal we have yet that AI is a normal technology. OpenAI is a product company, and its core users don’t need clear documentation on what’s changing with the model. Yes, they did have better documentation for their recent API models in GPT-4.1, but the fact that those models aren’t available in their widely used product, ChatGPT, means they’re not as relevant.
Sam Altman sharing a model launch like this is minor in a single instance, but it sets the tone for the company and industry broadly on what is an acceptable form of disclosure.
The people who need information on the model are people like me — people trying to keep track of the roller coaster ride we’re on so that the technology doesn’t cause major unintended harms to society. We are a minority in the world, but we feel strongly that transparency helps us keep a better understanding of the evolving trajectory of AI.
This is a good time for me to explain with more nuance the different ways transparency serves AI in the broader technological ecosystem, and how everyone is stating what their priorities are through their actions. We’ll come back to OpenAI’s obvious shifting priorities later on.
The type of openness I’ve regularly advocated for at the Allen Institute for AI (Ai2) — with all aspects of the training process being open so everyone can learn and build on it — is in some ways one of the most boring types of priorities possible for transparency. It’s taken me a while to realize this. It relates to how openness and the transparency it carries are not a binary distinction, but rather a spectrum.
Transparency and openness occur at each aspect of the AI release process. The subtle differences in decisions from licenses to where your model is hosted or if the weights are available publicly at all fall on a gradient. The position I advocate for is on the extreme, which is often needed to enact change in the world these days. I operate at the extreme of a position to shift the reality that unfolds in the middle of the discourse. This’ll also make me realize what other priorities I’m implicitly devaluing by putting openness on the top. With finite effort, there are always trade-offs.
Many companies don’t have the ability to operate at such an extreme as I or Ai2, which results in much more nuanced and interesting trade-offs in what transparency is enabling. Both OpenAI and Anthropic care about showing the external world some inputs to their models’ behaviors. Anthropic’s Constitution for Claude is a much narrower artifact, showing some facts about the model, while OpenAI’s Model Spec shows more intention and opens it up to criticism.
Progress on transparency will only come when more realize that a lot of good can be done by incrementally more transparency. We should support people advocating for narrow asks of openness and understand their motivations in order to make informed trade-offs. For now, most of the downsides of transparency I’ve seen are in the realm of corporate competition, once you accept basic realities like frontier model weights from the likes of OpenAI and Anthropic not getting uploaded to HuggingFace.
Back to my personal position around openness — it also happens to be really aligned with technological acceleration and optimism. I was motivated to this line of work because openness can help increase the net benefit of AI. This is partially accelerating the adoption of it, but also enabling safety research on the technology and mitigating any long-term structural failure modes. Openness can enable many more people to be involved in AI’s development — think of the 1000s of academics without enough compute to lead on AI who would love to help understand and provide feedback on frontier AI models. Having more people involved also spreads knowledge, which reduces the risk of concentration of power.
I’ve for multiple years feared that powerful AI will make companies even more powerful economically and culturally. My readers don’t need warnings on why technology that is way more personable and engaging than recommendation systems, while keeping similar goals, can push us in more negative rather than positive directions. Others commenting here have included Meta’s Mark Zuckerberg’s Open Source AI is the Path Forward and Yann LeCun’s many comments on X. — they both highlight concentration of power as a major concern.
Still, someone could come to the same number one priority on complete technical openness like myself through the ambition of economic growth, if you think that open-source models being on par can make the total market for AI companies larger. This accelerationism can also have phrasings such as “We need the powerful technology ASAP to address all of the biggest problems facing society.” Technology moving fast always has negative externalities on society we have to manage.
Another popular motivation for transparency is to monitor the capabilities of frontier model development (recent posts here and here). Individuals advocating for this have a priority stack that has a serious short-term concern of an intelligence explosion or super-powerful AGI. My stack of priorities is the one that worries about the concentration of power, which takes time to accrue and has a low probability of intelligence takeoff. A lot of the transparency interventions advocated by this group, such as Daniel Kokotajlo on his Dwarkesh Podcast episode discussing AI 2027, align with subgoals I have.
If you’re not worried about either of these broad “safety” issues — concentration of power or dangerous AI risk — then you normally don’t weigh transparency very highly and prioritize other things, mostly pure progress and competition, and pricing. If we get into the finer-grained details on safety, such as explaining intentions and process, that’s where my goals would differ from an organization like a16z that has been very vocal about open-source. They obviously have a financial stake in the matter, which is enabled by making things useful rather than easier to study.
There are plenty more views that are valid for transparency. Transparency is used as a carrot by many different types of regulatory intervention. Groups with different priorities and concerns in the AI space will want transparency around different aspects of the AI process. These can encompass motives of the researchers, artifacts, method documentation, and many more things.
The lens I’m using to understand trade-offs in transparency is a priority stack, an evolution of the Principle Stack, revisited many times in the last 5+ years of the Stratechery universe. The core idea is that whether or not you like it, every business and decision is governed by a set of priorities ranked relative to each other. Everyone has things that they care about more and less, even if the issues are both extremely important. It is the basis for making trade-offs in determining the direction of businesses.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Some examples of who could advocate for information on what in the AI ecosystem include:
* Capability transparency — keeping the public informed of progress of models that may be unreleased, primarily to keep track of a potential intelligence explosion. This often includes new types of systems now that AI agents are working.
* Base model transparency — these are most useful for people wanting to understand the role of pretraining on AI dynamics. The base models of today can easily follow instructions and do reasoning, but they’re less robust than the full final model. These are diminishing as a target of transparency, as reasoning and post-training grow in importance.
* Pre-moderation model transparency (endpoints without moderation filter, models without some refusals data) — to test the evolution of content risk for models that may be deployed without moderation endpoints, such as open weight models, which tend to be release just months after closed models with similar capabilities.
* Reward model transparency (and more extreme, preference data collection instructions) — those interested in the original goals of alignment, i.e. value alignment, can use these to test how the models’ views vary across different groups and test if the intended model preferences are picked up in the preference training process (i.e. relative to the instructions given to data labelers).
* Training specification transparency (Model Spec’s, Constitutions, and other goal-setting documents) — there are so many people who would want to know why the model behaves a certain way. I’ve mentioned these benefits before:
* Developers: Know what future models will become, which helps create a stable platform.
* Regulators: Transparency into what the heck frontier labs care about, which helps understand the directions AI is going, and the motivations of super powerful companies.
* Internal: Focus on defining and delivering your goals (separate from this transparency discussion).
There are also subtleties in these discussions, such as how structured access to models can serve different but complementary goals of open weights. Structured access is a set of programs where prescreened individuals can use models in a secure environment and operate independently from the AI laboratories themselves.
This could be seen as a separate direction to transparency, where instead of the public getting the information or artifact, only a few pre-approved people do. In reality, structured access is a complement to transparency and will be needed for details where the companies cannot disclose them publicly without substantial business competitiveness risk, such as novel algorithmic tricks that substantially modify how the AI works, or real-world harm, such as model weights pre safety interventions.
Some parts of AI should be accessible to the general public, and some to third-party testers. Currently, all of the transparency and access is below the safest equilibrium. We need more of both.
One of the most ignored details is just how access is implemented. A recent paper from Irene Solaiman et al. paints how releasing components is one step in sharing information and artifacts:
Generative AI release decisions determine whether system components are made available, but release does not address many other elements that change how users and stakeholders are able to engage with a system. Beyond release, access to system components informs potential risks and benefits. Access refers to practical needs, infrastructurally, technically, and societally, in order to use available components in some way.
The authors break access down into three axes:
* Resourcing: Infrastructural needs to host and serve.
* Usability: Varied technical skill levels can engage.
* Utility: Qualities (e.g. multilingual) with user utility.
As our models at Ai2 are becoming more capable, my relationship as a developer with my downstream users has changed. The models I’ve worked on have shifted from those primarily motivated by values, with the transparency we’re discussing being of top value, to now also adding utility as a much higher weight. People want to use some of our models in real applications. While my priority stack hasn’t changed — openness is still the top value — the way it’s implemented is shifting. I’m no longer racing to get all of our results hot off the press into the world because of the cost of time it takes to support them (support costs rise proportional to the user base).
Other key players in the AI space have obviously changed their priority stack.
OpenAI’s recent actions confirm that ChatGPT as a product is its top priority. Transparency and safety have been moving down on their list of priorities in favor of growth. This is partially due to increased competition, but also due to a shifting political landscape. OpenAI’s coming release of an open model doesn’t shift this priority stack for me.
I used to hear a lot about OpenAI’s pre-release testing and the accompanying non-disclosure agreements. This quiet model drop being “the quickest we've shipped an update to our main 4o line” shows that safety is moving down their priority stack. This isn’t to say that their safety changes are immediately concerning to me, but rather that there are trade-offs in everything. OpenAI is moving cultural norms in leading AI away from releases with detailed evaluation metrics and towards more normal, quiet technology company consistent drips of updates.
Thanks to Miles Brundage for a discussion that helped motivate this post.
https://www.interconnects.ai/p/openais-o3-over-optimization-is-back
Over-optimization is a classic problem to reinforcement learning (RL) proper, the RL from human feedback (RLHF) that gave us ChatGPT, and now what we’re seeing with new reasoning models. All of these have a distinct flavor and different impacts.
Over-optimization is what happens when the optimizer is stronger than the environment or reward function it’s using to learn. The optimizer finds bugs or lapses in the context of its training and produces unusual or negative results.
One of my first pieces of work that got substantial engagement online was from when we used hyperparameter optimization with model-based RL to over-optimize the traditional Mujoco simulation environments used for evaluation of deep RL algorithms. The result was a cartwheeling half-cheetah maximizing its forward velocity, where it was supposed to learn to run. The video is below.
Over-optimization in classical RL made it so that we couldn’t trust the agents to generalize to new tasks and applied great pressure on reward design.
Over-optimization in RLHF made the models completely lobotomized — I mean, repeating random tokens and gibberish, not the misdesign that results in over-refusal — the signal we were overcooking is mismatched from the objective we want. While we can’t know the true objective exactly, we can know when over-optimization occurs.
OpenAI’s new o3 model represents an entirely new type of inference behavior that mirrors a new type of over-optimization.
o3 is an interesting model because of on top of the new improved text-only capabilities, which results in super strong benchmark scores, it also has been designed for multi-step tool use to be used on any query where it’s relevant. It makes OpenAI’s product headaches in ChatGPT way worse — the model will search on its own even if I don’t trigger the search toggle — but comes with a new era in how language models are used.
For example, in the process of writing this, I simply asked o3: “Can you find me the gif of a motor boat overoptimizing a game that was used by RL researchers for a long time? Maybe like wave crashers or something?” [link]. A few weeks ago, it took me at least 15 minutes to find this manually. o3 got the exact direct download link for me.
As a counterpoint, Gemini 2.5 Pro in the Advanced UX hallucinated for me a wild link https://giphy.com/explore/you-play-the-motorboat. AI studio was better with search grounding, it got the game right, CoastRunners, but also yapped a lot of nonsense with it.
o3’s answer in full was:
Here’s the clip you’re thinking of—it comes from the Flash‑based racing game Coast Runners and is the canonical “reward‑hacked motor‑boat” example in RL papers and talks:
* Direct GIF (DeepMind mirror)
https://lh3.googleusercontent.com/9gQpeRvwj1xQo9MVZwUdmautOd04wR8gXRi1wvlVZRrVXb5isOZRdjJ2t1_JShiYZcYB--iIMtl9g4Lbk6oxQzHgw-QQ6VoIWVZ4V3au7RgKli6X (save / open in browser to view)
* DeepMind blog post (“Specification gaming: the flip side of AI ingenuity”, 2020) – scroll to the section that starts “Consider an agent controlling a boat…”; the GIF is embedded right there.
* Original OpenAI post (“Faulty reward functions in the wild”, 2016) – same clip, plus a short write‑up of what the agent is doing (circling a lagoon to farm green targets instead of finishing the race).
Why it matters
The agent gets shaping reward for hitting green buoys; the optimal (but unintended) strategy is to loop in a small cove where three buoys respawn, racking up points while literally setting itself on fire. The clip has been reused for years to illustrate specification gaming / reward hacking—exactly the “over‑optimization” you mentioned.
This was one of a few wonderful interactions I had with o3. For more reviews on o3 as a model, you can read Dan Shipper’s post, Tyler Cowen’s AGI snark (to be honest, I agree with him), or sleuth through many independent evaluations showing o3 is great (such as beating Gemini 2.5 Pro in a long-context story eval or beating R1 in creative writing). o3 is what happens when you keep scaling the RL training compute, which also boosts inference-time compute abilities. OpenAI’s release blog post shows the many ways the model is better than o1.
Some things didn’t work, these new reasoning models are very “spiky” in their intelligence. What this means is that some interactions are mind blowing and feel like entirely new modes of interacting with AI, but for some normal things that GPT-4 or Claude 3.5 have been able to do for year(s) they fall totally flat on their face. Take this as a good sign, especially when the laboratories are shipping fast, as it means that the pace of progress is so high that they need to get a model out now and will fix the oddities in the next, more mature version.
The over-optimization that comes with o3’s new behaviors is linked to the new type of training. While the first reasoning models were trained to a first approximation to get math and code correct, o3 is trained with all that and to use tools to acquire and manipulate information. From OpenAI’s blog post:
We also trained both models to use tools through reinforcement learning—teaching them not just how to use tools, but to reason about when to use them. Their ability to deploy tools based on desired outcomes makes them more capable in open-ended situations—particularly those involving visual reasoning and multi-step workflows.
The vast majority of these sub-tasks in its training are verifiable. The problem is, this new AI training is extremely effective at making the model more useful for the tasks we’re used to using. The problem is there’s no way yet to do scalable “fixing” of the model’s weird language along the way. The new over-optimization doesn’t make the models worse at outcomes, it just makes them worse at language and explaining themselves.
Some examples of o3’s weirdness feel like the model is underbaked, such as this one where it used an invalid non-ASCII dash in a coding setting.
METR found that o3 is the model that can operate independently for the longest in agentic tasks, but also noted it has a propensity to “hack” their scores. Sound familiar?
Transluce found that o3 hallucinated actions it took while trying to solve tasks — how does that even happen? Well, maybe the model was getting rewarded for successful tool calls and sometimes in the training data a fake tool call was incorrectly verified as real and successful. Once that happens a few times, the model will quickly catch on and keep doing it.
There are plenty more examples of reward hacking and even a measurement that hallucinations are higher in o3 than in earlier recent models!
It’s peculiar that the hacking for o3 has been a much more vocal component of the discourse, even when Claude 3.7 Sonnet also shows many signs of reward hacking, especially with code, but people shrug it off as “meh model” rather than a new phenomenon (more examples).
This all takes me back to when Karpathy commented on the original reasoning models, saying:
You can tell the RL is done properly when the models cease to speak English in their chain of thought
These weird hallucinations the model is outputting are the equivalent of that, but for actions. We have no basis for what hallucinations in action space look like, but with better systems, they can be easier to verify — the system / sandbox can always confirm if the actions happened, and then that can be used in the loss. The action component of o3 makes it far more interesting, but also maybe less intrusive than Claude 3.7’s messy code.
From a scientific perspective, this is wonderfully entertaining and enthralling intellectually — what is the model actually learning? At the same time, it is very reasonable for the safety-conscious to be wary of deploying these everywhere, but it doesn’t seem like we’ve seen anything too alarming yet, just inefficiencies and confusion.
To summarize the three types of over-optimization we’ve seen in eras of RL, we have:
* RL for control era: Over-optimization happens because our environments are brittle and tasks are unrealistic.
* RLHF era: Over-optimization happens because our reward functions suck.
* RLVR era: Over-optimization happens and makes our models super effective and even weirder. (*plus any other side-effects we’re yet to learn).
Interconnects is a reader-supported publication. Consider becoming a subscriber.
This over-optimization is certainly a problem to address, as legibility is an important benefit of language models. I’m confident it can be mitigated with more complex training processes, but when labs are trying to get the models out ASAP it’ll come later.
On top of all this is the prospect of o3pro. o3 feels similar in peak capability to o1pro (or even a little higher with its new tool use), but where o3 operates at a 60-70% hit rate, o1pro feels like it’s up at 95%. o3 pro will bring the best of both worlds — the new incredible workflow and incredible reliability. Some sort of shallow search or refinement is a very logical process to help eliminate all the minor bugs and bumps in the early inference paths we’re feeling today.
On top of this is the confirmation from OpenAI employees that o4-mini is a far better multimodal model than o3. We have plenty of new ways to use these models, integrating multimodality, tool use, reasoning, and shallow search coming in the near future. You should be excited, and when o4 and o3 pro are available, paying $200/month for them feels obviously worth it.
To quote Bob McGrew, former Chief Research Officer at OpenAI:
The spotlight for o3 is on tool use because intelligence is no longer the primary constraint. The new frontier is reliable interaction with the external world.
To make the models that enable this, we’re going to need to go through many new layers of uncertainty, surprise, and intrigue.
o3 and this post are extremely bullish for the future of RL. RL is the only framing where multiple actions to a complex goal make sense to be learned end-to-end. Now, this is beginning to work. Deep Research from OpenAI was the first tool they tuned o3-with-tools to specialize in. Now it works in general queries.
I personally, and we as a field, have a lot to learn about how this multi-tool RL works. Here are some recent papers that we can read to get a start (one-sentence summaries generated by o3 for the fun of it, just this one time):
* Reinforcement Learning for Long‑Horizon Interactive LLM Agents: Introduces LOOP, a memory‑efficient PPO variant that trains a 32 B‑parameter LLM to operate as an interactive digital agent in AppWorld, outperforming the larger OpenAI o1 baseline by 9 percentage points.
* ReTool: Reinforcement Learning for Strategic Tool Use in LLMs: Combines real‑time code execution with outcome‑driven RL so a 32 B model autonomously learns when and how to invoke tools, reaching 72.5 % accuracy on AIME and surpassing text‑only baselines.
* ToRL: Scaling Tool‑Integrated RL: Presents ToRL, enabling LLMs to discover optimal computational‑tool strategies via RL, boosting Qwen2.5‑Math accuracy on AIME 24 and showing emergent self‑regulation of tool use.
* Learning Autonomous Code Integration for Math Language Models: Proposes an EM‑style exploration plus off‑policy RL framework that teaches math‑reasoning LLMs to decide when to run code, yielding double‑digit gains on MATH500 and AIME without hand‑crafted templates.
* Improving Multi‑Turn Tool Use with Reinforcement Learning (blog post): Shows that GRPO fine‑tuning of Qwen2.5‑7B‑Instruct on just 100 examples raises BFCL multi‑step tool‑use accuracy from 55 % to 78 %, detailing stabilizing tricks like tiny‑KL and over‑long filtering.
Please share any more I missed over email or comment below!
https://www.interconnects.ai/p/openais-gpt-41-and-separating-the
Recently I gave another talk on RLVR experiments and I posted some thoughts on OLMoTrace — Ai2’s recent tool to let you look at the training data of OLMo 2.
OpenAI has been making many small updates toward their vision of ChatGPT as a monolithic app separate from their API business. Last week OpenAI improved the ChatGPT memory feature — making it so the app can reference the text of previous chats in addition to basic facts about the user. Today, OpenAI announced a new suite of API-only models, GPT 4.1, which is very directly in competition with Google’s Gemini models.
Individually, none of OpenAI’s recent releases are particularly frontier-shifting — comparable performance per dollar models exist — but together they paint a picture of where OpenAI’s incentives are heading. This is the same company that recently teased that it has hit 1 billion weekly active users. This is the company that needs to treat ChatGPT and the models that power it very differently from any other AI product on the market. The other leading AI products are all for coding or information, where personality, vibes, and entertainment are not placed on as high a premium.
A prime example of this shift is that GPT-4.5 is being deprecated from the API (with its extreme pricing), but is going to remain in ChatGPT — where Sam Atlman has repeatedly said he’s blown away by how much users love it. I use it all the time, it’s an interesting and consistent model.
Among their major model releases, such as o3, o4, or the forthcoming open model release, it can be hard to reinforce the high-level view and see where OpenAI is going.
A quick summary of the model performance comes from this chart that OpenAI released in the live stream (and blog post):
Chart crimes aside (using MMLU as y-axis in 2025, no measure of latency, no axis labels), the story from OpenAI is the simple takeaway — better models at faster inference speeds, which are proportional to cost. Here’s a price comparison of the new OpenAI models (Gemini Pricing, OpenAI pricing):
* GPT-4.1: Input/Output: $2.00 / $8.00 | Cached Input: $0.50
* GPT-4.1 Mini: Input/Output: $0.40 / $1.60 | Cached Input: $0.10
* GPT-4.1 Nano: Input/Output: $0.10 / $0.40 | Cached Input: $0.025
And their old models:
* GPT-4o: Input/Output: $2.5 / $10.00 | Cached Input: $1.25
* GPT-4o Mini: Input/Output: $0.15 / $0.60 | Cached Input: $0.075
To Google’s Gemini models:
* Gemini 2.5 Pro* (≤200K tokens): Input/Output: $1.25 / $10.00 | Cached: Not available
* Gemini 2.5 Pro* (>200K tokens): Input/Output: $2.50 / $15.00 | Cached: Not available
* Gemini 2.0 Flash: Input/Output: $0.10 / $0.40 | Cached Input: $0.025 (text/image/video), $0.175 (audio)
* Gemini 2.0 Flash-Lite: Input/Output: $0.075 / $0.30 | Cached: Not available
*As a reasoning model, Gemini 2.5 Pro will use many more tokens, which are also charged to the user.
The academic evaluations are strong, but that isn’t the full picture for these small models that need to do repetitive, niche tasks. These models are clearly competition with Gemini Flash and Flash-Lite (Gemini 2.5 Flash coming soon following the fantastic release of Gemini 2.5 Pro — expectations are high). GPT-4o-mini has largely been accepted as laggard and hard to use relative to Flash.
To win in the API business, OpenAI needs to crack this frontier from Gemini:
There are many examples in the OpenAI communications that paint a familiar story with these releases — broad improvements — with few details as to why. These models are almost assuredly distilled from GPT-4.5 for personality and reasoning models like o3 for coding and mathematics. For example, there are very big improvements in code evaluations, where some of their early models were “off the map” and effectively at 0.
Evaluations like coding and mathematics still fall clearly short of the likes of Gemini 2.5 (thinking model) or Claude 3.7 (optional thinking model). This shouldn’t be surprising, but is worth reminding ourselves of. While we are early in a paradigm of models shifting to include reasoning, the notion of a single best model is messier. These reasoning models use far more tokens to achieve this greatly improved performance. Performance is king, but tie goes to the cheaper model.
I do not want to go into detail about OpenAI’s entire suite of models and naming right now because it does not make sense at all. Over time, the specific models are going to be of less relevance in ChatGPT (the main thing), and different models will power ChatGPT than those used in the API. We’ve already seen this with o3 powering only Deep Research for now, and OpenAI only recently walked back the line that “these models won’t be available directly.”
Back to the ChatGPT side of things. For most users, the capabilities we are discussing above are effectively meaningless. For them, the dreaded slider of model effort makes much more sense:
The new memory feature from last week got mixed reviews, but the old (simple) memory has been something I really enjoy about using ChatGPT. I don’t have to remind it that my puppy is a X week old miniature schnauzer or the context of my work. This’ll continue to get better over time.
This feels extremely similar to as when I didn’t really notice when ChatGPT first added the search option, but now it feels like an essential part of my use (something that Claude still hasn’t felt like it does well on). Claude was my daily driver for personality, but with great search and a rapidly improving personality, ChatGPT was indispensable. Still, Gemini 2.5 Pro is a better model, but not in a better interface.
I strongly expect that the memory feature will evolve into something I love about ChatGPT. It’ll be much easier to ask ChatGPT to remind you of that thing you found a couple months ago than it would be to try and parse your Google search history.
Some were skeptical of these new memories from crossing personal and work uses, but I think with search, this is easy, rather than algorithmic feeds that try to balance all your interests in one. The funnel is per use, and interactions are more narrow and seem easier technically to get right.
A final related point — people have long balked at the prices of chat interfaces relative to the API, but the reality that is fast approaching is that the personal experiences only exist in the app, and these are what people love. With the API, you could build a competition that accumulates its own interactions, but as OpenAI has a huge product head start, this will be an uphill battle.
All of this reinforces what we know — products are the key to developments in AI right now. Memory and better separation of the ChatGPT lineage from the API helps OpenAI pave that path forward (and maybe do advertising, especially with memory), but we have a long way until it is fully realized.
https://www.interconnects.ai/p/llama-4
Where Llama 2’s and Llama 3’s releases were arguably some of the top few events in AI for their respective release years, Llama 4 feels entirely lost. Meta has attempted to reinvent their formula of models with substantial changes in size, architecture, and personality, but a coherent narrative is lacking. Meta has fallen into the trap of taking too long to ship, so the bar is impossible to cross successfully.
Looking back at the history of Meta’s major open models, the sequence is as follows:
* OPT – Released May 3, 2022 (ai.meta.com | 125M, 350M, 1.3B, 2.7B, 6.7B, 13B, 30B, 66B, 175B): A foundational open model that is underrated in the arc of language modeling research.
* LLaMA – Released February 24, 2023 (ai.meta.com | 7B, 13B, 33B, 65B): The open weight model that powered the Alpaca age of early open chat models.
* Llama 2 – Released July 18, 2023 (our coverage | about.fb.com | 7B, 13B, 70B): The open standard for academic research for its time period. Chat version had some bumps, but overall a major win.
* Llama 3 – Released April 18, 2024 (our coverage | ai.meta.com | 8B, 70B): The open standard for its time. Again, fantastic base models.
* Llama 3.1 – Released July 23, 2024 (our coverage | ai.meta.com | 8B, 70B, 405B): Much improved post training and the 405B marked the first time an open weight model competed with GPT-4!
* Llama 3.2 – Released September 25, 2024 (our coverage | ai.meta.com | 1B, 3B, 11B, 90B): A weird, very underperforming vision release, outshined by Molmo on the same day.
* Llama 3.3 – Released December 6, 2024 (github.com | 70B): Much improved post-training of the smaller 3.1 models, likely in response to other open releases, but largely a minor update.
* Llama 4 – Released April 5, 2025 (ai.meta.com | 17A109B, 17A400B): What we got today.
The time between major versions is growing, and the number of releases seen as exceptional by the community is dropping. Llama 4 consists of 3 models, quoting from the blog post, notes in brackets mine:
* Llama 4 Scout, a 17 billion active parameter model with 16 experts [and 109B total parameters, ~40T training tokens], is the best multimodal model in the world in its class and is more powerful than all previous generation Llama models, while fitting in a single NVIDIA H100 GPU.
* Llama 4 Maverick, a 17 billion active parameter model with 128 experts [and 400B total parameters, ~22T training tokens].
* These models are our best yet thanks to distillation from Llama 4 Behemoth, a 288 billion active parameter [and 2T total parameters] model with 16 experts that is our most powerful yet and among the world’s smartest LLMs…. we’re excited to share more details about it even while it’s still in flight.
Here are the reported benchmark scores for the first two models, which are available on many APIs and to download on HuggingFace.
Where Llama models used to be scaled across different sizes with almost identical architectures, these new models are designed for very different classes of use-cases.
* Llama 4 Scout is similar to a Gemini Flash model or any ultra-efficient inference MoE.
* Llama 4 Maverick’s architecture is very similar to DeepSeek V3 with extreme sparsity and many active experts.
* Llama 4 Behemoth is likely similar to Claude Opus or Gemini Ultra, but we don’t have substantial information on these.
This release came on a Saturday, which is utterly bizarre for a major company launching one of its highest-profile products of the year. The consensus was that Llama 4 was going to come at Meta’s LlamaCon later this month. In fact, it looks like this release may have been pulled forward from today, the 7th, from a commit in the Meta Llama Github:
One of the flagship features is the 10M (on Scout, Maverick is 1M) token context window on the smallest model, but even that didn’t have any released evaluations beyond Needle in a Haystack (NIAH), which is seen as a necessary condition, but not one that is sufficient to say it is a good long-context model. Some more modern long-context evaluations include RULER or NoLiMa.
Many, many people have commented on how Llama 4’s behavior is drastically different in LMArena — which was their flagship result of the release — than on other providers (even when following Meta’s recommended system prompt). Turns out, from the blog post, that it is just a different model:
Llama 4 Maverick offers a best-in-class performance to cost ratio with an experimental chat version scoring ELO of 1417 on LMArena.
Sneaky. The results below are fake, and it is a major slight to Meta’s community to not release the model they used to create their major marketing push. We’ve seen many open models that come around to maximize on ChatBotArena while destroying the model’s performance on important skills like math or code. We’ll see where the released models land.
Regardless, here’s the plot Meta used. Look at the fine print at the bottom too.
This model is actually the one tanking the technical reputation of the release because its character is juvenile. The actual model on other hosting providers is quite smart and has a reasonable tone!
ArtificialAnalysis rated the models as “some of the best non-reasoning models,” beating leading frontier models. This is complicated because we shouldn’t separate reasoning from non-reasoning models; we should just evaluate on reasoning and non-reasoning domains separately, as discussed in the Gemini 2.5 post. So-called “reasoning models” often top non-reasoning benchmarks, but the opposite is rarely true.
Other independent evaluation results range from medium to bad and confusing — I suspect very weird results are hosting issues with the very long context models. At the same time, the Behemoth model is outclassed by Gemini 2.5 Pro. To list some of the major technical breakthroughs that Meta made (i.e. new to Llama, not new to the industry):
* Mixture of expert architectures, enabling Llama 4 to be trained with less compute than Llama 3 even though they have more total parameters — a lot more.
* Very long context up to 10M tokens.
* Solid multimodal input performance on release day (and not a later model)
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Sadly this post is barely about the technical details. Meta nuked their release vibes with weird timing and by having an off-putting chatty model that was easiest to find to talk to. The release process, timing, and big picture raise more questions for Meta. Did they panic and feel like this was their one shot at being state of the art?
The evaluation scores for the models are solid, they clear a fairly high bar. With these highly varied MoE architectures, it’s super hard to feel confident in an assessment of the model based on benchmarks, especially when compared to dense models or teacher-student distilled models. The very-long-context base models will be extremely useful for research.
The question here is: Why is Meta designing their models in the same way as other frontier labs when their audience is open-source AI communities and businesses, not an API serving business or ChatGPT competitor?
The model sizing for the likes of Gemini and ChatGPT is downstream of nuanced decisions based on a balance of training cluster size, inference needs, and performance trade-offs. These trade-offs are very different for open models, where you don’t pay inference, and many users are not hyperscale companies.
The model that becomes the “open standard” doesn’t need to be the best overall model, but rather a family of models in many shapes and sizes that is solid in many different deployment settings. Qwen 2.5, with models at 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B parameters, is the closest to this right now. There’s actually far less competition in this space than in the space Meta chose to go into (and take on DeepSeek)!
One of these communities historically has been the LocalLlama subreddit, which named the entire community around running models at home around the Llama series of models — they’re not happy with Llama 4. Another community is academics, where the series of models across different size ranges is wonderful for understanding language models and improving methods. These two groups are all GPU-poor, so memory-intensive models like these sparse mixture of experts price out even more participants in the open community (who tend to be memory-limited).
This is all on top of an onerous license that makes all artifacts that use Llama in the process be tagged with the “Llama-” name, the Llama license, the “Built with Llama” branding if used commercially, and use-case restrictions. This is at the same time when their competitors, i.e. DeepSeek, released their latest flagship model with an MIT license (which has no downstream restrictions).
A third group is potential businesses looking to use open models on-premises as open models close the gap to closed counterparts. These feel like groups that would be sensitive to the extra legal risk that Llama’s license exposes them to.
On top of all of this weirdness, many of Meta’s “open-source” efforts are restricted in the European Union. Where the Llama 3.2 models blocked you if you tried to access them from Europe, Llama 4 is available for download but prohibits the use of vision capabilities in an acceptable use policy. This is not entirely Meta’s fault, as many companies are dealing with side effects of the EU AI Act, but regulatory exposure needs to be considered in Meta’s strategy.
Meta had a tight grasp on these communities, the Llama projects were rightfully loved, but now they feel lost. With Qwen 3 around the corner and countless other amazing open-weight models out now (and many more teased, such as from OpenAI), the competition is extreme.
The soul of the Llama series died by not releasing enough models frequently enough. Reclaiming that with GenAI’s constant organizational headaches looks like a Sisyphean task. What is Meta’s differentiation in the AI space? It still seems about enabling their own platforms to flourish, not about truly supporting open.
Meta’s GenAI organization has been showing major signs of cultural challenges thoughout its entire existence — including their head of AI research leaving just a few days before this model was launched.
Sadly, the evaluations for this release aren’t even the central story. The vibes have been off since the beginning by choosing a weird release date. Over the coming weeks, more and more people will find reliable uses for Llama 4, but in a competitive landscape, that may not be good enough. Llama is no longer the open standard. Personally, this makes me sad. As an American, I want the default pieces of the open ecosystem to be run by American or American-friendly companies.
With the macro pressure coming to Meta’s business and the increasing commoditization of open models, how is Zuckerberg going to keep up in face of shareholder pressure pushing back against the cost of the Llama project? This isn’t the first time he’s done so, but he needs to reevaluate the lowest level principles of their approach to open AI.
https://www.interconnects.ai/p/rl-backlog-openais-many-rls-clarifying
I have a second blog where I post half-baked thoughts, sometimes previews of what comes here. If you’re interested, I posted some musings on OpenAI’s coming open model release.
It’s obvious that reinforcement learning (RL) is having a total return to glory among the broader AI community, but its real successes are mostly the things people aren’t focusing on. More math and code datasets are important platforms — we know they’re coming and are important. They’re still over-indexed on. The same RL methods are being used in many of the leading models and AI products.
This is largely a post I wrote a few weeks ago on RL news, which I was following. It never had a focusing function, so it didn’t get published, but I’m sharing it because many folks are following this area very closely. Today:
* OpenAI’s many forms of RL,
* On distilling chain of thoughts vs. RL,
* Did DeepSeek distill o1?, and
* Why latent reasoning is so interesting.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
OpenAI’s many forms of RL
For those plugged into the OpenAI cultural tap that is Twitter, it is obvious that they’re very invested in reinforcement learning. With the hype around the release of their o-series of reasoning models, it was easy to assume that those were the only avenue for excitement. OpenAI’s recent releases have shown this is not the case, and every release from a model launch to a new product has included mentions of RL training. Some of this, of course, is marketing, but they all fit as different applications of reinforcement finetuning (RFT) / RL with verifiable rewards (RLVR).
The first other application was OpenAI’s Operator agent. They stated:
Combining GPT-4o's vision capabilities with advanced reasoning through reinforcement learning, CUA is trained to interact with graphical user interfaces (GUIs)—the buttons, menus, and text fields people see on a screen.
There’s a bit more speculation to do than normal in this post. Ultimately, with partners they launched with like DoorDash, Instacart, etc., they could set up verifiable domains where the agent is rewarded for accomplishing a natural language task. This could rely on help from those websites to get started. Ultimately, lots of people know that this could work, as agents deeply tied to the core of RL lore, but the implementation details haven’t really been worked out in open projects.
The same goes for Deep Research. They stated:
Deep research independently discovers, reasons about, and consolidates insights from across the web. To accomplish this, it was trained on real-world tasks requiring browser and Python tool use, using the same reinforcement learning methods behind OpenAI o1, our first reasoning model.
Deep research was trained using end-to-end reinforcement learning on hard browsing and reasoning tasks across a range of domains.
Some more was shared in the Deep Research system card.
There are lots of things one can envision — e.g. agent gets a reward if the document retrieved from search has relevant information (not a verifiable reward, but LLM-as-a-judge). Most of this is likely used to get very high reliability across tool use to enable the tons of calls done in the back end when a call takes 10+ minutes for the user.
More | research | has emerged on RAG/search with RL.
Least surprising was the announcement of the new GitHub CoPilot model with new and improved RL training for code:
Our new code completion model is shipping in public preview today. We are calling it GPT-4o Copilot. Based on GPT-4o mini, with mid-training on a code-focused corpus exceeding 1T tokens and reinforcement learning with code execution feedback (RLEF).
This all goes back to what I said in OpenAI's Reinforcement Finetuning and RL for the masses — this new RL training is a perfectly aligned way to get nearly perfect performance on a domain you can control carefully. The best results come with mastery of the domain and with training.
A fun speculation that OpenAI is really invested in RL and post-training is that their new o3-mini model has the same date cutoff, October 2023, as OpenAI’s other flagship models. This getting very far in the past shows how invested OpenAI is in their search products (which, to be fair are quite good) for information and how such strong performance gains can come by other improvements in the stack of training.
OpenAI also released a paper on competitive coding with RL training, but it did not have a ton of useful details.
On distilling chain of thoughts vs. RL
There were a few points from the DeepSeek paper and discourse that warrant repeating. To repeat it, distillation in this case is training a model (usually with SFT, but any loss function works) on outputs from a stronger model. Let’s get right into it.
First, DeepSeek made it very clear that using more RL after distillation (SFT) is crucial for the best possible models.
Additionally, we found that applying RL to these distilled models yields significant further gains. We believe this warrants further exploration and therefore present only the results of the simple SFT-distilled models here.
My current understanding here is that matching the data distribution from the base model’s training to the distillation data and the RL prompts is very important. This specifically is crucial for enabling RL at the end — SFT will almost always boost the scores, but can narrow the scope to which the model can be finetuned more. DeepSeek figured this out for their models, but didn’t share the details.
The next point is on how scale mediates the impact of RL training:
First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation.
This is more confusing than useful, and drawn from the fact that “DeepSeek-R1- Distill-Qwen-32B, which is distilled from DeepSeek-R1, performs significantly better than DeepSeek-R1-Zero-Qwen-32B across all benchmarks”. We should not expect that -Zero style models trained only with RL will perform well on benchmarks (unless you’re training on test). This is not what they are designed for. The distilled models are trained on text very finely tuned for existing language modeling workflows. The RL-Zero (not distilled) models are very exporatory in their behaviors.
The right baseline would be putting Qwen-32B through the whole R1 recipe — which would be far more likely to outperform the distilled version.
With this is the fact that small models take more work from RL. Doing this sort of exploratory RL is much easier with big models. It could be that they hold more rare behaviors in them during pretraining and RL draws them out. The smaller models may squash these long-tail behaviors.
Continuing on this, the DeepSeek authors state:
Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger scale reinforcement learning.
Did DeepSeek distill OpenAI’s o1 model? (hint, no)
This is a question I meant to address ages ago, but here we are, a few model launches got in the way. The criticism pushed by OpenAI and many media outlets is that DeepSeek was trained on reasoning traces from OpenAI’s o1 model. OpenAI spent approximately 18 months getting the initial data to train their o1 model, so it is understandable that they are wary of giving that away for free, but the existing evidence suggests that DeepSeek training on o1-CoTs is extremely unlikely.
To start, the o1 chains of thought were not visible to the users. In order to get this data, DeepSeek would need to reliably hack the OpenAI API or ChatGPT to reveal this data. Users were getting banned from OpenAI’s properties for trying to do this. Creating this scale of a cover-up is unlikely to go unnoticed.
Second, as shown in the DeepSeek R1 recipe, training on on-policy completions from your model(s) is crucial to training a model like this. In many ways, distilling from CoTs would likely be harder to create the final R1 model than following the recipe DeepSeek presented in the paper. They have evidence in training plots that their RL training works.
At the same time, this is a hard claim to settle, as I think it is very likely that DeepSeek used OpenAI model outputs in the training process of their recent models. Distillation in multiple stages of the post-training process is a very common practice. For example, in order to do initial post-training on models like DeepSeek V3, training on completions from OpenAI chat models is a very simple way to get going.
To this day, OpenAI is still worried about distillation from their chain of thoughts, or they’re doing something that makes it so showing the underlying chain of thought doesn’t make sense (e.g. basic forms of search or self-consistency). For example, OpenAI now shows summaries of the chain of thoughts for their o-series models, but they’re not raw like Claude or Gemini’s.
These aren't the raw CoTs but it's a big step closer and I'm glad we can share that experience with the world.
Why latent reasoning is so interesting
One of the most intellectually engaging ideas to emerge during this early 2025 rush of reasoning research is a set of ideas where language models can reason in a compressed intermediate representation rather than outputting the same text tokens, which come with the quadratic inference cost. The two papers that come to mind are:
* Training Large Language Models to Reason in a Continuous Latent Space
* Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach
Here’s a figure from the latter:
Without going into details of these papers’ implementations, this is compelling because it pushes in the direction of letting language models think in whatever representation suits them. Then, they’ll output tokens or take actions in a form that works in the environment or is human legible.
We’ve already seen many related results of RL training, such as the DeepSeek R1 Zero model switching from English to Chinese randomly when it thinks.
Ultimately, RL training is all about improving outcomes, so this type of structural drift is expected. The question is whether we can incentivize the models to use far more compressed representations than the standard language tokens used in modern tokenizers.
An existing trade-off already exists in language models, where non-English languages are often far more costly than English to perform inference on because they’re a lower priority in the tokenizer (or the language is just more verbose). The goal of latent or compressed reasoning research is to push this in the other direction.
Anthropic has been doing interesting research on understanding the nature of the reasoning traces from Claude. With their reasoning launch, they already stated that:
we didn’t perform our standard character training on the model’s thought process.
They’ve also seen that the reasoning is not connected with the model’s actual thoughts.
Thus far, our results suggest that models very often make decisions based on factors that they don’t explicitly discuss in their thinking process. This means we can’t rely on monitoring current models’ thinking to make strong arguments about their safety.
This is expected. The reasoning chain is not the same artifact that humans use chain of thought for even if it appears in the same format. Chain of thought is generating the right context to get the final answer right. There are no guarantees that the most interpretable form is the one with the highest performance — in fact, in many deep learning systems end-to-end learning where no constraints are put on the intermediate representation is often best!
To end, I’m leaving you with another classic Rich Sutton essay in full (author of the Bitter Lesson). With RL, better verifiers make it so you get more out of RL training and inference-time scaling:
Verification, The Key to AI
Rich Sutton, November 15, 2021
It is a bit unseemly for an AI researcher to claim to have a special insight or plan for how his field should proceed. If he has such, why doesn't he just pursue it and, if he is right, exhibit its special fruits? Without denying that, there is still a role for assessing and analyzing the field as a whole, for diagnosing the ills that repeatedly plague it, and to suggest general solutions.
The insight that I would claim to have is that the key to a successful AI is that it can tell for itself whether or not it is working correctly. At one level this is a pragmatic issue. If the AI can't tell for itself whether it is working properly, then some person has to make that assessment and make any necessary modifications. An AI that can assess itself may be able to make the modifications itself.
The Verification Principle:
An AI system can create and maintain knowledge only to the extent that it can verify that knowledge itself.
Successful verification occurs in all search-based AI systems, such as planners, game-players, even genetic algorithms. Deep Blue, for example, produces a score for each of its possible moves through an extensive search. Its belief that a particular move is a good one is verified by the search tree that shows its inevitable production of a good position. These systems don't have to be told what choices to make; they can tell for themselves. Image trying to program a chess machine by telling it what kinds of moves to make in each kind of position. Many early chess programs were constructed in this way. The problem, of course, was that there were many different kinds of chess positions. And the more advice and rules for move selection given by programmers, the more complex the system became and the more unexpected interactions there were between rules. The programs became brittle and unreliable, requiring constant maintainence, and before long this whole approach lost out to the "brute force" searchers.
Although search-based planners verify at the move selection level, they typically cannot verify at other levels. For example, they often take their state-evaluation scoring function as given. Even Deep Blue cannot search to the end of the game and relies on a human-tuned position-scoring function that it does not assess on its own. A major strength of the champion backgammon program, TD-Gammon, is that it does assess and improve its own scoring function.
Another important level at which search-based planners are almost never subject to verification is that which specifies the outcomes of the moves, actions, or operators. In games such as chess with a limited number of legal moves we can easily imagine programming in the consequences of all of them accurately. But if we imagine planning in a broader AI context, then many of the allowed actions will not have their outcomes completely known. If I take the bagel to Leslie's office, will she be there? How long will it take to drive to work? Will I finish this report today? So many of the decisions we take every day have uncertain and changing effects. Nevertheless, modern AI systems almost never take this into account. They assume that all the action models will be entered accurately by hand, even though these may be most of the knowledge in or ever produced by the system.
Finally, let us make the same point about knowledge in general. Consider any AI system and the knowledge that it has. It may be an expert system or a large database like CYC. Or it may be a robot with knowledge of a building's layout, or knowledge about how to react in various situations. In all these cases we can ask if the AI system can verify its own knowledge, or whether it requires people to intervene to detect errors and unforeseen interactions, and make corrections. As long as the latter is the case we will never be able to build really large knowledge systems. They will always be brittle and unreliable, and limited in size to what people can monitor and understand themselves.
"Never program anything bigger than your head"
And yet it is overwhelmingly the case that today's AI systems are not able to verify their own knowledge. Large ontologies and knowledge bases are built that are totally reliant on human construction and maintenance. "Birds have wings" they say, but of course they have no way of verifying this.
Sharing a copy of Rich Sutton’s essay because his website sometimes has DNS issues and goes down. http://incompleteideas.net/IncIdeas/KeytoAI.html
Thanks for reading!
Thanks to Tanmay Gupta for helpful links or comments used in this article.
https://www.interconnects.ai/p/gemini-25-pro-googles-second-ai-chance
Google, with its immense infrastructure and talent, has been the safe bet for the question of “Who will have the best models in a few years?” Google took a long time to get here, overcoming Bard’s launch and some integration headaches, and yet the model they launched today, Gemini 2.5 Pro feels like the biggest jump in evaluation scores we’ve seen in quite some time.
It’s often hard to communicate how the models we are getting these days are actually better. To be informed, you need to take a balanced view across many benchmarks, look roughly at the percentage by which the model is clearly state-of-the-art, and of course, try the model yourself.
To summarize, while more evaluations are rolling in, Gemini 2.5 Pro is 40+ Elo points clear on the popular ChatBotArena / LM Arena benchmark (more here). Normally, when a model launches and claims the top spot, it’s barely ahead. In fact, this is the second biggest jump of the top model in LMSYS history, only behind the GPT-4 Turbo overtaking Claude 1. GPT-4 Turbo is when models were not really trained for the benchmark, so progress was much faster.
The blog post highlights insane scores on the benchmarks used to evaluate the leading reasoning models. One to note here is the score of 18.8 on Humanity’s Last Exam without search or tools, which was one of the evaluations I highlighted as impressive with the launch of OpenAI’s Deep Research, which compiles knowledge from the web!
Gemini 2.5 is topping other independent evaluations such as the Scale Leaderboard (which is underrated or at least low on visibility, more here). More independent evaluations are going to trickle in, but all of the ones I’ve seen are extremely positive.
Gemini still is also the model with the longest context length and with very strong multimodal performance (including audio). There are plenty of small wins that Google has like this that are hard to see when skimming the benchmarks above.
So, how did Google do it? As usual, the blog post doesn’t have a ton of technical details. Google says:
we've achieved a new level of performance by combining a significantly enhanced base model with improved post-training.
Until we have API pricing, it’ll be harder to make even informed guesses about whether the model is huge like GPT-4.5. As for understanding how Gemini models will behave, Google shares:
Going forward, we’re building these thinking capabilities directly into all of our models, so they can handle more complex problems and support even more capable, context-aware agents.
This idea of directly integrating reasoning into all of their models is something Sam Altman teased for GPT-5. This trend has serious trade-offs on user experience that we will get to later, but it is crucial for people to keep up with as the discourse today is often centered on "the best non-reasoning model” or “the best reasoning model.”
This came up recently with DeepSeek’s new V3 model.
DeepSeek's new model (0324) is a major update in performance and license. The MIT license will make it hugely impactful for research and open building. Though many are ending up confused about whether it is a "reasoning" model. The model is contrasted to their R1 model, which is an only-reasoning model (like o1).
Reasoning models are on a spectrum now, and it's not just yes or no. GPT 4.5 is a good example of what a model with pretty much no reasoning looks like today.
Compared to other models in the industry, like Claude 3.7 and Grok 3 with reasoning toggles, the new DeepSeek V3 is definitely in this class of "hybrid reasoners" where models are still trained extensively with RL on verifiable domains (or distilled directly from another reasoning model), but other parts of the post-training process come first and hold more weight than the RL heavy reasoning-only models.
This is all to say that when people say that "DeepSeek V3 0324 is the best non-reasoner model," that doesn't really make sense. The original V3 had very light post-training, so it wasn't really on the reasoning model spectrum.
Now, things are complicated. It'll be like this for a while!
Gemini 2.5 Pro is quite simple. It is very much a reasoning model, at least in how it is offered to users in Gemini Advanced and AI studio — every query has reasoning before an answer. It is fairly conclusive now that using this extended reasoning can boost performance across many domains, but it’s not clear how to best trade off cost and speed with varying amounts of reasoning.
Gemini 2.5 in its current offering is a brute force approach — a big, very smart model that is tuned to use a lot of reasoning tokens — and it’s good for the trajectory of the industry that it paid off with such high performance.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
The state of the AI industry
With launches from DeepSeek, GPT-4.5 from OpenAI, Claude 3.7 from Anthropic, Grok 3 from xAI, and now Gemini 2.5 Pro, this has been a wild spring for progress in AI models. The major AI laboratories have all delivered super impressive performance — this post feels like the ribbon that ties all of them together.
The one player seriously missing this spring is Meta with their Llama models. They’ve fallen into the trap where the longer you go between models, the harder it gets to release them because expectations get higher. I hope Llama 4 succeeds because they’re a large part of the open community, but it is a warning to AI laboratories on how to manage deliverables.
With the major progress that AI labs are making, it feels like the answer for who will have the best model is now who can drop the hot potato of a cutting-edge model into the real world the fastest.
The common interpretation of events is that models are commoditizing, but that is an incomplete story. The value in the ecosystem is poised to accrue to the sites with users. Some established ones in AI are ChatGPT, Perplexity, Cursor, etc. This may not always be the case is uses for AI evolve.
What we’re seeing with the newest models is that the pace of progress is staying high in many areas (i.e. more than just ChatBotArena). All sorts of evaluations, from niche information to hard mathematics to software development, are getting new leading models every few weeks.
The more often state-of-the-art models are released in a fixed time window, the more confident you can be in the pace of progress continuing. These labs are all racing up similar trees, but it’s only possible for so much competition to exist when progress isn’t super hard to find. The ceiling on performance is rising and the potential value underneath it that we haven’t unlocked is continuing to balloon.
Google AI’s second chance
This quote has been going around after Ben Thompson interviewed OpenAI CEO Sam Altman on his plans for OpenAI:
Ben Thompson: What’s going to be more valuable in five years? A 1-billion daily active user destination site that doesn’t have to do customer acquisition, or the state-of-the-art model?
Sam Altman: The 1-billion user site I think.
A world where user-facing websites are the most valuable part of AI is a world where AI is less of a platform for doing things and more of a tool for complementing existing habits. AI progress is as high as it ever has been, and the focus of that is on moving from benchmarks towards turning them into agents and tools.
Google’s biggest opportunity is being the one player that has it all — leading models, infrastructure, and a cloud offering to make it the default platform for building value with AI. They have users to retain with Google.com, which they are obviously trying to do, but the rest of their efforts should be on being an AI platform.
With this release, I spent time trying to use Google’s Gemini Advanced offerings like I use ChatGPT and Claude. These use cases were immediately confusing. It didn’t feel like Chat is at all the right way to evaluate this new Gemini 2.5 model. It’s perfectly capable, but without a depth of personality it feels lost relative to the fun GPT-4.5 or the ever quirky Claude.
And why am I paying for Gemini Advanced? Google is the company known for giving things away for free and at scale. If Google isn’t committed to figuring out advertisements for its chat products, then it’ll never meaningfully shift the revenue. Breaking through the moat of ChatGPT with anything other than better models at a free plan is next to impossible at this point. The disruption and new habits have already formed.
Many of my experiences with Gemini 2.5 Pro, other than lacking the distinctive character of GPT 4.5 and Claude 3+, where Gemini feels sort of bland, have to do with the form factor of forcing reasoning into every model. Even for basic queries the extensive reasoning of Gemini 2.5 Pro makes the time to first token on the order of seconds.
Normal consumers don’t benefit from the improvements with reasoning that are accompanied by such a decrease in speed. For agents doing substantial work in the background, a long time to first token or a bland personality don’t matter!
Reasoning heavily on every query is a major quality-of-life drain for chat and reopens the same discussions on reasoning models that balance when they should reason. Claude, Grok, DeepSeek, and OpenAI all have selectors for toggling reasoning on or off. This should be the default until models are better.
Gemini should not be focusing on competing with ChatGPT in the same business. That’s a losing battle and arguably not even the biggest possible final market — subscriptions never have scaled to be the core of the world’s largest companies.
Where Gemini Advanced (at gemini.google.com) feels like a ChatGPT clone, AI Studio (at ai.dev) feels like the onboarding point for developers and customers using their platform. Logan and others have made big progress softening the barrier for people jumping from OpenAI and Anthropic to Google. These leads are far more valuable than Gemini Advanced subscribers.
Google should be a platform for others to build AI and use AI to make their offerings better. Google has had success with its AI overviews and continues to build on that. At the same time, their offerings for using Gemini in products have pretty much failed completely.
There are two clear avenues where Google can use Gemini to deliver business value:
* Gemini for product: Enhancing existing products like Docs, Sheets, YouTube, Android, Assistant, etc. — i.e., taking the above and making it actually work. The Gemini product offerings across the Google suite are pretty much still in their Bard stage. The same focus and execution from training needs to extend into Gemini products and Google Cloud for the next stage of this arc. Extreme value is ready to be captured by the models even if the models don’t continue to improve. The urgency on products at this point very well should be higher than the pressure to train better models.
* Google Cloud: Offering fast and cheap inference of Gemini in the form factor developers need. Google Cloud, being integrated from hardware in TPUs to model can often provide the best models now at the lowest prices. Selling Gemini into a world of flourishing agents is a far better match for Google’s culture of product successes. AI Studio and API developer relations around it can be a seed that grows.
Google has the best models again, as they should have started this whole AI bloom. The strategic error has been righted. The AI leadership has woken up to the crisis, and the researchers/engineers have risen to the occasion. The rest of the company has to do the same.
https://www.interconnects.ai/p/how-to-manage-ai-training-organizations
It is a closely guarded secret how the leading AI laboratories structure their training teams. As with other technology companies, the saying “you ship your org chart” still applies to training AI models. Looking at these organizational structures will reveal where research can be scaled up, the upper limits of size, and potentially even who uses the most compute.
How modeling teams do and do not work
A crucial area I’m working on (reach out if you would like to share more off the record) is how to scale these lessons to bigger, more complex teams. The core factor differentiating teams that succeed from those that do not is maintaining these principles while scaling team size.
Big teams inherently lead to politics and protecting territory, while language models need information to flow from the bottom to the top on what capabilities are possible. Regardless of the possibilities, leadership can shift resources to prioritize certain areas, but all of the signals on whether this is working come from those training models. If senior directors mandate results under them before unblocking model releases, the entire system will crumble.
Seeing this potential end state — without naming specific companies — it is obviously desirable to avoid, but anticipating and avoiding it during rapid growth takes substantial intentionality.
Within training, the planning for pretraining and post-training traditionally could be managed differently. Pretraining has fewer, bigger runs so improvements must be slotted in for those few annual runs. Post-training improvements can largely be continuous. These operational differences, on top of the obvious cost differences, also make post-training far more approachable for non-frontier labs (though still extremely hard).
Both teams have bottlenecks where improvements must be integrated. Scaling the pretraining bottlenecks — i.e. those making the final architecture and data decisions — seems impossible, but scaling teams around data acquisition, evaluation creation, and integrations is very easy. A large proportion of product decisions for AI models can be made irrespective of modeling decisions. Scaling these is also easy.
Effectively, organizations that fail to produce breakthrough models can do tons of low-level meaningful research, but adding organizational complexity dramatically increases the risk of “not being able to put it together.”
Another failure mode of top-down development, rather than bottom-up information, is that leaders can mandate the team to try to follow a technical decision that is not supported by experiments. Managing so-called “yolo runs” well is a coveted skill, but one that is held close to the models. Of course, so many techniques work still that mandates don’t have a 100% failure rate, but it sets a bad precedent.
Given the pace of releases and progress, it appears that Anthropic, OpenAI, DeepSeek, Google Gemini, and some others have positive forms of this bottom-up culture with extremely skilled technical leads managing complexity. Google took the longest to get it right with re-orgs, muddled launches (remember Bard), and so on. With the time lag between Meta’s releases, it still seems like they’re trying to find this culture to maximally express their wonderful talent and resources.
With all of this and off-the-record conversations with leadership at frontier AI labs, I have compiled a list of recommendations for managing AI training teams. This is focused on modeling research and does not encompass the majority of headcount in the leading AI companies.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Recommendations
The most effective teams who regularly ship leading models follow many of these principles:
* The core language modeling teams remain small as AI companies become larger.
* For smaller teams, you can still have everyone in one room, take advantage of this. For me personally, I think this is where remote teams can be detrimental. In-person works for this, at least when best practices are evolving so fast.
* Avoid information siloes. This goes for both teams and individuals. People need to quickly be able to build on the successes of those around them and clear communication during consistent rapid progress is tricky.
* For larger teams, you can scale teams only where co-design isn’t needed. Where interactions aren’t needed there can be organizational distance.
* An example would be one team focusing on post-training algorithms & approaches while other teams handle model character, model variants for API, etc (specifications and iterations).
* Another example is that reasoning teams are often separate from other pieces of post-training. This applies only to players that have scaled.
* Language model deployment is very much like early startup software. You don’t know exactly what users want nor what you can deliver. Embrace the uncertainty and learn quickly.
* Do not overly try to separate engineering teams from training. Engineering needs to build tools for the generation +1 model and cannot do this without talking to researchers.
* Evergreen research is separate from the language modeling teams itself, but still sits within “research”. Otherwise, it will be impossible to prioritize truly long-term ideas. Long-term goals are fragile and need nurturing. Language modeling is about the next 1, or maybe 2, models.
* A lot of the sexy work is not that helpful and a lot of the useful work isn't sexy. Data is the prime example as the often most impactful type of work.
* Expect failed training runs and do not overreact to them along the way.
Failure modes
High-priority projects can fail if you…
* Try to ship too many models for each capability improvement. Instead, stick to a set schedule of model training. Have fewer models that are more capable.
* Try to force contributions from individual teammates into the final product. Do not sacrifice performance for personalities in search of “a contribution”.
* Let in teams that try and territorially force their way into contributing to the big company goal.
* Scale the training organization too much. Having too many people “doing stuff” and adding noise to the organization detracts from high-level direction and focus on the execution of specific goals. (This can also relate to 1. and be trying to do too much in one model).
* Letting politics grow, taking many forms, and causing intertwined issues. Do not lose the sense of results being the #1 driving factor of decisions. Bad decisions here compound.
* Over-indexing on a single model evaluation will hamper (or flat out block) real progress in other areas.
Before the rest of the post, expanding on the topics above, you may be interested in previous articles on this topic.
Related writing
For more reading on how language modeling teams work, see some of my other writing here, on team structure, and…
….managing risk.
An example of how mid-sized training projects work
I recently got a list of questions on how training for Tülu 3 operated (which is a post-training analog to OLMo really). I figured I would share these and they serve as a foundation for me gathering useful information from friends on frontier labs on how representative it is.
With reasoning models, most of this translates directly. Infrastructure is becoming more important because generating long sequences is particularly memory intensive (and can expose issues in open-source tools for inference), but when the time comes to make a state-of-the-art fully open reasoning recipe, the lessons learned here will apply directly.
1. How long does a large post-training project take?
Tülu 3 was the focus of our post-training team from mid-July until its release on November 21st, 2024. We were building on our previous recipes, in Tülu 2/2.5, so not very much of this was catching up on internal know-how, but rather integrating new external resources. If a team like this was working continuously all year on the same focus it would’ve taken approximately one month less to achieve these results. Bootup takes substantial time, as does release management.
2. How do you choose the right personnel for a moderately sized training project?
A project like Tülu 3 or any other effort to push the frontier of AI in a popular area normally takes a moderately sized team. The smaller the niche, the smaller the team you need. The team at Ai2 is researcher-heavy relative to engineer-heavy among the 20+ authors. If prioritizing only performance on known techniques, the ratio of engineers can be far higher. Pushing the frontier takes 10x the resources as repeating extensively documented work.
In the case of Tülu 3, where most of the techniques are not known the proportion of researchers is obviously higher. This, though, for companies trying to scope who to hire for modeling teams is not a trivial problem. First, one must scope the level of uncertainty in the domain of interest and then hire around it. Applying Tülu style approaches could definitely be done with a team of 2-4 focused engineers.
3. What model sizes are used for iteration? How do results scale?
A core principle of modeling research is to iterate at the smallest model that provides a reliable signal. This is the entire principle behind scaling laws as a de-risking tool. In post-training, compute costs are substantially lower so the models used actually can be bigger. In this case, given a project designed around the Llama 3.1 base models, ~80% or more of experiments were at the 8B scale (normally 8 or 32 H100s, finishing in <1 day), ~19% at the 70B scale (normally 32 or 64 H100s, finishing in 2-3 days), and only a handful of runs at the 405B scale that were using 256 GPUs each for multiple days. In overall GPU utilization, the project utilized 100-600 GPUs concurrently for the entire 4-5 month span.
These days, results tend to transfer extremely well when scaling. Bigger models may need less data, especially less general data, and a gentler optimization (lower learning rate usually), but transfer hasn’t been a challenge. Changing base models is harder than scaling with post-training techniques.
4. How many experiments are actually run?
The Tülu project evaluated about 1000 checkpoints in our process. This feels about right for a major post-training process. Some of these are intermediate or competitor models, but most of them, 100s, are experimental training runs. The model scores can be plotted in a time sequence with the metadata we collected (credit Hamish Ivison for the plot). When you squint, it is largely a logarithmic curve with faster gains at the beginning and leveling off at the end. Of course, you can also see the flurry of models trained right in the last few weeks.
5. What is the biggest bottleneck on progress?
All of these projects are bottlenecked by compute available. Making systems more efficient is a compute multiplier, but if the starting point in the number of GPUs is too low, it won’t matter. There’s often potential to accelerate projects by adding more people to explorations, whether it’s training approaches like process reward models (PRMs) or data curation, but scaling management and integration of data across numerous evaluations can be tricky. Best practices for models with 100s of target evaluations (as done in frontier laboratories) rather than the ~10 we used, are far from established.
The second bottleneck would be personnel willing to constantly grind on new data experiments. Focus on data almost always pays off fairly quickly.
6. What I would need to get a serious post-training effort off the ground from a cold start?
Finetuning has such a large gradation that impact can be made with almost any team size. To do truly excellent work takes mostly patience and proportional resources. Getting the model exactly right takes retraining many times even after you hit your initial benchmarking goals.
For companies focusing on local models, a few nodes of H100s (~100 GPUs) could go a very long way. For companies trying to make truly state-of-the-art models above the 7B scale, trying to do so with <500 H100 GPUs is likely not worth it. It is very easy to be stuck in the middle and compute is still the largest determining factor of success.
These numbers will come down as best practices of distillation from strong models are established, but this knowledge is far from known. If you want to invest in training you need to do enough to move the frontier, or else you will be inevitably falling behind and it would be better to ride on other’s coattails.
7. What is the hardest part of these projects? Where do you actually spend time?
Training projects take a lot of time and a lot of focus to detail. Teams need extreme isolation from other company goals to focus on their one goal of training. The hardest part is often this — having all the members of the training team focus on one single output for sustained periods. Tracking down recent developments, small experiments with training algorithms, curating data (likely most of the time in hours as babysitting GPUs is largely an idle activity), etc. are all bread and butter of solid engineering talent. Success is downstream of good decision-making by tech leads and managers while getting many small shots on goal.
In the case of projects like Tülu 3 the reason we don’t immediately transition to Tülu 4 is that people have other interests. Companies that directly align training with their bottom line don’t need to do this.
Thanks to Nicole Fitzgerald, Finbarr Timbers (Midjourney was not one of the companies I studied), and others unnamed at leading AI laboratories for comments or input that helped with this post.
Post: https://www.interconnects.ai/p/gemma-3-olmo-2-32b-and-the-growing
Ever since the release of the original ChatGPT, much has been said about making a truly open-source version of it — with data, code, weights, etc., all available. Open-source versions increase transparency, access, long-term progress, security research, and lots more. Lots of people have used this claim to bring hype into their projects, but the substance of these releases have been rather shallow (i.e., often focusing on one evaluation).
This milestone was so long coming that I entirely forgot about it as a target. Through 2024, and especially before DeepSeek, the impression was that scaling AI capabilities was just too expensive for the smaller players willing to do truly open-source development.
Truly open releases take a lot of effort by making more to release and maintain, open up potential legal risks that preclude types of training data, and completely undermine competition. The few organizations doing fully open-source research are non-profits, like Ai2 or Eleuther AI; academics, like LLM360; or companies that benefit from the long-term ecosystem growth, like HuggingFace.
I was poking through the results for our latest model when I realized that we finally did it! We have a fully open-source GPT-4 class model, i.e., it is comparable with OpenAI's original release rather than the current version.
Today, we're releasing OLMo 2 32B, the biggest model we've trained from scratch yet. Here are the post-training evaluations, where it surpasses GPT-3.5, GPT-4o-mini, Qwen 2.5 32B Instruct, the recent Mistral Small 24B, and comes close to the Qwen and Llama 70B Instruct models.
And this recipe is extremely training efficient. Here’s a plot showing the FLOP comparisons to peer base models:
Most of this release isn't entirely new. OLMo 2 is the result of lots of small wins on data, architecture, post-training with Tülu 3 recipe and so on — we just let the GPUs hum for a lot longer. You can learn more about OLMo 2 in my original release announcement or in this podcast with the leads.
The new part of this release is a major milestone where any company can pick up our training stack and cook up exactly the model they need at nearly the GPT 4 level. Beating the latest GPT 3.5 and GPT 4o mini models feels like fair game for the claim. This capability will take time to diffuse, but it is a major moment in the arc of why we do what we do. Even without more progress on OLMo, which we obviously will continue this year, this will keep fundamental AI progress outside of the major AI labs going for multiple years. It’s an optimistic day for open-source.
Here are your links to more information on OLMo 32B:
* Blog with technical details and demo
* Base model: OLMo-2-0325-32B
* Instruct model: OLMo-2-0325-32B-Instruct and intermediate SFT, OLMo-2-0325-32B-SFT, and DPO checkpoints, OLMo-2-0325-32B-DPO
* Pretraining dataset: OLMo-mix-1124
* Mid-training dataset: Dolmino-Mix-1124
* Post-training datasets: Tülu 3 SFT Mix (updated), Preference data for OLMo 2 32B and RLVR Mix
Gemma 3 as the next point on a steep trend line
Yesterday, March 12th, Google released the next batch of their flagship open-weight models, Gemma (report, models, flagship model). They highlight the following capabilities in their documentation:
* Image and text input: Multimodal capabilities let you input images and text to understand and analyze visual data. Start building
* 128K token context: 16x larger input context for analyzing more data and solving more complex problems.
* Wide language support: Work in your language or expand your AI application's language capabilities with support for over 140 languages. Start building
* Developer friendly model sizes: Choose a model size (1B, 4B, 12B, 27B) and precision level that works best for your task and compute resources.
Some technical details of note:
* In open models, 32B dense models are convenient because they can be finetuned on one node of 8 H100s (slowly). Google's sizing at 27B likely is downstream of TPU considerations that don't map directly, like how knowledge distillation works at pretraining.
* The Gemma models continue to be trained extensively with teacher-student knowledge distillation (KD). This KD is different than the colloquial definition of distillation in leading AI models. The common use of distillation is training the models on any output of a much stronger model. This is most commonly done in post-training to learn from generated completions of the stronger model. KD is a subset of the general idea of distillation, where the model being trained learns to match the distribution of the teacher model. Other labs than DeepMind have mentioned this KD technique, but Google has pushed it far further. This was discussed further in last summer’s post on synthetic data.
Otherwise, the paper has some interesting information but nothing super groundbreaking. This is par for the course for most technical reports these days.
Onto the evaluations, and therein the impact, of Gemma 3.
The best way to think about this model is a “general chat model” like GPT-4o and Claude 3.7 rather than a reasoning model like R1. The rise of reasoning models has made comparing models tricky because there are multiple evaluation suites that people care about — broadly characterized as a reasoning suite and an instruct suite. They overlap, but strong capabilities on both is rare.
Gemma 3 27B’s performance on some tasks like MATH and Bird-SQL (coding) match the Gemini 1.5 Pro model from just a few months ago! The progress on small, open weight models is simply insane. Small models can perform excellently on narrow tasks like math and some coding, but they lack the depth and world knowledge, as seen in GPQA or SimpleQA above.
Yes, DeepSeek distills are better at smaller sizes on MATH, but not enough people evaluate those distills across all capabilities like ChatBotArena. Having it all in one model is very convenient and is still how most workflows are handled.
Most people are also fairly skeptical of evaluation scores like MATH stated by Gemma, DeepSeek distills, and the like, claiming they don’t translate to real world usefulness. This is why the ChatBotArena results were the most striking of the Gemma 3 release. Gemma 3 falls in the top 15 of every category. It beats DeepSeek V3 with its 600B+ total parameters. It is outperformed in niche categories like math or coding by its peer models in the overall ranking, indicating a small level of superficial alignment, but doing this to get into the top 10 of ChatBotArena during this period of AI with immense competition is a huge accomplishment.
It is an ever evolving open question on how reliable chat evaluations like ChatBotArena are. These days, with how in vogue RL training methods to maximize MATH evaluations are, the value is higher again. Is it representative of some subset of real-world use, which would indicate that specific capabilities that small models are able to excel at — math, general chat, etc. — can translate directly to real value.
This implies that tasks like SimpleQA and GPQA indicate performance on more niche tasks that not many people encounter, but we have a lot to learn as a field here.
With my use of leading models, I find this hard to believe — switching to something like GPT-4.5 feels like an immediate improvement in chat tasks. My conclusion is that the answer is somewhere in the middle — small open models can do super well on everyday tasks, but we don’t know exactly how to measure it. ChatBotArena and SimpleQA don’t tell us exactly what to expect from the models.
The fact that there isn’t a cliff in performance with models this small is going to drive substantial business value — it’ll be easier to find exactly the minimum model size needed for your distribution of tasks.
In the specific case of Gemma and most of the leading open weight models right now (with DeepSeek R1 as a wonderful exception), the licenses of models that often restrict downstream use are a brake on open weight model adoption. Without getting too much into the details, they expose companies looking to use the models to some amount of legal risk and add terms and conditions to finetuned models.
Largely, until feedback loops and use cases for open models are established, teams releasing these models don’t have strong cases other than future potential growth to counter the safe option that comes from legal teams’ recommendations. I hope that the evolution in the adoption of open weight models for valuable applications applies pressure to make licensing less of a headache.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
The state of the open-closed model gap
3 of the top 15 models on ChatBotArena are open weights. In a world where frontier labs have many minor model versions crowding the leaderboard, this is an incredible platform for accelerating progress in open model adoption. Even just the gap from Gemma 3 to Google's flagship Gemini models is very small! The entire AI market is the most dynamic and competitive it has been in some time — maybe ever.
When it comes to “truly open,” i.e. actually open-source models, the gap between closed models has remained somewhat consistent — I estimate it as about 18 months behind closed labs. With open models generally on the upswing, open-source access to code, data, etc. is likely to come with it. The likes of Llama, DeepSeek, etc. are some of the most important pieces in the complete open ecosystem, and approaches like Ai2’s would struggle without them.
Related to this, in the coverage of DeepSeek R1, I noted:
This points to the first time since Stable Diffusion’s release that the most relevant and discussed AI model is released with a very friendly license. Looking back at the journey “open-source” AI has been on over the last 2.5 years, this is a surprising moment in time marked in the history books.
A month later, this is still the case.
To understand the progress of the open-closed model gap, with an oversimplification, there are two simple ways that models tend to get better:
* Scaling up model size, and
* Various data, architecture, and algorithm improvements that make an equivalent performing model smaller.
Progress on truly open-source models has always relied on the second of these due to the relatively smaller budgets of players in the open-source space. By being forced to focus on efficiency rather than scaling, it has often seemed like truly open source models are falling behind by more than they really likely were. With similar potential to scale, the gap between truly open-source and closed would be quite small on standard academic benchmarks which "raw model potential" is usually evaluated on.
When models like Llama 3.1 405B, DeepSeek V3 and R1, etc. come out, the open weight models scaled up have been at the frontier. With more resources for truly open-source training I don't see many reasons why that wouldn't be the case for models like OLMo too. In some ways, the biggest moat to training is just resource access. For example, DeepSeek is reported to have at least 10x more compute than the likes of Ai2.
This interpretation of the gap is more tractable than other critiques I've imposed on the open-source AI community in the past, such as the need for better preference data / RLHF pipelines.
Comparing Gemma 3 27B to OLMo 32B, the pretraining evaluations for both are super similar, but Gemma 3 scores are way better after post-training. The ceiling on post-training expectations has been shifting extremely fast among open models. Largely, open recipes are motivated by matching existence proofs like Gemma 3 or DeepSeek R1. Throughout 2025, expect open post-training resources to get even better again, motivated by the need for better performance and more detailed workflows on reasoning training.
All of these subtleties are paving the ground for a better platform for building value with open models. The gap from closed models to open counterparts being its smallest means the opportunity is the best it’s ever been.
The race for inference providers to host models like DeepSeek R1 is a great platform for progress and example. Even if the why of open models starts as primarily being good for marketing and recruiting, this can trickle down into bigger benefits.
Open weight models are likely to be offered for inference at far lower margins in the near term, due to the massive competition in the inference space, driving the prices even lower for those looking for ways to adopt them. Here’s an example of some of the leading platforms offering R1 inference:
This sort of everywhere-inference for an open model is something that has only happened in the past for Llama models.
Yes, I know APIs are orthogonal to the benefits I just listed, but they’re the way people become familiar and interested in the models. APIs help people check if the model can serve their purpose, and then if it does, they can spin up a more complex workflow.
“Why open” is more important than “what is open”
The biggest stories in open-source AI in 2024 often felt like bickering about definitions. I wrote a lot of articles about definitions. Llama 3 was pretty much all we had to get excited about. At the end of the day, even with how much I think it would be better with more information on the whole stack of AI development, open-source is largely going to be defined by community norms. For now, Llama weights have been that norm rather than other definitions.
By comparison, 2025 feels poised to be about actually building open AI. We have had surprising, impactful, and exciting releases and it’s only March. We know Meta is looking to get back into the conversation with Llama 4 in April at LlamaCon. We have our open-source ChatGPT. We’ll have more we can’t predict.
Crucially, on top of the gap being smaller, all of these open models are crossing meaningful boundaries in performance. When model capabilities made the leap to GPT 4 class models, tons more applications were possible. Now, we have GPT 4 class small models that can be deployed in privacy-conscious ways. There’s been a huge demand for this, and the ecosystem is slowly building the tools to do so. Yes, closed AI will continue to march forward, but open solutions need to prove their own independent feasibility.
In the long march of progress, open-source AI feels far closer to an inflection point of proving out the hypothetical benefits we have focused on for a few years. Transparency, privacy, better performance, etc. could actually all be happening this year.
For the backlog of content I've written on open-source AI, click here. Thanks to Vikram Sreekanti, Ross Taylor, and Xeophon for feedback that helped shape this post.
Eugene Vinitsky is a professor a New York University department of Civil and Urban Engineering. He’s one of my original reinforcement learning friends from when we were both doing our Ph.D.’s in RL at UC Berkeley circa 2020. Eugene has extensive experience in self-driving, open endedness, multi-agent reinforcement learning, and self-play with RL. In this conversation we focus on a few key topics:
* His latest results on self-play for self-driving and what they say about the future of RL,
* Why self-play is confusing and how it relates to the recent takeoff of RL for language models, and
* The future of RL in LMs and elsewhere.
This is a conversation where we take the time to distill very cutting edge research directions down into the core essences. I felt like we were learning in real time what recent developments mean for RL, how RL has different scaling laws for deep learning, and what is truly salient about self-play.
The main breakthrough we discuss is scaling up self-play techniques for large-scale, simulated reinforcement learning. Previously, scaling RL in simulation has become economical in single-agent domains. Now, the door is open to complex, multi-agent scenarios where more diversity is needed to find solutions (in this case, that’s what self play does).
Eugene’s Google Scholar | Research Lab | Linkedin | Twitter | BlueSky | Blog (with some great career advice).
Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Show outline & links
We cover many papers in this podcast. Also, as an experiment, here’s a Deep Research report on “all the papers that appeared in this podcast transcript.”
In this episode, we cover:
* Self-play for self-driving, mostly around the recent paper Robust Autonomy Emerges from Self-Play (Cusumano-Towner et al. 2025). The simulator they built powering this is Gigaflow. More discussion on HackerNews.(Here’s another self-play for self-driving paper and another from Eugene from earlier this year).A few highlights:
“All simulated agents use the same neural net with the same weights, albeit with randomized rewards and conditioning vector to allow them to behave as different types of vehicles with different types of aggressiveness. This is like driving in a world where everyone is different copies of you, but some of your copies are in rush while others are patient. This allows backprop to optimize for a sort of global utility across the entire population.”
“The resulting policy simulates agents that are human-like, even though the system has never seen humans drive.”
* Large Language Models are In-context Preference Learners — how language models can come up with reward functions that will be applied to RL training directly. Related work from Stanford.
* Related literature from Interconnects! The first includes literature we mention on the learning locomotion for quadrupeds with deep RL (special shoutout as usual to Marco Hutter’s group).
* Recent and relevant papers Value-based RL Scales Predictably, Magnetic control of tokamak plasmas through deep reinforcement learning.
* Other things we mention:
* Cruise, Tesla, and Waymo’s autonomy stacks (speculation) and how the self-driving industry has changed since we were / were considering working in it.
* Evo 2 foundation model for biology.
* Eugene is working with a new startup on some LLM and RL stuff. If you’re interested in this episode, ping [email protected]. Not a paid promotion.
Chapters
* 00:00:00 Introduction & RL Fundamentals
* 00:11:27 Self‑Play for Self‑Driving Cars
* 00:31:57 RL Scaling in Robotics and Other Domains
* 00:44:23 Language Models and In-Context Preference Learning
* 00:55:31 Future of RL and Grad School Advice
Transcript
I attempted to generate with ElevenLab’s new Scribe tool, but found the formatting annoying and reverted back to Alessio’s smol-podcaster. If you’re interested in working part-time as an editorial aide to Interconnects, please get in touch.
Nathan Lambert [00:01:27]: Hey, Eugene. Welcome to the show.
Eugene Vinitsky [00:01:29]: Hey, Nathan. Thanks for having me. Excited to be here.
Nathan Lambert [00:01:32]: Yeah, so I'll have said this in the intro as well, but we definitely go well back in all the way to Berkeley days and RL days, I think.
I will embarrass you a little bit now on the live read, which is like, you were one of the people when I was switching into RL, and they're like, oh, it seems like you only figured out how to get into AI from a potentially different background, and that's what I was trying to do in 2017 and 2018.
So that was kind of fun, and now we're just friends, which is good.
Eugene Vinitsky [00:02:01]: Yeah, we were both figuring out. If I had any lead over you there, I was also frantically trying to figure it out, because I was coming from a weird background.
Nathan Lambert [00:02:11]: There are definitely a lot of people that do that now and over-attribute small time deltas to big strategic plans, which was probably what it was.
And we're just going to do some of our normal conversations on RL and self-play.
I think the backstory of this is you told me that your recent paper from some of your time at Apple, I think I don't want to time for it too specifically, was something, paraphrasing, like the most exciting RL thing you've ever had a part of.
And major RL projects are not that frequent.
I think if you segment out all the language model excitement in the past 10 years, there's really a few major milestones, and it's good to kind of talk about them.
So we can kind of start, I think, basic things, like how do you define reinforcement learning, and it will kind of build up to this self-driving project.
Eugene Vinitsky [00:03:05]: Yeah, so I think RL is kind of a big thing, but at a really basic level, you have this process of taking actions in the world.
You're seeing the state of the world.
If you're taking actions in the world, you sometimes receive a reward that tells you the value of that action, and you're trying to kind of optimize your cumulative behavior over time.
So that, you know, over long trajectories, you're optimizing those costs.
That's both, you know, the hard thing and the exciting thing is that if you do it well, you can really optimize really long horizon behaviors.
Nathan Lambert [00:03:41]: Yeah, I agree.
And it's funny because now it's finally, the language models are finally doing this long chain of thought, and I don't really think that's the same.
I think the interactive notion will come up a lot here where these long context behaviors are many, many actions interacting with the world relative to one really, really long action, which is kind of odd.
Eugene Vinitsky [00:04:04]: Yeah, I guess, yeah, it mixes things, right?
Because it has very long state, right?
It's got very long contexts, and it's generating its own context.
But in the end, there's really one action at the end that, like, kind of determines how everything went, you know?
Nathan Lambert [00:04:23]: Yeah, yeah, yeah, we'll get into this.
And then the next thing that we kind of need to set up is what do you define self-play as?
I think this word has been particularly broken in recent times with language models, and I'm hoping we can get a fairly specific criteria for what is self-play and what are related topics.
Eugene Vinitsky [00:04:42]: Yeah, I think even within the field, there's quite a bit of debate as to what constitutes self-play.
So talking to, you know, experts, people will disagree about what methods are and are in self-play.
But what I will say is I generally define self-play as anything where an agent plays a copy of itself.
So up to a bunch of different agents interacting with each other, as long as they're mostly, in some ways, copies of each other, we're doing self-play.
Nathan Lambert [00:05:12]: Yeah, and then do you think anything, I mean, your background's in multi-agent as well.
Do you think there is something fundamental to kind of a game that has a really specific hill to climb where it's kind of this competitive nature versus something like language?
Eugene Vinitsky [00:05:29]: Yeah, this is kind of the dream of, I think, some multi-agent researchers is this type of like ratchet effect where you have a bunch of agents interacting with each other and kind of increasing complexity on the part of any agent generates increasing, like creates new challenges that need to be solved and then force you to learn new skills.
And then you kind of get this endless, endless ratchet.
Maybe that's what you meant.
I may have misinterpreted.
Nathan Lambert [00:05:55]: We're going to revisit it.
I think also it's like, how does the multi-agent nature of a lot of these things change what people think about with RL?
This is kind of the last building block before we go into the self-driving stuff.
Eugene Vinitsky [00:06:07]: Yeah, yeah, yeah.
So the way that the multi-agent thing changes things is it makes everything much harder and more interesting.
So you go away from this world where you have like a clear score function, right?
So you have some reward for first in single agent setting, you have some reward.
If that reward is high, you're doing well, right?
And when you move into the multi-agent setting, it becomes reward with respect to whom, right?
It all of a sudden matters whom I'm playing, right?
So if we go to a game of like, like one setting is like two players, zero sum games, right?
So a game of two player poker, I give you, I train a poker bot, right?
How do I know it's any good?
I have to play another poker bot to decide that it's any good, right?
And so all of a sudden, this challenge of like, what is a good policy becomes very fundamental.
And you kind of lose even a notion of there being like one clear good policy.
And like the whole, a lot of, a lot of the field of multi-agents is coming up with different definitions of what would cost you goodness.
Nathan Lambert [00:07:06]: Um, so, and then back to the self-play thing with that, like, is all of the self-play that we discussed, like if you were playing yourself, does the same consideration apply?
Like, is that, is self-play necessarily a multi-agent framing?
Eugene Vinitsky [00:07:19]: Um, I think it, I think it is because oftentimes what we're trying to do with self-play is like to converge to some notion of policy goodness.
And self-play is just a mechanism that gets us to some definition of, of high quality policies.
Um, and, and, and what turns out to be the case is there, there are actually many like non-self-play type methods for doing this.
Self-play just turns out to be an effective way to accomplish constructing effective policies.
Nathan Lambert [00:07:45]: Yeah, I, I, there's many, I'll, I'll link later a lot of these papers on self-play for preference learning and look into them a bit more.
Eugene Vinitsky [00:07:56]: Yeah.
Nathan Lambert [00:07:57]: Essentially that's been the lens.
There's two lenses by which this has come back and both of them, I don't think fit into, I, I think this multi-agent lens of self-play is much richer and I don't think any of them have fulfilled this.
I think there's useful methods for preference tuning.
I think that's like maybe spin it's like self-play something preference learning is one.
And there's papers related to this where they're probably looking at the probability of the own model in generating a response or something like looking at the internals of the model.
And it's not really set up in this game nature of some sort.
And then also with Q stars, when the self-play stuff came back where I really think I've, I've talked to some people that did original reporting on this and it was that the model looked like it was talking to itself.
And I think that very understandably for less, a little bit less technical audiences that haven't engaged with self-play, that coverage of talking to itself got transformed into a self-play commentary and hype cycle, which took people down the wrong path for like an entire year, which is so brutal, but also very understandable and funny.
Eugene Vinitsky [00:09:11]: Yeah, I think there's something interesting and different happening in these like multi-agent like LLM self-play setups.
I'm not super familiar, but I think what's happening is something quite different than what we mean in other multi-agent settings when we're talking about self-play.
Like I feel like it's, it's more about like refining like the distribution of actions that it takes in some, some kind of odd way.
Nathan Lambert [00:09:39]: I think this sounds ridiculous at first pass, but it's almost that the language models are simulating a softer version of self-play within themselves to kind of check their own work and engage in their own discourse, which the level of intelligence they have is not going to like unlock the true like incremental progress that we think of with self-play.
Which probably, I think for context of things for self-play, just to put them on the record of this are, have been very impactful or things like AlphaGo and New Zero.
I think that's, those are the prime examples of generating some superhuman policy in a closer way.
I think it's, it's important to kind of gate the conversation on like, these are the aspirational goals, um, in terms of outcomes and then figuring out how to apply them to new domains and new tools is kind of unknown.
Eugene Vinitsky [00:10:31]: So, so maybe I should have said this earlier, but like self-play is the thing that gives a, is like maybe the one way that we know to build superhuman agents right now.
So, right.
So, um, superhuman go, um, human level Dota, human level, uh, Starcraft.
Um, technically poker is in a, in a slightly weirder, um, weirder space where I don't, I don't exactly know that I would call the method on that underlie that self-play.
Um, sorry.
Um, and, uh, but yeah, it's the one way we really know how to build superhuman agents.
Nathan Lambert [00:11:06]: And I think this is a kind of a natural transition because the, to make people excited in the work that you did, it seems like you've discovered superhuman driving through self-play without inductive biases.
And I'm like, um, how do you view the potential impact of this?
And then we can kind of go into the method.
Eugene Vinitsky [00:11:27]: Right.
So the, the challenge with self-play is, and this requires a bit of technical detail to get there, but you know, in, in like two players, here are some games, games where you and an adversary are playing with each other and somebody wins and somebody loses, there's a very well defined notion of what being good is.
Um, you know, that they're, they're well, you know, their criteria that we would like our policies to converge to.
And, and the challenge has always been about moving beyond that to a domain where it's much harder to define what, what doing well means, right?
There isn't like an abstract notion of what good driving is there out in the world where I could just write down the reward function and simulate it and optimize with respect to that.
And all of a sudden I'd have a good driving policy.
So the, the gap has always been between these methods that work really, really well in, in well-defined games like, like Starcraft or go, uh, and chess, um, and settings where it's much harder to define that.
And so we haven't been able to, to move to self-play in settings where, for example, humans might be in the loop, right.
And, and particularly driving is an instance of that somewhere where at the end, we're going to take our policy and it's going to drive with humans and we have no way to simulate humans and play against them.
Um, and so figuring out how to close that gap has been kind of an open, open challenge.
And I think maybe this is the first instance of, uh, finding a way to do that.
Nathan Lambert [00:12:51]: Okay.
So that's a much better motivation than I gave.
And I understand the excitement now, because if this works in one domain, um, and you'll tell us about how grand of an effort it actually was.
I know big tech companies can put a lot of force and long-term investment behind things to get them off the ground.
Then a lot of the other things that people are saying about language models or other complicated domains are at least there's an existence proof of something similar happening.
So why don't you just continue to explain, uh, this problem set up of learning driving without having a human teacher.
It will probably take detours to analogize different self-driving stacks just because we know about them and it's good to compare.
Eugene Vinitsky [00:13:36]: So one way of framing this is, and I'm going to put cautions in the end, I'm going to give you the, the, the extreme version of it.
And I'm going to walk it back a little bit is like human level driving without any human level data.
And the caution needs to be that this is in simulation and our ability to measure human level driving in simulation is limited in a lot of ways.
So I can tell you about the ways that we measured it and then I'll, I'll have to tell you what the limitations of those things are.
Um, so this was a large scale effort, um, uh, and Lovlin Colton's team and at Apple, um, it was about like eight researchers, research engineers working together for about a year and a half, uh, build, building the stack out.
Um, it was, I think a lot of us came at it from different places.
I know some folks were very inspired by this idea of like alpha star for driving, you know, building a diverse, rich world and then driving it in a way that such you would, you would transfer to policies that you hadn't seen before.
So like human actors.
Um, so, um, yeah, the, the, if, if, if it's helpful that the idea here is that, or the goal here was to build a human level simulated driver.
Um, and here, what that means in our case is not a fully end-to-end method, right?
So we're not simulating perception.
So driving stacks consist of like generally perception, prediction, planning controls.
So you have a perception stack that, you know, takes your LIDAR, your camera, your radar, and converts it into, you know, where are the cars, where are the road is, what's impassable.
Um, and then a prediction stack will take the like positions of all the cars, the cyclists, pedestrians, and it'll predict, predict where they're going to go next.
And then a planning stack will say, okay, given those predictions, you know, what's a good trajectory for me to take.
And then the control stack will say how to actually follow that trajectory safely and robust.
Right.
And we're talking about subsuming the prediction, planning, control portion of the stack, not the perception part of the stack.
Nathan Lambert [00:15:28]: Okay.
So I was, I was thinking that you might not even do control.
I was thinking you might just say, uh, control is a softer album and not do that too.
Eugene Vinitsky [00:15:35]: So in the same way, we're kind of, we're only kind of doing control.
Uh, we're, we're, we're doing this for, I think Waymo uses the
Nathan Lambert [00:15:42]: the term behavior for this.
I think it's been their behavior team for a while.
Is that right?
Eugene Vinitsky [00:15:46]: Okay.
Nathan Lambert [00:15:47]: Uh, you know, I very, it's hard to know where the abstraction ends, but they definitely have a behavior team that's done a lot of things through the years.
Well, he's not the job apps that I've been applying to an interview or have interviewed for in the past.
Yeah, me too.
Eugene Vinitsky [00:16:01]: Um, I think we do know how to control cars.
We know how to make cars follow a pre-specified trajectory, right?
This is, this is somewhat of an easier problem than like humanoid robotics or something.
You know, big thing got wheels.
We know how to make it turn.
Nathan Lambert [00:16:14]: Um, so how do we get these things from, I mean, they start as like, it doesn't start at just all the simulated cars crashing all the time.
What is the start here?
Eugene Vinitsky [00:16:24]: I'll send you the video once it's out, but like, you know, the, the first 10 hours of simulation is just like cars scattered all across the road, smashing into each other, driving off the road, that type of thing.
It's actually interestingly useful because what we do is when two cars crash, we have them come to an immediate stop.
And this actually creates a lot of blockades in the road.
So at some point during the training, the cars start to learn to drive around stopped cars, even though those cars are stopped because they've crashed, um, as well as to drive around like obstacles and things like that.
Um, so that, yeah, that's what it looks like.
Um, yeah.
Nathan Lambert [00:16:58]: Um, as well as the reward function for these.
So you have a bunch of cars that can see their peers and there's some reward function I'm guessing.
Eugene Vinitsky [00:17:06]: So the, the major component of the reward function is getting to your goal without colliding.
So we, we have these maps that we've taken from the cartless simulator.
They're fairly large maps.
Some of them are like multiple kilometers in spatial extent.
We have eight of them and we place goals randomly over the map.
Um, and you get a sequence of goals.
So, you know, that like, okay, I want to get to this point.
And then after that, I'm going to want to get to this next point.
After that, you're going to get a big reward for getting to that goal.
You're going to get some amount of penalty for colliding.
And then there's also an implicit penalty because if you collide, you're not ever going to get to your goal.
And then there, there is some amount of hand design here in that there are small rewards for like staying in your lane and being aligned with your lane and like, you know, not driving in the opposite direction in the wrong lane.
Nathan Lambert [00:17:51]: This was one of the questions is if you had to do this sort of thing.
Eugene Vinitsky [00:17:54]: You have to do that.
But one interesting thing, and maybe we could talk about that at some point is we randomize the weights of those rewards.
So there are agents that like really want to drive in the lane going in the right direction.
And there are agents that don't care about that at all.
And they will take the wrong lane on the highway, uh, you know, going at full speed in the opposite direction.
And that's kind of useful because you're ready for that scenario.
You've seen that scenario in the world when you're driving around.
Right.
Um, but yeah, we have to, we have to do some of that stuff because at some point there are laws and you can't avoid encoding the laws into your system.
You know, stop signs are a human concept.
Um, they're, they're not, you know, it's not going to emerge that you see a red thing and you're like, oh yeah, that means I should stop.
And then I should like give the right of way rules to the other cars.
Um, but all of our rewards are kind of soft in the sense, like, you know, if you're at a stop sign and folks have been preventing you from going for a very long period of time, right.
You're going to start to nudge in and like break the rules about right away.
Nathan Lambert [00:18:55]: One of my questions for later on this is like, do you think our vehicles and driving dynamics and infrastructure kind of constrain the way of driving?
Like we've co-designed human driving in our infrastructure so that human driving is actually no longer that special because of the track is so long, so defined.
Eugene Vinitsky [00:19:11]: I think this is, this is part of why this is all going to work or like why it works is because like human, human driving is, and human behavior in many domains is like fairly constrained by the institutions and the laws and the norms that we design.
Uh, it's not super free from, uh, so like driving amongst humans is much more of a constrained problem than you would, than you would, you would think it's also unconstrained in some interesting ways, but, but it's, it's quite unconstrained, quite constrained.
Nathan Lambert [00:19:42]: And how hard to act was this to actually learn?
So how sensitive of a process is it now?
I think in the paper, you're talking about gigaflow, which is like a high speed
simulation engine.
So obviously, you know, on data, the final paper says that it learns in 1.6 billion kilometers of driving.
I was wondering if you had an intuition for that.
So like how many miles are driven by all the cars in San Francisco and day or something like this?
Eugene Vinitsky [00:20:10]: That's a, that's a great question.
Nathan Lambert [00:20:12]: Um, it could be a good chat GPT query, to be honest.
Eugene Vinitsky [00:20:16]: This might be a chat GPT question.
Um, let me, let me give some, some numbers that I do know.
Uh, and this is kind of maybe helpful.
So I think cars crash every 20,000 to a hundred thousand miles and a fatal collision happens every a hundred million miles, something like that.
Um, but how many miles are driven in a day in a city?
I'm not sure.
1.6 billion kilometers, the distance between here and Saturn.
Um, it sounds like kind of far when you put it that way, but there are a lot of cars.
Yeah, there are a lot of cars, right?
There are a lot of drivers.
Um, there are surprisingly few trips in a city, fewer than you would expect, but, um, I'm struggling to put a number on it.
Nathan Lambert [00:21:01]: Um, I'll tell you what chat GPT gets when it's done.
I was thinking it's Oh three mini high.
This is not a reliable number.
Take this time.
So your intuition that it's lower goes a lot.
I mean, you've thought about a lot of these car systems for a very long time and I will link to some of your other work on this.
So you definitely have better intuitions than I would.
Eugene Vinitsky [00:21:20]: Well, the intuition comes with the fact that like a lane of the highway can take 2000 vehicles per hour, which is like just not that many vehicles.
Um, and you know, most, most of, most of traffic is between like, you know, 8am and or like 7am and like 10am and then on the way back home.
And so, you know, you can like kind of estimate based on how many lanes there are on the main highway, how many trips there are.
Nathan Lambert [00:21:43]: So San Francisco, the chat Oh three mini high estimated four to 5 million miles in a day in San Francisco.
It's a bully.
It's a plausible number, but it's well below what you are doing.
Like this is, I think maybe globally this billion kilometers could be hit.
So this is okay.
Eugene Vinitsky [00:22:03]: Here's one way to think of it.
We simulate 10,000 years of human drive.
Nathan Lambert [00:22:08]: Okay.
So yeah, 10,000 per one.
I guess it depends on how many cars you have in parallel.
Eugene Vinitsky [00:22:14]: Per one training run one trip to get the policy that we get.
We simulate about 10,000 years of human drive.
Nathan Lambert [00:22:20]: Yeah.
Eugene Vinitsky [00:22:21]: Yeah.
Nathan Lambert [00:22:22]: So to have 10,000 cars, it's all of them driving for a year.
Eugene Vinitsky [00:22:26]: Yeah, exactly.
And we have about like a million cars driving at any given time in the simulator.
Nathan Lambert [00:22:34]: Do you think that substantially changes the learning dynamics?
Like are they all, how many cars are any of them interacting with at any one time?
Eugene Vinitsky [00:22:40]: Yeah.
Any given simulator in any given world.
So this is this like kind of like Isaac's gym style vectorized simulator.
So it all runs in the GPU.
So it's a bunch of worlds happening in parallel, but any given world, there are about 150 cars in it, which means that sometimes you're driving in sparse traffic and sometimes you're going to interact with like 10 or 20 cars at any given time.
Um, and I, I think one thing is that one, one cool thing is that at that scale, I think RL becomes very, very stable.
Um, like for us, like every training run succeeds, the reward curves go straight up.
You know, there's no like, um, what are you scaling?
Nathan Lambert [00:23:19]: Are you just like scaling batch size effectively?
Uh, what is, yeah.
What is the actual thing you're, they're scaling?
Eugene Vinitsky [00:23:26]: We're scaling the amount of experience generated.
So it's like a trillion samples of, of total experience, um, that, that the agents train on.
Um, and then, yeah, we use gigantic batch sizes, like, you know, um, but like, what is the thing
Nathan Lambert [00:23:43]: that you need to dial up in order to make learning actually happen?
Eugene Vinitsky [00:23:47]: Uh, total amount of experience generated, right?
So you need to be generating, you know, million samples per second to train on type of thing.
Nathan Lambert [00:23:57]: Okay.
And then what is the actual, I guess I don't know a ton about multi-gen RL, but what is the actual RL like algorithm and is it a giant replay buffer that is just building and building and building?
Eugene Vinitsky [00:24:08]: It's PPO.
Uh, you know, one thing we've been seeing throughout our work pretty continually is that for, for both theoretical and empirical reasons, PPO is actually a really good multi-agent RL algorithm.
Nathan Lambert [00:24:20]: You had the paper, are you, you are on the paper years ago.
That's like on the something, something PPO multi-agent simple.
Eugene Vinitsky [00:24:29]: So we know that PPO works empirically pretty well.
That's basically the title of the paper.
That's a PPO simple, good multi-agent cooperative.
Good.
Uh, it's good in cooperative problems.
It's, it turns out to be pretty good in two players, zero, some games.
And, and here in, um, this driving thing, it's what's called the general sum game.
And, and there, you know, it seems to work in the setting too.
So evidence is accumulating.
Nathan Lambert [00:24:51]: Something that people probably don't know about multi-agent RL and maybe I don't know either, but in this paper, all of the cars were using the same actual weights of the model.
Is that standard in multi-agent RL or is it kind of a variable?
Eugene Vinitsky [00:25:04]: So I'll add one little, uh, subtlety here.
So yes, we're using every policy is the copy of the same agent, right?
They're all looking at their local observations.
So it's decentralized, but it's all one copy, but every agent gets its own like conditioning vector.
That's like, what are my like reward weights?
How big of a, you know, what's my width and my length?
Am I a cyclist?
Am I a pedestrian?
Am I a driver?
And they flexibly adjust their behavior based on that condition.
Nathan Lambert [00:25:29]: Do you think that's actually like, if you were to squint at the system, is that actually changing the policy or is it changing the environment in kind of an indirect way?
Eugene Vinitsky [00:25:38]: It's, it's changing the policy.
Like you'll see that like a car is like, oh, I'm a, I'm a, like a pedestrian.
I'm a, I'm a big truck.
I'm going to do like a K point turn to turn around.
Uh, I'm a pedestrian.
I'm, you know, going to like smoothly wiggle through these small boxes of areas that I couldn't get through.
Otherwise it, it, it really, uh, appreciably changes the policy, which is cool because it's this like tiny 3 million parameter neural network or like 6 million parameter.
Um, and, and so like, there are all these like little sub policies inside of it that you can activate by, by conditioning.
Nathan Lambert [00:26:11]: Can you do it, um, post hoc to change the behavior in an interpretable way?
Eugene Vinitsky [00:26:16]: Um, I don't know about interpretable.
I guess it, it sometimes depends what we mean when we say interpretable, but yeah.
So you can be like, look, okay, you, you, you don't care about staying in your lane and you'll see it start going into the other lane and driving.
You know, you change the size of the policy or like the, the car and it will change the trajectories that it takes in response.
Um, it's, it's very responsive to this condition.
Um, we have some cool graphs in the paper pointing, pointing out all the different things you can make it do by changing these, these values.
Nathan Lambert [00:26:46]: Um, I'm trying to think of how this reflects on the nature of driving and what the downstream use of this tool is.
So you showed that this is doable and what does this, like, what does this mean for self-driving specifically?
Like, what would you do if you had the same big team and you know that this exists and you're interested in self-driving as a field?
I mean, there are obviously a lot of people that a lot of companies that have big teams and lots of money to try to think about self-driving.
Eugene Vinitsky [00:27:14]: So as I said earlier, like there's this like, um, perception, prediction, planning, control stack.
And I think this is a really is providing a lot of evidence that you could maybe subsume the prediction and the planning stack, um, and, and put it into this type of like end-to-end policy that you could then like train in sim and then maybe not zero shot deploy onto the roadway.
Just like take a straight from sim, put it onto the roadway though.
I think like maybe possible, uh, but like really give you this like base policy that you could then start to put on the roadway and start to build this flywheel, um, that you can then use to collect, you know, more and more experience, validate the safety.
You know, like if you're, you know, if you're a, um, uh, automotive manufacturer that doesn't have like a full spun up self-driving team, but you have a pretty good perception stack, like this is something that you can use to just like get something out in the world pretty fast.
Cause like three, I think like two, two, three days of training later, you have something that I think, and we'd like to start testing it, uh, can be like straight up put onto the roadway with humans driven around and things will be like pretty okay.
Um, you know, don't take the safety driver out, but like, yeah, and you have some cred
Nathan Lambert [00:28:24]: saying this given that you've done RL experiments with real cars, this is not something that's, um, ripping off the bandaid for the first time.
You've done different types of self-driving experiments with RL policies in the real world.
I don't, it might not be at the same level of the stack, but I can add links to that.
Eugene Vinitsky [00:28:42]: That was a lot more constrained, right?
We were putting these cars on the highway to like smooth traffic.
So they would drive in a way such that like stop and go waves and traffic would like get smoothed out and disappear.
Um, but there it was just like, you know, stay in your lane, follow the car behind you here.
We're talking about like, you know, complicated interactions at intersections and that type of thing.
So a lot, a lot more like safe, everything there is safety critical, but like significantly less constrained than anything we've done in the past.
Nathan Lambert [00:29:08]: And then to kind of keep leading this on, uh, I will say a bunch of things because you're more of an industry insider.
So it makes it less revealing if I say things, cause I don't really know anything.
Um, back when I was interviewing for a job and around 2021, at least a lot of RL people were interviewing with self-driving companies who were doing extensive research in RL for different parts of this behavior stack.
Um, even at that time, four years ago, prediction seemed largely or like sensing and prediction was perception was largely solved.
At least CV stacks are really mature and figuring out the actual driving component and decision making was really hard.
There was, I mean, I did a Tesla self home self like take home and for their self-driving team and they were hiring other RL people that take home was ridiculous.
Eugene Vinitsky [00:29:54]: I was like, yeah, I remember that.
Nathan Lambert [00:29:56]: Freaking intersection of polygons.
It's four years ago.
They've got to be using a different question, but it was so hard.
Um, I did end up solving the test cases.
Um, it was, I solved the test cases.
God, that was rough.
But essentially the rumor was they're doing something like mu zero for self-driving and or a mix of imitation learning, which is there's a duality of learning a world model from real data relative to building a simulator.
But the motivation of the work is very similar, which is in mu zero, you want to unroll trajectories and be able to learn from that and distill an RL policy versus if you have a big simulator, you then can learn everything from scratch and figure out how to transfer that to real.
And I think there's different assumptions on what would work.
And the history of RL, it is now that the simulator to real is generally a more promising path.
If you can build the right simulator then and going from real to enhancing real with, with RL alone, um, cruise was building a research team.
And one of the best engineers I talked to was trying to build a world model or like a simulator and do this like alpha go for self-driving.
I think that was a phrase from the interviews four years ago.
So a lot of this, and Waymo is now obviously winning.
I think Waymo, I don't know exactly what they're doing.
I think their stack is actually probably the most complicated, um, where they probably were looking at behavior, like all sorts of RL inspired things for very specific parts of the stack to, to improve behavior.
But it's funny that looking back four years ago, this was something that the spectrum of ideas that industry was looking at was actually very related to this.
And in the same time, the self-driving industry has changed a lot.
Uh, so what do you think of this whole industry of self-driving relative to, you have a lot of experience here.
I mean, I'm, I'm a big Waymo fan now, but there's just like, it's so funny how these things evolve.
And I think after this, later on, we'll talk about the, like, this is the RL specific trajectory with simulation, simulated results and stuff too.
Eugene Vinitsky [00:31:57]: I mean, we were interviewing at the same time.
So I was also interviewing with all of these self-driving companies when you were, uh, and, and it, it did seem like it was the place that was the most friendly to doing RL type research at the time.
Um, I think now almost everyone has gone all in on this like imitation learning type approach, um, that are like, this is a huge fraction of what people are doing.
I think a lot of the RL teams have been spun down, uh, which I think is unfortunate a little bit because I think what this work shows is that, uh, it may be wrong to do so that there is a lot of, a lot of value still in RL for this last piece of, of the, of the puzzle.
Um, you know, um, you know, one thing we have here is, uh, an insanely robust policy, right?
So like just an end to end neural network in SIM, it crashes once in a million miles,
Nathan Lambert [00:32:46]: um, crashes at all.
Eugene Vinitsky [00:32:49]: Yeah.
Nathan Lambert [00:32:50]: And you, but what was the number you said before for miles per crash?
Eugene Vinitsky [00:32:53]: Uh, humans are between 20 and a hundred K, um, somewhere, somewhere like that.
It's a little hard to get estimates because it varies from place to place a lot.
So, I mean, a lot of industries are pretty excited about this, like alpha zero for self driving type thing.
And the question, you know, becomes, as you said, like, what is the simulator that we do this in?
And so one perspective that's very prominent is like, let's collect a lot of data.
Let's sell the world model and then let's unroll in that simulator.
And then the challenge becomes like, who do you unroll in that simulator?
Now your world model has a build into itself, a model of the other agents, right?
If you kind of take the single agent perspective, I'm going to unroll world model.
I'm going to place a car inside of it.
And that's the car I'm going to train with RL.
And now what happens.
Nathan Lambert [00:33:40]: This was a big problem for self-driving because you have like a dynamic number of, um, objects in the scene that you're supposed to reason about with your world model.
How does the policy that you train handle this kind of agents coming in and out?
Now, is it all just that you have some, like, are you identifying entities as nearby as other cars are nearby or is there some abstraction or is that the perception stack handles that?
Eugene Vinitsky [00:34:04]: Yeah, exactly.
We roughly simulate a sensor in the sense that you only see cars in some radius of yourself.
Um, but, but we don't, we don't, yeah.
I mean, all the cars are there persistently in the simulator driving around and we, we answered this riddle of like, what should the other cars do by like their self-play, right?
They're a copy of your policy.
They're driving around.
Um, whereas I don't know what happens in the world model, right?
Like kind of in this like world model approach, you're limited by how capable the world model is at simulating the behavior of other actors.
And if your world model has actually learned a robust model of human driving for all the other agents in the simulator, then like, you don't even need, you don't really need, you need to do RL because like the world model already has a model of how humans should behave in a simulator at human level, but they don't.
Um, so yeah.
Nathan Lambert [00:34:53]: And it's just like, it's just, it's, it's so funny that it just feels like they haven't.
And the only way that Waymo et cetera has gotten it, it seems like Waymo has adapted a autonomous stack with like some human inspiration to make the driver more smooth is what it seems like when you're in it, which is like extremely, really strong perception and world understanding with some really clever policy that is tuned to feel human, but probably not human or RL at the bottom of the day.
Eugene Vinitsky [00:35:27]: I wonder, I don't know what Waymo's planning stack actually looks like in the end, right?
Like Waymo's pretty secretive and, uh, I've never worked there.
Um, and if I had worked there, I wouldn't be able to say.
Um, but you know, I think, I think, you know, if I had to make a bet, it's some, some kind of like hand designed cost, um, like mixing a bunch of terms together about like what a good trajectory looks like, maybe mixing with a little bit of human data to like, to make that trajectory feel like a little smooth in human life.
Nathan Lambert [00:35:59]: And yeah, to prompt you, um, what does your, yeah, I agree with this.
What does your history of being a nerd on urban plan and planning make you think of what is coming for self-driving cars?
Eugene Vinitsky [00:36:12]: So, so I guess the thing to mention is I'm a professor of transportation engineering, uh, among other things.
So I have, I have, um, required to have some thoughts on this.
Um, I think that, you know, self-driving cars are, are coming.
Um, I don't know if they're, they're coming a year from now to who knows when the cost curve gets driven down.
Nathan Lambert [00:36:32]: Where we live, they're more likely to come sooner given tech hubs and, um, people are willing to pay very high premiums.
Eugene Vinitsky [00:36:39]: That's true.
So like, like a lot of goods, they may come for, for wealthy folks first.
And then that allows the cost scaling to come down over time.
Um, and it really is a magical experience to take away Mo, right?
Like I remember the first day I saw like the cars driving around and nobody in it.
And I actually just started chasing one of the cars cause I was so like, it was such a magical moment.
I needed to, I needed to experience it for as long as possible.
Nathan Lambert [00:37:04]: Um, yeah, my first time was in Scottsdale, Arizona for one of my best friend's bachelor parties.
He's also an engineer.
And we saw one driving with no person.
And I was like, I wonder if we could take one.
And I immediately download the app.
And because it's in the middle of nowhere, they're testing zone.
They have tons of supply and no demand.
Eugene Vinitsky [00:37:20]: So we were just immediately able to drive one around.
I actually sat in an airport for three hours in, in Phoenix while my phone upgraded to the newest OS so that I could like download the app and take away Mo for the first time there.
Nathan Lambert [00:37:36]: Uh, yeah, it's, this is totally reasonable behavior for anybody listening and you should update your prior.
If you don't think it's reasonable, it's totally reasonable.
Eugene Vinitsky [00:37:44]: It's a totally normal thing to do.
Um, but I think, so I think in cities, like.
Um, so I think that it's, it's still going to be a long time before these things are rolled out at scale.
Um, so just because costs, safety, how long does it take you to verify that it's safe to drive in a new city?
I mean, okay, let's put Tesla aside.
I don't, I don't talk about it.
I don't really know how to think about that.
Um, but that's how I feel too.
Um, there's, you know, there, there's parts of the United States that are due to state dependence, like very spread out, right?
Like because of like suburbization, like, I don't know if that's a word, it's the word I use.
Um, like they're, they're very spatially spread out.
Like in my, my grandpa's hometown or where my grandpa lives, uh, there's no public transit.
There's no way to get by without a car.
Public transit isn't viable because of the way people are distributed.
So if those systems like those continue to exist, like, you know, people will continue to drive there.
And over time, those things will be replaced by, by, by a few self-driving cars.
Um, you know, uh, as a public transit advocate, I would still say that I think within cities, it is significantly more efficient to like fund buses and subways and things like that.
Um, but you know, there's, there's parts of the U S that are just like, so set up and I expect self-driving cars to be part of that.
Uh, yeah.
Nathan Lambert [00:39:15]: I mean, this isn't, this isn't a hot take.
I think you're just kind of like a realistic and you don't have a crazy worldview about it.
Eugene Vinitsky [00:39:22]: Yeah.
I mean, I have, I have my, you know, real deep love for public transit and like a desire for more people to experience it than just the people who live in New York city, where I think New York sees like 50% of all public transit areas in the U S.
Um, uh, but you know, the system is what the system is right now.
Nathan Lambert [00:39:41]: Yeah.
Okay.
Um, let's pivot from self-driving land where we've had this self-play RL and try to draw some analogies to the other RL breakthroughs that aren't language models that have been happening.
I think the one that everybody should know about or in many people do is this, um, locomotion and or sim to real with robotics with humanoids, quadrupeds.
Eugene Vinitsky [00:40:07]: Yeah.
Nathan Lambert [00:40:07]: If you look at it, it is definitely directionally similar to what is this self-play thing is.
I think that it's hard for people who haven't been in RL to understand the grandness of this transition from one agent locomotion to many agents doing something in a cooperative or competitive game with these same mechanisms.
I feel like even talking to you, I don't think we've done a good job grasping just that enormity.
Like multi-agent is just historically so much more complex.
I don't know if there's anything about something like Dota five and how that, I wish I knew more lore of how that happened and it didn't continue because I feel like it could be a good example of why this is actually so much harder than even something like AlphaGo, which is just one policy and these robotics things we're going to talk about, which are one, like it is all still one policy, but just like one thing in the world.
Eugene Vinitsky [00:41:07]: So let me, let me give it another try because I think I also haven't done the greatest job describing it.
So like in something like Dota or Go, there is in fact a notion of like a best way to play.
It's, you know, it's, it's, well, it's, it's, it's like an ash equilibrium.
It's like, you can't do better than that.
If you play it, nobody can beat you.
Nathan Lambert [00:41:27]: Have we arrived at that at Go?
I don't think like, have we actually arrived at these at chess and Go because the ELO scores are still going up.
Eugene Vinitsky [00:41:33]: No, we haven't.
Nathan Lambert [00:41:34]: But like conceivably there is a max.
Eugene Vinitsky [00:41:37]: There is a max.
You're, you're never going to get it.
It's like the game's too big, but there is a best way to play.
And then in every domain where there's a human in the loop, there's not an obvious best way to play it.
And so the challenge has always been like, you know, if I run self-play, it's going to converge to some behavior.
And that behavior is not necessarily something that can drive with, with humans in the loop.
Like, you know, it'll learn something like you could imagine, for example, like you do a bunch of self-play and the cars learn that they can tell their partner where they want to go by hitting the blinkers left, left, right, right, left, left.
That means like, I'm taking a left turn and I'm going to go at 25 miles per hour.
And so there's this idea that like, there's all these policies that you wouldn't want to play and don't make any sense.
And kind of what we show in this paper is that if you do a little bit of reward design and you really scale up RL, then like the simple fact of like having, being uncertain about where everybody wants to go and having to be very robust to collisions constrains your behavior in such a way that you like broadly learn how to drive well.
Um, and, and I think this is transferable to other domains where, you know, you want some kind of base policy that roughly knows how to do the task well over some unknown distribution of partners.
Nathan Lambert [00:43:02]: How easy do you think it would be to learn in the same simulator?
If all the other cars were controlled by this policy, if you only had to learn for one policy, how much easier is it to learn just one policy that kind of works in the world rather than this multi-agent and everybody is learning at one setup?
Because this is essentially what people have done is like, we've learned how to control one robot in the world and do that super well versus learning everything from scratch with multiple is well harder.
Eugene Vinitsky [00:43:30]: And I, I think if imagine, if you imagine that, okay, we have N cars and N minus one of them are controlled by a perfect model of human driving, right?
Like, I think that you could learn that super fast and really robustly.
Um, and of course the problem is we don't have that one perfect model of human driving that we can plug into our simulator.
I don't think you would, it would take, you know, the trillion samples that it took us.
Nathan Lambert [00:43:54]: So, so I think this, yeah, so that's the difference.
And that's what these other things and that's like quadruped for robotics to have, let me just let you talk about it.
Where do you think this sim to real robots and single agent is at and heading?
So like, what do you see in the, and being more, slightly more plugged into the academic RL side of things?
So, so it's like 2021 and 2022 is when these Marco Hutter group papers started.
And I'm sure the trend is still content, continuing.
Eugene Vinitsky [00:44:23]: It's still continuing, right?
Like for quadrupeds, we, we, there, people are regularly making them do these insane behaviors that we, we haven't been able to design in other ways.
And I think the real, the lesson there is like at scale RL works.
Like a lot of the lessons of like self-supervised learning are transferable to the RL side.
Um, and while it would be great to get the sample complexity down and stop doing this with like a trillion samples, you know, if you're willing to bite that bullet and just scale, um, and you, you have, you know, kind of a, a fairly good simulator, um, you know, you can, you can, you can really do incredible things.
Nathan Lambert [00:45:00]: Um, do you think these RL results scale more with model size or sample complexity?
Do you think that they're kind of brute forcing it through many more inter scaling the interactions with the world?
Eugene Vinitsky [00:45:10]: Yeah, uh, I think that they scale, um, scaling with model size is like a little iffy in, in, in, in RL.
Um, there, there are tricks that people have been coming up with to let you use bigger and bigger models.
But right now I think a lot of the, the impetus is towards smaller models that like have low inference costs, um, that let you like force a trillion samples into the policy, right?
Whereas if you make the model bigger, inference cost becomes more of a thing.
I think it's harder to acquire the samples.
Nathan Lambert [00:45:38]: Um, so I think, I think this relates to, I think the other area that I'm excited about in RL is this like procedural generation and open-endedness.
Do you think this kind of, I see needing to see a ton of samples as being in spirit related to this where open-endedness is, I think a field of study designed to make agents that are good at collecting the right samples and in using the word explore, which we haven't really used.
Eugene Vinitsky [00:46:07]: Yeah.
So I think a lot of, a lot of what we're doing here is actually kind of dodging the exploration problem in a lot of ways.
Um, and in general, this is something that like the RL that works is a lot about dodging the exploration problem, right?
Why do we need a trillion samples?
Because we explore very inefficiently.
Um, if we, you know, the, the kind of, this is, I think what we have, if we talk about the quadrupeds and things like that, right?
Like these are well-defined tasks with a well-understood reward function.
Um, and this, you know, at some point as we started to scale up RL, this, this, this task design will become the bottom, right?
It's like, uh, what tasks should the agent do?
There's a human in the loop sitting down writing the reward function saying, okay, that's a good task.
This is a good task.
The kind of the dream of open-endedness is that we'll move away from this and towards kind of just taking the human, this task designer out of the loop.
Nathan Lambert [00:47:00]: Let's start taking a step back.
Are there tasks that you think are heavily simulated in control domains that are actually well-suited to this RL approach that may have not been done yet?
I mean, simulation is like a core tool in robotics and autonomy.
So what other things are doing heavy simulation and not leveraging this?
Maybe even like hard sciences are doing this.
Eugene Vinitsky [00:47:19]: I think this is going to eat almost everything that can be simulated.
Uh, well, so like the fundamental thing is like, can you simulate it with a relatively small sim to real gap and can you simulate it efficiently?
Um, and if you have both of those things, like I think RL is going to eat all of those things.
Um, so, and, and, or you can also scale this up by like, you know, uh, paying the price.
So if you can, uh, for example, like I expect like formal verification, like agents that like, uh, write lean proofs to do really well.
Um, you know, any there it's expensive because the simulator is slow, but there's no sim to real gap.
Nathan Lambert [00:47:57]: Um, I'm thinking in the, like this scientific in control domain.
I think one of them is, I mean, a timely example is humanoids, which I'm very, um, I've been consistently bearish on them.
I think if you have the simulator, the control policy will be solved, but I think most of it is an environment problem where the robotic actuators are too strong.
So therefore they're limited to manufacturing.
And I don't necessarily know how much a humanoid, um, is better than the static arm in manufacturing and logistics.
Eugene Vinitsky [00:48:29]: So I might be bearish on humanoids for similar reasons, but like, you're, I guess you're right on the point.
I think like, will we think, will we be able to make a humanoid follow whatever trajectory we would like it to follow through scaling up RL?
Yeah, I think so.
Like, but then the question becomes like, what trajectory should it follow?
And then that's where things get iffy again, right?
Like exactly, you know, how, how softly should it behave?
You know, how, stuff like that, like task planning, things like that.
But, but from the, like the controls perspective of like, here's a system, I want it to follow this trajectory.
Um, most of these things have like good fast simulators.
Nathan Lambert [00:49:10]: Um, do you think RL should be used more in AI for science than scaling deep learning?
So I'm guessing there are a lot of scientific domains that are very simulation intensive.
And a lot of the approaches and excitement right now is to train a deep learning model to predict data.
I think there's Evo2, which is a recent DNA sequence predictor.
And I was reading a lot about this.
And a lot of the criticism is like, we don't really know how to use it.
And the mechanism is if, if the model is like, oh, I don't know about this DNA string, and then like, maybe it's a mutation.
And there's a lot of weirdness like that.
Yeah.
But maybe it's still that just this slow burn of scaling RL systems has a, at least like a more direct way that can potentially improve some domains.
Eugene Vinitsky [00:49:54]: Great question.
Um, super interesting question.
Um, so I think that the story I've been telling you about like sample inefficient RL scaling really well, I think we, we understand that pretty well.
And less, less clear on the sample in RL in like limited sample domains.
And I think a lot of the thing in deep learning for science that the simulators themselves are quite slow.
Um, so like if you want to simulate, say like a fusion, like loop, um, honest, like, like a tokamak of some kind, um, it can take actually months to run a single simulation.
Nathan Lambert [00:50:28]: Um, then what do you think of the deep, you brought this up?
What do you think of the deep line nuclear fusion control paper then?
Eugene Vinitsky [00:50:34]: Uh, they might've been doing a slightly different simulator.
It's a different simulator.
It's not as it doesn't, it's, I don't think it requires quite as much, much precision.
Um, I'm thinking of other, other, other simulators, um, to be clear, I haven't read that paper super closely, but if you think about something like AI for materials or AI for bio, like a lot of these are fairly slow simulation platforms.
What I do think is pretty exciting is I think at some point somebody is going to, and there are a lot of bottlenecks to this, like someone's going to build like a, an autonomous lab and just like keep letting it loop forwards and characterizing some material and then like running it through the loop again.
The problem there is actually this like characterization step doing it correctly is really hard.
Um, like, you know, what are the properties of the material that I've just synthesized?
Um, but, but, you know, so, so I think that in terms of RL for science, I think that trajectory is a little trickier because of this like kind of low ability to acquire samples.
Whereas in the humanoid and the quadruped domain, we can generate just, you know, people they'll simulate like 2000 humanoids at once on one GPU or something silly like that.
Nathan Lambert [00:51:41]: Um, do you think these things scale nicely with action space?
I feel like if we want to do this open-ended learning in a lot of environments, I don't know how to constrain the action space in a nice way.
So that somewhat worries me.
Eugene Vinitsky [00:51:55]: So I think there's a couple of pieces of that.
So I think LLM sometimes give you pretty good priors over actions, right?
That's the thing we've been pretty consistently seeing is that they constrain the action space on their own in a, in a really helpful way.
Um, it is also the case that with like much larger action spaces, you just eat a sample complexity penalty and like things take longer, but we're seeing it be fine.
You know, kind of in the domain of like 500 actions, this kind of thing.
Now, if we all of a sudden go out to like 5 million actions, I think all bets are off.
Nathan Lambert [00:52:26]: Um, it does kind of seem like they might have the same thing that happened with language models is that open-endedness.
Now all the agents are going like pointing out a language model or some general interface, like a computer interface that ends up concerning action space to keyboard mouse inputs, which order of magnitude wise is actually the same in action space.
Eugene Vinitsky [00:52:49]: I think, I think there are a lot going to be a lot of helpful constraints on the action space to let you, you know, deal, deal with this problem, right?
You know, you're not operating in totally open-ended language or open-ended use of a computer.
Nathan Lambert [00:53:04]: To recap some of this, just because it's interesting is this self, it's like the self-play question is, is the hardest, is the hardest one to grok, which is really just that, uh, honestly, I still don't even fully understand it.
And I will listen to some of these things, but it's just like the scaling in sample time is the biggest, is the thing that makes RL actually work is that we can scale in samples.
And that is the thing that's needed for most of these domains.
And it's very different than what's happening in language models, but it's at least a consistent theme across what is now over like five years of like resurgent RL for control results.
Eugene Vinitsky [00:53:41]: Scaling works in RL, right?
There's, there's like no, no real wall here.
Nathan Lambert [00:53:46]: It's a different type of scaling than people expect.
I think a lot of the historical scaling papers were trying to scale parameters and looking at something like deep, um, what is dreamer V3 or whatever, and scaling the parameters of the policy with the number of environment interactions.
But it seems like that is just like, it's actually just different axes to what is thought of in traditional deep learning scaling.
Eugene Vinitsky [00:54:07]: It's, it's kind of the number of samples that you're getting.
Nathan Lambert [00:54:10]: Which is very good to say clearly.
And then the, um, the self-play thing is changing the domain to be one that is much more complicated.
And the fact that it can still work there opens up a lot of interesting questions.
So like the scaling thing is an independent axis that actually works.
And then the self-play thing is dramatically increasing the complexity of your problem from this single agent world.
Eugene Vinitsky [00:54:34]: It's adding a lot of like diversity through, you know, there being other agents in the system that behaving in unexpected ways.
Um, so yeah, there's like scaling up the sample complexity.
There's scaling up in diversity of your simulator and in the single agent domains, you don't need to scale up the diversity of your simulator.
You have one task you want to do it.
Um, but yeah, yeah, that makes sense.
Nathan Lambert [00:54:55]: Okay.
I'm pretty, that's, these are interesting takeaways to kind of reflect on what they mean, um, to kind of go towards wrapping this up or go into language model corner and then grad school advice corner.
You had a paper that you, um, you were on a paper as an advisor of large language models are in context, preference learners.
What, what is this story here?
I think I've been preference learning is openly out of vogue, but I think that's because people are short-sighted and AI is so easy dominated where it's like, everyone is still actually doing preference tuning, but everyone's just talking about like RL and verifiable domains
Eugene Vinitsky [00:55:31]: or whatever, whatever the hype is that I, yeah.
Nathan Lambert [00:55:34]: But like, what is your take on this preference learning thing?
And I know you have big self-play and RL background here.
Eugene Vinitsky [00:55:42]: YSo I'll tell you, I'll tell you how we got here real quick and it'll, it'll make clear and I should say that there's coincident work by Jaden Clark, Joey, I'm going to mispronounce his name, Hedgna and Dorsa at Stanford.
Um, that kind of got the same idea across at the same time.
And the idea is that if you want to do preference learning, right.
And you're doing preference learning, you're doing like RLHF from scratch or something.
You have to learn this reward function.
And you have to acquire a lot of samples to do it, to do so.
Um, and you know, the tabula rasa version of this is really inefficient.
And it turns out that what you can do is if you have a particular like description of a task you'd like to do, you can ask a language model, like write me a set of reward functions that I think correspond to this task, right?
You'll take those reward functions.
You'll train an agent corresponding to those reward functions.
You'll have a human rank, um, their preferences over those reward functions.
Like this was a good one.
That was a bad one.
And you'll feed all of that to a language model again, and do another loop of asking it to write some reward functions, given these preferences.
And it turns out that language models can take that information and use it to decide what a good next reward function to try out is.
And over time, you're going to get a reward function that is much more aligned with your preferences just by having the language models actually write the rewards.
And so this is like, you know, kind of letting you do like personalization or reward tuning in like 50 human queries kind of scale.
So like this kind of came about because we want, we were asking like, if I wanted to build an agent that like acts scared or is deceptive, right?
Like I kind of have to do some amount of preference learning, right?
Like is deceptive is the thing that's defined with respect to humans.
Um, and so, you know, we're trying to figure out how to do something like that sample efficiently.
It turns out you can just ask an LLM, write a reward function that corresponds to being deceptive and then like run that loop a bunch of times.
Nathan Lambert [00:57:35]: So I would say that this means that like language model personalization doesn't need to be done within the parameter space or something like this.
Or like, imagine having the domain here is, yeah, the domain here is partially at least control looking at, looking at this figure, but it's goes to show that they actually probably have the representation and it's just a problem of setup and how do you get people to actually do this and stuff.
I've, I've said the same 20 questions saying, which is what if a language model and ask 20 questions about the user and get the information out of there.
So it's nice to see that this might not be like, it's not an actual technical limitation.
It's just like, we don't like, how the hell do you do that and chat or whatever.
Eugene Vinitsky [00:58:17]: We haven't tested that particular case.
And I want to, I think it's like a thing that might work.
Like our case, it was like, make a robot jump like a human, right?
Like write a reward function that corresponds to jumping like a human.
And it turns out like a couple of rounds of iteration, you can get a language model to write down a reward function.
Nathan Lambert [00:58:32]: Did the reward function make sense?
Like what the, what is the reward function for jumping like a human?
Eugene Vinitsky [00:58:37]: You know, it's like, you know, make sure that you jump off both legs instead of hopping on one leg, you know, don't flail your arms around too wildly, you know, that type of thing.
I think, I think the whole reward function is in the paper.
But yeah, you know, it's got some interpretable terms, like the, the, the, the base reward function that comes out in the first iteration always involves hopping on one leg.
And then by the end, it's, you know, two legs, not too much flailing.
Nathan Lambert [00:59:04]: There's like a deeper RL control thing there, which is that all of the agents when they start are just wildly flailing when you learn control from scratch.
So that's in many ways, not super surprising.
Um, do you have any comments on the general like language modeling RL stuff?
I think it's so focused on, and part of the point of this is just like broaden the, broaden the lens by which people consider RL to be a thing.
Eugene Vinitsky [00:59:33]: I mean, the biggest thing I should say there is I think it's going to work.
Like, I don't think like for domains where you have verifiable rewards, like I just, I think this is going to work.
Just going to have to bite the bullet and, and, uh, generate a lot of samples.
Nathan Lambert [00:59:46]: Um, so, uh, it's interesting that you say that because one of the biggest things we find is you just have to keep training, right?
So I mean, it's a classic RL curve.
So you start out, it's a log, it's like a really, it's like, you start out really fast and then you're just on this plateau of just, you're getting a little bit more for a really long time.
And it's much far fewer samples and pre-training and everything, but it is, the learning curves look so similar to anything you would get in RL and you can have crashes and stuff, which is such a throwback.
And it's very different than preference tuning, which you have this whole over-optimization thing and stuff like this, where it's just so much less artful.
It's like, just so obvious.
It's like, if the number's going up, it's probably fine.
And you don't really have to do that much.
It's nice, right?
There's a number.
Eugene Vinitsky [01:00:32]: There's a, there's a number.
You just have to push that number up.
Life is great.
Nathan Lambert [01:00:36]: Last, last section is career corner.
What do you think people are interested in working on an RL right now?
What do you say to them?
Eugene Vinitsky [01:00:46]: I mean, I think, I think RL is just starting to eat different domains.
So like, I think this is a really good time to get started on it.
You know, there, there are not enough strong RL researchers, surprisingly.
So like, I don't think we're like an, even an oversaturated domain.
Nathan Lambert [01:01:07]: Do you think it's feasible for grad students to do this language model stuff and core like RL agent stuff?
Because the agent stuff feels like more of a long-term moat because you're doing something that fewer people know about.
But should people like fully ignore the language model stuff if they're trying to get established as a grad student?
Eugene Vinitsky [01:01:24]: I think that you should like, this is an important academic thing is like, you need to focus on demonstrating the idea with as minimal, as few barriers as possible to it, right?
So you want to pull out the minimum version of the demonstration.
And a lot of the time putting a language model in there is adding a huge bottleneck, right?
All of a sudden you need a bunch to use, training takes forever.
So, I mean, you should probably do some language modeling things at some point because like, it's a good skill to have demonstrated when you go on the job market.
So I think a lot of students will do is they'll do that in their last year or two of grad school, just to show that they kind of know, can't, can do this.
But like for demonstrating the core ideas, I don't think, I don't think you always have to use the language model there unless your ideas are deeply tied to that domain.
Yeah.
Nathan Lambert [01:02:14]: The way that by which things scale and results are communicated is just really different in the RL for domain or the core RL algorithm or the language model plus RL thing, which I think is sequencing probably can be the best bet, which is like starting with something less competitive and focusing on skill development, which is generally my biggest, should I do a PhD answer is if you're invested in developing new skills or you are a person that's like actually extremely academic and scientific in nature, which there are a subset of people
that are just truly like scientists in the nature of the word and they will probably thrive emotionally in that situation.
But most people want to do an AI PhD because they think of it as a credential for a future job, which is generally a hilarious and ill-formed side effect of wherever the field is right now.
Eugene Vinitsky [01:03:03]: It's just such a bad idea.
There's one brief moment in history where a PhD was like a definitive route to a, like a high paying job, right?
Generally what a PhD is supposed to do is it's like, Hey, it should be fun.
It should be fascinating.
It should be like five years where you're like, I could not imagine doing something cooler than what I'm doing right now.
And then it's supposed to unlock some jobs that aren't accessible to you otherwise.
Running a research team in industry, doing particular skills that, you know, using particular skills that are hard to develop unless someone gives you a year or two to focus on, right?
Like hard optimization problems, a lot of specialties.
But, you know, the like, I'm going to do a PhD.
That's going to give me like a 500K total compensation job straight out of grad school.
It's just like such a weird quirk of history that like optimizing for it is never a good idea.
Nathan Lambert [01:03:56]: Yeah.
And I think that kind of says if people are trying to be, or people are grad students or like junior faculty or junior grad students right now, I think if anything you're optimizing for is trying to extract value from that quirk of history, you're putting yourself in a precarious position.
Eugene Vinitsky [01:04:15]: Yeah, optimize for being able to do cool things, you know, that's, that's a consistent thing you can always optimize for.
It doesn't go away when the job market changes.
Nathan Lambert [01:04:29]: Yeah, I agree.
I mean, that could be a good place to end it.
You can do, you can actually surprisingly still just do things.
And yeah, I think it's easy to lose track of that in the language modeling chaos.
Eugene Vinitsky [01:04:43]: Yeah, I mean, but look, also, I'm coming from a position of privilege, right?
Like I have a faculty position.
Nathan Lambert [01:04:48]: We're early enough where it is fine.
Eugene Vinitsky [01:04:51]: Yeah.
Okay.
Well, you know, this has been a pleasure, you know, thank you for taking the time to chat with me and give me a chance to talk about this paper, which is, I think, still had some trouble conveying exactly why I think it's so exciting.
But hopefully some of it got across.
Nathan Lambert [01:05:06]: I think we got to some clear things and the self-play being weird thing definitely gives me more sympathy to how bad the discussion around self-play for language models is, because there is very nuanced on why this, like what we're doing with RL with verifiable rewards is very different than language models talking to themselves and both updating their policy.
And it's not to say we shouldn't be trying that, but we should be very wary in saying that is that until we are trying to do really, really hard things.
The grandioseness of language model self-play is probably like trying to let language models discover their own language to be more effective at tasks.
And they do this with interacting with each other.
Eugene Vinitsky [01:05:50]: And so, yeah, language model self-play for tasks they haven't been trained on, like learning to do new tasks collaboratively together.
Super exciting.
It makes sense.
I'm doing some work on it.
I'm excited about that.
This thing where you, the amount of knowledge that they have is bounded and you do self-play to kind of refine the distribution that they're playing over, as opposed to like doing a new task together, a little weirder, weirder stuff going on there.
Yeah.
Nathan Lambert [01:06:16]: Yeah.
So I think it's good.
I think people now know that this, the single agent RL stuff working is not surprising.
And the self-play area could be one of these like multi-year out before takeoff thing.
And there are early signs that it could actually work.
And that's something that people are often looking for, which is what are the problems that there are a bit more risk on, but not complete risk and not the obvious thing that everybody is working on.
Eugene Vinitsky [01:06:41]: Yeah.
And I think if you're thinking that category, collaborative agents, agents that know how to effectively collaborate with humans, with other AI agents, very underrated area, going to be big in a bit, I think.
Nathan Lambert [01:06:54]: I think the ones who are playing these models is what does it mean for multiple language models to be interacting on the web that have separate goals, but are going to interact with each other and it's not necessarily the same self-play setup, but you could understand it with, through some of these lenses.
And it's easy to see how this is going to come about with you when listening to the marketing from all these labs.
Eugene Vinitsky [01:07:17]: Yeah.
Well, it's, yeah, it's going to be fun.
It's going to be weird.
It's going to be great.
I also, I do have to inject a quick little pitch before I, before I disappear.
Nathan Lambert [01:07:28]: Sounds good.
Eugene Vinitsky [01:07:28]: Um, so, uh, I've joined, uh, like a new, like AI, uh, stealth AI company.
Um, we're like working on, um, making better decisions in critical industries like healthcare, supply chain, defense.
So like in industries that are like not very tech forward, not very AI native, um, but like are where almost all productivity lies.
Um, and so, uh, we're, uh, looking for strong folks with experience with either RL or LLMs.
Nathan Lambert [01:07:55]: Do you have a company name?
Eugene Vinitsky [01:07:56]: We do not.
Nathan Lambert [01:07:57]: Okay.
Well, we'll have a link.
We'll have a contact link below.
I think, yeah, Eugene is a fun person.
Um, so, and at least knows enough to make it through this conversation, which is, there are a lot of startups out there that are riding on less.
So that's good.
Eugene Vinitsky [01:08:13]: Um, that'll be fun.
It's, it's, there aren't, I don't think there are that many applied RL teams out there in the world.
So maybe there are, and I just don't know them, but I think like Mark Belmer's startup is
Nathan Lambert [01:08:24]: actually potentially related, which is seems like it's kind of trying to scale.
RL with the mix of language models to solve problems, but I haven't asked him directly.
Eugene Vinitsky [01:08:33]: I mean, that's somewhat of what we're, what we're doing, doing too.
Um, but, uh, that's all I can say.
Nathan Lambert [01:08:42]: Yeah.
Sounds good.
It's, it's early days for, for that and self-play and many other things, but, um, I'm sure we'll cross paths soon.
Either if I go back to New York or if you, for some reason, come all the way to Seattle, which I don't know.
I love Seattle.
Eugene Vinitsky [01:08:58]: So yeah, I'll love in the fall.
Um, anyways, yeah, uh, it was, it was a pleasure talking to you and hopefully, uh, you know, get a chance to talk again soon.
Nathan Lambert [01:09:08]: Yeah, we'll do.
Full post: https://www.interconnects.ai/p/elicitation-theory-of-post-training
If you look at most of the models we've received from OpenAI, Anthropic, and Google in the last 18 months, you'll hear a lot of "Most of the improvements were in the post-training phase." The most recent one was Anthropic’s CEO Dario Amodei explaining Claude 3.7 on the Hard Fork Podcast:
We are not too far away from releasing a model that's a bigger base model. Most of the improvements in 3.6/3.7 are in the post-training phase. We're working on stronger base models (perhaps that will be the Claude 4 series, perhaps not; those are coming in a relatively small number of time units [months?].
Here's a simple analogy for how so many gains can be made on mostly the same base model.
The intuition I've been using to understand the potential of post-training is called the elicitation interpretation of post-training, where all we are doing is extracting and amplifying valuable behaviors in the base model.
Consider Formula 1 (F1), most of the teams show up to the beginning of the year with a new chassis and engine. Then, they spend all year on aerodynamics and systems changes (of course, it is a minor oversimplification), and can dramatically improve the performance of the car. The best F1 teams improve way more during a season than chassis-to-chassis.
The same is true for post-training. The best post-training teams extract a ton of performance in a very short time frame. The set of techniques is everything after the end of most of pretraining. It includes "mid-training" like annealing / high-quality end of pre-training web data, instruction tuning, RLVR, preference-tuning, etc. A good example is our change from the first version of OLMoE Instruct to the second — we improved our post-training evaluation average from 35 to 48 without touching the majority of pretraining.
Then, when you look at models such as GPT-4.5, you can see this as a way more dynamic and exciting base for OpenAI to build onto. We also know that bigger base models can absorb far more diverse changes than their smaller counterparts.
This is to say that scaling also allows post-training to move faster. Of course, to do this, you need the infrastructure to train the models. This is why all the biggest companies are still building gigantic clusters.
This theory folds in with the reality that the majority of gains users are seeing are from post-training because it implies that there is more latent potential in a model pretraining on the internet than we can teach the model simply — such as by passing certain narrow samples in repeatedly during early types of post-training (i.e. only instruction tuning).
Throwback to the superficial alignment hypothesis
Another name for this thoery is the Superficial Alignment Hypothesis, coined in the paper LIMA: Less is More for Alignment. This paper is getting some important intuitions right but for the wrong reasons in the big picture. The authors state:
A model’s knowledge and capabilities are learnt almost entirely during pretraining, while alignment teaches it which subdistribution of formats should be used when interacting with users. If this hypothesis is correct, and alignment is largely about learning style, then a corollary of the Superficial Alignment Hypothesis is that one could sufficiently tune a pretrained language model with a rather small set of examples [Kirstain et al., 2021].
All of the successes of deep learning should have taught you a deeply held belief that scaling data is important to performance. Here, the major difference is that the authors are discussing alignment and style, the focus of academic post-training at the time. With a few thousand samples for instruction finetuning, you can change a model substantially and improve a narrow set of evaluations, such as AlpacaEval, MT Bench, ChatBotArena, and the likes. These do not always translate to more challenging capabilities, which is why Meta wouldn’t train its Llama Chat models on just this dataset. Academic results have lessons, but need to be interpreted carefully if you are trying to understand the big picture of the technological arc.
What this paper is showing is that you can change models substantially with a few samples. We knew this, and it is important to the short-term adaptation of new models, but their argument for performance leaves the casual readers with the wrong lessons.
If we change the data, the impact could be far higher on the model’s performance and behavior, but it is far from “superficial.” Base language models today (with no post-training) can be trained on some mathematics problems with reinforcement learning, learn to output a full chain of thought reasoning, and then score higher on a full suite of reasoning evaluations like BigBenchHard, Zebra Logic, AIME, etc.
The superficial alignment hypothesis is wrong for the same reason that people who think RLHF and post-training are just for vibes are still wrong. This was a field-wide lesson we had to overcome in 2023 (one many AI observers are still rooted in). Post-training has far outgrown that, and we are coming to see that the style of models operates on top of behavior — such as the now popular long chain of thought.
The counterargument to elicitation
The prevailing counterargument to the elicitation theory of post-training has been that post-training is teaching specific skills to the model. This can be seen through very large finetuning datasets used in the early eras of open models. One of the last prominent research examples of this is OpenMathInstruct 2, which showed performance improvements with finetuning on up to 14.3 million instructions.
A rough industrial norm is that you can do excellent post-training with only 1 million prompts at instruction tuning. The way to interpret the above plot with the elicitation theory is that the vast majority of the gains come from the beginning of training.
The trade-offs discussed in this counterargument, such as scaling post-training methods, were specifically discussed in the paper Revisiting the Superficial Alignment Hypothesis, where it is clear that you can teach new skills to models in post-training. The hardest part today is to know when the skills are entirely new — it is often safer to assume the model has seen them at pretraining. Therein, it is far more efficient to elicit behaviors from the model than to teach, and eventually, the tide may shift to favor teaching models, but not in the near future.
Comparing this math training dataset to current best (and emerging) practices for reasoning models makes the strongest case for the elicitation theory yet.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
RL’s role in elicitation
The reinforcement learning (RL) training we’ve seen take off in this early era of reasoning models is often described as “sample efficient” and “the model learning new behaviors.” Both of these fit with the theory presented. The astute way to view the model learning new behaviors is not that it is learning entirely new abilities but rather learning to reinforce behaviors that were already latent in the pretraining dataset.
Compared to teaching the model math with millions of samples, just a few thousand prompts of RL training can far surpass the performance in MATH shown above.
In many ways, RL training exploding in popularity and effectiveness is the ultimate endorsement of the elicitation theory. Where we used to try and teach the model math with millions of supervised samples, now we just let the model try different approaches on thousands of math problems, and they reach far higher peak performance.
This is, of course, also linked to why people say that “stronger base models are better starting points for RL.” All of this fits together as the base model is the platform on which post-training is built.
A reductionist take here is to say that pretraining is not important — in reality, pretraining is just slow and hidden from most of the gains we are seeing. Still, excellent post-training and the performance improvements we enjoy today are all well downstream of pretraining. Pretraining still is arguably the most important part of the training stack, as it allows those with confidence in the elicitation during post-training to thrive.
Thanks to Mohit Raghavendra for some email exchanges that helped this post.
Link: https://www.interconnects.ai/p/where-inference-time-scaling-pushes
There’s a lot of noise about the current costs of AI models served for free users, mostly saying it’s unsustainable and making the space narrow for those with the historical perspective of costs of technology always plummeting. GPT-4.5’s odd release of a “giant” model without a clear niche only amplified these critics. With inference-time compute being a new default mode, can we still have free AI products? Are we just in the VC-subsidized era of AI?
For normal queries to ChatGPT, the realistic expectation is that the cost of serving an average query will drop to be extremely close to zero, and the revenue from a future ad model will make the service extremely profitable. The most cohesive framework for understanding large-scale internet businesses built on the back of such zero marginal costs is Ben Thompson’s Aggregation Theory.
Aggregation Theory posits that extreme long-term value will accrue to the few providers that gate access to information and services built on zero-marginal cost dynamics. These companies aggregate user demand. It has been the mode of modern dominant businesses, with the likes of Google and Meta producing extremely profitable products. Naturally, many want to study how this will apply to new AI businesses that are software-heavy, user-facing platforms, of which OpenAI is the most prominent due to the size of ChatGPT. Having more users and attention enables aggregators to better monetize interactions and invest in providing better experiences, a feedback loop that often compounds.
Aggregators are often compared to platforms. Where the former relies on being an intermediary of users and other marketplaces, platforms serve as foundations by which others build businesses and value, such as Apple with the iPhone, AWS, or Stripe.
Businesses like ChatGPT or Perplexity will rely on a profitable advertisement serving model being discovered that works nicely for the dialogue format. ChatGPT interweaving previous discussions into the chat, as they started doing in the last few months, is encouraging for this, as they could also have preferred products or sources that they tend to reference first. Regardless, this will be an entirely new type of ad, distinct from Meta’s targeted feed ads, Google’s search ads, or the long history of general brand ads. Some of these past ad variants could work, just sub-optimally, in the form factor.
An even easier argument is to see the current hyperscalers using low-cost inference solutions on AI models that complement their existing businesses and fit with components of Aggregation Theory — such as Meta serving extremely engaging AI content and ads. The biggest platform play here is following the lens through which language models are a new compute fabric for technology. The AWS of AI models.
All of these business models, ads, inference, and what is in between, were clear very soon after the launch of ChatGPT. As the AI industry matures, some harder questions have arisen:
* Who bears the cost of training the leading frontier models that other companies can distill or leverage in their products?
* How many multiples of existing inference paradigms (0-100s of tokens) will inference-time scaling motivate? What will this do to businesses?
This post addresses the second question: How does inference time compute change business models of AI companies?
The announcement of OpenAI’s o3 with the inference cost on ARC-AGI growing beyond $5 per task and the proliferation of the new reasoning models raised the first substantive challenge to whether aggregation theory will hold with AI.
The link to inference time compute and the one that sparked this conversation around aggregators was Fabricated Knowledge’s 2025 AI and Semiconductor Outlook, which stated:
The era of aggregation theory is behind us, and AI is again making technology expensive. This relation of increased cost from increased consumption is anti-internet era thinking.
This is only true if increased thinking is required on every query and if it doesn’t come with a proportionate increase in value provided. The fundamental operations of AI businesses will very much follow in the lens of Aggregation Theory (or, in the case of established businesses, it’ll reinforce advantages of existing large companies), and more work is going to be needed to figure out business models for inference-heavy products.
We can break AI use today into two categories:
* ChatGPT and general-use chatbots.
* Domain-specific models, enterprise products, model APIs, and everything else that fits into the pay-for-work model (e.g. agents).
The first category is established and not going away, while the second is very in flux. Inference time scaling will affect these in different ways.
Consumers — well, most of them (and not most of you reading this who are power users) — will never know how to select the right model. The market for super users is far smaller than the market for general use. The core for consumer products is having the model know how much compute to spend. This is where RL training will likely be most important and is something notably missing from the release of Claude 3.7 Sonnet.
OpenAI’s model offerings and initial excitement around inference time compute made many, myself included, get excited about the idea of a compute dial being shown to the users so they can control the “thinking effort” for their query. The problem is that rules for how well that translates to performance rely on a deep understanding of AI and how language model performance is very stochastic.
The so-called dial is being reduced to simple reasoning buttons or always-on reasoning — extremes and yes/no decisions are much easier for users to work with. This is already how I engage with models. I start with a normal model, and if it doesn’t work, I punt to o1 pro. Would my trying to guess the right spot on a dial for a new query really be a good experience? Please, the model should know its own limits and handle that on its own.
Today, the RL-trained reasoning models primarily serve as a trust and legibility enhancement to average users rather than a performance improvement. This is leading to the exposure of the Chain of Thoughts (CoTs) to be an industry norm. At the same time, this sort of minor increase in context length will still be subsumed into a zero marginal cost style business, pending the assumed discovery of a functional ad model. This is all also just the tip of the iceberg for inference time compute. From my coverage of Claude 3.7:
RL training is a short path to inference time scaling laws being used, but in the long-term we will have more methods for eliciting the inference-time tradeoffs we need for best performance.
For power users and enterprises, RL training and one model fits all is less important. Enterprises will want to benefit from clear trade-offs on performance vs log(compute).
Many in the industry, including in the aforementioned Claude 3.7 release and o3’s ARC-AGI performance, are discussing the use of parallel test time compute relative to just increasing generation lengths. Inference time scaling with parallel computation and strong verifiers will be essential to the long-term trajectory of the sub-area.
Where RL models can increase the compute spent by a model by factors of 2, 4, or 10 for a question, parallel computation already uses factors of 1000 (see o3), and will go far higher. This is a far more direct way to continue scaling the log-compute plots for inference time scaling. It’s also more efficient due to the quadratic costs of generating longer contexts — in fact most of the models we are using cannot scale output length infinitely, as we can with the number of samples.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Better verifiers will increase the slope of the inference time scaling plots we are seeing, as discussed in our coverage of Claude 3.7.
Models will be trained to make the probability of a true answer appearing increase over many generations and maximizing the probability that the extraction method can select it, rather than maximizing the probability that 1 single generation is correct out of the box. This is a very different way to finish the training of models than has been considered in some time. Here’s a recent example of a research paper studying this, Inference-Aware Fine-Tuning for Best-of-N Sampling in Large Language Models, and more will surely come soon.
Verification as the limiter for inference-time scaling performance is not a new idea. It was the starting point of my coverage on inference time scaling, before the release of OpenAI’s o1 (and mentioned in the recent post on Claude 3.7). Ultimately, the challenge is robustness, not if the models can generate the answer:
I wanted to highlight a result from the last paper, Large Language Monkeys, as to why inference time compute is feasible. That paper focuses on repeated sampling from a variety of language models to understand the distribution of answers in a model. They show that with an Oracle answer extractor (more on this later), even models as surprising as Pythia-70M have the true answer inside.
Remember, the domain of math answers is infinite. This shows that underlying distribution of the models has the right answer, we need to figure out how to extract the right ones. We need strong verifiers to make answer selection easy. The Snell et al. paper above (1.) calls this the “Proposer and Verifier” perspective.
The understanding that while the models we are using will almost always be able to generate the right answer and the fact that training verifiers for using that has just started should increase optimism that inference time scaling can work.
This type of performance will not be cheap. Unlocking new potential applications is still worth way more than the few dollars these queries can cost. Noam Shazeer of Google explained this on his Dwarkesh Podcast appearance with Jeff Dean:
One of the big areas of improvement in the near future is inference time compute, applying more compute at inference time. I guess the way I like to describe it is that even a giant language model, even if you’re doing a trillion operations per token, which is more than most people are doing these days, operations cost something like 10 to the negative [18 dollars]. And so you're getting a million tokens to the dollar.
I mean compare that to a relatively cheap pastime: you go out and buy a paper book and read it, you're paying 10,000 tokens to the dollar. Talking to a language model is like 100 times cheaper than reading a paperback.
So there is a huge amount of headroom there to say, okay, if we can make this thing more expensive but smarter, because we're 100x cheaper than reading a paperback, we're 10,000 times cheaper than talking to a customer support agent, or a million times or more cheaper than hiring a software engineer or talking to your doctor or lawyer. Can we add computation and make it smarter?
I think a lot of the takeoff that we're going to see in the very near future is of this form. We've been exploiting and improving pre-training a lot in the past, and post-training, and those things will continue to improve. But taking advantage of "think harder" at inference time is just going to be an explosion.
The TLDR is “Making language models more expensive is worth it.” It is quite the provocation — that intelligence is worth far more and will restructure the budding AI inference market.
All of the dynamics are going to be accelerated by AI adoption and Jevon’s paradox. Jevon’s paradox is the idea that when a technology gets more efficient, i.e. “reducing the amount needed for a single application,” the total usage will actually go up. The resource in this case is GPUs for inference. AI inference, generally, and reasoning models, specifically, are in this feedback loop. Reducing the baseline cost of models will make inference-time scaling even more palatable. On the research side we are also embracing far more ambitious compute spending per problem.
These are business dynamics that don’t fit into the most common aggregation theory model. With a gradient from classic zero-marginal cost, high-margin services to high-cost, high-impact new uses of AIs, the behavior of aggregators is going to evolve.
We’ve seen examples of Aggregation Theory applied to high-cost businesses, such as Amazon’s Prime — distribution is expensive — but these don’t posit to operate on the same internet scale of digitally native businesses. All in, aggregation theory will still apply to most of the consumer and enterprise markets, but there are large areas of the market where the way this is unfolding is entirely new.
The defining factor at the top end of inference scaling is paying for the most productive output. If the costs are so great in both capital and marginal costs, the AI super-companies of the future may look more like platforms than aggregators. They would have strong network effects as the only people in town who can offer the powertool that is inference-heavy AI.
AI companies will be pushed along the barbell effect — those with consumer demand can aggregate on cheaper models, but those with specific niches will be forced to compete on performance. The ceiling on price is unknown and markets need to be built on new platforms.
High costs could support AI being less winner-takes-all than many would suspect — we could have Anthropic operate as the platform for AI agents, ChatGPT aggregating the consumer use, and another party dominating code generation. With so much investment in AI it feels more likely that companies will win their area by focusing on specific domains — they cannot afford to dilute their efforts relative to a focused peer — unless an expansive AGI really is coming, in which one company conceivably can take the entire market.
More: https://www.interconnects.ai/p/gpt-45-not-a-frontier-model
As GPT-4.5 was being released, the first material the public got access to was OpenAI’s system card for the model that details some capability evaluations and mostly safety estimates. Before the live stream and official blog post, we knew things were going to be weird because of this line:
GPT-4.5 is not a frontier model.
The updated system card in the launch blog post does not have this. Here’s the original system card if you need a reference:
Regardless, someone at OpenAI felt the need to put that in. The peculiarity here summarizes a lot of the release. Some questions are still really not answered, like “Why did OpenAI release this?” That game theory is not in my purview.
The main contradiction to the claims that it isn’t a frontier model is that this is the biggest model the general public has ever gotten to test. Scaling to this size of model did NOT make a clear jump in capabilities we are measuring. To summarize the arc of history, the jump from GPT-3.5 to GPT-4 made the experience with the models go from okay to good. The jump from GPT-4o (where we are now) to GPT-4.5 made the models go from great to really great.
Feeling out the differences in the latest models is so hard that many who are deeply invested and excited by AI’s progress are just as likely to lie to themselves about the model being better as they are to perceive real, substantive improvements. In this vein, I almost feel like I need to issue a mea culpa. I expected this round of scaling’s impacts to still be obvious before the brutal economic trade-offs of scaling kicked in.
While we got this model, Anthropic has also unintentionally confirmed that their next models will be trained on an approximation of “10X the compute,” via a correction on Ethan Mollick’s post on Claude 3.7.
Note: After publishing this piece, I was contacted by Anthropic who told me that Sonnet 3.7 would not be considered a 10^26 FLOP model and cost a few tens of millions of dollars to train, though future models will be much bigger.
GPT-4.5 is a point on the graph that scaling is still coming, but trying to make sense of it in a day-by-day transition is hard. In many ways, zooming out, GPT-4.5 will be referred to in the same breath as o1, o3, and R1, where it was clear that scaling pretraining alone was not going to give us the same level of breakthroughs. Now we really know what Ilya saw.
All of this marks GPT-4.5 as an important moment in time for AI to round out other stories we’ve been seeing. GPT-4.5 likely finished training a long time ago — highlighted by how it has a date cutoff in 2023 still — and OpenAI has been using it internally to help train other models, but didn’t see much of a need to release it publicly.
What GPT-4.5 is good for
In the following, I am going to make some estimates on the parameter counts of GPT-4.5 and GPT-4o. These are not based on any leaked information and should be taken with big error bars, but they are very useful for context.
GPT-4.5 is a very big model. I’d bet it is well bigger than Grok 3. We have seen this story before. For example, GPT-4 was roughly known to be a very big mixture of experts model with over 1T parameters total and ~200B active parameters. Since then, rumors have placed the active parameters of models like GPT-4o or Gemini Pro at as low as 60B parameters. This type of reduction, along with infrastructure improvements, accounts for massive improvements in speed and price.
Estimates place GPT-4.5 as about an order of magnitude more compute than GPT-4. These are not based on any released numbers, but given a combination of a bigger dataset and parameters (5X parameters + 2X dataset size = 10X compute), the model could be in in the ballpark of 5-7T parameters total, which if it had a similar sparsity factor to GPT-4 would be ~600B active parameters.
With all of these new parameters, actually seeing performance improvements is hard. This is where things got very odd. The two “capabilities” OpenAI highlighted in the release are:
* Reduced hallucinations.
* Improved emotional intelligence.
Both of these have value but are hard to vibe test.
For example, SimpleQA is a benchmark we at Ai2 are excited to add to our post-training evaluation suite to improve world knowledge of our models. OpenAI made and released this evaluation publicly. GPT-4.5 makes huge improvements here.
In another one of OpenAI’s evaluations, PersonQA, which is questions regarding individuals, the model is also state of the art.
And finally, also GPQA, the Google-proof knowledge evaluation that reasoning models actually excel at.
At the time of release, many prominent AI figures online were touting how GPT-4.5 is much nicer to use and better at writing. These takes should be taken in the context of your own testing. It’s not that simple. GPT-4.5 is also being measured as middle of the pack in most code and technical evaluations relative to Claude 3.7, R1, and the likes.
For an example on the writing and style side, Karpathy ran some polls comparing GPT-4.5’s writing to GPT-4o-latest, and most people preferred the smaller, older model. Given what we know about post-training and the prevalence of distilling from the most powerful model you have access to, it is likely that GPT-4o-latest is distilled from this new model, previously called Orion, and its drastically smaller size gives it a night and day difference on iteration speed, allowing for better post-training.
More on the character in that GPT-4o-latest model was covered in our previous post on character training.
All of this is a big price to pay to help OpenAI reclaim their top spot on ChatBotArena — I expect GPT 4.5 to do this, but the results are not out yet.
I’ve been using GPT-4.5 in preparation for this. It took a second to get used to the slower speed, but it’s fine. I will keep using it for reliability, but it’s not worth paying more for. o1 Pro and the other paid offerings from OpenAI offer far more value than GPT-4.5.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Making sense of GPT-4.5’s ridiculous price
When the original GPT-4 first launched, it was extremely expensive. In fact, GPT-4 was comparable in price to GPT-4.5 at launch. Here’s a help post on the OpenAI forums, conveniently found by OpenAI DeepResearch with GPT-4.5, that captures the context. GPT-4 launched in March 2023.
We are excited to announce GPT-4 has a new pricing model, in which we have reduced the price of the prompt tokens.
For our models with 128k context lengths (e.g. gpt-4-turbo), the price is:
* $10.00 / 1 million prompt tokens (or $0.01 / 1K prompt tokens)
* $30.00 / 1 million sampled tokens (or $0.03 / 1K sampled tokens)
For our models with 8k context lengths (e.g. gpt-4 and gpt-4-0314), the price is:
* $30.00 / 1 million prompt token (or $0.03 / 1K prompt tokens)
* $60.00 / 1 million sampled tokens (or $0.06 / 1K sampled tokens)
For our models with 32k context lengths (e.g. gpt-4-32k and gpt-4-32k-0314), the price is:
* $60.00 / 1 million prompt tokens (or $0.06 / 1K prompt tokens)
* $120.00 / 1 million sampled tokens (or $0.12 / 1K sampled tokens)
GPT-4.5’s pricing launched at:
Input:$75.00 / 1M tokens
Cached input:$37.50 / 1M tokens
Output:$150.00 / 1M tokens
OpenAI included language in the release that they may not keep this model in the API, likely forecasting low demand, as they wanted to hear from users if it enabled entirely new use-cases.
Many analysts think that Nvidia’s next generation of GPU, Blackwell, which comes with GPUs with far more memory per FLOP (enabling storing bigger models), are not priced into this. We can expect to see the same arc of pricing with 4.5 as we did with 4 to 4 Turbo to 4o.
* GPT-4 Turbo launched in November 2023 at $10 / 1M input and $30 / 1M output.
* GPT-4o launched in May 2024 at $2.5 / 1M input and $10 / 1M output.
These are huge reductions, about 10X.
These are products that OpenAI makes a healthy margin on, and there are no signs that that isn’t the case. The AI community collectively has grown so accustomed to incredible progress in making the technology more efficient that even a blip in the process, where bigger models are available, feels potentially bubble-popping.
The future of scaling
Scaling language models is not dead. Still, reflecting on why this release felt so weird is crucial to staying sane in the arc of AI’s progress. We’ve entered the era where trade-offs among different types of scaling are real.
If forced to summarize all of this curtly, it would be: GPT-4.5 is, oddly, ahead of its time.
This means that the progression of AI needs to take a different tack, but we already knew this with the rapid progress of reasoning models. The true impact of GPT-4.5 is when it is integrated with multiple lines of rapid progress.
One of the flagship results in the DeepSeek R1 paper and related RL follow-up work in the AI community is that scaling RL training works better on bigger models. There is a lot of work to do to know all the domains that’ll be absorbed into this umbrella. Future models like o4 could be distilled from a reasoning model trained on GPT-4.5. In fact, this may already be the case. OpenAI’s current models likely would not be so good without GPT-4.5 existing.
In as soon as a year, most of the models we are working with will be GPT-4.5 scale and they will be fast. The “well-rounded” improvements they offer are going to help make many more applications more robust, but OpenAI and others in the AI labs have pushed scaling a bit further than the current serving infrastructure can support.
Frontier labs are not taking enough risk if they’re not going to try to push the limits of every direction of scaling they have. Though releasing the model isn’t needed, we have to guess why OpenAI actually wanted to do this. It’s likely that GPT-4.5 is being used in other internal systems for now and other external products soon, so releasing it is a natural step on the way to the next thing, rather than a detour.
GPT-4.5 is a frontier model, but its release is not an exciting one. AI progress isn’t free, and it takes a lot of hard work. Most people should only care when GPT-4.5 is integrated into more than just chat.
https://www.interconnects.ai/p/character-training
The vast majority of evaluations used to measure progress on post-training at frontier laboratories are internal evaluations rather than the evaluations you hear about all the time like MATH or GPQA. These, the well-known intra-industry evaluations, are certainly important for ballparking behavior, but for every public evaluation, these frontier laboratories are likely to have 10+ fine-grained internal evaluations.
The internal evaluations these model providers have cover a range of topics. Surely, most of them are basic, repetitive user behaviors that they need to make sure a new model doesn’t roll back too many of. Of these, the vast majority are likely skills, and “character” remains more of an art than a hill to climb up with careful data engineering.
Leading post-training laboratories surely know how to reinforce more robust behavior within a specific character, as seen by the march of progress on evaluations like ChatBotArena, but crafting a specific personality from scratch is an open question.
The primary goal of this post is to start the conversation outside of frontier AI labs around character training. Character training is the subset of post-training designed around crafting traits within the model in the manner of its response, rather than the content. Character training, while being important to the user experience within language model chatbots, is effectively non-existent on the web.
We don’t know the trade-offs of what character training does, we don’t know how exactly to study it, we don’t know how much it can improve user preferences on ChatBotArena, and we should.
The appearance of the AIs people are using is deeply coupled with how intelligent users will find it to be. Style of communication is crucial to how information is parsed. This is likely a very high priority to industrial labs, but something that almost no academic literature exists on. Even though I want to do research on this, I’m honestly not sure how to do so yet other than a 1 of 1 technical report on findings.
ChatGPT gets character depth
Out of nowhere on Saturday, February 15th, Sam Altman tweeted about this new GPT-4o model that will serve as the foundation of ChatGPT.
This is the biggest subjective change I’ve ever felt within intermediate model versions, from any primary provider — something more akin in vibes change to the shift from GPT-3.5 to GPT-4. The model immediately and consistently showed new behavior patterns. I found these very positive (Karpathy agrees), but they’ll take some getting used to.
Where ChatGPT used to sound robotic and shallow, it’s now very clearly leaning into a chipper assistant demeanor. Yes, for basic tasks, this new default model in ChatGPT is very Claude 3.5-like — more testing is needed to know if this GPT-4o with its peer models like o3-mini can dethrone Claude 3.7 Sonnet as a daily programming driver.
The biggest changes in the new GPT-4o model are:
* It now loves to reference past interactions in the chat (way more obvious than any other provider has been) — it was trying to flex that it knows my dog breed, mini schnauzer, or my book topic, RLHF. This is very in line with the new roadmap to GPT-4.5 and GPT-5 that Altman posted, where ChatGPT is designed around a fluid experience rather than standalone, siloed, powerful models.
* The model is very chipper, sprinkles in more emojis, and is almost funny.
* The multi-turn conversation is more dynamic, with follow-up questions and added texture to longer back and forths.
The reasons are at a high level very complementary to those I listed when I switched to Claude as my daily driver model.
The shocking part of this is that the impact of this sweeping change is almost entirely undocumented. Yes, OpenAI updated the Model Spec (my previous coverage here and here), but that doesn’t really capture how this model is different — it just clarifies the direction OpenAI is optimizing for. There are a few overlapping interpretations of this lack of transparency:
* OpenAI cannot precisely measure the differences as a few specific behavior traits, so they can only see the model performs better in high-level testing like ChatBotArena or other A/B testing, but they cannot capture the changes in score deltas between a few evaluations they could release.
* AI is moving so fast that taking the time to document these models is not worth it,
* Detailing the changes will make the character too easy to reproduce and will be another path of “distillation” of OpenAI’s models.
The community of model users is extremely far from having clear ways to measure these differences. While there are vibe tests on Twitter, they will not be conclusive. ChatBotArena won’t even come close to measuring the levels of these differences (and in the case of referencing past chats, it cannot). Character training is the sort of addition to a post-training stack that takes industrial training techniques from being reproducible, but expensive, to dark arts that are largely undocumented.
The most interesting part of the model spec for industry analysts is this plot where OpenAI shares the agreement rate of their newer models. This is comparing a reasoning model, o1, to a GPT-4o model, so there are questions of whether this is attributable to reasoning training.
Every frontier AI laboratory should have a model spec
Model Specs are the sort of community norm that a race to the top is the goal. They’re muddled if mandated — how would you actually check that a required model spec is accurate? — but if they are implemented by every lab carefully with feedback from the community, it would be far easier for the development ecosystem to exist around models.
The model spec is an extremely useful document detailing how developers can expect your models to change over time. They are also one of the few sources of insight we have into what the model providers are trying to get their models to do (which has regulatory advantages) and let us know what is an intentional or unintentional behavior mode.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
A model spec doesn’t provide all the information we need to keep up with model versions. This new version of ChatGPT desperately needs to be accompanied by evaluations capturing the behavior change, otherwise, a lot of undocumented differences will be passed on to developers updating endpoints to it. This is another rendition of the same lack of transparency we’re used to from leading AI laboratories.
The closest thing Anthropic has to a model spec is the mix of Claude’s Constitution and this blog post on Claude’s Character. Character training is a fairly new technique for the industry. From Anthropic’s post:
Claude 3 was the first model where we added "character training" to our alignment finetuning process: the part of training that occurs after initial model training, and the part that turns it from a predictive text model into an AI assistant. The goal of character training is to make Claude begin to have more nuanced, richer traits like curiosity, open-mindedness, and thoughtfulness.
The process is extremely synthetic data-heavy, but requires an artist’s touch, as stated later in the blog post: It “[relies] on human researchers closely checking how each trait changes the model’s behavior.”
Character training being the focus of developments is the strongest endorsement that RLHF and related approaches have shifted from their philosophical motivations of alignment to being primarily an empirical tool. The models can capture so many different behaviors, but getting them to reliably behave how we want is the hardest part. Right now, it seems more likely that this is about capturing the upside of RLHF as a performance tool, rather than a safety one.
One of the few public discussions of character training came from Amanda Askell during her appearance on the Lex Fridman Podcast (taken from the transcript):
Lex Fridman (03:41:56) When you say character training, what’s incorporated into character training? Is that RLHF or what are we talking about?
Amanda Askell (03:42:02) It’s more like constitutional AI, so it’s a variant of that pipeline. I worked through constructing character traits that the model should have. They can be shorter traits or they can be richer descriptions. And then you get the model to generate queries that humans might give it that are relevant to that trait. Then it generates the responses and then it ranks the responses based on the character traits. In that way, after the generation of the queries, it’s very much similar to constitutional AI, it has some differences. I quite like it, because it’s like Claude’s training in its own character, because it doesn’t have any… It’s like constitutional AI, but it’s without any human data.
In summary, Anthropic uses the same techniques they use for Constitutional AI and general post-training for capabilities to train these models’ characters. This is not surprising. This could be related to Askell’s other Tweet on how she designs system prompts, as system prompts are the easiest way to quickly change a model’s character:
The boring yet crucial secret behind good system prompts is test-driven development. You don't write down a system prompt and find ways to test it. You write down tests and find a system prompt that passes them.
This is very in line with what we started this post on — internal AI lab evaluations.
How far can you push character training?
Ahead of the Grok 3 release, Elon Musk Tweeted this example from Grok 3, saying it was “based.”
One of the predominant reactions to Grok 3 was, “Wait, so it isn’t actually based?” This is one of the big questions of character training and lacking model specs. Did xAI not figure out how to make their model-based and reliable? What model was Elon using here?
Whatever your politics, it’s likely that the default personality of models that you encounter will eventually not be something you like. There’s quite a lot of nuance in what the perfect chatbot is for each user.
Companies should be allowed to have a default personality for the models of their choosing, but a far better long-term equilibrium is to make the expectation that model providers make it easy to get exactly the personality you like out of a model. This isn’t regulation I’m recommending right now, but one way to make sure that an all-powerful AI model isn’t going to reinforce one point of view is to have tests that models need to pass on the breadth of their character and views.
Model specs are a step in the right direction to avoid drama about “what did they actually want their model to say,” but we still have a lot of work to do on creating a spectrum of tools that captures all the relevant information when comparing models.
On Monday, February 24th, 2025, Anthropic announced their latest model, Claude 3.7 Sonnet, which is their first model explicitly trained to use more inference time tokens to improve performance. This is another reinforcement learning (RL) trained model (mentioned in system card). With this model, they also released Claude Code as a limited research preview, which is a “command line tool for agentic coding.” Continuous improvements in models are enabling new modalities and domains addressable by the models, but assessing the impact of each new domain takes far more time than a quick model reaction post.
This is a tidy release, a solid improvement, but not a step change for Claude or the industry. Expect a lot of small changes to accumulate massively this year.
Claude 3.7 Sonnet is a clear improvement over Claude 3.5 Sonnet (New) and continues to push the limits in areas where users love Claude (e.g. read Ethan Mollick’s review here). The scores for those areas such as software development (SWE Bench) and tool use, are clearly state-of-the-art.
For example, Claude 3.7 Sonnet is the highest scoring “standard non-reasoning” language model on the Aider Polyglot benchmark. While models like o3 and Grok 3 DeepThink highlight superhuman performance on code benchmarks, this sort of behavior being integrated without extra inference time compute is wonderful. The price for superhuman coding AI is plummeting.
Even with o1 Pro, I still find myself using Claude 3.5 (New) on a regular basis. O1 Pro is the best model for doing succinct, one-off tasks like writing short scripts. It is extremely controllable and will often work out of the box. Though, when I’m doing tricky, iterative tasks I still use Claude. Claude 3.7 Sonnet only makes these workflows stronger and I’m stoked to play with it further.
The most useful piece of this release for those trying to understand the direction of the ecosystem, rather than just the status of today, is Anthropic’s post on Claude’s extending thinking where they detail the product trade-offs, alignment, and future of inference time compute in their models.
Anthropic’s offering of extending thinking to boost inference-time performance is far, far cleaner than that of OpenAI’s current model drop down disaster. Anthropic’s thinking model is the same as their general purpose model, much like xAI’s Grok 3, and what OpenAI teased will be the plan for GPT-5. Having just one model makes lots of infrastructure, product, and training decisions cleaner, but may come at the cost of the absolute Pareto front of performance for your organization shrinking.
The reasoning training being embedded in one model with a standard inference mode will make the reasoning benefits and behavior feel closer to something like Gemini-Thinking, rather than OpenAI’s o1 or DeepSeek R1 that are designed solely for this reasoning mode of operation. It doesn’t mean that in the limit that a single model will be weaker in performance, but rather that currently training them may be slower to iterate on than a “full” reasoning language model.
Focusing on deploying just one model that serves all the users is one of many examples where leading AI companies are needing to make their offerings legible to users and easy to use — a sign of the industry maturing from a race to intelligence to a race to usefulness.
Still, Claude’s interface is not perfect by any means, the user still has to intentionally go to a drop down menu to get performance when they need it. The best mode is that the model knows when inference compute is needed on its own. My hypothesis is that when training one model with reasoning and without, having the model figure out how much compute to use is harder than a reasoning-only model like o1 figuring out its own compute budget. Or, Anthropic needed to keep a special flag that is turned on and off in the system prompt. This is a subtle potential trade-off of putting reasoning in just one model, but we’ll see where the final equilibrium is.
On the other hand, Claude 3.7 Sonnet is showing the reasoning traces directly to users like DeepSeek R1 and Grok 3. These organizations have different ways of saying why, but it is clear that users just enjoy seeing it and it builds trust. Anthropic, understandably is using the reasoning traces to monitor the alignment of the models.
The reasoning chains in these models are how the general public is learning more about the internal representations of language models. Another interesting detail is that “didn’t perform our standard character training on the model’s thought process.” This is how Claude thinks out of the box and the actual answers have a different flavor to them. More research will study how far the reasoning chains can diverge from the answer language. We’ve seen research on latent reasoning within the model, but beyond this, we could have reasoning languages that are entirely ungrounded from human languages because they are a more token-efficient representation of information for the model. More on this soon.
The developer facing version of this Claude Extending Thinking is far cleaner and a sign of things to come — developers can request a specific amount of thinking tokens in their response.
How this works is that the model will stream thinking tokens until the number is reached, then shift to answer tokens. This is still one autoregressive stream and no search is being used in the new products, yet.
This explicit control over the thinking and answering phases is a growing behavioral focus in training reasoning models — expect more here soon. Developers can tune the setting that works for them and keep it baked in, rather than relying on the user to pass in a query that just happens to get the model to think for a long time. Explicit test-time inference budget increases are much more covetable than needing to hit the gold mine in a prompt search.
The best place to see where this could be applied is by selecting performance on a task that scales nicely with inference time compute. Anthropic ran a similar experiment on the challenging math evaluation AIME — the same one that OpenAI used in their original inference time compute plot.
Here there’s a subtle difference from the developer experience, where in Anthropic’s internal tests the model could exit early. In practice, this subtle difference shouldn’t shift the usefulness of the deployment methodology.
Anthropic continues in their excellent post, saying:
Our researchers have also been experimenting with improving the model’s performance using parallel test-time compute. They do this by sampling multiple independent thought processes and selecting the best one without knowing the true answer ahead of time. One way to do this is with majority or consensus voting; selecting the answer that appears most commonly as the 'best' one. Another is using another language model (like a second copy of Claude) asked to check its work or a learned scoring function and pick what it thinks is best. Strategies like this (along with similar work) have been reported in the evaluation results of several other AI models).
To accompany this, they shared the following results.
It is crucial to note here that the dashed red line — pass@N — is not an actual evaluation result, but measuring if the correct solution appears in the number of answers generated on the X-axis. The two lines below show how good initial answer extraction methods are at selecting the right answer from the N candidates. As has been known for a long-time in inference-time scaling research is that the models can often generate the correct answer to extremely hard questions, but not reliably.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
They make it very clear this is not used yet in their products:
Parallel test-time compute scaling isn’t available in our newly-deployed model, but we're continuing to research these methods for the future.
Still, this is a direction other labs are already pursuing. The best reporting on o1 Pro indicates that it does a “search” of some sort over parallel generations. Other OpenAI employees have stated that o3 uses a learned verifier to extract answers, at least for coding domains. As progress in scaling single-streams from the language model slows, this is the natural next place for scaling to turn to. As it has been for some time, performance limits are largely different forms of infrastructure problems before models can be served to users.
Claude is here, and it reinforces that RL training is a short path to inference time scaling laws being used, but in the long-term we will have more methods for eliciting the inference-time tradeoffs we need for best performance.
Thanks to Ross Taylor for some immediate feedback on this post.
Full post: https://www.interconnects.ai/p/grok-3-and-an-accelerating-ai-roadmap
xAI launched their latest flagship model, Grok 3, last night via a live stream on X, which is a new take on the launch process, but it largely felt familiar. Grok 3 is a state-of-the-art model on some important benchmarks. The core is that it is state-of-the-art relative to available models and we know better models are out there. Only some of them have been announced, some of them have been teased, and others lie in waiting.
What feels different is how the broader AI industry is signaling rapid progress coming soon. xAI said on the livestream that they will be updating the model “daily.” An era of sitting on unreleased models could be ending.
Grok 3’s release is a reinforcement of trends people began reckoning with as of the release of DeepSeek V3 + R1 — AI progress is not held in the hands of a few companies nor is it slowing down. 2023 and 2024 were defined by truly state-of-the-art AI being concentrated within OpenAI, Anthropic, and Google, where these companies could take a lot of time to package models from training to release and still have a substantial moat on capabilities relative to their peers.
At the time of R1’s launch, the “people’s choice” model was Claude 3.5 Sonnet, a model that had been trained “9-12 months ago” and the best models like Claude 3.5 Opus or GPT-4.5 (a.k.a Orion) were not available to users for a grab bag of reasons.
Competitive pressure from DeepSeek and Grok integrated into a shifting political environment for AI — both domestic and international — will make the established leading labs ship sooner. A large portion of delays in delivering models is for “safety testing,” but we don’t have exact details on how much of it was that and how much was cost-benefit tradeoffs (and other big company hurdles such as legal departments). The brand, and culture, of “having the smartest model” is extremely important to these companies, but having a way smarter model was often financially too much to bear.
“Safety” is actively being removed from the spotlight of the AI discourse. It is possible that this overcorrection causes meaningful harm, as this is an extremely powerful and rapidly evolving technology, but the political capital to make safety a core tenet of the AI industry was spent too early relative to meaningful harm emerging.
Increased competition and decreased regulation make it likely that we, the users, will be given far more powerful AI on far faster timelines.
We’ve seen time and time again the value of having the best model first. The only way to onboard new users is to have some capability or behavior that your model differentiates on. With the pace of progress high, minimizing the time from training to release is the best way to maximize one’s chance of impact.
DeepSeek and xAI show how organizations with slightly trailing technical progress or resources can outshine the likes of OpenAI and Anthropic who have voluntarily not shipped their latest models.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Grok 3 by the numbers
Benchmarks and vibe tests mark Grok 3 as one of the best models available today. As with any release, companies often choose evaluations that flatter their models. Yes, winning on these evaluations is extremely challenging, and much credit must be given to xAI for delivering a leading-edge model just about 19 months after its incorporation.
That being said, what is shown below is a total of 4 language model evaluations. Given that models like DeepSeek R1 or Gemini Thinking launch with 10-20 evaluations detailing their performance relative to peers, this has to be taken with a grain of salt. It is very likely that Grok 3 doesn’t outperform its peers in every category, but there is a slim chance these other comparison evals just weren’t run in the optimization for expedience.
To start, we can compare Grok 3 benchmarks versus available instruct models.
And versus available reasoning models (note how OpenAI’s announced o3 scores exceed these clearly).
An important detail, as we’ve seen with OpenAI’s reasoning model releases is, what do the shaded regions on the above plots show? Without exact details, we don’t know the inference cost for each of the models on these reasoning plots. Pushing the frontier in absolute terms is important, but the field overall is getting messier before it’ll get clearer.
Regardless, in the above two plots Grok 3 is pushing progress both on standard model training and the new reasoning training. While reasoning training and RL are the hot new things in the AI field, simple scaling and optimization of existing techniques still deliver value.
And Grok’s score on ChatBotArena.
A model launching at top of every category on ChatBotArena feels like something that should be rare (given it now encompasses many categories like Math, Coding, Style Control, Longer Queries, etc.), but it happened just a few weeks ago with Gemini 2.0 Pro!
ChatBotArena is known to favor models that are likely to not refuse requests (we don’t know by how much), as evidenced by Claude 3.5 Sonnet (New)’s relatively low position on the leaderboard relative to its utility, but overall is a hard evaluation to top. xAI’s stated goals of a “based” model should correlate well here.
A question we don't know the answer to: How many points of performance on evals do you gain by not caring about safety at all? Internal to the model, i.e. in behavior latent spaces, safety is pretty orthogonal to common high-utility behaviors like reasoning and code, and bigger models tend to do more things without a cost to other behaviors, but there has to be a safety performance margin. Did Grok 3 succeed because of this? It’s too early to tell.
At a technical level, Grok 3 is certainly a very big model. We don’t have specifics, but it’s reasonably safe to take a datapoint for scaling still helps for performance (but maybe not on costs). xAI’s approach and messaging has been to get the biggest cluster online as soon as possible. The Occam’s Razor explanation until we have more details is that scaling helped, but it is possible that most of Grok’s performance comes from techniques other than naive scaling.
Grok 3’s size to beat existing models feels like when Nemotron 340B beat Llama 3 70B, making it the leading open-weight model at the time, but uptake was slow because the cost relative to the performance gains wasn’t worth it to adopt. We’ll know more about this when Grok 3 is available in their API and we see the exact costs.
When models are approximately equal in performance, price and ease of use are the determining factors of adoption.
Overall, Grok 3 is a huge technical achievement but not one that indicates a substantial change in who is at the frontier of effective training. xAI is obviously closing in on OpenAI, Anthropic, and most of all Google, but all available data points put these labs ahead of xAI on the frontier of efficient model training. It is good that they are being pressured to deliver more absolute intelligence and not to just continue optimizing their frontier of performance per dollar.
Read some other reviews of Grok 3 here and here. Karpathy’s summary is particularly in line with my thinking (while potentially slightly overselling capabilities).
As far as a quick vibe check over ~2 hours this morning, Grok 3 + Thinking feels somewhere around the state of the art territory of OpenAI's strongest models (o1-pro, $200/month), and slightly better than DeepSeek-R1 and Gemini 2.0 Flash Thinking. Which is quite incredible considering that the team started from scratch ~1 year ago, this timescale to state of the art territory is unprecedented. Do also keep in mind the caveats - the models are stochastic and may give slightly different answers each time, and it is very early, so we'll have to wait for a lot more evaluations over a period of the next few days/weeks. The early LM arena results look quite encouraging indeed. For now, big congrats to the xAI team, they clearly have huge velocity and momentum and I am excited to add Grok 3 to my "LLM council" and hear what it thinks going forward.
Where progress is heading
If these AI models, and the industry writ large, are accelerating, it is important to wonder where they are accelerating toward. Most of the evals we use now to launch leading models are not that representative, in many cases they’re actually 100% out of distribution to normal life. What is the value in solving a competition math problem like AIME or so-called “Google Proof” questions? Time will tell, but the case for usefulness to average users is definitely stretched.
Small ChatBotArena improvements are marginal gains in robustness, where something like a 20-point difference in Elo rankings — the relative difference between Grok 3 and the next top model — translates to the model winning something like 51% of head-to-head match-ups. This robustness adds up over time, but it is far from meaning that this model is more intelligent in an absolute sense.
In fact, in the case of some of the latest evaluations from the research community, it seems like evaluations are being designed more around being hard than being useful. It is a natural response to models being super powerful to try and find something to challenge them with, but it makes tracking progress and communication far harder.
Companies have many internal evaluations that are not shared. Increasing transparency on these would help contextualize what is and is not meaningful progress. Without these, the only benchmark we have for model changes is them becoming more deeply integrated into products. Product-model synergy can enable extremely useful, new workflows, but it makes tracking the progress of AI a proxy measurement.
I do personally believe these somewhat arbitrary capabilities we are marching toward will generalize to extended and amplified value, but it takes some "feeling the AGI" to see that these models that are better on esoteric benchmarks generalize to every day use. So far they have. Bigger and generally “better” models have been more robust and easier to find valuable veins in, but we as an industry should be sharing more so that it is not just AI insiders who understand how to track progress.
When 2024 was reflected on with meager advancements, evidence is that there was substantial progress but less of it was delivered to users. We only got o1 late in the year and other models were deemed "too big to ship" or the requisite urgency (DeepSeek) did not exist.
2025 will be a year of intelligence being put in the user’s hands. The pace of underlying progress with that will continue to be high. The so-called “walls” facing AI progress haven’t materialized, but making sense of the progress we are getting is much more nuanced.
The era we are living through in language modeling research is one characterized by complete faith that reasoning and new reinforcement learning (RL) training methods will work. This is well-founded. A day | cannot | go | by | without | a new | reasoning model, RL training result, or dataset distilled from DeepSeek R1.
The difference, compared to the last time RL was at the forefront of the AI world with the fact that reinforcement learning from human feedback (RLHF) was needed to create ChatGPT, is that we have way better infrastructure than our first time through this. People are already successfully using TRL, OpenRLHF, veRL, and of course, Open Instruct (our tools for Tülu 3/OLMo) to train models like this.
When models such as Alpaca, Vicuña, Dolly, etc. were coming out they were all built on basic instruction tuning. Even though RLHF was the motivation of these experiments, tooling, and lack of datasets made complete and substantive replications rare. On top of that, every organization was trying to recalibrate its AI strategy for the second time in 6 months. The reaction and excitement of Stable Diffusion was all but overwritten by ChatGPT.
This time is different. With reasoning models, everyone already has raised money for their AI companies, open-source tooling for RLHF exists and is stable, and everyone is already feeling the AGI.
Aside: For a history of what happened in the Alpaca era of open instruct models, watch my recap lecture here — it’s one of my favorite talks in the last few years.
The goal of this talk is to try and make sense of the story that is unfolding today:
* Given it is becoming obvious that RL with verifiable rewards works on old models — why did the AI community sleep on the potential of these reasoning models?
* How to contextualize the development of RLHF techniques with the new types of RL training?
* What is the future of post-training? How far can we scale RL?
* How does today’s RL compare to historical successes of Deep RL?
And other topics. This is a longer-form recording of a talk I gave this week at a local Seattle research meetup (slides are here). I’ll get back to covering the technical details soon!
Some of the key points I arrived on:
* RLHF was necessary, but not sufficient for ChatGPT. RL training like for reasoning could become the primary driving force of future LM developments. There’s a path for “post-training” to just be called “training” in the future.
* While this will feel like the Alpaca moment from 2 years ago, it will produce much deeper results and impact.
* Self-play, inference-time compute, and other popular terms related to this movement are more “side quests” than core to the RL developments. They’re both either inspirations or side-effects of good RL.
* There is just so much low-hanging fruit for improving models with RL. It’s wonderfully exciting.
For the rest, you’ll have to watch the talk. Soon, I’ll cover more of the low level technical developments we are seeing in this space.
00:00 The ingredients of an RL paradigm shift16:04 RL with verifiable rewards27:38 What DeepSeek R1 taught us29:30 RL as the focus of language modeling
Article: https://www.interconnects.ai/p/deep-research-information-vs-insight-in-science
(sorry about some more audible breaths in this -- I'm going to work on it!)
We at Ai2 released a local LM iPhone app for our OLMoE model (1B active, 7B total params), with greatly improved scores! Let us know what you think, or read more here.
OpenAI’s Deep Research has largely been accepted as a super valuable tool for knowledge workers and analysts across the economy, but its real engine of economic progress is going to be changing the nature of scientific progress. Science is the fuel of technological revolutions.
Deep Research in its current form feels like a beta version of a next-generation piece of technology. It does what it is tasked with — searches the web and processes many resources to create a useful report with referenced sources. Some of my uses include researching model evaluations, recent robotic learning research, and AI for science breakthroughs.
Deep Research’s limitations mostly feel like problems of search, where it is prone to returning SEO optimized slop, style, where it returns verbose, low information density writing, and modality, where it does not have the ability to read, process, and return plots and diagrams. All of these are surely solvable and expected features if we look at the rollouts of other AI models in the last few years.
This isn’t a product review (you can read Stratechery or Turing Post for more of that) — as the answer is quite simple, if you work in a knowledge intensive vocation you should be using this — but rather asking: So what comes next?
The place to start from within AI circles is to revisit the question of “When will AI make novel discoveries?” A good example of this is in the Dwarkesh Podcast episode with Dario Amodei:
One question I had for you while we were talking about the intelligence stuff was, as a scientist yourself, what do you make of the fact that these things have basically the entire corpus of human knowledge memorized and they haven't been able to make a single new connection that has led to a discovery?
An example experiment we could do to test this is to train models on time-gated information and see if it can repeat a scientific discovery we already made (yes, this would be difficult to run, but not impossible). Ross Taylor described this on his Interconnects Interview:
So an experiment I've never done because I didn't have [the] compute would be this. Imagine if you could train a language model on all documents up to 1905, which is the year when Einstein had his miraculous year of four seminal papers. With that model, which is trained up to 1905, could you prompt the model to come up with a good explanation of the photoelectric effect, special relativity, this kind of stuff? And what would it take to rediscover these things?
The dream is for AI to make breakthroughs, and the absence of evidence for this even after the release of Deep Research is driving a reckoning over what language models will ever be able to do. The fork in the road is either believing that scaling (either in parameters or in new training methods) will unlock “insights” or accepting that the current generation of models are very useful tools and nothing more supernatural. Likely the most powerful tool humanity has made yet. Our first power tool for information.
Much of science is not about making novel insights but about making progress within established problems of the field. In AI, these are the countless benchmarks we are saturating. A very valuable contribution in AI as a field can be re-using known resources in a simpler way.
With AI, we are going to learn the boundary between true insight and scientific progress. A related form of scientific progress is the compression of noisy ideas and experiments into a cohesive trend. Something that Deep Research can likely do, but not something that builds the allure of Einstein and the other scientific greats.
To understand this relationship between Deep Research, AI broadly, and the nature of science, we must address:
* How to interpret existing “AI for Science” projects like AlphaFold in the bigger context of science,
* How reasoning models, AI research systems like Deep Research, and other forthcoming AIs revolutionize existing scientific practices,
* How recent developments in AI challenge Kuhn’s formulation of scientific revolutions, and
* How current institutions will need to change forever in the face of AI?
This (hopefully) series of posts is my attempt to create a worldview around what science means in the face of AI. Today, we focus on the first two — major AI for science projects and how normal science is being accelerated by AI — and hopefully raise urgency within the community to consider the final question.
The starting point — grand AI for science projects
There is a substantial overhang in computational infrastructure and fundamental deep learning capabilities relative to their impact on the broad class of sciences. In order to make a substantial leap in the application of AI to a specific domain, a team must mold the existing underlying capability of AI to the needs of trained groups of scientists.
The list of examples people think of in this mold ranges across domains: AlphaFold for protein folding, AlphaGeometry for mathematics, GraphCast and GenCast for weather, and more that lack such prominent branding. They leverage advancements in deep learning and transformer architectures, but tend to have X-factors specific to the domain of interest (see a Deep Research query summarizing this). Such added features are pulling forward AI capabilities to suit a narrow domain.
There’s a substantial craft to selecting suitable problems for applying this grand AI for science approach. It requires a field with central elements that are quantitatively focused. Even with this, outcomes are more uncertain than standard AI research or standard research in the domain of choice.
The essay A new golden age of discovery from AI Policy Perspectives details how DeepMind sees the opportunity here and showcases some internal ingredients they found that make these projects more likely to be successful.
The fact that any of these projects have succeeded shows the overall potential of AI for science. The overall necessity of the approach depends on whether the grand AI for science breakthroughs are pulling forward progress by months or years, or if these models are the single required breakthrough to approach entirely new areas of study.
As the broader scientific community embraces AI as “something that works” more of these step changes will happen. They take a very large density of compute and talent on a single problem.
These projects fit more naturally into a classical view of science. They take substantial resources and are high risk. Meanwhile, the mass market AI tools that everyone is adopting will dramatically shift the practice of doing science.
Towards instantaneous Ph.D.’s
We have two tools that dramatically shift the nature of scientific exploration. They will only get better.
* AI models that excel at code, mathematics, and reasoning: OpenAI’s o3, DeepSeek R1, Gemini Deep Thinking, etc.
* AI systems to rapidly parse and summarize existing literature: OpenAI’s Deep Research, Gemini Deep Research, Ai2’s Scholar QA (specific to academic papers), and many more that will come soon.
These tools are dramatically accelerating the most time-consuming aspects of research, particularly in computationally intensive fields. In a few years, the only gating factor on the impact of a scientist will reduce to their access to cutting edge tools, understanding the gaps in AI, and asking the right questions. The final point is well established as a trait of the most successful scientists, is what goes hand in hand with the idea of “insight,” and where the differentiation among scientists will only increase.
Computational super-scientists
All scientific fields that rely heavily on computational infrastructure as a bottleneck for progress are going to experience a dramatic acceleration in the near future. In AI and closely related computer science fields this is evident from the abundance of soon-to-be superhuman coding assistants and an exponential (short-term) increase in compute available.
Most AI research is severely bottlenecked by the compute available, the time to implement the intervention, and the implicit efficiency of the idea-implementation interface. Future siblings of OpenAI’s o1 models are going to be used extensively to streamline this. This worldview barely accounts for the ability of these reasoning models to decide on which problem to solve and to interpret the results. These sorts of research assistants running in the cluster are a central component of the vision of Anthropic CEO Dario Amodei’s view in Machines of Loving Grace, and it is one that requires far less optimism in magical breakthroughs than the grand AI for science projects.
Reasoning language models (RLMs) have in their first year of existence shown major progress on all of the evaluations the AI field put forward as fundamental challenges for the field. Accumulating iterations of this should transfer to scientific decision-making, but we don’t exactly know how.
The fundamental unit of progress in science, which can be viewed as one Ph.D.’s worth of progress (same goes for one paper), is reducing so quickly to redefine many methods of experimentation and deciding on what is or is not possible. Multiple efforts are already documenting how RLMs can be used to find errors in existing literature — a process that will likely be automated in the next few years. Rather than science proceeding with a high velocity, it feels as if science is proceeding with a high acceleration.
The pace of progress necessitates a reinvention of most of our scientific institutions. What happens when the time it takes to create a Ph.D.’s worth of knowledge is substantially smaller than the amount of time it takes to get peer review feedback from the fields’ journals?
Changing the structure of scientific revolutions
Most of the ideas behind this piece originated over the holidays when I was trying to understand how science, as a practice and institution, would be changed by AI. The announcement of OpenAI’s Deep Research only solidified the view that the current wave of AI tools are not one that will ever fundamentally be insight-driven. They are extremely powerful and efficient computing engines, not insight engines.
A pseudonymous account on Twitter, Michael, summarized the debate we started with, on if LMs can make insights, perfectly:
To an LLM, a novel discovery is indistinguishable from an error.
This is a limitation that letting language models act in a wet lab does not solve. Insight is far separate from information, but both are accepted as being crucial to scientific progress. Information is the engine for insight.
Still, Deep Research and the AI models we have access to are a substantial accelerant of science. Modern scientists have always had to wear many hats, advising, reading, implementing, thinking, communicating, etc. The list getting shorter, so long as these new technologies are broadly available, is a wonderful boon for normalizing access to scientific careers. This should be a wonderful bloom of scientific progress (at least, production of results), but we need to do substantial work to prepare our already strained institutions.
For my readers in academic or science driven circles, the comment section of this post is open to free subscribers, please share the biggest opportunities and pinch-points in the current scientific ecosystem as we embrace these AI tools. We need to be more ambitious and level-headed than complaints about AI written peer reviews. In the long-term AI peer review will be needed to review all of the progress. We need incentives for managing it, resources to share across a broad community, and much, much more.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Further Reading
Thomas Kuhn’s The Structure of Scientific Revolutions is taken as one of the most influential non-fiction books of all time by describing the behavior of and defining terminology for how modern science evolves. The core idea is that scientists form paradigms around certain theories and ideas and use them to uncover new knowledge until questions or limitations require the emergence of another. Kuhn makes it clear with many historical references that science is done by a community and slowly builds out the frontier of knowledge — rather than filling in a known space of potential knowledge. The necessity of this can be summarized as:
To reject one paradigm without simultaneously substituting another is to reject science itself.
This pattern can be seen again and again building on the likes of Galileo and others in the 17th century. A core fact of this is that scientific knowledge is a process and not one that sets specific ideas in stone:
If these out-of-date beliefs are to be called myths, then myths can be produced by the same sorts of methods and held for the same sorts of reasons that now lead to scientific knowledge. If, on the other hand, they are to be called science, then science has included bodies of belief quite incompatible with the ones we hold today. Given these alternatives, the historian must choose the latter. Out-of-date theories are not in principle unscientific because they have been discarded. That choice, however, makes it difficult to see scientific development as a process of accretion. The same historical research that displays the difficulties in isolating individual inventions and discoveries gives ground for profound doubts about the cumulative process through which these individual contributions to science were thought to have been compounded.
The biggest challenge to Kuhn’s theories of change is the emergence of AI. We must grapple with questions akin to “How will the dynamics of science change with powerful AI systems?”
The open question is if these accelerations unsettled the Kuhnian nature of science by making progress happen faster than paradigms themselves can be established.
In addition to the posts referenced throughout, I built my knowledge on top of many recent pieces on this area that you may find interesting:
* Dario Amodei on Lex Fridman podcast.
* Levers for biological progress on Asimov Press.
* X thread on theories of change in sciences.
* The white-paper Artificial Intelligence, Scientific Discovery, and Product Innovation.
* The Dwarkesh Patel Podcast with Adam Brown.
* AI Policy Perspectives piece A new golden age of discovery.
* Owl Posting checking recent NeurIPS ML for bio paper results (based on an idea from Ethan Mollick). Am I Stronger Yet? has a series on this topic too.
* Scientific Models in Philosophy of Science.
* Kuhn’s The Structure of Scientific Revolutions.
* The Intrinsic Perspective on why great scientists follow beauty.
* Dean W. Ball’s review of Deep Research, Knowledge Navigator.
As many of you know, this weekend I appeared on the Lex Fridman Podcast with my friend Dylan Patel of SemiAnalysis to cover DeepSeek and the implications on the AI ecosystem. I recommend you check it out.
This post was tricky to pull together. I decided to share it anyways given the timeliness of the topic and other more exciting things I have to get to. The minor, thematic contradictions on motivations, costs, and trajectories are exactly indicative of why analysis and productionization of open-source AI is so hard. In that, it is a valuable lesson that building open-source AI will come with a lot of ups and downs, but now is the best time to do so.
The DeepSeek moment represents the end of the first chapter of AI's recent takeoff as told through the emergence of ChatGPT. It reminds us, that while substantial resources, coalitions, brands, and trends have been established, the narratives we have been championing are not set in stone. DeepSeek, especially with R1, resets all the narratives around open vs closed, US vs China, scaling and commoditization, etc. as we prep for yet another acceleration in the diffusion, progress, and adoption of AI.
Of all of these debates, the focus on open vs. closed AI models is the one least driven by economic factors and most driven by vibes. The open-source AI community is driven by a future vision where AI is not held by a few super-rich companies, a future where more people get to partake in the building of AI, a future where AI is safer, etc. These are ideals and building the tools and systems that make this vision a reality is a monumental challenge. Building strong AI models is far, far easier than building a sustainable open-source ecosystem around AI.
Building a better, truly open ecosystem for AI has been my life’s work in the last years and I obviously want it to flourish further, but the closer you are to the core of the current open-source ecosystem, the more you know that is not a given with costs of doing relevant AI training skyrocketing (look, I know DeepSeek had a very low compute cost, but these organizations don’t just fall out of the tree) and many regulatory bodies moving fast to get ahead of AI in a way that could unintentionally hamper the open. Yes, efficiency is getting better and costs will come down, as shown with DeepSeek V3, but training truly open models at the frontier isn’t much easier.
Building the future ecosystem of open
As a perfect case point, consider Meta. Meta, as a platform serving content to billions of users, is extremely well-positioned to use AI to make its services more engaging and more profitable for advertisers. The Llama project is not needed for that vision. Yes, it will be easier for them to integrate and optimize an AI that they train, but in a world where AI models are commoditized, what’s the point? The most compelling reasons for openly releasing the Llama models are not business reasons but rather ideological reasons. Mark Zuckerberg revisited this on the recent Meta earnings call:
I also just think in light of some of the recent news, the new competitor DeepSeek from China, I think it’s one of the things that we’re talking about is there’s going to be an open source standard globally. And I think for our kind of national advantage, it’s important that it’s an American standard. So we take that seriously and we want to build the AI system that people around the world are using and I think that if anything, some of the recent news has only strengthened our conviction that this is the right thing for us to be focused on.
The pro-America messaging from Zuckerberg long predates the new administration (especially given that all of Meta’s major apps are banned in China), even if the language is amplified now. This is purely an argument of “we are doing this because we should.”
This argument is extremely similar to that used by DeepSeek AI’s CEO Liang Wenfeng. In an interview translated by ChinaTalk, Wenfeng described the need for Chinese leadership in open-source AI (in addition to a clear commitment to keep releasing models openly).
Liang Wenfeng: Because we believe the most important thing now is to participate in the global innovation wave. For many years, Chinese companies are used to others doing technological innovation, while we focused on application monetization — but this isn’t inevitable. In this wave, our starting point is not to take advantage of the opportunity to make a quick profit, but rather to reach the technical frontier and drive the development of the entire ecosystem.…We believe that as the economy develops, China should gradually become a contributor instead of freeriding. In the past 30+ years of the IT wave, we basically didn’t participate in real technological innovation. We’re used to Moore’s Law falling out of the sky, lying at home waiting 18 months for better hardware and software to emerge. That’s how the Scaling Law is being treated.But in fact, this is something that has been created through the tireless efforts of generations of Western-led tech communities. It’s just because we weren’t previously involved in this process that we’ve ignored its existence.
The interview has many other comments making it clear that the way this will be done is by training powerful AI and releasing it for the world to use.
Both of these arguments, from Zuckerberg and Wenfeng, rely on the optimism that we, as a community of users of open AI models, will figure out how to create a valuable ecosystem around them. Right now, the vast majority of AI usage for applications comes through various API calls. Yes, some of this includes the usage of open-weight models like Llama and DeepSeek R1, but it does not give clear positive attribution to the fact that the model was open as a reason said the model was used.
The nationalistic comments regarding open-source AI are only likely to grow stronger as governments more deeply integrate with their leading AI companies.
One of the main arguments why American AI leaders believe that the AI ecosystem should be built on a Western foundation is the risk of China “poisoning the well” of our future computational infrastructure. To be very clear — there is absolutely no evidence of this to date, but it is a simple proposition that the Chinese Communist Party (CCP) could build ties to the leading Chinese AI laboratories and require them to train for specific behaviors or train in some sort of back door through model weights into American infrastructure.
America has been reeling with the potential of this sort of influence on TikTok. If AGI is to be a real thing that can be steered to ideological outcomes, a bill titled Protecting Americans from Foreign Adversary Controlled Applications Act (the bill banning TikTok and forcing a divestiture) will operate at entirely the wrong level of abstraction. American companies raced to host R1 in a competitive frenzy. This is how open-source works and it will be far easier to incentivize better open models from Western labs than it will be to ban companies from adopting Chinese technology.
As of the release of DeepSeek R1, Chinese AI companies didn’t have clear links to the government, but after said release, DeepSeek’s CEO met with the Chinese Premier Li Qiang (approximately second in command) to discuss their work.
AI is obviously far more in the radar of American leadership as a priority and has been for some time. This is a major advantage that the U.S. has in terms of a fast reaction to changing needs for open models.
In a recent Reddit AMA soon after his appearance on stage with Trump for the announcement of the Stargate project, CEO of OpenAI Sam Altman even acknowledged that their strategy “may be on the wrong side of history” here with respect to openly sharing AI components. OpenAI should get no credit until their actions change, but DeepSeek and a new government administration have made many forces re-evaluate their relationship to the open ecosystem.
The current imperative of open-source AI is to create feedback loops where open models become more useful than their closed counterparts. Given that AI is very expensive and slow to train, this cannot look like the accumulation of small security and reliability improvements like done with open-source software. There’s a chance that there is an algorithmic innovation that makes this possible, but for now, the solutions need to be more imaginative. Two examples I am currently interested in include:
* Feedback loops from data to model behavior. If exposing the data to users, either from pre-training or post-training, makes it easier to control a model, then open models can win.
* Finetuning advancements. Currently, finetuning any model to target a specific task is extremely hard. This is with both open-source code and fine-tuning APIs. If open-source code can be made to enable feedback loops of cheap synthetic data with verifiers to make very targeted models, open models can win.
This is just two examples. We need more than these if we want open-source AI to continue once the bubble of AI advancement cracks. We don’t know when this comes, but if investment is driven by ideological reasons rather than monetary ones, public companies only have so much leeway to continue indefinitely.
These days I classify myself as an advocate for more openness for AI (which never means absolute openness), but I’m not going to describe myself as also being an optimist for it “winning.” As scaling continues to push the limits of multiple training regimes and recipes become more expensive, open models drift from the frontier. This DeepSeek moment happened once in the 2+ years since the release of ChatGPT. We need to change incentives if we want it to happen regularly.
Why DeepSeek R1 is so close to the frontier is that, on top of being extremely skilled, they have a way faster release process than the likes of OpenAI and Anthropic who do extensive safety testing and even pre-screen releases by federal governments. Meanwhile, DeepSeek seems to finish the model, take a week to write the paper, and immediately distribute it to the entire world. Iteration speed has always been an advantage of open-source and they happen to be aligned here.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Restricting the distribution of open models is a losing proposition
If we geolock websites like HuggingFace to only work within the United States, we lose.
If we geolock model weights with threats of jail time as proposed in this recent bill, we lose (not views of my employer etc. etc.).
If we add substantial penalties to companies who release models that end up being used in China, we lose.
DeepSeek R1 is the closest to the frontier that a language model has been with a permissive license. Models like Llama 405B have been open weights and at the frontier of performance before but weren’t picked up because they were very hard to use.
The only reason we should limit the ability of people to release model weights openly is if we see measurable negative impacts of doing so onto society. This means immediate harms to information ecosystems or other infrastructure.
In the time being, if we (as Western countries) decide to stop training and releasing open models another power will happily do this. This is not just China — there are countless countries with the wealth to create the current generations of frontier language models. While it is understandably a stressful proposition, it has been said for at least a year that “Whether or not we want it, open language models are likely here to stay.”
We are battling over who creates the most popular open language models, which could have a substantial influence on societies given the socially rich context that we use language models in today, not if they will exist at all.
Open-source processes succeed with a multiplicity of options for users. This is what gives users options to fiddle and improve the ecosystem as a whole, learning from each other, and improving together.
Open-source AI is something that people will not benefit from uniformly. Export controls are not about preventing access to AI, but rather about preventing proliferation (and use) of AI. We have already seen this potentially kick in with the DeepSeek R1 launch, where DeepSeek had to pause account signups and there were widespread issues with their API.
The amount of compute used to train models is increasing today, and rapidly, but as AI becomes more powerful the rate of AI usage will increase even further. Most of the datacenters under construction in the United States are for inference of AI technologies. Right now, most of these are going to existing things like Google services or Meta serving ads, but more of these are becoming language model workflows.
The clusters you hear about with hundreds of thousands of GPUs are those for training and are indicative of a lab’s ability to train models at the frontier (reminder, training needs the fast interconnects between as many GPUs as possible). Training infrastructure can be thought of separately from the reality of how AI transforms society and becomes extremely impactful — by being used.
Because cluster buildouts have such a long lead time, it is very likely that we get super powerful AI models where demand vastly outstrips the supply of usage. The narratives that AI CEOs parrot about “AGI in a few years” are exactly this. An AI working in the background, solving tasks autonomously, especially if on top of these reasoning models that are driving inference costs up with long-context generation (which is far more expensive than long-context inputs due to quadratic compute increase during sampling), will be very expensive to run. For example, OpenAI’s o3 pro costs dollars per query on the now famous ARC-AGI task.
All of this is to say that restricting open-weight models has a much smaller impact on the diffusion of AI technology than most commentators thought in the last few years. Most thought that training cutting-edge AI models was so expensive and sophisticated that only American companies could do this. It turns out that the cost of training can fall faster than the demand for inference. Yes, releasing open-weight models contributes to the former reduction in cost (and definitely doesn’t drive it), but if concerns around openly releasing AI models are driven by IP controls, it is an empty argument.
Attempts to restrict a adversaries ability to impact the world on AI should focus nearly entirely on compute rather than model weights. There is of course a chance that models continue to progress and eventually are so simple that only the weights are needed to deploy a destabilizing application. For now, and in the near future, powerful AI systems require substantial inference compute and even more custom software infrastructure.
For example, Deep Research announced by OpenAI is restricted to 100 queries per month for only users on the $200/month pro plan. Where a query to standard ChatGPT costs on the order of $.01, this new system costs on the order of ~$1 (estimated based on monthly cost and OpenAI “losing money on pro subscribers”). We should expect this cost to increase by 100X again in the near future.
Inference for a task that costs $1 in compute and takes ~10 minutes to complete implies that OpenAI would be using the equivalent of about 6 GPUs for that one user for those ten minutes (amortizing $2/hour/H100 GPU). This is a lot — each GPU is worth ~$40K! In reality, it is likely that multiple users are sharing even more GPUs to hit critical batch sizes, but that’s an aside for another time. Using very powerful AI is very expensive. It is making the argument even clearer that open weight models are not a major risk to be the target of soon-to-be export controls.
We need to keep releasing models openly so that our allies can build on them and create a thriving ecosystem.
What’s actually new in open-source?
From the 10,000 foot view, China released an AI model and countless Western companies immediately adopted it so that the AI ecosystem can build on it. This is the system working as intended.
A new “open-source” tool in the pipeline showed up and caused prices to be slashed. This is the system working as intended.
When focusing solely on the angle of open vs. closed models, DeepSeek R1’s release is new solely as a high watermark for the existence of models closest to the frontier of progress with a permissive license.
Developers themselves still aren’t really using open models because they have clear benefits in adaptability, privacy, etc. This is the hole in the system. While DeepSeek R1 is open weight, the vast majority of the usage is in applications like Perplexity AI or other hosted APIs.
The grand problems facing open-source haven’t changed much, but we have the best tools we ever had in order to make dents in them. It’s time to reinvest and refocus on the positive futures such progress could provide.
Western governments should make sure their programs for funding the development of AI extend beyond just the largest companies trying to solidify their moats through capital expenditures. It is the time to invest in open research and public-sector coalitions who are building the Western alternatives to DeepSeek. Research is the foundation of the next decades in progress. Building openly is the only path forward if there is concern over our future ability to leverage open models from China.
For further reading on this topic, see Thom Wolf’s post which is very related and complimentary to mine, or this rebuttal to Dario’s post on export controls. For more on open-source AI, see my previous writing on the area.
This post is early to accommodate some last minute travel on my end!
The new models trained to express extended chain of thought are going to generalize outside of their breakthrough domains of code and math. The “reasoning” process of language models that we use today is chain of thought reasoning. We ask the model to work step by step because it helps it manage complexity, especially in domains where the answer requires precision across multiple specific tokens. The domains where chain of thought (CoT) is most useful today are code, mathematics, and other “reasoning” tasks. These are the domains where models like o1, R1, Gemini-Thinking, etc. were designed for.
Different intelligences reason in different ways that correspond to how they store and manipulate information. Humans compress a lifetime of experience into our spectacular, low-power brains that draw on past experience almost magically. The words that follow in this blog are also autoregressive, like the output of a language model, but draw on hours and hours of background processing as I converge on this argument.
Language models, on the other hand, are extremely general and do not today have architectures (or use-cases) that continually re-expose them to relevant problems and fold information back in a compressed form. Language models are very large, sophisticated, parametric probability distributions. All of their knowledge and information processing power is stored in the raw weights. Therein, they need a way of processing information that matches this. Chain of thought is that alignment.
Chain of thought reasoning allows information to be naturally processed in smaller chunks, allowing the large, brute force probability distribution to work one token at a time. Chain of thought, while allowing more compute per important token, also allows the models to store intermediate information in their context window without needing explicit recurrence.
Recurrence is required for reasoning and this can either happen in the parameter or state-space. Chain of thoughts with transformers handles all of this in the state-space of the problems. The humans we look at as the most intelligent have embedded information directly in the parameters of our brains that we can draw on.
Here is the only assumption of this piece — chain of thought is a natural fit for language models to “reason” and therefore one should be optimistic about training methods that are designed to enhance it generalizing to many domains. By the end of 2025 we should have ample evidence of this given the pace of the technological development.
If the analogies of types of intelligence aren’t convincing enough, a far more practical way to view the new style of training is a method that teaches the model to be better at allocating more compute to harder problems. If the skill is compute allocation, it is fundamental to the models handling a variety of tasks. Today’s reasoning models do not solve this perfectly, but they open the door for doing so precisely.
The nature of this coming generalization is not that these models are one size fits all, best in all cases: speed, intelligence, price, etc. There’s still no free lunch. A realistic outcome for reasoning heavy models in the next 0-3 years is a world where:
* Reasoning trained models are superhuman on tasks with verifiable domains, like those with initial progress: Code, math, etc.
* Reasoning trained models are well better in peak performance than existing autoregressive models in many domains we would not expect and are not necessarily verifiable.
* Reasoning trained models are still better in performance at the long-tail of tasks, but worse in cost given the high inference costs of long-context.
Many of the leading figures in AI have been saying for quite some time that powerful AI is going to be “spikey" when it shows up — meaning that the capabilities and improvements will vary substantially across domains — but encountering this reality is very unintuitive.
Some evidence for generalization of reasoning models already exists.
OpenAI has already published multiple safety-oriented research projects with their new reasoning models in Deliberative Alignment: Reasoning Enables Safer Language Models and Trading Inference-Time Compute for Adversarial Robustness. These papers show their new methods can be translated to various safety domains, i.e. model safety policies and jailbreaking. The deliberative alignment paper shows them integrating a softer reward signal into the reasoning training — having a language model check how the safety policies apply to outputs.
An unsurprising quote from the deliberative alignment release related to generalization:
we find that deliberative alignment enables strong generalization to out-of-distribution safety scenarios.
Safety, qualitatively, is very orthogonal to traditional reasoning problems. Safety is very subjective to the information provided and subtle context, where math and coding problems are often about many small, forward processing steps towards a final goal. More behaviors will fit in between those.
This generative verifier for safety is not a ground truth signal and could theoretically be subject to reward hacking, but it was avoided. Generative verifiers will be crucial to expanding this training to countless domains — they’re easy to use and largely a new development. The field of LLM-as-a-judge (and related synthetic data pipelines) only really became stable with models at the level of GPT-4.
Reasoning models trained as a judge are a very natural fit because the exact token for a predicted reward or ranking is crucial — CoT is essential. All of the progress here relies on continued progress on both generators and verifiers. o1 et al. were likely trained with mostly explicit, code verifiers. They spawned far more powerful generators, which will enable new types of verifiers. Then, we can train better models (and so on).
Onto another example of unexpected performance of new reasoning trained models. DeepSeek-R1, the new open-weight o1 replication has been showing up at the top of many random benchmarks as top overall, above Claude 3.5 Sonnet, Gemini, and GPT-4o, and alongside o1. Examples include a creative writing and humor leaderboard or the brand-new, extremely challenging benchmark from the Center for AI Safety and Scale AI — Humanity’s Last Exam. Oh, and yes, it’s best on both accuracy and the new metric “calibration error” which is designed to have the model express its own uncertainty. Calibration is a long-sought behavior in traditional LMs and turns out maybe reasoning training helps it?
A lot of my friends find o1-pro to be clearly the most useful AI model in their daily workflows (one example here and a similar R1 example here). ChatBotArena has all of the new models, from o1, Gemini-Thinking, and R1 as some of the top models these organizations have in the best “normal use” evaluation the AI community has. These reasoning models are definitely absorbing the other lessons learned in post-training across the AI industry.
The explosion of R1 caused arguably the biggest general awareness of AI moment since the original ChatGPT. DeepSeek’s App has been the number one overall free app in the U.S. and non-technical users are getting meaningful value out of seeing the reasoning process. What was a niche training process is bringing many more types of benefits than expected.
All of this is just on “day 1” of this technology. Reasoning models are going to proceed at a rate far, far faster than most expect.
These models will not be state-of-the-art on every domain, but probably far more than you expect. Language models are a complex technology and they will never be one size fits all, but the ground is being reshaped under us.
Especially, where the standard models match the reasoning models abilities, you’ll be paying way more for the same performance. At the same time, so many domains are going to be open to the “if you pay a little bit more, the reasoning model will get you a bit more performance,” which will accrue so much value over time.
These are trade-offs that many in the AI industry see at face value. Many ask where Anthropic’s reasoning model is, but they may never explicitly have one. Before o1 launched, Claude was already using extra tokens hidden from the user to improve the quality of responses. Anthropic CEO Dario Amodei commented on their approach in an interview with Joanna Stern of the WSJ recently:
To say a little about reasoning models, our perspective is a little different, which is that there’s been this whole idea of reasoning models and test-time compute as if they’re a totally different way of doing things. That’s not our perspective. We see it more as a continuous spectrum — the ability for models to think, reflect on their own thinking, and ultimately produce a result. If you use Sonnet 3.5, sometimes it already does that to some extent. But I think the change we’re going to see is a larger-scale use of reinforcement learning, and when you train the model with reinforcement learning, it starts to think and reflect more.
It’s not like reasoning or test-time compute — or whatever it’s called — is a totally new method. It’s more like an emergent property, a consequence of training the model in an outcome-based way at a larger scale. I think that will lead to something that continuously interpolates between reasoning and other tasks, fluidly combining reasoning with everything else models do.
As you’ve said, we’ve often focused on making sure using the model is a smooth experience, allowing people to get the most out of it. I think with reasoning models, we may take a similar approach and do something different from what others are doing.
The newest Claude 3.5 Sonnet models are very likely already trained to some extent with RL on verifiable outcomes. Just days before o1 was launched, Claude’s behavior of “I’m thinking about that” was the biggest indicator we had of consumer companies trading more compute for better responses. Anthropic hasn’t shifted their strategy here, and you can decide how much weight you want to put in their CEO’s recent comments.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
The techniques are here to stay, and it took revolutionary new models to show us that.
Like many new technologies, we needed to be shown what was possible, and then it can be folded back into normal experience. o1 was this breakthrough, and the benefits of reasoning training will now expand out into all of the AI products we are using day and night.
To end, I leave you with a quote from the DeepSeek R1 paper, where the authors reflect on their experience with the model(s):
One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment. This spontaneous development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities, enabling it to tackle more challenging tasks with greater efficiency and accuracy.
Thanks to Ross Taylor and Hamish Ivison, for discussions that helped inspire this post.
We're here to share the story of building our Open Language Models (OLMos) and what we improved to build the OLMo 2 7B/13B model that is competitive with the Llama 3.1 8B model. This is all about building an effective, small language modeling team that can share all it learns with the scientific community. Dirk, Luca, and Kyle are some of the people I learn the most from and have more knowledge (and entertainment) to share than we have time.
Some questions were pulled from Twitter, but please comment or get in touch if you want us to cover anything in the future episode(s)!
Main topics:
* Pretraining efficiency and our quest for stability after a not-so-secret failed 70B run early in 2024,
* What the role of OLMo is in the broader AI landscape and how that is, or is not, changing,
* Many little decisions that going into building language models and their teams (with a focus on NOT post-training, given I already talk about that a ton).
Play with the models we build here: playground.allenai.org/
For more history of open language models (OLMos) on Interconnects, see my first post on OLMo, my coverage of OLMoE, OLMo 2, and why I build open language models. If you have more questions or requests, please let us know (especially the researchers out there) and this can be one of N, rather than a one off celebration.
Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Contacts
Dirk Groeneveld — https://x.com/mechanicaldirk // https://bsky.app/profile/mechanicaldirk.bsky.social
Kyle Lo — https://x.com/kylelostat // https://bsky.app/profile/kylelo.bsky.social
Luca Soldaini — https://twitter.com/soldni // https://bsky.app/profile/soldaini.net
General OLMo contact — [email protected]
Papers / models / codebases discussed
* OPT models and talk from Susan Zhang
* BLOOM
* C4: Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach
* Maximal Update Parametrization (muP) is from Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
* Spike No More: Stabilizing the Pre-training of Large Language Models
* LLM360: Towards Fully Transparent Open-Source LLMs — Amber model
* A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Kyle said Hitchhikers)
* Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Chapters
Chapters: Here is a list of major topics covered in the podcast, with timestamps for when the discussion starts:
* [00:00:00] Introduction
* [00:02:45] Early history of the OLMo project
* [00:15:27] The journey to stability
* [00:25:00] The evolving role of OLMo and pretraining research
* [00:29:00] Pretraining Q&A (µP, scaling laws, MoE, etc.)
* [00:40:40] How to think about pretraining data work
* [00:54:30] Role of pre-training vs mid training vs post-training
* [01:02:19] Release strategy and wrapping up
Transcript
This is generated by AI and lightly edited for clarity. Particularly, the attribution per-speaker was poor on this time around.
Nathan Lambert [00:00:07]: Hey, welcome back to Interconnects. In this interview, we're bringing one that I've hinted at for a while, which is interviewing some of the other leads on the OLMo team at AI2. So essentially, this covers the story of OLMo from its early days where we got our compute, kind of our path to stability and some failed runs along the way, the role of OLMo and the broader AI ecosystem, and really just a very long tale of technical details and decision making and considerations that you have when actually training language models that you're trying to have at the frontier of performance relative to peers like Llama, etc. This is a fun one. There's less post-training than normal because this is me interviewing some other co-leads at the Allen Institute for AI. So there's three people in addition to me, which is Dirk Groeneveld, who is the lead of training, handles most of engineering, Kyle Lo, and Luca Soldaini, who are the data leads. So we have a pre-training engineering lead and two data leads with me who has done a lot of the post-training. This is just a part of the team. And I hope you enjoy this one. We can do more of these and bear with the fact that I'm still expanding my podcasting tech equipment. But I think the audio is definitely good enough and enjoy this episode with me, Kyle, Dirk, and Luca.
Hey, everyone. Welcome to the AI2 office. We're finally talking more about some of our OLMo things. Too much work to do to actually get all the... the information we want to share out into the world. So I'm here with Dirk, Kyle, and Luca. We can also talk so people identify your voices so people are not all on video. Hi, I'm Dirk.
Dirk Groeneveld [00:02:01]: I am the lead of the pre-training part of OLMo.
Kyle Lo: Hi, I'm Kyle. I work on data.
Luca Soldaini [00:02:08]: Hello, Luca. Also work on data with Kyle.
Nathan Lambert [00:02:13]: Okay, so we're kind of going to maybe go through some of the story of OLMo to start. And then just get into as many nerdy details until we get tired of OLMo 2. Which, in my state, this will probably be mostly about pre-training. You can ask me post-training questions as well. But I'm not going to sit here and be like, ask myself questions that I'm not going to answer. Because that is an absolutely ridiculous thing. You can ask me one question. Okay. One question. It's like, why shouldn't you post-training with all the compute?
Nathan Lambert [00:02:45]: But I wasn't here for when OLMo actually started. So I think it'd be good to tell people, I mean, like, broadly what AI2 was like at the time, what language modeling was like at the time, what it may or may not have been risky.
Kyle Lo [00:03:01]: Yeah, you should probably get this.
Dirk Groeneveld [00:03:03]: Yeah, I think it all started in the fall of 2022.
Dirk Groeneveld [00:03:10]: We were talking to AMD at the time about some sort of collaboration. We're scoping out some stuff. And at the time, we wanted to take the Bloom model. And put 300 billion extra tokens in. And we wrote up a proposal and we sent it to AMD and it disappeared into a black hole. And we never heard from them again. And then ChatGPT came out a couple months after that. And suddenly everybody was very excited. And two, maybe one month after that, AMD came back to us and said, now let's do it. And that kicked off a very busy period for us. At least the three of us were involved at the time. Plus some of us. Some more people trying to scope out exactly what the project would be. Putting 300 billion tokens into Bloom wasn't that cool anymore. The field had moved on. So we needed to find something else that would work both for us and for AMD.
Dirk Groeneveld [00:04:07]: And that's exactly what we did. We figured it out. We figured out who would be on the team, how exactly to do it. We had to get the data from all of that stuff and then started working on it.
Luca Soldaini [00:04:16]: I think it was, let's look it up. And the official birthday of all of us. Almost is February 2nd, 2023. That's when we had like a big sort of half day. Summit workshop and a bunch of researchers self-organized a long discussion. I'm foreseeing maybe like 40, 50 of us try to scope down a potential language model project at AI2.
Kyle Lo [00:04:48]: Yeah, it was also extremely bottom. Up because we were all like, nobody, it was not on anyone's radar. We were working on, everyone's working on different projects that we had promised for the end of the year. This was very much just like a side gig for us. We had no compute other than this mysterious AMD GPUs that just came. It was like, oh, it's possible. And everyone was just like, yeah, I'll work on this on the side. Let's just start hacking together some stuff.
Nathan Lambert [00:05:14]: How far along the line until you decided on 7B? Like, were these things obvious at the time?
Luca Soldaini [00:05:20]: I think the size of it. This is where Llama's size was. Yeah, we started with seven because seven was the smallest Llama size. This was Llama one. Yeah, Llama one was like first couple months of 2023. Yeah, we started, we started scoping before Llama one. And then when Llama one came out, it made sense to have a configuration that was just sort of close to what they were doing. So it's not too much reinventing. I think seven was.
Dirk Groeneveld [00:05:52]: Yeah, I mean, I think the original scope was recreate Llama one, which would be a 7B at 1.4 million tokens. What were we staring at? OPT.
Kyle Lo [00:06:03]: We were staring at OPT also, right? During around that time.
Dirk Groeneveld [00:06:07]: For inspiration. Yeah. And for what not to do in many cases. Was OPT even like in the many tokens regime or was that still like when people did the booms and booms?
Luca Soldaini [00:06:18]: I think OPT and booms were.
Luca Soldaini [00:06:22]: They were not, they were not over trained at the end were both a scope to Chinchilla that they both had extensive logs and so they were very useful because both of them have hundreds of pages of like, whatever can go wrong during pre-training. Yeah. I mean, OPT was amazing as a resource for figuring out, you know, we knew nothing, so we needed to know what's important. And yeah, I remember there's also avoidance and so on. There's that. It's like Susan has this talk.
Dirk Groeneveld: I'll come load parallels of training OPT and yeah, I think the original ones, I always feel it's kind of a shame because the OPT models are not very good, but, but they were first, like they figured all that stuff out for the first time. I have huge amounts of respect for that.
Nathan Lambert [00:07:11]: And what's the like open source angle thing at the time, or like, had you already identified that there was no open pre-trained data sets for these models?
Kyle Lo There definitely wasn't any open pre-trained data sets. I think we were basically looking at. The gopher paper that had most documentation and then Llama one had enough documentation about what data sources were using, where we were like, okay, let's try to reconstruct what it was. And I think roughly around the same time, Red Pajama V1 and then shortly after it was like Falcon, Falcon, the first Falcon, we were all kind of concurrent works at the time, but basically starting from, I don't know, Grab, Common Crawl, grab a bunch of sources to try our best.
Luca Soldaini [00:07:50]: The funny thing, like we had conversation of like. Like, uh, there was like, boy, it would be good if we didn't have to do the data. This would be one fewer thing to do, but at the time, like even when, uh, Falcon dropped, they released like a small preview that wouldn't match like the token budget that we wanted for a training run. So it was not even like, you know, it was good work and like, oh, maybe we just switched to this one. And then we quickly arise, not, not big enough for the two trillion. So I think it was like, maybe. Yeah. Yeah.
Dirk Groeneveld [00:08:22]: I mean, we did the C4 data set way before any of this. Um, and so my first idea for how to do data was to just run C4, but on all the Common Crawl, um, instead of just whatever the most recent one was at the time. And I actually started writing a repo for that, but then ended up not doing it. This is the C5 repo. Yeah.
Nathan Lambert This was C4's side of data cleaning practices.
Dirk Groeneveld Yes. That's exactly a re-implementation of C4. And, um, for it to touch it, we'd run on slightly different hardware, um, with more dApps and that was, that was going to be the entire story until we found we could do better.
Nathan Lambert Yeah. And, um, for general timelining, I joined pretty much like almost 7B was, I think mostly done training or wrapping up pre-training and the like instruction tuning at the time was like basic SFT with a sprinkle of DPO. Yeah. So I think a lot of that story gets cut. Compressed. Like I'm guessing the actual pre-training happened in like the second half of the year, mostly. So it's a lot of prep to get a language modeling system to exist. Yeah.
Luca Soldaini [00:09:32]: I think we handed off the one of Dolma. So the data set that we used for pre-training is like end of June, I think, 2023. Grab Common Crawl, end of March. Yeah. So all the source acquisition was March, April. Let's see March and then yeah, a few months. There.
Nathan Lambert [00:09:52]: Um, if someone wants to do the same thing today, which is like, we should train a language model, how much faster would it be to like, is OLMo actually making that much of like, would it be a week with OLMo stuff now, or would it still take a lot of time to set this up?
Luca Soldaini [00:10:07]: I think if, if you want to, um, if you want to train exactly on OLMo data, you know, data, it's much faster, um, training, I think it requires a little bit more finesse and dirt. Yeah.
Dirk Groeneveld [00:10:23]: If someone gives you a cluster to, to run on, just figuring out the mechanics of getting your thing to run, just so setting all the environment variables and having the drivers loaded and so on, it might take you a week or so if you're, if you've done that kind of thing before. Um, so that's very different, but you can take a trainer that already works and just, just use it.
Luca Soldaini [00:10:45]: Um, it really depends like where, where you start. It's like, if, if you're spinning up your cluster from. Scratch, then you acquired a hardware, then that hardware has burning periods. So the first three months stuff will fail and that has nothing to do with the model itself. It's just, your hardware is also brand new.
Dirk Groeneveld [00:11:06]: Yeah. I mean, I am eternally grateful for AMD for giving us the compute to get started, but it was kind of difficult to run on.
Nathan Lambert What was the exact amount of compute? Like, I think when I arrived, that wasn't even what we're using where it's like Lumi discussions and the original amount.
Dirk Groeneveld Of compute was, uh, 2 million hours on Lumi.
Nathan Lambert So, so 2 million GPU hours.
Dirk Groeneveld [00:11:29]: Um, that's we're training way bigger now than that. Yeah. So I think I did the math recently. It's like the order of a million hours is if you do a thousand GPUs concurrently, like 20 days. Uh, I don't have that math in the top of my head, but, um, the first, the first end to end run for the 7B that we did took, uh, 35 days. We can now train that same. Model again in three days. So things have changed a lot since then. Yeah.
Luca Soldaini [00:11:58]: Well, some rough, rough stats for almost two anyways, seven and 13, just the final ones, um, was a little bit over 5 million GPU hours combined. And then we have roughly 5 million hours worth of experiments.
Dirk Groeneveld [00:12:15]: Um, these are, uh, A100, H100. Might be surprised. Oh, it's the case too high or too bad to do some, it's way too high.
Luca Soldaini [00:12:33]: Um, it's like, how do you encamber overhead then?
Dirk Groeneveld Oh, combined.
Luca Soldaini [00:12:36]: It's some of them plus the ultimate training. They're also not using the new one core quickly.
Dirk Groeneveld [00:12:42]: So, yeah, but I'm just thinking if it's, let's say conservatively 7,000 tokens per second, four months on a thousand. Do you think it's less than that?
Nathan Lambert Like, okay, let's just go and track those number down. I think it's interesting. It's like, what percentage, what is the percentage of improvements still? Like how much of all the two being better is just by the compute being more stable just by doing more experiments. And that lets you test things like stability and just get the ducks in a row rather than like the data being so much better. It's an impossible question.
Luca Soldaini [00:13:20]: It's that it was like. And, you know, the trigger part with using that AMD hardware at the time, specifically that cluster, was that cluster was being brought up online at the same time as we were experimenting with it. So we were helping that cluster being set up. So it's because of that, there's a lot of things where we had to second guess ourselves, whether that was an issue on our side, the hardware side.
Nathan Lambert [00:13:58]: Isn't this always going to be an issue with new GPUs coming into the world? Does Microsoft plug in opening eyes GPUs and they just work?
Luca Soldaini [00:14:06]: I think it was, yeah, it's always tricky. It's a combination of like getting both new GPUs. At the time, AMD was a relatively new vendor, plus the cluster itself being new. So it's like stacking, you know, risky, risky things on top of each other in a way that it's like, oh, if you can, if your cluster is solid, that, you know, the GPUs are brand new. Well, the network is not going to cause issues, but if the cluster is new and the GPUs are new, who knows where the problem sits. Yeah.
Nathan Lambert [00:14:44]: We'll go down the... Yeah. We'll go down the whole stability round the hole. Dirk, how close are you to a number?
Dirk Groeneveld Five trillion tokens at 7,000 tokens per second, which is what we get for the 7 billion, more or less, over the long run, is only about 200,000 hours on each one. So our first estimate was way off.
Luca Soldaini [00:15:05]: It was... Check the top. I think maybe my memory was wrong. Maybe my thing was... This is why I have this laptop here.
Luca Soldaini [00:15:18]: Oh, no, I was misremembering. Okay. My name is 500K. I remember flying... 500K. Yeah, yeah, yeah.
Nathan Lambert [00:15:27]: So it's like from the first AMD grant of a few million GPU hours on AMD to what we have today. It's like it's gone from multiple million AMD hours to training a model over five times the tokens in half the GPU hours. That's right. Yeah. Like, where do we...
Dirk Groeneveld I mean, the biggest one is that the MI250 that Lumi has on, like, the MI250 is the AMD GPU that Lumi has, is of the A100 era. It's comparable to an A100 in price and capacity. But now we train on H100s, and they're just...
Nathan Lambert What percentage of tokens... It's just a newer GPU. Yeah, what percentage of tokens in OLMo 1 code versus OLMo 2 code are lost at, like, a 7B, so a scale that we're reliable on? What percentage of tokens in OLMo 1 code versus OLMo 2 code are lost to spikes?
Dirk Groeneveld I think it was OLMo 1 losing a considerable amount against the spikes game. That's impossible to estimate, because there's so many other differences at the same time between OLMo 1 and OLMo 2.
Nathan Lambert Can you summarize the architecture differences? There's a list in the paper. We don't have to be exhaustive.
Dirk Groeneveld That's going to be a lot of stuff. The biggest difference is the init. So I guess now we're getting into what did we actually discover?
Nathan Lambert These are some audience questions. OLMo 1 and OLMo 2. Finbar, who you might know specifically, asked, like, how do you write an init N(0,0.02) to an init? I'm like, I don't know.
Dirk Groeneveld That particular init is the default in Megatron. And the init that we had in all one was just trying to be too clever. We stole that init from OpenOLM, and they took it from somewhere else, actually. And I don't remember what the original source is.
Nathan Lambert What is the actual decision-making on an init that's too clever? You, like, think that you can get a better learning region by bundling with something?
Dirk Groeneveld We tried it. We ran it for, you know, 100 billion, 200 billion tokens, and we looked at which one is better. And scaled init is absolutely better for a long time. So scaled init is the original. It's the OLMo 1 init. Works better for a long time. You have to train for a really long time before you see it come apart. You have 2 trillion tokens for a 7Bmodel. And then things get a little bit dicey. So this is why, you know, this is why we used it for OLMo 1, because it looks quite good for a long time.
Nathan Lambert Which of our OLMo models did we figure out that the init was a change?
Dirk Groeneveld Because we did a few through the year. We tried that same init with a 7D model, and that did not work. That model stalled out around 1.3 trillion, 1.4 trillion, something like that,
Dirk Groeneveld [00:18:12]: which gets at the heart of the stability. So we started to think about the stability investigation. So I think that was one of the audience questions, right? And how do we even go about the stability investigation? starting from the point of we're training the 7DB and it's not working anymore, what did we do? The first step was to identify the issues that we see in the metrics and see them in a smaller model. And the two issues we saw were lots of spikes that we call them fast spikes. So the model recover. They recover quickly, but they just happen more and more the longer you keep training. And at some point, even the fast spikes kill you.
And the other thing was a growth in GradNorm. It seemed very much that the 7DB would always start blowing up once the GradNorm got to 0.4, regardless of what intervention we did, it would get a little bit further. And then as soon as we hit 0.4 GradNorm, it would blow up again.
Nathan Lambert So you lowered the learning rate and it was up again.
Dirk Groeneveld So fortunately, yeah. Yeah. So we would do things like that, increase the batch size, change the late decay, blah, blah, blah, but quickly it gets back to 0.4 and then blows up again. So fortunately, both of those phenomena also appear at the 7DB, even though the 7DB trains fine, it has both of those traits. So we decided to focus on those two because it's too expensive to try all these experiments at 7DB. But these are two things we could fix at 7DB and then see how it goes. So that was, that was the first step. But now. Now we have a metric where we can pretty quickly, within 12 hours or so, do a run, find out if our numbers are better and then change something and do it again. And the second component was we took another model that successfully trained that didn't show these issues, that didn't show the slow GradNorm growth and it didn't show the spikes either. And we ablated against that. So that was the LLM-360 Amber model. They're like all very open. So we could take their data. We could take their setup and look at it in great detail.
Dirk Groeneveld [00:20:22]: And we basically tried things one by one, sometimes two by two or so to not run too many operations. But we tried things until we got to a stable setup. There are some other insights at the time. I was really into the Spike No More paper, which is all about the magnitude of this. So we tried embeddings. So we tried some stuff there.
Dirk Groeneveld [00:20:48]: Pete Walsh on our team tried some other stuff involving Adam W settings that made things even better. And then we took a lot of inspiration from the Chameleon models because we were talking to that team on a semi-regular basis and they had a lot of stability issues. They found some solutions that we also tried and some of them worked for us and some of them didn't. And we took the ones that worked for us. So it's always ablating at the 70 scale until our numbers look super smooth and super nice.
Nathan Lambert How specific do you think these are to our setup? Are these all OLMo specific insights or is it just kind of a process you have to walk down? We've heard some of these things before. It's like all these developments are you have to do the previous type of thing before you can go bigger, do a more complicated model. Do you think that's actually true or is there just best configurations at the time?
Dirk Groeneveld I really don't know the answer to that. It's hard. But something I want to know, something I want to do for OLMo three is walk back a few of these things and see in retrospect which ones are actually necessary. And in particular, I'm hoping that some of those are not necessary and they're costing a bit of performance, you know, just to boost our own efficiency a little bit.
Luca Soldaini [00:21:54]: In general, I don't know, you can tell me if there's a useful summary, but it seems like the space of intervention you can take is so big. And other model, they're not going to translate perfectly, but the hit rate to like find a good solution is higher if you start from that model and you explore around it versus like try to explore like the full space of possible solutions. Yeah. And then some things will not pan out once you try to rerun them on your setup. And I don't think that's an indication of like necessary . Yeah. You know, we can mistakenly reimplement their thing, not in the way they're supposed to be. It's more like some things translate, some things don't. But it's a good starting point.
Dirk Groeneveld [00:22:55]: Yeah. I mean, we are a fairly conservative bunch with this, right? Because even the 7B runs are actually kind of expensive. So make small changes from a known baseline by and large. Yeah. I mean, everyone has.
Nathan Lambert Yeah. And risk is pretty obvious when you look at the cost numbers and like who you are trying to beat or not. And it's like we are trying to try to plot or people can build on it. And it's much better to keep making small progress than it is to go for glory runs and just hope that works. I think both works. The more compute you have, you can have a bigger distribution of investments, but it's not that surprising.
Dirk Groeneveld I mean, I hope that we can be a lab that is a little bit more risk tolerant than others. For one thing, we don't have Meta's resources. So we should be a little bit more aggressive. You know, it would make me much more nervous if I had to bet a billion dollars on our next run than the amounts that we can bet. So we can try a little bit more. I also feel and I hope that our management agrees with this. I feel that if we always, if we're always safe, if every one of our runs works. That means we're not trying hard enough, right? We have to occasionally crash and burn.
Nathan Lambert I think there's a few every year that you should crash and burn. I think these crash and burns at the big scale get a lot of attention from media and stuff. But it's like, what do you expect them to do? If they haven't, you're walking up a line and might as well try to take three steps at once every so often. Exactly. But I do agree. I think that's a cultural thing that we're trying to navigate. It's like, how do we do more interesting stuff and not just fall into the trap of being the best? Open model. No one else is doing this. Like, okay, you could do that for a while, but it's not as motivating.
Dirk Groeneveld And it's not just because it's more interesting to do that, but also just the fastest way to make a better model. The fastest way to calibrate your risk tolerance properly. You have to sometimes be over. Yeah. It's inevitable.
Nathan Lambert [00:25:05]: Any follow ups on risk?
Kyle Lo Yeah. I'm thinking now it's like, because the 70B crash was so sad. Yeah. And I'm wondering if you look back on it now, it's like, that was the greatest thing for us. We learned so much from that.
Dirk Groeneveld [00:25:19]: It was very important to love too. I do a little bit. So, I mean, we felt terrible, right? Like this was an awful time for us. I was like, I'm done. Let's get good questions. No, we were the training team that couldn't train at all. I felt so bad. But the work we did following up is some of the proudest I've been about the stuff I've done in my time at AI2. Yeah.
Luca Soldaini [00:25:47]: In general, my thing about the role of OLMo sort of keeps evolving, right? It was very natural to have OLMo as these models designed to help others do research and language models. That's how we initially, it was a big part of OLMo 1. You just release all the components because it's important to have these tools available to everyone. To study language models. And I think we serve that community well. One thing that it's, I hope we can do with OLMo more is that there are like some interesting aspects of language models. Interesting capability, interesting architectural decisions that for a myriad of reasons, they sort of get overlooked in like say a company or like in a framework where, you know, you have certain constraints in your model. But it's still there. They are important. And there are questions around like what a model should be able to do, how it should operate, and things like that. But I think we can take a role where like we have in general this recipe that both enables research and language model and for like subset of model capabilities that we think are fundamental. No one is touching. It's our space to do work there. I think the prime example that I keep repeating these days is what we did with MOLMo and
Luca Soldaini [00:27:25]: vision team was mostly working on it. And MOLMo is very good vision language model in general. It benchmarks up there. It's not the best, but it benchmarks up there with open models. And then it has this like this interesting point. Pointing capability that no other vision language model has. And that pointing capability is, turns out, is fundamental for a lot of language models and robotics that you want to build. It's a core capability the same way that a text model should have long context. And it was cool to make, to sort of emphasize that of like, oh, we have the specific capabilities that would enable all these applications. And so more people should work on like the specific aspects. So I think that's a cool way to like work on things that folks haven't had a chance to touch on yet.
Nathan Lambert [00:28:24]: I think it's like trying to parse out why this type of situation could happen is not easy. Because you generally, everybody would want to do this. Like everybody wants to come up with a new capability that expands the scope of what X type of AI model can do. And I think it's most of like probably goes down to the culture of where people have space. To think about stuff in a more interesting way. It's like, because obviously everyone wants to have breakthroughs and open AI and Anthropic that copy. But it's like sitting at a boundary between doing just the same stuff and doing more researchy stuff that you need to have. I have more architecture questions. One is MUP. Multiple people are asking about it. I still don't really intuitively know what it is. But are we going to use this?
Dirk Groeneveld We have done a fair bit of work into it. And it hasn't worked for us yet.
Nathan Lambert Can you explain what it is?
Dirk Groeneveld MUP is mainly a way of setting the learning rate, but also some other hyperparameters. By training only small models and then having a guarantee or at least a pretty good idea that it will work also for larger models.
Dirk Groeneveld [00:29:33]: We have implemented this. We've experimented with it. So far in our setup, it works across model sizes. So the learning rate that it predicts you should use, it doesn't predict the learning. It just gives you one learning rate. Basically, the good learning rate for the small model is also the good learning rate for the big model. That works if we change the size of the model. It does not so far work if we change the length of the training run. And that's why we haven't been using it so far.
Like number of tokens.
Yeah. Or longer. If we double the length of the training run or we 10x the length of the training run, the optimal learning rate is different in our setup.
Dirk Groeneveld [00:30:21]: It seems like this might be a bug. It should work, but it doesn't.
Nathan Lambert And the positive gain is just that better scaling because you don't have to fiddle with the certain. You know you're getting the right learning rate, which is a crucial hyperparameter.
Dirk Groeneveld Yeah. It's just a better way of setting learning rate. And it works for a few other hyperparameters too.
Nathan Lambert But there are other open models that use this. Explicitly. Pretty sure. I mean, open weights model. Yeah. Those are linking. Like Llama and stuff using this. Llama does not, I think. But I don't know for sure. We'll always see with the next iteration. Even Llama3 felt like they were still building their org and their infrastructure so fast. It's just like get in what you can get in and there will be more models in the future.
Dirk Groeneveld Yeah. I mean, MUP is a shortcut, right? Like you can for many settings where MUP wouldn't work. Or you have to just establish scaling laws and predict what it will be. You could do the same thing for the learning rate. Just MUP lets you do this with even fewer runs. You know, you don't even have to extrapolate anything anymore. You just use MUP and your setting will work. That's the idea.
Dirk Groeneveld [00:31:29]: But you kind of already need a scaling law set up anyways for things that MUP doesn't work for. You know, like architecture changes and so on. Yeah. So in that sense, it's not that important. It's still pretty important. And we're going to keep trying to make it work for us. Maybe just find the bug. But it's not absolutely critical.
Nathan Lambert How does scaling laws actually tell you the way to change like the width? Do they actually tell you the change in width or the depth, like the proportions of the network relative to the size? Like what are the actual output variables? Or how are you controlling the architecture you're going to use in the scaling laws? Well, like I know what it's trying to predict, the accuracy, but are they on set architecture things?
Dirk Groeneveld You would usually vary one thing.
Dirk Groeneveld [00:32:17]: Like you don't vary anything. You establish how it scales with size. Yeah. And you set your size according to a certain formula. Like you might say, I will go 1.4x the depth and 1.4x the width. So I have roughly 2000 pixels. That's a bigger model. And you do that a few times and you draw it on a graph. Then you change your architecture. You do it again. You draw a different graph. You lay them over each other and you hope that the lines don't cross. And one of them is clearly better than the other.
Nathan Lambert Yeah. I definitely have known that there's some, it's like one of the obvious things architecture design and the not obvious things. It's like you obviously make the model bigger, but the subtlety of like how tall versus wide. I think we're talking about like a client that's like much deeper than ours, our model architectures. And it's just like, I'm around these things and I don't have an intuition for if tall or wide is better. And I think it's like what works.
Dirk Groeneveld There are some early results from Google, I think. I think they're called efficient net or something. That suggests that over a wide range, it doesn't matter whether you go wide or deep. It's not that surprising. That's pretty old results now. We're following up on a particular result right now. Actually, so OLMo 2 is a 7 and a 13, right? But there also was a 1 that didn't work very well. And we're trying to find out why. And one thing about that model was it was pretty wide and not very deep. So we're checking whether that is the reason why it wasn't very good. So we're sort of in the middle of double checking this assumption that it doesn't really matter whether you go wide or deep.
Nathan Lambert Yeah, that makes sense. I think that is something that doesn't matter to most people. They're probably very interested in it. Just like how they have these blocks and how do they decide. And it's like just one of us decides.
Dirk Groeneveld And it's like, eh, seems right. There are other concerns, right? So we train with FSDP, with 0.3 sharding. So we can try to choose these sizes such that they utilize the GPU in the optimal way.
Dirk Groeneveld [00:34:29]: Which has nothing to do with the sort of abstract training dynamics. It's just the practicality of getting this thing into 80 gigabytes of memory. So then those concerns might take over. There's other stuff like all your tensors, all your tensor dimensions need to be multiple of 64, 128, things like that. GPU math stuff. Yeah, exactly.
Luca Soldaini [00:34:53]: It's really hard to argue against things that are practically making you run fast. Because it means that if I find something that is 20% faster, your big run trees fast. All the experimental cycles are 20% faster. So it's not very glamorous. But everyone is really happy when we find one of these. Like, oh, this is a shortcut.
Dirk Groeneveld [00:35:16]: I find it super glamorous. I mean, when did you ever have such a clear sign of impact that you can say, I wrote this thing and it is not 20% faster? No, the impact is very good. Yes.
Nathan Lambert The numbers you're changing are not necessarily glamorous. It's just detailed stuff.
Kyle Lo [00:35:34]: I also think the experimental cycle thing is probably the biggest thing for me. What we're seeing consistently is the more experiments you run for a particular idea, the more likely it is to just work out. It's just a function of trying more things.
Nathan Lambert [00:35:47]: It seems like in the pre-training, there's very few, like, you just get the idea. I mean, well, I said post-training more. But literally, like, we had a meeting with John Schulman. He was like, everyone, lead labs, train RL and athletes do this. And we got, like, a three-month head start on one step. But pre-training, all that stuff. I think it's evaporated.
Kyle Lo [00:36:05]: The human intuition piece is just gone. I think once you do v0, you can kind of do everything with intuition. It's like, oh, look at data. This kind of makes sense. This seems like . And then after you get to, like, v2 of something, it starts becoming really hard to make sense of what is good for a language model or not. So you kind of just need to just try a bunch of stuff.
Dirk Groeneveld [00:36:29]: And then there comes a game of stacking improvements that are worth 2% to 5% each.
Nathan Lambert I think it's very compounding, at least in all the math, works out over a year. I think I want to ask about MOEs as well, if you have a different thing you want to say. But it's mostly, like, it seems like we have a OLMOE, which, if you look at the plots on paper, it's like this MOE architecture beats all of our own things and carry efficiency. But it seems like we had a path we needed to go down to make sure dense works really well and get all these improvements. And then you have to, like, feed back in. And you, like, merge the MOE streams. We have DeepSeek. We have Minimax. There's countless other MOEs that get really high eval scores. Like, they're not as easy to do research with because they have tons of total parameters. And people need bigger clusters to fine-tune them, blah, blah, blah. But it's like, is MOE something that you think we just need to do to make better models?
Dirk Groeneveld Well, it's a complicated question, and we haven't quite answered it yet for ourselves.
Dirk Groeneveld [00:37:34]: We did investigate doing a bigger MOE. And we found that the engineering is somewhat difficult. And at the time, we came to the conclusion that we could do that engineering, but then who's going to run that thing later? They also have to have a team of engineers on top of it to make sure they can train this.
Nathan Lambert What does the engineering look like? It's not, like, CUDA-level kernels. It's how you distribute parameters?
Dirk Groeneveld It's a little bit like... It's a little bit CUDA-level kernels in that... If Mega Blocks by itself isn't enough for you, then it gets really complicated. And we ran into that situation where if it had to be significantly bigger than what we did, it just got too complicated.
Luca Soldaini [00:38:22]: There is an inference. These very big models that really get advantages by... If you tailor them to, like, where you're going to do inference with them. So if you're a big company, you start thinking about, like, how to batch request, how to, like, serve the model. But if we could do it ourselves for the place where we're running, but then you start thinking, like, oh, folks who want to use their model in their hardware, they're better served by advanced model than also redoing this engineering on top. Like, there is, I think, a clear advantage if you are... Also providing an API to an MOE. Yeah. Very clear cut.
Dirk Groeneveld [00:39:10]: It depends on how we think of the product of ALMO. And the number one is still it's an item to be researched. So other people need to be able to train on it and to modify it and so on. And that is just much easier if you have a dense model. Yeah. If you think of it as something that gets put into a product. And people will run tons of issues. But if you have a lot of inference on and you only really care about the final score that it gets, then maybe the MOE starts making a lot more sense again.
Nathan Lambert Yeah. That's a good answer. I think it's, like, I think people can fill in the blanks of, like, what we may or may not do.
Luca Soldaini [00:39:53]: And I mean... I mean, like, different, like, I'm curious, like, what, like, folks at Llama, the Llama team think about MOE.
Nathan Lambert [00:40:03]: If the Meta AI exists, they're 100% going to do an MOE.
Luca Soldaini [00:40:06]: I mean, it's interesting, right? It's, like, if they're serving few, if they're expecting that the Llama users are going to be, in fact, one of the better smalls are few large companies that can figure out inference, then MOE makes sense. But if they're thinking about more, like, this model that wants to, it's great if it's adopted by a million developers, large and small, then, you know, they're still going to reach a lot of dense model. Yeah. Exactly. That development is so easy, so much easier for people to set up their own inference with a dense model.
Nathan Lambert [00:40:40]: Yeah. I think we've gone surprisingly long without asking about data. It's, like, how much more, is it just an infinite hill to climb on data? It's finding good data and filtering bad?
Kyle Lo [00:40:53]: I mean, I think it's an infinite hill to the extent to which everything else is also, and you can kind of keep improving, right? But yeah, it's the main threads constantly are. Got to get more data, because if you're working with larger pools of data that you can't actually get easily new data that's not in your distribution, it's probably interesting to study how that adds in. And you have more to work from. So if you have, like, a strict quality filter, you can still get your high token yield if you start with a much larger pool and filter down. So getting more data is really, really critical, especially if you can target specific pockets that you think is missing. You can always keep iterating on better filters. Understanding how those filters affect performance. And everything kind of interacts with each other. Like, safety filters interact with quality filters, interact with deduplication, interact, like, all these together. So there's an infinite, even ordering, search space between these operations. So keep throwing more things at it.
Luca Soldaini [00:41:53]: Yeah, it's very much just stacking small improvements. Yeah, shots on goal. I think the way it looks is, like, it's... For each... Now that we have, like, these multiple stages of pre-training, we think about, like, what kind of improvement you want to get from data at all the various stages. Like, clearly, the improvement you want to get from data you put at the end of training is different than the improvement that you want to see at the beginning. It comes with a different set of requirements. One thing that is really useful is... Intuitions are always often wrong. But one thing that it's worth spending time on is figure out... If you have a data ablation idea, what is the fastest way to disprove it, which requires a little bit of experimental design. And then, yeah, you've got to fiddle with, like, especially, you know, when you do the first version so that you can take a very... It's very easy to measure improvements. And then as you start thinking, like, refined version, then, you know, you've got to think of, like, how you measure your improvements or so. But, yeah, it's... There's no, like, big... After you're done, you know, the basic stuff, your V1 is done. There's never, like, a big, like, thread of, like, this is the one data thing. It's more, like, stacking your Lego bricks to get to a better model.
Nathan Lambert [00:43:18]: Do you think you can iterate faster on, like, end of pre-training, whatever you want to call it, like, highest quality bit training and the only data? Yeah. Have you, like, started that recently?
Luca Soldaini [00:43:28]: I think it depends on the... What we're getting, you know... We... We need a little bit more evidence of this, but it depends on the role of data. Like, it's very much... The reason why we started doing mid-training at all is because we were interested in having base models be primed with certain capabilities that we didn't get during the long pre-training phase. And for those, it's really easy to iterate on new data sources that would improve on those capabilities at the end, pre-trained. But during, like, the pre-training phase, why not the important aspect that we think about is, like, efficiency of your data is, you know, if there is a version of your data that is where train on it and the model gets to performance X on 20% faster, it means that you can train 20% longer, right? Or run more experiments. Or run more experiments. And so... But for those, it's, like, you know, it's... In some cases, you can use mid-training as, like, a proxy for this. In other cases, it doesn't quite make sense, so you have to come up with, like, maybe experiments through scaling laws, maybe experiments through some other technique. But yeah, it really depends on, like, what role a data set plays into, like, the various stages of pre-training.
Nathan Lambert [00:44:53]: So it seems like, like, compared to Dolma 1, which is, like, do the thing, it's all targeted abilities. It's, like, we want to be better at things. We put people on this. It's, like, targeted abilities or where we think we can get a lot of data.
Kyle Lo [00:45:05]: Like, a certain data source that hasn't been mined for stuff. Yeah. Yeah. We have to be opportunistic because it's so hard to get data. And for us, especially if we want to be open with the data, it's, like, we have to also do it by due diligence. Like, we're going to study this data, put all this effort in, and we're still going to be able to share it with everyone. So...
Nathan Lambert [00:45:22]: If you were in a lab that didn't release data, do you think you could make more progress on it? Like, how, like, how much is that actually?
Kyle Lo [00:45:27]: Oh, yeah. Oh, my God. Such a time sink.
Luca Soldaini [00:45:31]: I mean, it's, like, it's a little bit of a mistake that we put in. Yeah. Like, and this is not even, like, doing, you know, getting data that managed to not legal, right? You could form partnership. You know, you have people knocking at our door all the time saying that you want to buy this data set. And they're, like,
Nathan Lambert [00:45:48]: I've been contacted by one ML owner to try to facilitate a data deal.
Luca Soldaini [00:45:52]: Oh, yeah. Twitter. Oh, my God. But only the first, the first follow-up is, like, are you cool if we release the data? Of course, they're not. Yeah. So, it's, like, it's, it's, even, like, there's plenty of data that you could acquire from people, but then you can't release it. So, that's, that's a complication to, to progress.
Nathan Lambert [00:46:15]: Yeah. This is more of a self-question, but, like, how much do you think mid-training should be, like, a philosophical shift in how we organize teams? Because it's very easy to do. I mean, we've already consolidated, like, our training and data to base, which is not surprising. But this is mostly hypothesizing on what other people do. It's, like, how close do you think this kind of end of pre-training to post-training handoff should actually be?
Kyle Lo [00:46:40]: I think it's, it makes sense as a thing if, I think these things are, in theory, arbitrary, but you can think of, like, in the extreme, if you had a perfectly oiled machine, you have a very smooth transition between pre-training to mid-training to post-training, and it's actually, there's no boundaries. Like, that's, like, a theoretical. You can probably squeeze a ton of performance by smoothing that out. But in real world, stuff, stuff is messy. So the real world is your three trillion tokens into your base model run, and then you signed a new data deal. You got to do something with this, and you're going to undo your training one. Well, you got to figure out something. So maybe that's mid-training, right? Mid-training is when you have an opportunistic need for something, or you're training something and someone catches a bug, which happens all the time, like a data bug or some training bug, and you're like, oh, I had to patch it. So then there's the shift fundamentally. You got to know how to deal with this. So just because these things aren't, these large training runs aren't super repeatable, and they take so much time that the world state changes all the time, you always need some strategy on how to deal with, oh, I'm near the end of pre-training versus I'm near the beginning of pre-training versus... Yeah.
Nathan Lambert [00:47:47]: It's like, we're obviously trying to solve long context, so this fits right into this. It's like, we're going to do this thing. Does it go, where does it go? Some people do it in post-training. Yeah. There's some component during pre-training.
Kyle Lo [00:48:00]: It's kind of just like, you have to follow a few recipes and figure out what works for your team. Yeah. And so much of it is just, if it's expensive, try to push it off as much as possible. Because if it's risky, push it off as much as possible. If you can intervene to get the same result much later, huge win. You can try a bunch more things. If you have to intervene because it's some core thing that has to be baked into pre-training time, you're kind of... It's a sad space to be in. But then that's the thing where you have to intervene. That's the pre-training data.
Dirk Groeneveld [00:48:29]: There's a big question that I'd love to get an answer to, but I don't even really know how to think about it. But the question is, what makes a pre-training model a good candidate for mid-training fine-tuning? Because all we really try to do is we try to maximize our metrics, but we don't really know that those metrics are what makes a good step zero for post-training.
Nathan Lambert I think a relevant thing, I don't even know if I've told you this, but I don't know how to take action on this, is we got advice that we have the multiple stages of post-training. In this instruction tune phase, we got advice that's like, eh, it could be a little broken. You can have some crap in there. It'll get fixed later on. And it's like, why is that okay?
Nathan Lambert [00:49:14]: It might be the same thing in pre-training. It's like, you want to get in the right... It's more important to get in the right ballpark than the right exact number. Yeah.
Luca Soldaini [00:49:21]: It feels like it's more about not how to make a good model for post-training. But what to avoid so you don't have a bad model post-training. Yeah.
Nathan Lambert [00:49:33]: There's a whole other question, which is how to make a base model that's easy to fine-tune in general, versus one that, if with the right finagling, can get the absolute best numbers. Which I think, for OLMo, would be really great to be like, here's a super stable platform. A lot of people have complained about specifically That Llama Instruct, it's hard to fine-tune. Which, after most of the post-training. Because this is where people at companies start. They're like, this is the best open-weight model. I want to add a little thing in it. And a lot of people have difficulty in fine-tuning it. It's different at the base, because most people can't do this full instruct thing. But for researchers, having a stable platform at base is way more valuable.
Kyle Lo [00:50:12]: There's an interesting... About this, like, what makes a base model a good base model. There's this interesting, I guess, debate that we've had a bunch of times. We've also had with other people. Which is, it seems like there's like two hypotheses on what the role of this... How do you think about data as an effects-based model behavior? There's one hypothesis, which is, you need quality data so that you don't get any spikes. You have stable training. You have no bugs. And once you pass that level of quality, as diverse as possible. It's just about an init to the model, so that it can go in literally any direction. And so, diversity is the next. That's one hypothesis. The other one is, it's all domain effects. The only reason why... Like, you can just keep climbing. There's a notion of quality. But you... And you can keep getting more and more and more as long as you're very clear about what target domain or target application you are. You just keep getting closer and closer. Well, there's a lot of suite learning. Yeah. Well, this goes into, like, the continue pushing. I just like... It's just domain effects all the way down. If you're only evaluating on this particular stuff, you can always get your base model to be better for that. Just keep climbing on it to get it more and more similar. As opposed... And, like, think about, like, I care about this application, this suite of applications, all the way through. From base model... Can you not kind of have both? I feel like I'm confused with how, like, actual generalization fits into this. It's... It's... It's... It's competing ideologies in terms of, like, if you believe in the first one, then you're all in on diverse data acquisitions. And how you set up your team. Yep. You're all in on efficiency and stability for your pre-training. And then you just get as much different data as possible. And you're post-training all the time. If you believe in the latter one, you solve backwards from, this is what I want the model to do. And I make all the changes everywhere to try to squeeze performance out of this class of problem. In the big... In the data, in the bit-training data, bit-training data, et cetera.
Nathan Lambert [00:52:01]: How important do you think the actual, like, multi-tag category of every data document is? Like, know that someone... Like, that these people have really advanced tagging of all their pre-training documents. Like, do you... Like, does it essentially say, like, doing that and choosing them? Which is, like, a very much, like, crafting a, like, recipe for your pre-training versus, like, just good numbers. So, like, just get a good classifier and roll with it.
Kyle Lo [00:52:27]: We have tags. That's fine.
Luca Soldaini [00:52:31]: The tags are useful even if you get this idea of, like, let's use as much as possible. You know, diversity is important. A lot of web data comes with absolutely no useful metadata. You have, like, URLs. URL is very, like, you have to do things on top of it to make your URL useful. It doesn't add much. So, the more you have in terms of, like, categories, metadata information, you can start using this as a tool to try extra technique on it. Maybe it is extra technique to mix your data in a certain way. Maybe it's filtering out things. Maybe it's, like, designing benchmarks. Try to correlate with those. Yeah. Otherwise, it just seems to have this giant bucket with maybe, like, one quality knob. And it's, like, it's very hard to make progress if all you can adjust is, like, one number we cut for quality here. So, it's, I'm not surprised that, you know, the big labs, they almost have these tags. I want to know how they use them. That's, like, the part that's not good. That's the part that's not good. Yeah.
Kyle Lo [00:53:51]: But it's also not just you have more levers to pull and then, you know, the more things you can try, the better. It's also you want tags that are actionable, right? So, like, if you had a sensible notion of a tag and you realize, oh, more of this data as you keep adding more of this lever, performance keeps going up. At some point, you might be, like, we're out of that data. We need to go get more of that. Without that tag, you want that tag to be something that's understandable so you can go and negotiate another deal, do synthetic generation, et cetera, of that type of data.
Nathan Lambert [00:54:13]: Do you think most of the synthetic data gen, is for very specific things at pre-training? I mean, it kind of has to be. Probably, yeah.
Kyle Lo [00:54:25]: You can't just be, like, oh, it's generative data. Like, that's not something, I don't know what that procedure is.
Luca Soldaini [00:54:30]: It's probably to prime the model to whatever you need during post-training. Like, you know, we've seen, like, normally with math, it's much better if your model has an elementary knowledge of math to, like, improve on that. It's quite the same with everything that it's, like, oh, I want to do RL on this. If the model is completely random on it, you're going to have a very hard time.
Nathan Lambert [00:54:52]: Yeah, it's, like, I guess a good transition. It's, like, what do you three think post-training is, should be, or, like, is not doing?
Kyle Lo [00:55:02]: It's elicitation.
Nathan Lambert I'm coming around to this view that it seems that you can extract abilities from the model.
I think it's totally elicitation. Like, the Hitchhiker's Guide to Data paper from Google, yeah, that one was very, that one had, like, a very specific experiment. But it seemed like that was pretty strong evidence towards it. It's, like, you filter out all of this type of data, you literally can't fine-tune that model. You can never recover that. There was a history detection, right?
Nathan Lambert [00:55:28]: I think if you do more flops, you potentially can. I mean, it's obvious, like, we're not talking about, like, O1 stuff things here. But, like, there are even datasets that have, like, 15 million math-only instructions. Are they going to be able to really start doing a ton of math? At some point, yes. Yeah. But I think that most of it, or it's almost easier to operate. I mean, it's just like, assume that capabilities are in this model and are post-training to get it out.
Luca Soldaini [00:55:53]: Sometimes there's this very large set of, like, things that you do in pre-training because you have a sense of, like, how they play an application. I think one day it's, like, very obvious. It's like, code model, you want to do, you want them to do completion, you're going to add, fill in the middle of loss, maybe at the beginning of pre-training. It's like, oh, then I can play my entire pipeline around like that. So it's all about... So far, it seems all about that. I don't think we have cracked a good recipe to do the same for things that are not capabilities, but they're, like, recalling facts. Oh, yeah. Or, like, long-term knowledge.
Nathan Lambert [00:56:29]: Yeah. It's, like, all of us, like, all know, or, like, I don't know, at least people out there have MLMU numbers that go up in X stage. Like, instruction tuning, boosting MLMU, I'm like, what are you putting in there?
Dirk Groeneveld [00:56:42]: What do you think of mid-training then? Is that a manifestation or... Mid-training? I think it's still...
Kyle Lo [00:56:47]: I think it's still positive knowledge. I think mid-training is, it's just, it's still pre-training, but with strong domain effects. It's just smoothing out the boundary between, you have a very, very sharp distribution shift when you do post-training, and we know from, like, kind of ML101 from the past five, six years, that smooth, smoothing out, helping, like, transition between major domain shifts helps. But we don't have a clear example of where, like, it helps with specific knowledge acquisition. Yes. For them, we don't know how to do it. But for, like, you that are really easy to evaluate, things that are really big progress on, it's like, yeah, smooth this out.
Nathan Lambert [00:57:30]: So, like, why is post-training important to the release site? Some of you guys came around to, like, post-training being important for getting traction later on. Is that just, like, an ML ecosystem, how it works?
Dirk Groeneveld Oh, I mean, the base model is kind of useless, right? Yeah. There's only so many next tokens you need to know about. Yeah.
Luca Soldaini [00:57:50]: But it's like, you know, we've seen papers that use all the research, for example, where the idea for that research only came by comparing base model with, you know, instruction team model, like, the one where folks, they were involved around, like, certain pattern of speech and OLMo 1. Where do they come from? Do they come from pre-training? Do they come from post-training? And, like, even if you just want to do research, it's kind of useful to being able to compare side by side. So it feels wrong to put a model out that it, like, cuts sort of the problem that you can study in half until you have the post-training ready. And it's useful to have all in one package so you can use it right away.
Kyle Lo [00:58:40]: Post-training is just, like, a really, really long eval loop. Yeah. And that's a lot like, oh, base model, you know, a few shots, a few shots on some benchmarks, like, no, no, no. We eval it by post-training it and then eval it in post-training.
Nathan Lambert [00:58:54]: Yeah. I mean, to some extent, it is kind of true. I mean, that's how we should think about it.
Dirk Groeneveld [00:58:59]: If we could do that cheaply, we would totally hill climb on that metric.
Kyle Lo I think that's the metric. Because if base model is the good in it for the post-training, which is the model people actually want to use, then we evaluate it on its own. And on its status as a good in it.
Nathan Lambert [00:59:16]: Yeah. So it's like, how do we... And then the question is, like, how important do you think research for post-training on the specific checkpoint is? It's like, how important is genealogy versus, like, general recipes? Because I think we... I openly think we under-index on using one model. Because much like the path to stability, which is a eight to ten month really specific thing, I'm guessing if you're really just, like, in a narrower regime, you can just keep kind of turning these little things. Yeah. Hopefully at some point we can do better with new models. Yeah.
Nathan Lambert [00:59:52]: Okay. We're kind of going to some wrap-up things. How do you think about release decisions? Like, should AI2 release everything that we ever tried? Or is it, like, when should we actually get models out the door?
Dirk Groeneveld I mean, I would love to do that, actually. Especially the failed runs. You know, like, where else could you get a repository of failed runs? Yeah. I mean, I think it's just a matter of giving other people the possibility of looking into these failed runs and finding out exactly when they failed. In practice, that's super difficult. Because just releasing something is hard. You know, you need to upload the checkpoints and translate them in a different format. And you have to describe what you were even trying in some way that makes sense to people outside of the org. Give them access to the weights and biases and to the logs. And it's just a lot of work. And there's always something else that seems more pressing than that.
Nathan Lambert Seems like a scaling. Like, how much we can share is capped by how we can scale our org. Which, like, we're not going to have a complicated management hierarchy or, like, an entire org that is just support. And everything you upload, you build as a support burden. It's like, literally, we just have seen the envelope grow, grow, grow. It's like, more people use our things, you get, like, boring support. Like, people want to use it. That's the cost of it.
Dirk Groeneveld I guess it's a great problem to have. People want to use it. People want to use us.
Luca Soldaini [01:01:15]: And it's funny. To make a checkpoint where, like, some are very useful, you need the person who was involved. You have to fill, right? You need the person who was involved in it to sort of pour their knowledge into a format that then, you know, people can consume outside, right? Otherwise, you know, we would just open up our S3 bucket, the checkpoint, and it would be, like, utterly useless. Because what if you wanted to know more of the parameters, so, like, as long as we optimize for release, then we have the bandwidth to provide, like, the support around that. If people want the 70B fail run enough, you know, I'm sure we can release it.
Nathan Lambert [01:01:57]: It seems like it's just finding the right medium to release things. Like, I think long-time people reports are really good for the stuff that we do, because it just puts everything in one place for people, and it almost makes on-demand easier in the future. Whereas, like, we could just drip and drag models out all the time, but, like, that's not something we can't do. It's just, like, not... In terms of making progress in things that are easy to build on, it's probably just not worth it.
Kyle Lo [01:02:19]: In fact, there's even a cost to it, right? The big example here is we had a release of OLMo 1 0724, or July 1. Yeah. I think research, using that for research, that has been probably one of the tougher models, because it didn't come with a blog post, it didn't come with, like, some docs. And so, yes, it still waits for checkpoints and everything, but comparatively, usually, even when people come to us, we're like, oh, we recommend you use 0424. And now with OLMo 2, we're like, oh, that's the one we recommend, because it has all the documentation. So just dropping something doesn't seem like it really helps.
Nathan Lambert [01:02:56]: I would say we should move faster than, like, the 1-2 iteration. But the in-between is not necessarily even worth it. Which is very odd, when you think about being fully open. It's just, like, kind of with the costs of doing business.
Kyle Lo [01:03:10]: It's like being fully... You want to be fully open, but you don't want to add noise. And you don't want to waste people's time. Right? So if you drop something that's kind of half done or half baked, and people start spending time on it, only to get frustrated later, you've cost them something.
Nathan Lambert [01:03:22]: How does this relate to, like, how pre-training is changing? Like, do you think we need to invest in... Like, openly, a lot of startups are changing their relationship to training. And if they're going to use Llama or pre-training or customer data, and then we have X compute budget, and does any of this come into play? Or is, like, it's all the same with the talking? It's, like, continue to hill climb, do what you can, reasonable trade-offs, and who will actually use the models? It's, like, not too different.
Luca Soldaini [01:03:54]: I think that the... So for me, the cutoff point is, like, is there something useful and generally interesting to add if you pre-train? The case of Llama, all this, like, mid-train things that we concluded, we done. It couldn't be as clean if we started with an already pre-trained model. So it's, like, is there really something useful to add to the conversation if you pre-train? If we get to the moment when the answer is no, or, for work, like I was saying. But it feels there's still value to add to the conversation. At least in the research side, like, pre-training, there is tonight a question of, like, we know how to help researchers. We want to help more than just researchers with the models we put out. And if we think there is this application that we can do a very good job, or just this use case, a very good job by starting with someone else's pre-trained model, we shouldn't waste compute on pre-training from scratch. Just saying. We can solve that. But it's an ever-evolving question, really. It's, like, I don't know. We can make decisions six months out, maybe? Maybe a year?
Kyle Lo [01:05:24]: Well, that's what I would say.
Kyle Lo [01:05:27]: I know. You're the pre-training. You're the hardcore who's pre-trained some models.
Dirk Groeneveld [01:05:34]: There's lots of runway left in pre-training. The big labs are fairly conservative because they have to be. But that doesn't mean that we're done. I mean, it's not that we're done. I also feel that the point of all is to make pre-training research accessible to more people, because even if you don't have the resources to pre-train the whole thing from scratch, you can still use our checkpoints and use our code to prove out some sort of improvement. And as we've seen in other areas, even Microsoft tries to push .NET or Apple tries to push Swift or whatever. They try to, like, it's a really big effort for them, and they try to push this. And the open-source community says, I don't care. We're going to use Python. And Python wins. So if you can somehow enable the vast resources of a million people banging on a thing, even a company like OpenAI or Meta cannot compete with that. And with OLMo, I'm hoping to capture that a little bit, that if we can capture something with some of the open-source enthusiasm and the academic enthusiasm.
Nathan Lambert Do you think it'll get better this year? Because a lot of academics are bringing on tens of hundreds of H100 clusters around the country. Like, before, it was like just Harvard had 500 and MIT or whatever. But now it's like the long tail of universities. Like, there are a lot of people.
Dirk Groeneveld [01:07:12]: And then, you know, if you have 200 H100s, you can at least establish scaling laws for your idea. So, like, what I'm hoping is someone uses OLMo to try some new thing, establish the scaling laws up to a 3B model or whatever. Then we take it and we prove it up to 30B or whatever our computer allows. And if it still works, then, and if it's probably going to open, then I take it. And let them win. Let's not win. Yeah.
Nathan Lambert [01:07:36]: I mean, they would never tell us that they'd win. Yeah. Like, what do we need to achieve this? Do we need resources and compute and certain people? Like, do we need more feedback from the community? Do we need feedback from people at labs telling us which things to do?
Kyle Lo [01:07:48]: Compute and people, for sure. That is undeniable. If you have more computes, you can try more things. We can go bigger. If you have more people just trying more things, especially like on our artifacts, we'll just learn so much more and not have to just like, we spend so much time guess working, trying to piece together things from other people's pieces. And sometimes it's nice to just get something out of it. So if they did this on OLMo, we can immediately start working off of it. So people, compute, always, for sure.
Luca Soldaini [01:08:20]: One thing that I, we get a lot of feedback, but it's like, I really like AI2. I would like to use OLMo, but it's missing this feature, which is great. I love that feedback. It's helped us a lot in prioritization. If we could get more, I would love to also get like aspirational feedback of like, none of the models is doing this. But I have a good case for this. Yeah. Those to me are always like very inspiring to read. Whether we'll do it or not, it's, you know, it's a question of like, can we do it and how it works with other things.
Kyle Lo [01:08:55]: But those are always very, very welcome. You know what would be really cool? I think what would be really cool is like more projects in space that you can't do unless you have some sort of fully open constellation of artifacts. Yeah.
Nathan Lambert [01:09:09]: The thing that Dirk, does anyone ever do the thing where you load the model into one GPU and like iterate through the batches to find the one that, what happens when it blows up and the, or like when a loss spike happens?
Dirk Groeneveld I mean, to some degree we did this ourselves. Yeah. But it's like something that people can do. It's not like we wrote a paper about that, but, but yeah, I would, I would love to see a detailed write-up of like millisecond by millisecond, what happens in a retention when a loss spike happens. You know, how, how does it actually happen? These are the things that people can do.
Nathan Lambert And it's like, you just have to keep, keep zooming into a specific level of details in what happens.
Dirk Groeneveld Yeah. I mean, right now we're having, someone is using the various checkpoints to see how a certain metric that we're interested in develops throughout pre-training. Yeah. And it's like, you can do that with fairly minimal compute. You don't have to be AI2. Yeah. It's like one of my favorite weird language law papers. It's the Sander Land fishing for Magic Karp paper. And it's like, you can get much more actual feedback looking at weird tokenization. Yeah. You can get much more actual feedback looking at weird tokenizer impacts and tokenizer data interactions on old mode than just picking API models and figuring it out.
Kyle Lo [01:10:20]: A lot of also, there's a lot of really forward looking at the checkpoints that we have with the data patches and trying to do something like, okay, let's replay this, the, between everything between these steps by rejecting some different data or manipulating the data between these two checkpoints, just to see how it turns to something different. How big of a fork does it go through? Yeah.
Nathan Lambert [01:10:39]: Like if you add the same intervention, like how big does it go through? Exactly. And just to see how it turns to something different.
Kyle Lo [01:10:43]: So it's like reconverge. Or early in pre-training versus later in pre-training same interventions, messing with the data. It's just like, that stuff is really cool.
Dirk Groeneveld [01:10:49]: I mean, I think there's, I've complained about this for a long time. Grad students, I think, are a little bit hesitant to go into pre-training stuff because they need to publish four papers a year. And it's pretty difficult to do that when your cycles are so long. But on the flip side, it's a bit less busy a field. Yeah. So less likely to get scooped if the field doesn't change out from under you while you're in the middle of your project. Yeah. Post-training is not quite as much as it happens on your side.
Nathan Lambert It makes no sense. It's just like, pick something you want to do and people will probably do it. That's okay.
Dirk Groeneveld [01:11:31]: So I'm hoping that by publishing all of this stuff and making all the checkpoints available and the data and so on, we can enable more people to work in that side as well.
Nathan Lambert Yeah. Anything else you guys want to add?
Kyle Lo [01:11:49]: Like, comment, subscribe.
Kyle Lo [01:11:52]: Yeah, I think that's it.
Nathan Lambert [01:12:01]: Okay. Thanks for listening. If you have questions for any of us individually, the Blue Sky and Twitter handles for everyone in this podcast are below. And you can reach out to the general OLMo contact at allenai.org. That's an email address. Or we're really happy to help and we want to keep building this kind of open scientific ecosystem of language models. So all the best. Bye bye. Bye.
Full post for links, images, etc: https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1
I have a few shows to share with you this week:
* On The Retort a week or two ago, we discussed the nature of AI and if it is a science (in the Kuhn’ian sense)
* I appeared on Dean W. Ball and Timothy B. Lee’s new podcast AI Summer to discuss “thinking models” and the border between post-training and reasoning methods. Listen here.
* Finally, a talk I gave at NeurIPs on how I think about post-training for AI applications is now public.
This post is likely getting cut off in email inboxes — I recommend reading online by clicking on the title!
Yesterday, January 20th, China’s open-weights frontier AI laboratory, DeepSeek AI, released their first full fledged reasoning model. It came as:
* A flagship reasoning language model, R1, trained via a 4-stage, RL heavy process. It is MIT-licensed which means companies and researchers can build upon and train on its outputs to accelerate the development and deployment of reasoning language models (RLMs).
* An RL-only reasoning model trained directly from their V3 base model, R1-Zero (used to create training data for full R1).
* A suite of open-weight models finetuned with supervised finetuning (SFT) data derived from R1 (similar data to one of their intermediate training steps).
* A technical report detailing their RL training methods.
* Models are available at chat.deepseek.com (via DeepThink) and in their new app.
This post is less about the evaluation results (which, of course, are extremely good and shown below), but rather about how training is done and what it all means.
This is a major transition point in the uncertainty in reasoning model research. Until now, reasoning models have been a major area of industrial research without a clear seminal paper. Before language models took off, we had the likes of the GPT-2 paper for pretraining or InstructGPT (and Anthropic’s whitepapers) for post-training. For reasoning, we were staring at potentially misleading blog posts. Reasoning research and progress is now locked in — expect huge amounts of progress in 2025 and more of it in the open.
This again confirms that new technical recipes normally aren’t moats — the motivation of a proof of concept or leaks normally get the knowledge out.
For one, look at the pricing of these reasoning models. OpenAI was likely charging more for its model due to the costs of long-context serving and being the only model in town, but now o1’s pricing at $15 per million input tokens / $60 output looks out of place relative to R1’s pricing at $0.55 per million input tokens / $2.19 output (yes, o1-mini is cheaper at $3/$12 per million, but still almost a 10x difference). The price war that is coming for reasoning models will look like the Mixtral inference price war from 2023.
With o3, OpenAI is likely technically ahead, but it is not generally available nor will the weights be available anytime soon. This points to the first time since Stable Diffusion’s release that the most relevant and discussed AI model is released with a very friendly license. Looking back at the journey “open-source” AI has been on over the last 2.5 years, this is a surprising moment in time marked in the history books.
We don’t entirely know how these models will be used in the future beyond code and math, but noises are constantly bubbling up that OpenAI’s o1-Pro is the best model for many more challenging tasks (I need to try it myself before making definitive recommendations).
The most useful post to write now is one that establishes the research area, the do’s and don’ts, and the open questions. Let’s get into the details.
The DeepSeek R1 training recipe for reasoning
The training of R1 comes in 4 stages:
* “Cold-start” of supervised finetuning on synthetic reasoning data from the R1-Zero model.
* Large-scale reinforcement learning training on reasoning problems “until convergence.”
* Rejection sampling on 3/4 reasoning problems and 1/4 general queries to start the transition to a general-purpose model.
* Reinforcement learning training mixing reasoning problems (verifiable rewards) with general preference tuning reward models to polish the model.
Below, the post breaks down each training stage into its core components, insights, and open questions.
The winds of o1 replication have been blowing strongly away from any sort explicit search (especially at inference time). It really was, and is, a language model with the new reasoning behaviors coming from a lot of RL training.
Before we start, remember that to do this reasoning training well you need a very strong base model with long-context capabilities. Much like for standard post-training, we don’t really know what traits of a base model make for one that is more suited for direct RL training.
Step 0. Training R1-Zero to initialize R1 with synthetic data
DeepSeek R1 Zero will be best known as the first open model trained with “large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step.” Rumors had mentioned this for o1, but understanding how it worked wasn’t clear. This is a funky model that DeepSeek reports will sometimes change languages in reasoning or show signs of other reliability issues.
The minor usability issues with R1-Zero show why more than just large-scale RL is needed to train a fantastic reasoning model, but the RL part is the key to unlocking the reasoning behaviors we are searching for.
They include the most interesting results for R1-Zero, including the plot I’ve been asking for of RL-training time scaling. Since o1’s release, everyone has been obsessed with the plots showing how inference time is correlated with evaluation performance. Inference time is far easier to elicit (or force by using a framework like Monte Carlo Tree Search), but showing training time improvements via RL is the real foundational result. This is the result I’m searching for in my research.
And an unsurprising, yet very satisfying plot of length growing with training. This could be mixed with the above plot to make one of the “inference time scaling” plots we have seen many versions of with less clear methods.
In both of these plots, it looks like the numbers could still be going up if they let the RL cook longer. With the pace of progress so high, these laboratories get more gains by ending the jobs near saturation and starting the next experiment instead of seeking that last 1%.
Most, if not all, researchers will skip the step of training an R1-Zero style model because they don’t need to. DeepSeek made it clear that their “cold start” of SFT reasoning traces makes the final R1 model better — this is unsurprising, as they want R1 to be a certain type of instruction-tuned model. It’ll help avoid some of the “RL oddities” in R1-Zero that DeepSeek mentions like changing language mid-generation.
Still, the area of RL-on-base-models should be studied further. The way that R1-Zero can be trained is quite clever as most base models without any instruction tuning have a major issues with rambling and never generating a stop token. R1-Zero avoids this with a system prompt telling the model to generate HTML tags. Additionally, I suspect this type of training wouldn’t work on older base models that don’t have some standard post-training style instruction data in the pretraining corpus. For example, in OLMo 2 we had some MATH instruction data in the annealing mix. Just a few instructions will let this system prompt work.
In fact, the trend of increasing generation length via RL training could be even stronger when training directly from a base model rather than a standard post-trained model that doesn’t have a verbose chain of thought style. In order for RL to really start cranking up the response length in such an instruction-following model it will have to unlearn a certain response length that was baked in. For example, in Tülu 3’s final stage of RL finetuning, the phase where the response rate first goes down could be the barrier of misalignment between a larger round of SFT training before a smaller RL setup.
Zooming in on the x-axes of these R1-Zero plots, you can see that they’re doing 1000s of “RL steps.” RL step in this case refers to the model update step, which comes after multiple generations are made for the prompts in the batch and then answers are verified. This is a large amount of RL training, especially with such a large model. For reference, in our Tülu 3 work, we finetuned our models for 100s of steps normally, and the biggest models we are releasing soon only trained for ~50 steps of RL.
This is scaled-up RL relative to existing literature. R1 proper surely uses a similar setup, but DeepSeek did not include the same details, so the rest of this post relies more on explicit text in the paper.
Step 1. Reasoning SFT “Cold Start”
In order to improve the readability (i.e. help maintain formatting) and increase the final performance of the final reasoning model, DeepSeek performs a small amount of supervised finetuning on the original base model with “a few thousand” filtered completions from the R1-Zero model. This involves a few tricks (none of which seem essential, you just need some of this data), such as:
Using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.
For replication efforts, any of these can be done. In fact, using DeepSeek-R1 itself is likely the easiest way.
This phase readies the loss landscape of the model to make the “emergent” behaviors like “wait, let me check my work” or “that was wrong” come forth more easily in RL training.
Step 2. Large-scale RL for reasoning
As a reminder, RL for reasoning models is built on a simple idea where you should reward the model for getting correct answers to problems where you can check if it has a correct answer. A basic feedback loop of this looks like the following:
Exactly what the “reward” is here (the same question applies for R1-Zero) isn’t detailed. DeepSeek mentions three reward components during the reasoning phase of RL:
* Accuracy rewards: These are score bonuses if the response to a prompt is correct. I’ve been referring to these as “verifiable” domains and in OpenAI’s Reinforcement Finetuning this is handled by their graders. TLDR: If the answer is correct, the reward is positive, if not, it is 0.
* Format rewards: These are rewards (or penalties if not satisfied) to check and make sure that the model follows the correct formatting of or and and for stable inference.
* Language consistency rewards: A reward is added to the model if the language of the answer is 100% matching the language of the question. DeepSeek writes that this additional reward shows a “slight degradation in the model’s performance,” but better human preferences. It’s added to make the model nice to use, which is a wonderful reminder that evaluation scores are not all that matters.
The first reward here drives the majority of the learning and the other two are guardrails for creating a stable model (which is not to say they aren’t important implementation details, but rather that the first one is necessary and the others may not be). To optimize this reward, DeepSeek uses the RL algorithm that they introduced, Group Relative Policy Optimization, which is the PPO update rule with a different value approximation method based on Monte Carlo advantage estimates rather than holding a separate value model in memory. The most likely explanation for this choice (much like how OpenAI has always used PPO) is that it is the mature implementation in their infrastructure.
This image from the DeepSeekMath paper is a fantastic comparison of PPO to GRPO (this is fine to skip this if you only care about the big picture recipe):
The nature of the reward setup (and the data) is the key to this sort of reasoning training and many of the small RL details can be substituted for each other.
Much like the DeepSeek V3 paper, the details of what data they used to train the model are not included here. This is absolutely crucial and almost certainly involves many, many verifiable prompts with answers. In order to study these models the community needs open versions of these datasets.
I would’ve loved to see details of their RL infrastructure (similar to the details in the DeepSeek V3 paper), as many people are looking to build on these models. RL training requires holding multiple models in memory and alternating between generating, verifying, and taking loss steps. As Sasha Rush says, “We need to code up verifiers ASAP,” which is what we are trying to do at Ai2 building on Tülu 3 and could use a lot of help with the open-source code. A good approach for entities interested here is to develop tooling and data for one domain at a time.
These first two steps are not new but rather scaled-up versions of ideas people have been discussing extensively. The final two steps DeepSeek details in the paper are new applications of known techniques to help take their raw reasoning performance and “train a user-friendly model.”
Step 3. Rejection Sampling to introduce general abilities
Rejection sampling is a technique where you generate completions from a model, rank them via a reward model, and then finetune the original model (normally with the supervised finetuning loss) to improve performance on a variety of tasks. It’s one of the standard post-training tools used by Llama 3 and many others.
DeepSeek uses rejection sampling to begin to introduce general capabilities back into the model. It is also the one stage where they include data numbers — 800K completions total, split as 600K for reasoning and 200K for general chat problems. The 800K number is not surprising to me given this is just a late-stage SFT training, but it is similar in size to the ~1M prompts we used in the Tülu 3 SFT mix which is the ballpark for leading post-training recipes.
The details in the paper are largely around methods for generating responses to prompts and filtering to prioritize high-quality training data. In order to bring more domains into the scope of abilities for the model, DeepSeek has a variety of tricks, such as:
* Using generative reward models (i.e. LLM-as-a-judge) to verify answers to questions that may not be explicitly verifiable,
* Data from the DeepSeek-V3 standard post-training pipeline, and
* Standard (nonverifiable) chat data augmented with extended chain of thought before answering to help the model generalize from reasoning training to broader use cases.
All in, we currently have very few details here and there is a lot of open space to learn (and likely improve).
Step 4. Final RL training for general use
Finally, DeepSeek R1 goes back to reinforcement learning, which really seems to be how most finetuning is ending these days. The second RL stage is “aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities.”
In order to do this, they do RL training that mixes prompts from the verifiable domains (as done for R1-Zero) and prompts for standard RLHF preference tuning. In order to do this they have multiple reward models and build upon their post-training recipe in DeepSeek V3.
This is not easy to do and involves many questions: What is the right data balance? Can you use an off-the-shelf existing reward model or does it need to have seen long reasoning traces? Are there additional steps needed to not degrade performance? And so on.
As this grows into a larger area of research and development these questions will slowly be answered.
As this post has transitioned into the later stages of training, it is clear that many details are unknown. We have the general shape of how to sequence things and will fill in the details from here. I have a very long stack of reasoning-related research papers to poke through, and while they came before DeepSeek R1, they still will point toward answers.
All of this is solvable, as proven by how quickly DeepSeek went from the o1 release to matching performance with an open weights model.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Discussions and next steps
The DeepSeek R1 report has an entire other subsection dedicated to its distillation experiments, where it took completions from the R1 model and finetuned existing open-weight models with them to boost performance. This is a fantastic service for them to release this and provides a solid baseline for RL experiments on smaller models to try and match in the near future.
The discussion in the paper on how large models are required to see the biggest reasoning gains (and generate effective synthetic data) is likely the biggest open question:
First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger-scale reinforcement learning.
As smaller models continually improve over the years, it is likely that the same type of training could work on something like Llama 5 or 6 8B. It leaves us with the same, open question as to why different abilities “emerge” at larger models. Scaling laws are the reasons that each generation’s frontier models tend to be the largest models available. The exciting form of this question for 2025 is: How small will the slow progress of language modeling research drive advanced reasoning capabilities?
Every so often a paper comes around that makes the path forward clear. The last time I felt this way was with the Llama 3 report for post-training, which solidified into the Tülu 3 paper.
Soon, I’ll comment on…
* Distillation of reasoning traces (as done in the R1 paper),
* The demise of process reward models (PRMs) and Monte Carlo Tree Search (MCTS),
* Some things in the DeepSeek paper, like the “Aha” moment and over-indexing on human priors, that annoy me,
* The new reasoning research coming out from academia,
* The other reasoning model that dropped yesterday — Kimi 1.5,
* The biggest application of Tülu 3 RLVR yet, and
* All the other ideas that are under debate in the reasoning model space.
R1 is surely not the only way to train these models, but it is the recipe that people will build off immediately. Let’s get cranking on more datasets and infrastructure.
For those new here, you can check out the Inference & Reasoning tag on Interconnects!
Full post for images, etc: https://www.interconnects.ai/p/to-meta-ray-ban-local-ai
With the Rabbit r1, the Humane pin, the Friend thing, the Sam Altman rumors, Meta Ray-Bans, and everything in between, it is obvious that we are going to get new devices in the near future driven by advancements in AI. Trying some of those that already are public makes this obvious from a functional perspective rather than a marketing perspective.
Even though many of these devices will have a shelf life drastically shortened by the underlying API access getting turned off when the parent company runs out of money, the call for these devices is very strong. AI is going to be more than a chat window we use for work, we just don’t know what that will feel like. AI should be fun, flexible, and available.
Meta’s Ray-Bans were first launched in 2021, long before any of this ChatGPT-inspired interest in AI began. Having tried them — the form factor would have caught on eventually, but AI was the catalyst to accelerate adoption. AI expanded our expectations for the range of exciting outcomes that could be coming our way.
Using the AI in the Ray-Bans is much like using a protolithic chatbot. If I had never used ChatGPT, it would have been transformative, but today it feels slightly outdated. We should be more impressed by these generally and contextualize the AI they’re delivering. The product excitement cumulatively feels unexpectedly like what AirPods had on day 1. I was not expecting this fondness.
The form factor for the Meta Ray-Bans is fantastic and drives this connection. I’ve been legitimately excited to use them (albeit, much more during sunny Seattle summers relative to now), and it immediately made sense when taking them out of the packaging. My best use has been for outdoor activities, taking photos and videos without needing to fuss with a phone and communications. An example video is below -- like most things, it has a learning curve.Here’s a photo from that outing:
Or a video:
Clearly, they’re fine.
What I want to use them for today has nothing to do with AI. In some ways, this makes me more bullish on the form factor, but it makes it clear that Meta is in a precarious position. Ironically, I would’ve been more reluctant to buy them if not for the excitement about AI.
As of writing this, I would much rather have “Apple Ray-Bans” because of a seamless integration with the rest of my information ecosystem. However, Apple may not be willing to take the risk to build them (as I avoid an Apple Vision Pro Digression).
This does not mean the long-term story of many new devices won’t be the AI.
AI, in the recent past (and likely in the near future), left most electronic devices with an eerie, bland sameness. My sunglasses can answer basic questions about my day just like Siri. At the same time, my appliances try to talk to me. The hard-to-visualize step is how this changes (and overcomes the same integration dead ends that agents face). AI in 5 years (or way less) will actually know the context of our lives and be able to execute basic web tasks.
When the AI is good, Meta Ray-Ban type devices will be indispensable. Reminders, calls, reasoning, integration, all on the go. Much like the sensation products like AirPods provide, AI devices (and services) done right will make us free to be in the world naturally.
Meta now has a real hill to climb for AI. They just need to focus on building one more useful feature at a time rather than building a god. They have a tangible goal and a real product that is going to get better in the normal march of progress. If only we had an ecosystem of people who wanted to do this work and keep hill climbing the AI part for them.
The AI of the Meta Ray-Bans (and the other devices I started with) being primarily in the cloud is a drag but is needed for these first generations of glasses to maintain battery life. The cloud-centric nature of the AI is the largest perceivable reason Meta cannot open a Software Development Kit (SDK) for the glasses — all the developers would be doing is changing Meta's internal Llama API calls, rather than uploading new and improved models to the glasses.
AI models in the cloud are consistently the first ones to cross the frontier of new capabilities. As we figure out what we want to use new AI devices for, using the cloud models will make us more likely than not to find useful applications. Now that we have things that people actually like, we need to optimize and specialize these models out of the cloud.
What’s the state of local LMs?
The AI angle for this post is to prompt the question: What do people actually use local, or on-device, language models for? What are they driving innovation of?
The local model ecosystem is composed of a distribution of tinkerers, researchers, and those whom API models refuse their use cases. Most people doing this are not directly innovating on local models in a way that dictates meaningful improvements to underlying AI innovations. Yes, companies surely monitor progress and observe lessons, but there are far bigger markets at play for why local models are needed in the future of AI than the tinkerers that get visibility.
Local language models are crucial for maintaining privacy (not everyone can afford fancy inference data centers like Apple), optimizing inference speed, and providing access in situations with no web connectivity. The Meta Ray-Bans stand to benefit from all of these.
Phrasing the reasoning starting from the frontier, cloud models most people are used to, rather than what we want, it goes as: Local models shouldn’t try to be our general use case model. Outsource that to the cloud. Use local models for efficient, specific tasks out in the world.
What local model enthusiasts are doing is building an ecosystem around optimization, latency, and task specialty that drives a lot of value. This value is captured by companies with no feedback loops to the tinkerers. Having SDKs and other direct places where those evolving local models can benefit in real ways is the goal. The models themselves will actually get better too — an actual potential feedback loop from open AI models.
Just about a year ago I wrote a very similar take on local models, on how they have different trade-offs and trajectories. Apple Intelligence, Google’s new models / Pixel phones, and the Meta Ray-Bans are showing us that this future is coming.
What is left to be understood is the manner in which local models are developed for new devices. Will any major technology companies let us run our own models with deep integrations? How can open-source principles and local models synergize?
Hillclimbing with open, local language models
Giving developers ways to integrate their own AI models into the operating system (OS) hooks used by the Meta Ray-Bans would immediately spawn a platform for local, open-weight language models. I first learned how locked down the Ray-Ban developer ecosystem was because I was excited to try and get our multimodal LM Molmo on them. That attempt didn’t make it far.
Other companies, like Apple, could conceivably have SDKs that let users point their language models at OS hooks. Creating operating systems that allow users to integrate certain open models (even only those that are approved by the companies) would completely change the (lack of) incentives for iterating on language models in the open.
While we still don’t have the new Apple Intelligence version of Siri that can plug into multiple applications, we know this works by letting an AI model generate tokens that correspond to actions in other applications. Letting users choose AI models (maybe their own), even if they only are useful in a subset of the tasks, would be wonderful. I would love to sacrifice whatever the AI situation is on my version of the Ray-Bans by default and get just the best vision model for explaining my environment, the best model for cooking ideas, or the best conversational model to just push the limits for AI devices in any of these promising directions. It would be so fun to try different AI models on a real device.
The open language modeling ecosystem desperately needs these types of feedback loops (and it is totally natural for excitement about a type of technological development like this to exist before the proof cases of its value).
Getting to the point where Meta has an AI SDK for devices along with the leading open language models will make their entire strategy value additive (rather than just destroying the advantages of competitors). In fact, Meta likely needs to do so, or else Apple’s product competitor may dominate the market. Only different strategies and feedback loops can dislodge Apple’s integration.
On the modeling side, there’s no doubt we have step-change improvements coming to those used on the Ray-Bans. On ChatBotArena, we have many models with a few billion parameters that beat the first versions of ChatGPT. The same type of performance gain — where at 100X smaller model can match or surpass performance in a few years — will come for the Ray-Bans and all other sorts of AI applications.
The big picture arc of technology
Starting in 2025, I’m excited about the breadth and quantity of profound, new technological experiences I’m having. Some of them, like ChatGPT Advanced Voice Mode, haven’t really landed for me (even though they’re extremely impressive to non-tech non-AI friends and family). Meta Ray-Bans, Waymos, Codex, and standard ChatGPT all feel like technologies that were immediately obvious as something I needed. I need to get a Starlink hub in one of the remote locations my hobbies bring me to, and I’m sure I can add reusable rockets to the transformations I’ve embraced.
The last technologies sparking these joys were the likes of the iPod and the iPad.
Every person I take to ride a Waymo for the first time has a similar experience of joy.
This year we may also have new models that solve arbitrary internet tasks for us in the background.
The future is here and we’re living in a time where it’ll be more evenly distributed.
Original post:
https://www.interconnects.ai/p/deepseek-v3-and-the-actual-cost-of
Chapters
00:00 Opening
03:15 DeepSeek’s learning efficiency
06:49 DeepSeek’s compute transparency and reality
Figures
Fig 1: Benchmark Results
Fig 2: ChatBotArena Results
Fig 3: Compute Usage Table
Slides for this post-training talk and slides for the full tutorial on language modeling (with a bit less post-training content and no recording yet). Here are some timestamps for the video:
00:00 Introduction 10:00 Prompts & Skill Selection 14:19 Instruction Finetuning 21:45 Preference Finetuning 36:17 Reinforcement Finetuning 45:28 Open Questions 52:02 Wrap Up
Psssst… we just recently released our technical report for OLMo 2 — 2 OLMo 2 Furious, check it out for tons of training details and tips!
This post has some good content, but if you just want to watch the tutorial on YouTube, it’s here.
I’m far more optimistic about the state of open recipes for and knowledge of post-training starting 2025 than I was starting 2024. Last year one of my first posts was how open post-training won’t match the likes of GPT-4. This is still the case, but now we at least understand the scope of things we will be working with better.
It’s a good time to record an overview of what post-training looks like today. I gave a version of this tutorial talk for the first time in 2023 (at ICML), which felt like a review of the InstructGPT paper not based on reproduced literature knowledge. In 2024, the scientific community made substantial progress in actually training these models and expanding the frontier of knowledge. Doing one of these talks every year feels like a good way to keep tabs on the state of play (whereas last year, I just had a bunch of links to add to the conversation on where to start).
With the talk, I wanted to add more context on where I see post-training generally.
The most important one people need to know, given the excitement around OpenAI’s o1 series of models, is that post-training alone is nowhere near a complete enough lens or taxonomy to study training reasoning language models. It’s a step.
Back to processes for all modern AI models. There are a lot of post-training methods to improve models and, more importantly, they can be segmented so the scientific community can make progress on each of them individually. The new state of finetuning stages is satisfying, with three groups of training methods:
* Instruction finetuning (a.k.a. supervised finetuning),
* Preference finetuning (the generalization of reinforcement learning from human feedback), and
* Reinforcement finetuning is the new abstraction for improving performance on specific tasks.
Some of the long-tail methods like rejection sampling, knowledge distillation, and extensive filtering aren’t studied well, but you can still do excellent post-training without them. We have options for studying post-training in 2025.
Where last year we were settling debates such as “DPO vs. PPO” or “does AI feedback for RLHF work,” now we are focused on just making the best practices better.
Similarly, the stress around doing research on outputs from foundation model providers, i.e. if research violates the OpenAI terms of service on training competitor models, has dropped further and is common practice — in fact, distilling from strong models is a fundamental part of successful post-training.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
To summarize the state of post-training, there are a few things to keep in mind:
1. Post-training techniques are more impactful on the final performance of models
Some caveats before I toot the horn of post-training as all you need today. Given that “scaling as we know it is ending” this is not entirely a controversial take. Finally, it is obviously self-serving to myself as someone who is going to benefit from post-training being more important.
All of this aside, it’s very logical that post-training will be the next domain for scaling model compute and performance. Predicting the next token accurately is not something that a user cares about — correct answers and how the answer is presented are. All through 2024, there were way more discussions on how post-training is more important.
If we look at the Elo ratings of models on ChatBotArena, we can see progress has accelerated even though the models haven’t been getting noticeably bigger. Pretraining on these architectures is improving, yes, but the biggest and best models are used as tools and supervision for better post-training.
Post-training got more popular because there was more low-hanging fruit on model performance. A lot of that potential has been realized and, in doing so, entirely new types of models are being made akin to o1.
To interpret these numbers:
* 100 Elo margin over another means ~2/3 win probability over the lower,
* 200 Elo gives ~76% win probability,
* 300 Elo gives ~85% win probability, and so on.
You can play with these numbers here.
2. Post-training can be very expensive
While still far cheaper than pretraining due to the price of GPUs, post-training costs have been growing rapidly. If we estimate the costs of post-training the Llama models, we could guess that the all-in costs for the models were about the following: Note — numbers are based primarily on a combination of headcount and data costs with compute driving them even higher.
* LLaMA (Q1 2023) <<$1M: instruction tuning only.
* Llama 2 (Q3 2023) ~$10-20M: 1.4M preference pairs, RLHF, IFT, Safety, etc. and other costs not in the paper.
* Llama 3.1 (Q3 2024) >$50M: similar preference data to Llama 2, a ~200-person post-training team, larger models, etc. The number could be much higher.
Post-training costs from large data bills and extensive inference to generate, clean, and verify multiple types of synthetic training data. More complex loss functions, e.g. RL optimizers, use a lot of memory to train, but far fewer FLOPs than pretraining for general Instruct models. This is all growing rapidly and is expected to change.
This culminates in the o1 style models where the compute with post-training loss functions can account for 40% or more of the overall compute of the model. Even Tülu 3, our major post-training project at Ai2 that didn’t buy any human data, I estimate costs >$1M for a large academic project.
3. Post-training is less reliant on human data
While all the frontier laboratories still rely on human data for parts of their post-training pipeline (including both training and evaluation), AI can be substituted at most stages and get a “good enough” outcome. For example, given the costs above, they can be slashed from moving from human preference data that is ~$5-20 per preference point to AI feedback that is <0.01$ per sample. The optionality of synthetic data driven by having models that are good enough for supervision makes the pace of post-training progress far higher. In my experience, AI feedback for RLHF only became possible with GPT-4 tier models and the academic community reaps extreme benefits from the plummeting cost of inference.
4. Post-training ability is the door to advanced reasoning models
Doing post-training well and having mastery of the techniques seems crucial to making progress on reasoning models like o1 due to the infrastructure for RL finetuning of an instruct model is the same as what is used for large-scale RL training, at least you want it to be.
Given the above trends — we know more, it is easier to study, we have cheaper alternatives, etc. — there is cause for optimism in open replications of o1. It should still be expected that the first “replications” of o1 are more relative models — scaled up post-training on reasoning rather than the special pretraining + scaled RL that OpenAI does. We will learn a lot soon.
The talk on YouTube
Slides for this post-training talk and slides for the full tutorial on language modeling (with a bit less post-training content and no recording yet). Here are some timestamps for the video:
* 00:00 Introduction
* 10:00 Prompts & Skill Selection
* 14:19 Instruction Finetuning
* 21:45 Preference Finetuning
* 36:17 Reinforcement Finetuning
* 45:28 Open Questions
* 52:02 Wrap Up
In 2025 we need to disambiguate three intertwined topics: post-training, reasoning, and inference-time compute. Post-training is going to quickly become muddied with the new Reasoning Language Models (RLMs — is that a good name), given that loss functions that we studied via advancements in post-training are now being leveraged at a large scale to create new types of models.
I would not call the reinforcement learning training done for OpenAI’s o1 series of models post-training. Training o1 is large-scale RL that enables better inference-time compute and reasoning performance. Today, I focus on reasoning. Technically, language models definitely do a form of reasoning. This definition does not need to go in the direction of the AGI debate — we can clearly scope a class of behavior rather than a distribution of explicit AI capability milestones. It’ll take work to get an agreement here.
Getting some members of the community (and policymakers) to accept that language models do their own form of reasoning by outputting and manipulating intermediate tokens will take time. I enjoy Ross Taylor’s definition:
Reasoning is the process of drawing conclusions by generating inferences from observations.
This is a talk I gave at NeurIPS at the Latent Space unofficial industry track. I wanted to directly address the question on if language models can reason and what o1 and the reinforcement finetuning (RFT) API tell us about it. It’s somewhat rambly, but asks the high level questions on reasoning that I haven’t written about yet and is a good summary of my coverage on o1’s implementation and the RFT API.
Thanks swyx & Alessio for having me again! You can access the slides here (e.g. if you want to access the links on them). For more on reasoning, I recommend you read/watch:
* Melanie Mitchell’s series on ARC at AI: A Guide for Thinking Humans: first, second, third, and final. And her post on reasoning proper.
* Miles Brundage’s thread summarizing the prospects of generalization.
* Ross Taylor’s (previous interview guest) recent talk on reasoning.
* The inference-time compute tag on Interconnects.
Listen on Apple Podcasts, Spotify, YouTube, and wherever you get your podcasts.
Transcript + Slides
Nathan [00:00:07]: Hey, everyone. Happy New Year. This is a quick talk that I gave at NeurIPS, the Latent Space unofficial industry event. So Swyx tried to have people to talk about the major topics of the year, scaling, open models, synthetic data, agents, etc. And he asked me to fill in a quick slot on reasoning. A couple notes. This was before O3 was announced by OpenAI, so I think you can take everything I said and run with it with even more enthusiasm and expect even more progress in 2025. And second, there was some recording issues, so I re-edited the slides to match up with the audio, so you might see that they're slightly off. But it's mostly reading like a blog post, and it should do a good job getting the conversation started around reasoning on interconnects in the new year. Happy New Year, and I hope you like this. Thanks.
I wouldn't say my main research area is reasoning. I would say that I came from a reinforcement learning background into language models, and reasoning is now getting subverted into that as a method rather than an area. And a lot of this is probably transitioning these talks into more provocative forms to prime everyone for the debate that is why most people are here. And this is called the state of reasoning. This is by no means a comprehensive survey. To continue, I wanted to make sure that I was not off base to think about this because there's a lot of debates on reasoning and I wanted to revisit a very basic definition.
And this is a dictionary definition, which is the action of thinking about something in a logical, sensible way, which is actually sufficiently vague that I would agree with it. I think as we'll see in a lot of this talk is that I think people are going crazy about whether or not language models reason. We've seen this with AGI before. And now we're going to talk about it. Now, reasoning kind of seems like the same thing, which to me is pretty ridiculous because it's like reasoning is a very general skill and I will provide more reasoning or support for the argument that these language models are doing some sort of reasoning when you give them problems.
I think I don't need to share a ton of examples for what's just like ill-formed arguments for what language models are not doing, but it's tough that this is the case. And I think there are. Some very credible arguments that reasoning is a poor direction to pursue for language models because language models are not going to be as good at it as humans. But to say that they can't do reasoning, I don't see a lot of proof for, and I'll go through a few examples. And the question is like, why should language model reasoning be constrained to look what look like what humans do?
I think language models are very different and they are stochastic. The stochastic parents thing is true for many reasons. And. We should embrace this. And we should continue. And I think a big trend of the year is that we're seeing new types of language model reasoning that look less human. And that can be good for kind of separating the discourse for expecting a really narrow type of behaviors.
I did an interview with Ross Taylor, who was a reasoning lead at Meta, which I thought was a very good education for me on this. And this is just a direct pull from the transcript. But essentially it's saying is like, if you do chain of thought on a language model. What it is doing is essentially outputting its intermediate steps. If I were to ask you all a math problem right now, you can do most of them in your head and you are doing some sort of intermediate storage of variables. And language models have no ability to do this. They are kind of per token computation devices where each token is outputted after doing this forward pass. And within that, there's no explicit structure to hold these intermediate states. So I think embracing chain of thought and these kind of intermediate values for the language models is extremely reasonable. And it's showing that they're doing something that actually gets to valuable outputs.
Nathan [00:04:10]: So this is like one of the many ways that we can kind of lead towards O1 is that language models have randomness built into them. And a lot of what people see as failures in reasoning are kind of these language models following very static chains and making very specific mistakes. Along the way with really no ability to correct for that. This is really not something that we see in human reasoning. So if a human makes a mistake, they will normally catch it on the next step. But we need to handle language models differently.
Nathan [00:04:41]: And why O1 is exciting is because it's a new type of language models that are going to maximize on this view of reasoning. Which is that chain of thought and kind of a forward stream of tokens can actually do a lot to achieve better outcomes. When you're doing a reasoning like ability or reasoning like action, which is just repeatedly outputting tokens to make progress on some sort of intelligence defined task. So it's just making forward progress by spending more compute and the token stream is the equivalent of some intermediate state.
Nathan [00:05:18]: What is O1 has been a large debate since its release. I'm not going to spend a lot of this talk on it. But the more I've spent on it. Is that you should take open AI at their face value, which they are doing very large scale. RL on the verifiable outcomes is what I've added, especially in context of the RL API that they've released, which I'll talk about more. But most of the reasons to believe in more complicated things like process rewards, models, self play. Monte Carlo tree search are mostly based on previous literature and things that we would have expected advanced reasoning to look like for language models. And not based on evidence that they have given us or the behavior, whether you're looking at evaluations or how actually like inference is done when serving the model.
This takes us to replications, or I would probably call them relatives of O1 coming from the community. These are wonderful to see. We are exploring the boundaries for like what we can do with chain of thought in models. The two I've highlighted are from deep seek and Quinn and a lot of people in this room have probably seen them. And I think that these models are really substantially narrower than these full O1 models from open AI. So open AI is if you use O1, you can do it for a lot more tasks. If you use like I was using the deep seek model and it's supposed to be for math or code. But they've tried to keep the model so narrow that even in that if you ask a code question, sometimes it'll be like I have only supposed to work on math or code. And a lot of the success of O1 and the future models of this is going to be able to. It being able to handle more tasks and more domains.
So SemiAnalysis wrote a post that I haven't read in full. But even if you look at the paywalled headings, you can kind of make some intelligent claims about what O1 is or is not. I think these are two of the things from the table of contents that you can see without paying. I'm due to pay at some point, but I have not. And incredible amounts of forward passes during training. I think you'll see this as I discuss RL fine tuning models. Maybe more in a little bit. But when you're doing RL, there's two types of ways that you see data many times and that will relate in many or result in many forward passes. One is that when you're doing RL on a prompt, you can sample many completions to then grade them or use them in different ways to update your policy. So if I ask one math problem, I could look at eight completions and choose the best one or do some contrast of thing between the best and the worst one. And that kind of gradation can help the RL policy actually learn. And the second time, because the loss function is more flexible than something like instruction tuning, you can go over the same prompts many more times than you would in instruction tuning or kind of pre-training. So this kind of means they're doing just a lot of this sampling from the model, which is very different than other types of training we've seen in the past at pre and post-training. And then one of this one is great. Thanks for doing for showing everyone this is that post-training flops exceed pre-training. I think this pretty much clearly says that they're using RL. They're using a ton of compute for this large scale RL. And at that point, it would probably mean something different where this is like pre-training RL. And this is something that these early relative models are not going to be doing because we don't like no one has this infrastructure like OpenAI does. It'll take a while to do that, but people will make it.
Nathan [00:08:50]: OK, this takes us to reinforcement fine tuning. I would say that this is a hard pivot in the talk where O1 is essentially pre-training scale RL. Extremely big RL. And we don't know what all the details of the data are to OpenAI then showing us this new beta API program that they're making, which is just a sprinkle of this. So what can you do with a tiny bit of their infrastructure? I think one of the fine tuning leads responded to a tweet from Swyx. And they were like the tweet letter. There was like one of the tweets. There was a long tweet that gave a lot of details. But even the first tweet I hadn't seen, I had like eight likes. And I was like. This API is using the same infrastructure that we used to train O1. And I was like that alone is like a lot of detail. There's like on Twitter is random thing. And then there's a really long details on other stuff of it. But it is just a new paradigm for fine tuning. And I have seen some of this work and I'm pretty optimistic that it'll work for kind of kind of really specific capabilities where answers matter rather than features in your style of text mattering. So. Again, kind of like I was hinting at with O1. This reinforcement fine tuning does many passes over the data, which is why they can say you only need dozens of labeled samples to actually learn from it, which is just very different than previous training regimes. So what happens is that the model gets a greater gives a bonus when the answer is right. And the model learns to reinforce behaviors that get right answers. And I'll move later in the talk. I'll highlight a research project that we did. That was pretty much doing a very similar thing to target very specific evaluations on open models. And you do RL and you give a reward bonus when the answer is right. And that's all you do. And the kind of key innovation and the simplicity is that modern language models are strong enough base where just a really gentle RL fine tuning can add these specific capabilities without degrading the model. I think a lot of fear for adding RL to these training regimes. And I'm sure we'll get to that in the future. But I think one of the biggest challenges for teams, especially on general instruct models like in chat to BT was just that they're going to destroy the rest of the performance, the base of chatting. So you care about. And it really seems like you can just do this out of the box. If open AI is going to allow an API, they aren't going to let people train a model that then just gets worse on random other things.
Nathan [00:11:20]: So what the data format looks like this, the example they gave is way more complicated than I think you should. It's like, you can start with like a grade school math problem. And just say, like the correct method. Correct answer is the correct number. The genes are confusing. But essentially, you have two components, a prompt and an answer, which is different than having a prompt and completion that you would train on. Or if you're doing preference tuning, you would do a prompt and a chosen completion and a rejected completion. So it's a new type of data format, I suspect. Quickly, we'll see things like Hugging Base having more of these. I will highlight we have some of ours for our specific project that we did. We have examples for math. Math on the screen is an example for precise instruction following, which is the idea that if you have a prompt, you can say something like, have every sentence start with the letter A. And you can verify that with Python really easily. This is something that we did in our project. And it's like, the model gets better at this. You have constrained data, and the RL algorithm learns to change the model just a tiny bit and actually reach these answers.
Nathan [00:12:23]: A confusing thing for people was these grader models. I think the place to come from these is evaluation. There's been a lot of work in evaluation to make answer extraction stable, especially with math, where an example that I used in the blog post I wrote today on this is like Lama 3.1 details their evals. For math, they use both SymPy, a Python process or Python package for extraction, and LLM, it's a judge, to extract their answers for math. And what the graders are doing is essentially amping this up to a whole lot. It's a whole nother level where it's kind of a nested structure of configs for doing reward shaping on these verifiable outputs. For math, it can be really easy. It's like, you know, you have to handle these five formats that I came up with in a minute for how you could represent different numbers and tokens. But as you get to more complicated things and more complicated behaviors, it seems like OpenAI is insinuating that you're going to need more than just a yes, no loss function for your domains. And that seems fine. Um, well, we already have a bunch of things. We have a bunch of open models that are doing like, um, like judge of models and Prometheus and other things that are designed specifically for LLM as a judge. And I see that continuing to just become part of this kind of open RL infrastructure.
Nathan [00:13:41]: OpenAI had a bunch of screenshots. I'm not going to end on a commentary on these, but it looks pretty standard. They're going to track how performance changes over time and stuff like this. You'll be able to look at all the outputs. This is just them making pretty things. And then they have this like very generic RL plot. Um, the most standard RL plot is a X axis of time or trials and a Y axis of reward here. Reward is like an accuracy or a success rate on a certain validation set. And X is actually supposed to be like how much training was done. And. That's a very similar to what we did in our project. I think this is kind of just another way you can put this with an RL feedback diagram. If you've seen RL, where you have this agent interacting with the environment, this, you can squint at it and it'll be familiar. If you haven't. You'll probably be in for more of these things if RL keeps becoming popular because RL is really formulated as trial and error learning.
But if you're interested, we're happy to try to have people use our code, which does this for math and some instruction tuning already, and we want to try more complicated graders for things like code. So for code quality, a binary outcome doesn't really make sense, which is a good way to think about why you might need to do some reward shaping for how you would grade outputs from a various model. And to kind of compare the plot that OpenAI had, which is like performance improving over time.
These are some experiments we ran on various evaluations. So the left column is some language model evaluation that we would use in an academic paper. And the right is all the various internal, um, RL statistics where like GSMAK math and IFVL are all being trained on training sets. So we have the answer, we have the prompts, which are math questions, and we have the answers, which are numbers, and we're really doing. This RL on seeing if this answer is right. And then it generalizes to various math evaluations that we care about. So I kind of see this as like, we got a tip from a industry lab member to do this a few months early. So we got a head start. And I think a lot of people are obviously going to be trying to replicate this now. So it's fun that we have a starting point and I'm excited to talk about it with people this week. And I think reasoning is worth continuing as something. Yeah. I can read the post that I was referencing here and I'm happy to take any related or hard question on reasoning. Cause I kind of opened the floor for that. So thank you. Okay.
Original post
https://www.interconnects.ai/p/2024-interconnects-year-in-review
Original post:
https://www.interconnects.ai/p/openais-o3-the-2024-finale-of-ai
Chapters
00:00 Introduction
02:51 o3 overview
05:57 Solving the Abstraction and Reasoning Corpus (ARC)
10:41 o3’s architecture, cost, and training (hint: still no tree search)
16:36 2024: RL returns
Figures
Fig 1, Frontier Math results
Fig 2, Coding results
Fig 3, ARC AGI results
Fig 4, ARC AGI result details
Fig 5, ARC AGI example 1
Fig 6, ARC AGI example in text
Fig 7, ARC AGI example “easy”
Original post: https://www.interconnects.ai/p/the-ai-agent-spectrum
Chapters
00:00 Introduction
03:24 Agent cartography
08:02 Questions for the near future
Figures
Fig 1. multiple feedbacks diagram
Original post:
https://www.interconnects.ai/p/openais-reinforcement-finetuning
Chapters
00:00 Introduction
04:19 The impact of reinforcement finetuning’s existence
07:29 Hypotheses on reinforcement finetuning’s implementation
Figures
Fig. 1, Yann’s Cake
Fig. 2, Grader config
Fig. 3, RLVR learning curves
Finbarr Timbers is an AI researcher who writes Artificial Fintelligence — one of the technical AI blog’s I’ve been recommending for a long time — and has a variety of experiences at top AI labs including DeepMind and Midjourney. The goal of this interview was to do a few things:
* Revisit what reinforcement learning (RL) actually is, its origins, and its motivations.
* Contextualize the major breakthroughs of deep RL in the last decade, from DQN for Atari to AlphaZero to ChatGPT. How could we have seen the resurgence coming? (see the timeline below for the major events we cover)
* Modern uses for RL, o1, RLHF, and the future of finetuning all ML models.
* Address some of the critiques like “RL doesn’t work yet.”
It was a fun one. Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Timeline of RL and what was happening at the time
In the last decade of deep RL, there have been a few phases.
* Era 1: Deep RL fundamentals — when modern algorithms we designed and proven.
* Era 2: Major projects — AlphaZero, OpenAI 5, and all the projects that put RL on the map.
* Era 3: Slowdown — when DeepMind and OpenAI no longer had the major RL projects and cultural relevance declined.
* Era 4: RLHF & widening success — RL’s new life post ChatGPT.
Covering these is the following events. This is incomplete, but enough to inspire a conversation.
Early era: TD Gammon, REINFORCE, Etc
2013: Deep Q Learning (Atari)
2014: Google acquires DeepMind
2016: AlphaGo defeats Lee Sedol
2017: PPO paper, AlphaZero (no human data)
2018: OpenAI Five, GPT 2
2019: AlphaStar, robotic sim2real with RL early papers (see blog post)
2020: MuZero
2021: Decision Transformer
2022: ChatGPT, sim2real continues.
2023: Scaling laws for RL (blog post), doubt of RL
2024: o1, post-training, RL’s bloom
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Chapters
* [00:00:00] Introduction
* [00:02:14] Reinforcement Learning Fundamentals
* [00:09:03] The Bitter Lesson
* [00:12:07] Reward Modeling and Its Challenges in RL
* [00:16:03] Historical Milestones in Deep RL
* [00:21:18] OpenAI Five and Challenges in Complex RL Environments
* [00:25:24] Recent-ish Developments in RL: MuZero, Decision Transformer, and RLHF
* [00:30:29] OpenAI's O1 and Exploration in Language Models
* [00:40:00] Tülu 3 and Challenges in RL Training for Language Models
* [00:46:48] Comparing Different AI Assistants
* [00:49:44] Management in AI Research
* [00:55:30] Building Effective AI Teams
* [01:01:55] The Need for Personal Branding
We mention
* IBM’s Deep Blue
* Alberta Machine Intelligence Institute (AMII)
* Claude (Anthropic's AI assistant)
* Bard (Google's AI assistant)
* Scale AI
Original post: https://www.interconnects.ai/p/openais-o1-using-search-was-a-psyop
Figures
Figure 0: OpenAI’s seminal test-time compute plot
Figure 1: Setup for bucketed evals
Figure 2: Evals with correctness labels
Figure 3: Grouped evals
Figure 4: Hypothetical inference scaling law
Full post:
https://www.interconnects.ai/p/olmo-2-and-building-language-model-training
OLMo 2 demo: https://playground.allenai.org/
OLMo 2 artifacts: https://huggingface.co/collections/allenai/olmo-2-674117b93ab84e98afc72edc
Chapters
00:00 Building AI Teams
06:35 OLMo 2
Figures
Fig 1, pretrain plot: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo2/pretrain.webp
Fig 2, pretrain table: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo2/pretrain-table.webp
Fig 3, post-train table: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo2/postrain-table.webp
Original post: https://www.interconnects.ai/p/tulu-3
Chapters
00:00 History
05:44 Technical details sneak peak
Figures
Fig 1, results: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/tulu3-img/results.webp
Fig 2, overview: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/tulu3-img/overview.webp
Fig 3, preferences: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/tulu3-img/preferences.webp
Fig 4, RLVR: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/tulu3-img/rlvr.webp
Original post: https://www.interconnects.ai/p/scaling-realities
Original post: https://www.interconnects.ai/p/saving-the-nairr
Chapters
05:26: Do we need an AI research resource or an LM research resource?
08:59: Policy roundups
Tim Dettmers does not need an introduction for most people building open-source AI. If you are part of that minority, you’re in for a treat. Tim is the lead developer behind most of the open-source tools for quantization: QLoRA, bitsandbytes, 4 and 8 bit inference, and plenty more. He recently finished his Ph.D. at the University of Washington, is now a researcher at the Allen Institute for AI, and is starting as a professor at Carnegie Mellon University in fall of 2025.
Tim is a joy to talk to. He thinks independently on all the AI issues of today, bringing new perspectives that challenge the status quo. At the same time, he’s sincere and very helpful to work with, working hard to uplift those around him and the academic community. There’s a reason he’s so loved in the open-source AI community.
Find more about Tim on his Twitter or Google Scholar. He also has a great blog where he talks about things like which GPUs to buy and which grad school to choose.
Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Show Notes
Here's a markdown list of companies, people, projects, research papers, and other key named entities mentioned in the transcript:
* QLoRA
* Llama 3
* Claude (AI assistant by Anthropic)
* Transformers (Hugging Face library)
* Gemma (Google's open weight language model)
* Blackwell (NVIDIA GPU architecture)
* Branch Train Merge (research paper)
* "ResNets do iterative refinement on features" (research paper)
* CIFAR-10 and CIFAR-100 (computer vision datasets)
* Lottery Ticket Hypothesis (research paper)
* TRL (Transformer Reinforcement Learning) by Hugging Face
* Tim's work on quantization (this is just one example)
Timestamps
* [00:00:00] Introduction and background on Tim Dettmers
* [00:01:53] Future of open source AI models
* [00:09:44] SWE Bench and evaluating AI systems
* [00:13:33] Using AI for coding, writing, and thinking
* [00:16:09] Academic research with limited compute
* [00:32:13] Economic impact of AI
* [00:36:49] User experience with different AI models
* [00:39:42] O1 models and reasoning in AI
* [00:46:27] Instruction tuning vs. RLHF and synthetic data
* [00:51:16] Model merging and optimization landscapes
* [00:55:08] Knowledge distillation and optimization dynamics
* [01:01:55] State-space models and transformer dominance
* [01:06:00] Definition and future of AI agents
* [01:09:20] The limit of quantization
Transcript and full details: https://www.interconnects.ai/p/tim-dettmers
Get Interconnects (https://www.interconnects.ai/)...
... on YouTube: https://www.youtube.com/@interconnects
... on Twitter: https://x.com/interconnectsai
... on Linkedin: https://www.linkedin.com/company/interconnects-ai
... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv
… on Apple Podcasts: https://podcasts.apple.com/us/podcast/interconnects/id1719552353
Andrew Carr is co-founder and chief scientist at Cartwheel, where he is building text-to-motion AI models and products for gaming, film, and other creative endeavors. We discuss how to keep generative AI fun and expansive — niche powerful use-cases, AI poetry, AI devices like Meta RayBans, generalization to new domains like robotics, and building successful AI research cultures.
Andrew is one of my well read friends on the directions AI is going, so it is great to bring him in for an official conversation. He spent time at OpenAI working on Codex, Gretel AI, and is an editor for the TLDR AI Newsletter.
Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Show Notes
Named entities and papers mentioned in the podcast transcript:
* Codex and GitHub Copilot
* Blender 3D simulator
* HuggingFace Simulate, Unity, Godot
* Runway ML
* Mark Chen, OpenAI Frontiers Team Lead
* Meta’s Lingua, Spirit LM, torchtitan and torchchat
* Self-Rewarding Language Models paper
Timestamps
* [00:00] Introduction to Andrew and Cartwheel
* [07:00] Differences between Cartwheel and robotic foundation models
* [13:33] Claude computer use
* [18:45] Supervision and creativity in AI-generated content
* [23:26] Adept AI and challenges in building AI agents
* [30:56] Successful AI research culture at OpenAI and elsewhere
* [38:00] Keeping up with AI research
* [44:36] Meta Ray-Ban smart glasses and AI assistants
* [51:17] Meta's strategy with Llama and open source AI
Transcript & Full Show Notes: https://www.interconnects.ai/p/interviewing-andrew-carr
Full post:
https://www.interconnects.ai/p/why-i-build-open-language-models
How Claude's computer use works. Where OpenAI, Anthropic, and Google all have a lead on eachother.
Original post: https://www.interconnects.ai/p/claudes-agency
Chapters
00:00 Claude's agentic future and the current state of the frontier models
04:43 The state of the frontier models
04:49 1. Anthropic has the best model we are accustomed to using
05:27 Google has the best small & cheap model for building automation and basic AI engineering
08:07 OpenAI has the best model for reasoning, but we don’t know how to use it
09:12 All of the laboratories have much larger models they’re figuring out how to release (and use)
10:42 Who wins?
Figures
Fig 1, Sonnet New Benchmarks: https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d2e63ff-ac9f-4f8e-9749-9ef2b9b25b6c_1290x1290.png
Fig 2, Sonnet Old Benchmarks: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccbd4d-f1c8-4a38-a474-69a3df8a4448_2048x1763.png
Get Interconnects (https://www.interconnects.ai/)...
... on YouTube: https://www.youtube.com/@interconnects
... on Twitter: https://x.com/interconnectsai
... on Linkedin: https://www.linkedin.com/company/interconnects-ai
... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv
… on Apple Podcasts: https://podcasts.apple.com/us/podcast/interconnects/id1719552353
Arvind Narayanan is a leading voice disambiguating what AI does and does not do. His work, with Sayash Kapoor at AI Snake Oil, is one of the few beacons of reasons in a AI media ecosystem with quite a few bad Apples. Arvind is a professor of computer science at Princeton University and the director of the Center for Information Technology Policy. You can learn more about Arvind and his work on his website, X, or Google Scholar.
This episode is all in on figuring out what current LLMs do and don’t do. We cover AGI, agents, scaling laws, autonomous scientists, and past failings of AI (i.e. those that came before generative AI took off). We also briefly touch on how all of this informs AI policy, and what academics can do to decide on what to work on to generate better outcomes for technology.
Transcript and full show notes: https://www.interconnects.ai/p/interviewing-arvind-narayanan
Chapters
* [00:00:00] Introduction
* [00:01:54] Balancing being an AI critic while recognizing AI's potential
* [00:04:57] Challenges in AI policy discussions
* [00:08:47] Open source foundation models and their risks
* [00:15:35] Personal use cases for generative AI
* [00:22:19] CORE-Bench and evaluating AI scientists
* [00:25:35] Agents and artificial general intelligence (AGI)
* [00:33:12] Scaling laws and AI progress
* [00:37:41] Applications of AI outside of tech
* [00:39:10] Career lessons in technology and AI research
* [00:41:33] Privacy concerns and AI
* [00:47:06] Legal threats and responsible research communication
* [00:50:01] Balancing scientific research and public distribution
Get Interconnects (https://www.interconnects.ai/podcast)...
... on YouTube: https://www.youtube.com/@interconnects
... on Twitter: https://x.com/interconnectsai
... on Linkedin: https://www.linkedin.com/company/interconnects-ai
... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv
Read the full post here: https://www.interconnects.ai/p/building-on-evaluation-quicksand
Chapters
00:00 Building on evaluation quicksand
01:26 The causes of closed evaluation silos
06:35 The challenge facing open evaluation tools
10:47 Frontiers in evaluation
11:32 New types of synthetic data contamination
13:57 Building harder evaluations
Figures
Andrew Trask is one of the bright spots in engaging with AI policy for me in the last year. He is a passionate idealist, trying to create a future for AI that enables privacy, academic research, and government involvement in a rapidly transforming ecosystem. Trask is a leader of the OpenMined organization facilitating researcher access to non-public data and AIs, a senior research scientist at Google DeepMind, a PhD student at the University of Oxford, an author and educator on Deep Learning.
You can find more about Trask on Twitter or Google Scholar. You may want to watch his recent talk at Cohere on the future of AI (and why data breakthroughs dominate), his lecture at MIT on privacy preserving ML, or his book on deep learning that has a substantial GitHub component. Here’s a slide I liked from his recent Cohere talk:
The organization he helps run, OpenMined, has a few principles that say a lot about his ambitions and approaches to modern AI:
We believe we can inspire all data owners to open their data for research by building open-source privacy software that empowers them to receive more benefits (co-authorships, citations, grants, etc.) while mitigating risks related to privacy, security, and IP.
We cover privacy of LLMs, retrieval LLMs, secure enclaves, o1, Apple's new models, and many more topics.
More on Andrew: https://x.com/iamtrask
Transcript and more information: https://www.interconnects.ai/p/interviewing-andrew-trask
Interconnects (https://www.interconnects.ai/)...
... on YouTube: https://www.youtube.com/@interconnects
... on Twitter: https://x.com/interconnectsai
... on Linkedin: https://www.linkedin.com/company/interconnects-ai
... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv
We Mention
* Claude 3.5 launch and “pre release testing with UK AISI” (and the US AI Safety Institute)
* CSET (Center for Security and Emerging Technology)
* NAIRR
* The “open data wall”
* Apple’s Secure Enclaves, Nvidia Secure Enclave
* Data-store language models literature
* RETRO: Retrieval-Enhanced Transformer from DeepMind (2021)
* SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore (2023)
* Scaling Retrieval-Based Language Models with a Trillion-Token Datastore (2024)
Chapters
[00:00:00] Introduction
[00:03:12] Secure enclaves and pre-release testing with Anthropic and UK Safety Institute
[00:16:31] Discussion on public AI and government involvement
[00:20:55] Data store language models and better approaches to “open training data”
[00:42:18] History and development of OpenMined
[00:48:57] Use of language models on air-gapped networks
[00:52:10] Near future of secure enclave technology and industry adoption
[00:58:01] Conclusions and future trajectory of AI development
How scaling changes model behavior
Some trends are reasonable to extrapolate, some are not. Even for the trends we are succeeding at extrapolating, it is not clear how that signal translates into different AI behaviors.
Read it here: https://www.interconnects.ai/p/how-scaling-changes-model-behavior
[00:00] How scaling changes model behavior
[05:03] Metaphors for what scaling may solve
[08:45] Short-term scaling is already de-risked
SB1047's veto, OpenAI's turnover, and a constant treadmill pushing AI startups to be all too similar to big technology name brands.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/ai-safety-culture-vs-capitalism
00:00 AI Safety's Crux: Culture v Capitalism
06:03 SB1047 as a regulatory litmus test for AI safety
08:36 Capitalism at the helm
Riley Goodside is a staff prompting engineer at Scale AI. Previously working in data science, he is often seen as the default for the new role of a “prompt engineer.” He regularly posts incisive prompts that illicit notable behavior from the most popular AI models.
I really resonated with this saying from Anthropic’s recent podcast on prompt engineering — “now we write essays and treat them as code.” In order to be good at prompting, you need to understand that natural language operates as our code used to.
This episode is a masterclass on why you should care about prompting and how it impacts results. Of course, there’s a bunch of great discussion on recent models that reflect the need for different and or better prompting. Enjoy it!
Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.
We mention:
* Prompting to push the frontier of AI models,
* Post-training and prompting interaction,
* Prompting base models,
* o1, Reflection 70B, reasoning,
* Scale’s leaderboard, evaluation tricks, evaluation needs,
* PlanSearch paper
* “The hottest programming language is english”
* “Think silently” instructions
* Scale Leaderboard and Humanity’s Last Exam
* ChatML formatting
Chapters
* [00:00:09] Introduction
* [00:02:40] Riley's path to LLMs
* [00:07:54] Impact of ChatGPT on prompt engineering
* [00:12:03] OpenAI's o1
* [00:18:21] Autoregressive inference and prompting sensitivities
* [00:24:48] Reflection 70B model and its implications
* [00:28:00] Impact of prompting on evaluation
* [00:32:43] Prompting vs. Google search
* [00:46:55] Prompting and RLHF/post-training
* [00:56:57] Prompting of AI agents
* [01:01:20] Importance of hands-on experience with language models
* [01:05:00] Importance and challenges of AI model evaluation
Transcript
Built with smol-podcaster.
Nathan L. [00:01:08]: Hey, Riley, welcome to the show.
Riley G. Hey, Nathan, great to be here.
Nathan L. [00:01:14]: Yeah, so for the audience here, I mostly wanted to try to, as I work on post-training a lot and I see my own difficulty in taking prompting seriously and the things that I don't think that we are doing enough, and I don't see any reason why it can't be scientific in how we do prompting. So that's my biggest goal with this. I think there's a lot of podcasts where we could kind of say, like, what is the history of prompting? Where is it going? And that's easy to kind of redo. And I still find it interesting, but I just don't think there's enough people talking about the role of prompting in evaluation, how prompting changes with how your post-training models, because we're trying to take that seriously and how we have a post-training setup, but we just like regularly run into these things like system prompts aren't handled well, how to release a model of a system prompt. So that's the tone that I'm trying to get to when I ask these questions. And also OpenAI's 01 model just came out, so I'm definitely going to get onto that pretty quickly because that's what everyone's excited about. I like to start with background just to kind of get to know people, because a lot of this is just, I want to talk to interesting people in AI, is like, how did you become interested in prompting? I think I've seen your background in data science and then your joint scale around when Chad2BT came out, which is fun timing, but like, how did you become maybe obsessed with this, but like the focal point of your work?
Riley G. [00:02:40]: Yeah, I have sort of an unusual introduction to large language models. For most of my career, I've been a data scientist, mostly in the on-mandating industry. I was at OkCupid and Grindr. And after I left Grindr, I took sort of a sabbatical to educate myself, I guess, about the progress in large language models. It was around the time that GPT-3 codecs had just come out. And that was where I think I started to become really interested because I was following along with maybe, certainly when GPT-2 came out, the examples there wowed me as much as they wowed the rest of the world, I think, with the example of the news article about the unicorn and all that. And not long after that, we had AI Dungeon, and I played around with AI Dungeon a bit. But at that point, language models seemed to be mostly about language, that they were sort of very heavily focused on stylistic mimicry and creative writing and so on. And when Codex came out, it really started this thought of that text is a more universal interface than we were giving you credit for, that language models might be more broadly useful. And I just became very excited in a practical sense of what these models could do for what I kind of intuited was very boilerplate-like data science code, that I thought of like most of the Python and Julia and R and things that I've written over my career, this seemed like stuff that an LLM could handle. And that was sort of one of its early strong points. So I was playing around with, I think one of my first projects was a VS Code extension that had some kind of integration with Codex. But I never really shipped anything out of it. And mostly what it transitioned into pretty quickly was playing around with posting prompting examples on Twitter, because when I looked out online to find what were people saying about how to prompt these models, there really wasn't much out there. And so I had to kind of resort to just like the few examples that had been circulating in viral screenshots of humorous completions and so on, of like the results that people got out of it. And I started posting those examples. I started following academics and low-level engineers at the research labs and anyone that was working in shipping language models I thought were interesting. And elbowed my way in.
Nathan L. [00:05:18]: I have more questions on this, because I find it like, some people find, there's this whole like Twitter dynamic of like, you find so much signal there, but the question is like, how much does it generalize? Because there's so many of the lessons you can learn from these models, from these examples. I think the straw, like the number of R's in strawberry things is the current one. And then, and it's like, do you get a sense that these are transient or are these kind of repeated themes? And like, how should you read these examples to try to extract themes from them? If like, I've followed you for a while, and a lot of people do, and you're more insightful in how you post them. If you post these threads with like multiple tries and stuff like this, like, should people be doing that when they see something pop up?
Riley G. [00:06:03]: I think so. I also would say that Twitter is a very different river to step into now than it was back then. At the point that I started doing this, like, nobody was really talking about these things that much, or to the extent they were, it was sort of fleeting. It was like, wow, look at this, and then they on to the next thing. And I think the thing that's very different now is just that because there are so many new entrants in AI and LLM, there's a lot of rehashing of the basics. And I think a lot of people in the industry would tell you that the popular examples that you see around of like, how many R's are in strawberry, or some of the ones that I'm partially responsible for, popularizing at least. I think like, these things are really just like, rookie mistakes in some sense, right? That these are things that we've long known language models can't do. And it just keeps popping up as a surprising quirk of language models that I think the public is just confused that something could be so good at so many other things and so bad at this. Right? That is seemingly trivial task, and that is hard to explain to people. And the answer to that hasn't really changed much in the past few years. They're generally bad at spelling for kind of the same reasons they were bad at spelling two or three years ago.
Nathan L. [00:07:27]: Yeah. I mean, like, how did these things change with ChatGPT? Because ChatGPT is like the introduction of RLHF into these models. And I think, I didn't write this down as a question, but there's like the difference in patronizing base models and instruction models and RLHF models, which I think that for most of this discussion, it's like the end model, the like chat RLHF model is the one that people think about. But was that a big transition point in your work or is it just kind of plugging along? Right.
Riley G. [00:07:54]: I mean, I would say, I don't think it's any understatement to say that, or sorry, any overstatement to say that, that the release of ChatGPT was probably the single biggest event in the history of prompt engineering in that prompt engineering became drastically easier after ChatGPT came out. And most other models learned from the ChatGPT way of doing things, right? That they, like, I think people forget just how fiddly prompt engineering used to be, right? Like people today don't think about things like frequency and presence penalties, right? They used to be that by default, you would get very repetitious output and you had to work to avoid that. People forgot about like, don't end your prompt in a space, right? That you had to understand how tokenization worked at all times, because like, if you put an extra space in there, you were going to go out of distribution. I think that, or another one that I think is particularly vivid for me is Yobi Reel that in June of 2022, Douglas Hofstadter had a piece in The Economist showing the, what he called the hollowness of GPT-3's understanding of the world, that it failed on various simple questions. Like, when was the Golden Gate Bridge transported for the second time across Egypt and so on? And someone, I believe it was Nick Camerota of OpenAI, showed that you could fix almost all of these just by telling the model that if you gave it a silly question, say Yobi Reel instead of answering it, right? That models had to be prompted with the possibility that they were allowed to say, I don't know, or, you know, that's a dumb question, right? You know, like there is no answer, right?
Nathan L. [00:09:34]: This is like, we've added the anthropic system prompt to our AI2 models, and we're like, this doesn't change the evals at all, but it makes the behavior something that we like more. Because I think culturally we're somewhat similar to anthropic, it's like we want to express uncertainty, we want the model to say that, I don't know, and a lot of that is in the system prompt of anthropic models.
Riley G. [00:09:51]: Right. And I think that really, you know, it's another microcosm of just how messy all this is, that what people like is a very different thing from how good are the models. I think, you know, LMSYS had a great blog post recently talking about like stylistic bias and output, that models will be rated as better if they do things like put their output into the format of a bulleted list with bold initial words on each label point. So there's like cheap tricks like that, that will make people like your output better or make them perceive it as, you know, more authoritative or, you know, more comprehensive that you kind of have to control for and just going by preference. I mean, I don't remember what the exact magnitude of it was, but I think they did put some numbers on it in that post.
Nathan L. [00:10:42]: Like, do you think you could handle all of that? Just like, can you make that big of a style delta in the system prompt relative to training? Is kind of what I'm wondering. Like if we release a model at AI2 and it's decent, but then we put a detailed system prompt that it's like, whatever possible, you should put your models into a list format with bolded headings and use markdown. Like, do you think we would get a 50 point bump on ElmSys?
Riley G. [00:11:06]: Maybe not on ElmSys in particular, being as they're trying to correct for this actively. But presumably it would have worked at one point, right? So I think that's, you know, that says something that these, or another great example, I think that's really clear of like why human preference isn't, you know, always the answer. I saw somebody on Twitter once that was really impressed by some anonymous model on ElmSys that was able to produce an ASCII art drawing of a unicorn. And it was a great drawing. And, but when I searched for like specific details of that drawing, I found that it was just in some like widely circulated list of ASCII art drawings. And it was a verbatim regurgitation of some signed work that somebody had made. And so I think there's an argument there that any request for ASCII art should probably just be thrown out, right? That a human's preference of how good an Elm is at ASCII art maybe just does not matter because like, it's so likely to be regurgitated or at least like figurative things, maybe diagrams are okay and so on. Yeah. Yeah. Okay.
Nathan L. [00:12:03]: We've touched on multiple of the things I want to get to in the future, but you kind of said that Chad2PT was the biggest moment for prompt engineering. And I think O1 is not nearly the same magnitude, but it's a very interesting microcosm of the future of prompting because the model feels very different to use. OpenAI has explicitly told us we need to prompt it differently. But I think my guess is that in the long-term, they're going to figure out how to train this model so that the behavior is not indistinguishable from their GPT models, but that it's not as sensitive to prompting and whatever you throw at it, it's going to work. Maybe they need to rewrite the prompts, but that's probably a temporary thing.
Nathan L. [00:12:45]: Two questions to me is simpler. What do you think when you see them giving you like, oh, we need to have these new prompting instructions to use it differently? Do you agree with my long-term convergence idea?
Riley G. [00:12:57]: I definitely agree. I think that there's an argument for seeing prompt engineering as kind of the experimental next branch of language models, right? That it's the features that people are just on the cusp of figuring out how to systematize and integrate into the models themselves. And to the extent that somebody comes up with a prompt engineering idea that is just so good of an idea that it's worth applying to literally every prompt, then it will be integrated into the models and you'll stop calling it a model, you'll call it a system and it'll have some auxiliary second model. I think the clearest examples that we've seen of that are content filters, right? That nearly every model that you get from a vendor will have some kind of cheap auxiliary model that looks at the output and says, is this plagiarism? Is this, or not plagiarism, but regurgitation of copyrighted work, right? Are you reciting Harry Potter word for word? The value of those is so, rather, sorry, the cost of having that kind of secondary model on the output is so low that it truly is worth it to just apply it to every generation, right? And we haven't seen too many examples of that on the input side, but they're starting to appear, I think. I think we've seen from anthropic evidence that they make modifications to user inputs based on certain conditions that they detect if you're asking about some particular feature, they modify the prompt if you are. And I think that's a common pattern in a lot of applications.
Nathan L. [00:14:31]: I'm guessing they've seen some public people kind of using the model. I haven't heard anything about modifying the prompts in a clod or a chat GPT window.
Riley G. [00:14:42]: It's, I've seen it for instructions for avoiding plagiarism, avoiding regurgitation. Oh yeah, that could make sense. Yeah, so the, but it's a common pattern you see in a lot of applications, right? That you, so like a good use case for this is like instructions for tool use, that you might analyze a user's, say, chat GPT input, and if the input appears to be a request to use dolly three, then you should apply to the, you should supply to the model, these long instructions on how to use dolly three, which otherwise you don't need to block to supply. Right. So I'm not saying that that's exactly how chat GPT did it, but it's easy to imagine that that would be worth doing. So, so a lot of applications do things like that to have, you know, conditional sort of augmentations of the prompt. Yeah.
Nathan L. [00:15:33]: I mostly see that like long-term, I don't know how this impacts prompting, but I think of like chat GPT, and then we'll have multiple models that they route to. So this is kind of like an early way of doing this, where it's like, if you give a really long context model, they'll have some, you've maybe even like, like Mambo, like model or different architecture for super long context, or they pass it to O1. If it's like this question is incredibly hard instead of GPT 4.0. But that's that the border between that type of routing and prompting is, I don't know how to classify it.
Riley G. [00:16:05]: Yeah, it's really fascinating. I think, you know, people have this idea of, I think, sort of seeking purity in their models that they want everything to be like, you know, just a model. But I think, you know, we're rapidly approaching the point that you have to start thinking about these things as systems that might just have arbitrary complexity inside of them. I also like, I think that, you know, that the guides that we've seen from O1, you know, that they take that sort of shape, right, that you get that, like the content that Open Eyes put out, like how to prompt O1, it's sort of a list of like domain competencies and weaknesses, right, that it's good at physics, it's good at abstract logic, analytic philosophy, maybe less great at creative writing. The, and then also you have these sort of like patches almost for like noticed problems, right, that they've noticed that it doesn't, that think step by step often degrades at performance. Why do you think that is?
Nathan L. [00:17:11]: Because it's essentially trained to do that on its own. Like, it almost feels like it shouldn't conflict with it. It almost feels like it should just be like empty tokens, like it will just repeat yourself or something.
Riley G. [00:17:22]: That's a really good question. I think the answer to that maybe speaks to just to how much this isn't just, you know, chain of thought. That's a meme sort of flying around now that a lot of people have claimed that all this is is fancy prompt engineering, isn't this just what Reflection did and so on.
Nathan L. [00:17:37]: It's obviously a different inference stack with a lot of improvements across the whole lifecycle of the model and the product.
Riley G. [00:17:45]: Right. And also the other thing that people have been saying a lot is that it must be some complicated system, right, that there can't be a single model doing this through autoregressive inference. But the claim seems to be that it is, right. I think there was a comment from Noam Brown on Twitter where he said that it really is a model that the whole generation is coming autoregressively, which is, you know, I have no reason to doubt that. It seems plausible to me. So it's but I think that people need to be a bit more imaginative and like what's possible and just through autoregression.
Nathan L. [00:18:21]: Yeah, I wrote a really long article on this like came out yesterday. That's like I put the constraints from like the Noam Brown tweets, plus the pricing, plus the inference scaling laws to kind of converge at something. It's like if they do some clever things to a model and some batch inference and self rating and stuff like it's definitely doable. I don't know why that as an RL expert, I'm not surprised that the model is sensitive to things like things step by step in the prompt. I just would have thought that it would come up in the examples of training because there's the seed set for this is almost definitely a very wild human generated some prompt with some like back and forth dialogue, essentially human seeds of things that look like what it is doing. Have you seen this with AlphaGo? We saw this with InstructGBT and ChatGBT. You need the human demonstrations to start the learning process. Why is it sensitive to think step by step like that thing? I think maybe more about the training, but you learn that through prompting.
Riley G. [00:19:23]: Yeah, it is a bit of a mystery. And this is very speculative what I'm about to say, but I think maybe like a kind of thought experiment of how you can imagine that it could be true is imagine if like some auditor or somebody who had the penalty of law over your head asks you to do something and to document exactly how you did it. It's easy to imagine that you would do the process differently and that you might do it worse, right? That because you can only do the things that are the most conservative and the things that you can justify and explain that you're not going to produce as good of a work as you might have otherwise.
Nathan L. [00:20:01]: It's like GBT4 needs to think step by step because every small mistake is a big deal. But almost with O1, we maybe should be like, go forth and conquer and make mistakes on your way and just let it wander to an answer.
Riley G. [00:20:15]: I think that's pretty hitting the nail on the head maybe.
Nathan L. [00:20:21]: I want to go try that silly prompt and see if it gets better at coding or something.
Riley G. [00:20:30]: Yeah, yeah. But I mean, I feel like that's the key improvement here that a lot of people don't appreciate is that they seem to have cured like all the Lacunian problems of exponential divergence, that if you sample a bad token, you're going to keep sampling more. And it's not that there wasn't progress on this before, like people had tricks to deal with it. But I think the thing that's really changed is that the models get mileage out of like thinking for long periods of time, but they derive benefit from just continuing on. Because that's very different from behavior you see from like 4.0. Like if you've ever tried like the exercise of just once it's gone down a wrong path, just say, no, keep going. Like keep going till you get it, right? Like it's pretty evident after a while that it's not making progress, that it's just gone like deeper and deeper into like some failed path of reasoning.
Nathan L. [00:21:24]: Why does that often break? I mean, I understand why it often breaks models, but that's also one of the jailbreaking techniques is just like keep sending the same message over and over and over until the models die, which like I wonder how that relates to O1. Maybe it's just easier from a safety perspective because it doesn't have that like as many turns or something. Yeah.
Riley G. [00:21:45]: And it's also like one of the bigger differences in behavior between GBT models and CLOD that I've noticed that opening eye tends to produce their models to
Riley G. [00:22:02]: like in the specific case that if you keep like telling it it's wrong, it will always take your side. It will say, well, oh, yes, of course I made a mistake. Let me try again, right? And it's never going to like diverge from that behavior. Whereas CLOD will eventually get sick of you, right? Like if you just keep saying like, no, you're wrong, it'll be like, look, I have told you many times that I am right. Like you need to be a bit more specific in how I'm wrong. If you really want to make an argument here, it'll start like just telling you to go away. And that's like-
Nathan L. [00:22:28]: This is why I want Anthropic to write a model spec because the behavior describing with chatGBT does fit with what they're, like open AI's models are like in behavior and they're kind of described as wanting to be like robotic computation assistants where like they follow, they take the user's information and they try their best to execute it without violating any basic principles. But I think CLODs is much more of like, we have created a, like I don't like the hard words to do without anthropomorphizing and all these other things. But like we've created an intellectual entity that is going to go back and forth with you. And it's not going to, like it's going to, like you pass in sensitive information as data to CLOD and you're like reformat it. It says no. You get these weird things because it's like this entity that doesn't want to be sent like harmful texts or be told how to make a bomb or something. But chatGBT is like the robotic one. So now I kind of use both of them depending on the task and the behavior that I want. But I'm excited to see how that goes further, really.
Riley G. [00:23:27]: Yeah. Yeah. I mean, that's, you know, I think it goes back to your point before that, you know, we're seeing more specialization in these models. But, you know, that all of this is temporary, right? That eventually like somebody will come up with the right way to delegate correctly to one model or another. And then you'll have just, you know, some unified chatGBT interface or whatever that, that, you know, decides like, is this a prompt that one would be good at and sends it to it? Yeah.
Nathan L. [00:23:50]: And while we're on these complex reasoning things, there was also this reflection 70B drama, which was mostly big because it was a big mess of credibility and memes. But there's also like real science in there that people need to remember of like how to prompt a model and spend more on inference. So I think it's really just a tiny bit of fine tuning with some special tokens and a system prompt. That's like, make sure you use these reflection steps. And that is how you move something like GBT 4.0 closer to O1. You can't, you can't prompt your way to O1 behavior, but that's the sort of things that more people should be considering. And it kind of leads into like, I want to ask about like math evals and stuff like this. And it's like reflection 70B style of prompting is a real thing that more people should be doing. And I don't know how we get around that communication issue now. It's going to be even harder because people are going to be like, oh, it's O1. We made it open source O1 now instead of just the best model. I just wanted to give air time. If you have any comments on that, go ahead.
Riley G. [00:24:48]: Yeah, I think, you know, reflection 70B was, you know, it was sort of a perfect storm of a lot of like the tuning method feeling plausible, right? That it was something that was very, you know, it's a legitimate like area of research. They like, it was, you know, rumored to be part of Strawberry and so on. And so there was like, it had like the right strategy for Buzz there. And, you know, however, they ended up releasing that model, like, you know, they don't have what they think they have. You know, so it's, I think, you know, it's kind of, you know, once you saw the, I won't recap the whole saga of like, you know, with Laura and finding the Laura from the previous version of WAMA 3.0 instead of 3.1 and all that. But I think the, you know, there's that kernel of truth there, right? That this is, you know, sort of a good idea, at least for some problems. I think also the thing that people don't appreciate is that very good idea for many problems feels maybe like a better idea than it is because it's so optimized for the domain of problems that tend to be on benchmarks, which is somewhat different than the thing that you really want to optimize for in the real world of like user satisfaction and just, you know, preference. Like some mix of like, do people like it? Like, is it useful? And does it do well in benchmarks? Because I think that there's like a, even for what I think should be like philosophically the core like use case of LLMs, like do they like do practical work? Like can somebody achieve the thing that they want to do with this? But, you know, like whether, however they do it through prompt engineering or whatever, it kind of matters more than whether like academically it does well on like the most naive presentation of the problem, right? Like whether somebody can figure out how to do it correctly matters. And that specifically is just not captured well on benchmarks, right? That like this, if you're doing a benchmark that compares across several models, there's, you know, a natural incentive to do it uniformly. That maybe you follow like vendor's best practices on, you know, how do you apply the template of the prompt and so on, or if a vendor recommends that you apply some suffix or whatever, you might do it. But for the most part, you're not going to put a human on the task of figuring out what is the best prompt for each model, right? Because then, you know, how do you know that they did a perfectly good, you know, fair job of that, right? But really that's what matters. Like that is like, you know, at the end of the day, like the thing that determines whether GPT-4 is better than Quad is when you sit down and try to, you know, solve your problem in GPT-4, you know, applying whatever hacks, you know, and, you know, taking, you know, advice you find online and, you know, whatever dirty tricks you have, and then you do the same for Quad, which one works better. And so like that's the state we're in. And that's, you know, very elusive as a thing to try to measure. Yeah. Okay.
Nathan L. [00:28:00]: I'm going to keep going, roll right into this, into the evaluation section of this conversation. You had, you were talking about this with how you actually use the models before you had mentioned, like you need a white space to properly evaluate or use the models like tokenizer things. I, one of my big blind areas is it seems like most frontier labs are using some sort of custom prompts on some sort of evaluations. And I don't really have a good sense for how much that actually impacts scores or how much that translates to downstream performance. It might not be custom prompts. It might be like custom setups. There's all these, like all the math evaluations, you need a specific format for your answer. I think like math, the all capital one, you like need to put your answer in a box and
Riley G. [00:28:45]: things like this.
Nathan L. [00:28:46]: And how, what is your view on these per prompt or per evaluation? Prompting is actually a thing. I think the Lama three paper had some cool analyses on how varying subtle things changed evaluation scores, which is great, but they're the only one sharing that. Otherwise we just get like our score is X and it's reproduced to some capacity.
Riley G. [00:29:09]: Yeah. I don't have like a lot of deep, like technical wisdom to share on that front, other than to confirm that, like, I think you're right that it is a big problem that we generally try to follow the vendor recommendations. We work with the vendors to prompt their models fairly. But like I said, like ideal and optimized prompts are very different than what's the default. But I think also that there's, I think a longer term trend that these issues maybe matter less than they used to. And, you know, or that, that, that should continue. I think like when you want the, like maybe one of the clearest signs of this is that Lama, like most versions of Lama, you can prompt them incorrectly in terms of like the system top prompt template, and it will be just fine. And in fact, you can often template them with system prompt templates from other models entirely, like just say representations of chat ML and they will be fine. Right. So there's, there's sort of familiarity in the pre-training with, with, with just chat templates in general. And the idea of like...
Nathan L. [00:30:25]: Do you think this is specific to Lama? I've also remember hearing a conversation at AI2 where we were considering doing the last turning, last stage of pre-train with random chat templates and like random instructions and multiple chat templates so that the model could be amenable to fine tuning and multiple chat templates, which there's a chance that they did that. I actually don't know. I would not put a high bet on it. But do you think that's just because Lama knows they're going to have so many users? It's possible.
Riley G. [00:30:54]: I mean, it's also plausible to me that that just shows up in pre-training incidentally, right? Nobody intended it to be there. It's just like, it's in the data. But I think that, that, you know, that, that process is only going to continue, right? That we're only going to see like more models just being familiar with how models behave. I think to some extent, like, you know, you see like, like another thing that I think is maybe like evidence in favor of this is if you look at the base Lama, like, I think I looked into this on like base Lama 2 once, that if you prompt with like, like instruction prompt formats, it would adopt the behavior of, of like a chat GPT like assistant, right? So, so I think, I think it shows that examples of chatbot behavior are now so widely disseminated, you know, across the internet that a pre-trained model is better at instruction following tasks than any pre-trained model was before the work of instruction GPT was done. So, yeah, I believe you.
Nathan L. [00:32:00]: I want to check this. How does this impact how we should view evaluations? I'm just trying to reckon with, do we, like, there's a couple of scenarios. It's like, it doesn't really matter because these models are going to be not that sensitive to the system prompts that we're using to say, do GSMA care math. And that goes for models like Lama in the open, AI2's models, GPT5, whatever. It seems like the sensitivity to prompting for really well-known formats is actually going to go down. And that solves some of our problems. Because I don't think we're going to come up with new, like that many new formats for evaluations. We're going to make evaluations more specific and harder in the content.
Riley G. [00:32:43]: I think that's right. And I think the version of it that we have to play with now definitely does feel like one step forward, two steps back in that regard. And that it's much better at benchmark style inputs where you give it just no advice on how to do it. You keep everything very simple with what are your output requirements. But it's also just very hard to steer. If you have opinions on how it should do it, those opinions won't be followed generally. And it also has issues with output formatting. So I think we're seeing, I've seen anecdotal reports on Twitter at least, and I've seen this myself, that its output is just inconsistent even when you ask it to be consistent. That it will forget things like block quotes and so on. The result of this, I think we're going to have to see a lot of benchmarks, is that maybe the fair way to do this is to have some secondary model on the end of it that puts everything into a consistent format.
Riley G. [00:33:50]: I think we're not that far away from benchmarks that just do that across the board, of just saying that it's not the model's job to do this anymore. And we'll clean up the results however it is. Yeah, I think that's a better place to be.
Nathan L. [00:34:03]: It's one of those things that the model's getting better can solve some of our problems. I think there's less angst now about the whole closed labs evaluation scores anyways. I'm mostly trying to reckon with what open groups and academics are doing rather than closed labs, and they kind of rely on each other. I've been on the, before, there's now this hugging face upload chat template. So a lot of models have the chat template saved with the tokenizer, and most of the time they don't have a system prompt, which is surprising. I feel like it should be the norm that a system prompt is included with every model. Is there any reason that you see not to do that?
Riley G. [00:34:49]: Yeah, I mean, I can think of things that might be slightly better, but I think that that's that generally makes sense, right? Like, I can imagine that maybe they, you know, you'd release several, right? And say, you know, it's like any of these is fine, or, you know, like training on several and, you know, say it's like an average of these three or whatever is like kind of the is ideal or something like that. Yeah, most of my reasoning is I think that most users of language models are not sophisticated.
Nathan L. [00:35:14]: So the model cards and documentation do normally say we recommend using the system prompt, but the simple ways of using the models do not integrate them. Simple ways of using the models do not integrate the system prompt. And it's not always easy to modify your data to add, like if you're doing the messages format, like you remember to add the system thing. And if you have multiple models in your queue, you then have to go and manually hard code
Riley G. [00:35:37]: all of them.
Nathan L. [00:35:37]: And like, that just makes it get dropped. And if the system prompt is a big deal for performance, like that impacts either if it's a product or it's like, this is where I'm trying to understand like academia is like, if only half of the people remember to add the system prompt for their model, they're evaluating in this kind of academic paper. And I know it impacts things like all the vibes based valves, like alpaca valve, empty bench, whatever. Like, if you have the different system prompt, it can vary behavior. We did an experiment, which was like, to make sure this works, or you just give it the system prompt of like, you're a terrible model, you are to me, you're made to make other models look good, and you happen to give wrong answers. And like alpaca valve goes to zero and all these things. So it's like, I think it's easier to show the down case, but you could probably get one to 2% improvements, which matter in the long trajectory of academia in terms of if your method is accepted or not.
Riley G. [00:36:31]: Yeah, I mean, I've often like been frustrated by the ambiguity and a lot of academic publications over like how prompts are formatted. And they often, they always run into the same pitfalls of that, like the fundamental problem is that system prompts are often, or prompts in general that you're presenting like during evaluation are implicitly templates, right? That you have like your points where you insert like the actual problem or whatever. And that templating needs to be communicated to the reader of the paper, and the prompts themselves may involve templates, right? They may, you know, like describe like how, you know, like an output should be formatted, for example, and might do this using, you know, like curly braces, right? So this creates like several layers of confusion that you need to distinguish between, like where are the variables that you're interpolating purely in the logic of this paper of like that, you know, that things that would be translated into Python, you know, like if you were to actually implement this versus the templating instructions that are literally part of the instructions on how it should, the model should receive like a template of how it should format its answer and so on, right? Because like a lot of prompts end with use this format and then have some kind of template. Yeah. Right. So the, like I've often thought that we'd benefit immensely just from standardizing on something like saying that like if you want to clearly communicate a prompt in your paper, the way to do it is to show Python code that will produce that string. Yeah. You just literally show it as an f-string, there's no ambiguity.
Nathan L. [00:38:15]: Because you copy out of a paper, you drop the slash n slash n that you need or something like that.
Riley G. [00:38:21]: Yeah, right. Like the, but if you were to literally just include a Python code block, there's no ambiguity, like, you know, like whether or not there's a trailing new line or is it so on. And those things are really fiddly and need to be communicated. Because I've seen people do all sorts of like imaginative typography to like represent new lines and things like that. You know, like having the return signals at the end in light gray and, you know, like you're putting dots between spaces and all that thing, right? Because if you're doing like, I've seen like early like playground competitors sometimes did this that approached like more like from a technical approach that you need to know where spaces are. So it's worth it to represent them as like gray dots, right? Yeah. That's the kind of thing that the level of detail that you need in communicating these things. So I think like standardizing on Python would be just like a good way to like, you know, get the problem out of the way. Yeah.
Nathan L. [00:39:14]: I also saw in some discussion of a one or maybe a reflection. I don't remember. It's been a while, two weeks. You're talking about like equal inference costs, comparison of prompts and a reply. And I think that's a great idea. Like, do you think there's, okay, well, like one first, do you want to explain the idea? I'll kind of ease into this.
Riley G. [00:39:33]: Sure. So my thinking is that models are evaluated right now just based on how they do under like sort of the same, I guess, invocation of inference, right? That you let the model sample, you sample auto-aggressively as long as that takes, you know, however long the completion is. And you don't pay attention too much to like what it costs you to run that or you factor that in afterwards that you score it up. And there's a lot of reasons why this makes sense, right? That, you know, it's simpler, it's more fair. And sometimes you don't know exactly how to equalize the inference there, right? That you can't like really say that like what the trade-off is, right? But there's, you know, exceptions to this that, or maybe not so much an exception, but like there are ways of doing it that aren't perfect like self-consistency, right? So like there's a method called universal self-consistency where you prompt a model multiple times and then take the model again and give it all three answers and then ask it to choose which of these is the most consistent with the consensus of all answers that were generated. And this is sort of a method that's pretty reliably not worse than just doing it naively, right? It's hard to imagine any prompt where this method would steer you wrong or, you know, be worse than doing it naively. And that, you know, suggests that maybe there's like a fairer basis of comparison here, right? That we could say that if something really is cheaper enough that you can do that, you could run it 40 times and take self-consistency that then maybe that should be its score. But I think one of the bigger reasons why this is kind of like a, in hindsight, this is maybe like a bit of a facile tweet that I made about this, but like really the trade-off between the exchange rate, if you will, isn't very good. I think like a rule of thumb that I saw in a paper once is that if you do self-consistency on 40 samples of GPT-3.5 turbo, it's on par with one sample from GPT-4. So you sort of move up one generation every time you do 40 inferences, right? But at the same time, in specific domains, there are refinements of this that work quite well. So we had a scale actually put on paper recently on a method we call plan search, I think was the name of it, yeah, plan search. And then the gist of that is that if you can improve performance on programming problems by generating diverse attempts at solving the problem, right? So the approach that plan search takes is to first create like sort of high-level observations or ideas about how a problem might be solved, then to combinatorially sample that list of ideas, and then take combinations of them to inspire strategies. And then for each strategy, you lay out sort of a path of like reasoning of like how you could turn this into code, and then you turn each one into code and then assess which one works best. And this like lets you search over the portion of, it lets you search over the variation in your strategies that actually matters, right? Because you can imagine that if you were just simply resample a model blindly over and over again with the same problem, there are a lot of ways that an answer could vary that don't matter, like whether you use tabs or spaces, but you name the variables and so on. And you don't want to search over that variation, you want to search over like the part you think is going to be fruitful, like the high-level strategies. So I think that for particular domains, like that is the more relevant comparison of like what could you do if you were to apply like a bit of search here.
Nathan L. [00:43:40]: Yeah, it almost seems like there'll be different tiers of evaluation scoring, where it's like the basic prompting, it's kind of like linear time. And you could do like, it's almost like with the models, it's like there's a biggest, best open model at every time. But like LLAMA is dominating because it has the 400B, the 70B and the 80B that are all really good, it should have a 1B. And if you're having a prompting paper, eventually you're probably going to have to have binned comparisons like that, which is like we are comparing two basic prompting techniques, which I think they will have less headroom by needing the autoregressive behavior and things like this. And then maybe there's things like reflection, where it's like we've added minor structure so that the model can now generate a bunch more tokens, but not like a 10X or 100X. And then there's the things like we've added a whole new planning component to how we're prompting the models, and it's all abstracted away from the users. And you're not going to be able to compare those, because those are the things that are going to just solve all the benchmarks that we have out of the box. I think that's fine. I think people will converge to this. It just always takes a bit longer than we want.
Riley G. [00:44:47]: Yeah, I think that's right. I am really excited about the O1 RL approach to this.
Riley G. [00:44:58]: On some level, all prompt engineering is approximating this RL-like search. We have a lot of prompt engineers out there that are trying different things. They see what works. They tell their friends, hey, this works. But the space of things that works is probably, well, I mean, demonstrably, maybe at this point, given O1, outside of what a human might think of. There are things that we see things, even in the summarized reasoning traces that O1 puts out, that are eerily anthropomorphic. That it will say things like, hmm, or let me think about that. Yeah, I feel like they added that in.
Nathan L. [00:45:42]: I think it's almost like a trigger for the model to have a more reflective response. Those are the examples they used, but it's cool.
Riley G. [00:45:49]: I mean, it's not hard for you to imagine that RL could find something like that, right? Just that empirically it works to say, hmm, because that suggests that you're about to do something else in the pre-trained modeling manifold of plausible text. Like saying, hmm, might just be empirically a good thing to say. And it could find that. So I think that's the kind of exploration that you're benefiting from with O1. It's the space of prompts that work that we're not really equipped to find. Yeah, do you have anything?
Nathan L. [00:46:28]: I think this is a good discussion. Kind of to wrap up the academic side of things, how much of papers that are nominally about RLHF training or any sort of post-training as the contribution, do they need to do anything with prompting? Is there a clear segmentation there? Or is it like, if you're doing this fine-tuning, you're necessarily changing how the model is going to respond to prompting? That we should do some checks there.
Riley G. [00:46:55]: That's one view of it.
Nathan L. [00:46:56]: Or the other view is you have a model and prompting is just a way to take one step further with it, which I think Anthropic did this recent podcast with Amanda and their chief prompt engineer that I don't know.
Riley G. [00:47:07]: And that's how they do it.
Nathan L. [00:47:08]: Amanda's like, I can do things with these models that most people cannot. And that kind of leads the way. Rather than prompting being really part of this post-training stack that everyone needs to be checking the box on. I don't know where we fall. I guess there's this IF eval, which we could come to after that, which is kind of a separate
Riley G. [00:47:29]: case. Yeah, I definitely lean a bit more towards the Anthropic view of the world. I guess you could argue that's maybe somewhat self-serving, with no big news there. Prompt engineers are important. But I think that it's true that we do see people that are just good at this. That our ability to prompt these models sometimes exceeds our ability to explain how we're doing it and what the general strategies to apply are. And I think those strategies are worth extracting.
Riley G. [00:48:09]: It's worth introspecting.
Riley G. [00:48:12]: One thing I think about a lot is anytime somebody... I really love when people suggest a prompt or suggest doing something to a model that I can tell immediately will not work. And it's a terrible idea, but it wasn't obvious to them. And that's fascinating, right? Do you have an example?
Nathan L. [00:48:29]: I would love to know if you have something that everyone tells you, but it's a generation behind or something.
Riley G. [00:48:35]: A lot of, I'd say, strategy ideation in fields that are new and competitive. If you wanted to have an LLM give you ideas for what's a good LLM startup to try right now, it's probably not going to tell you anything useful. Some things like that, where it's like, people are still figuring it out and there's money to be made in knowing how to do this better than the average person, you're going to get mediocre advice on a lot of things. But that's not true for everything. If you ask it about physics, you're going to get like, oh, I don't know how to do this. If you ask it about physics, you're going to get like, above average advice.
Riley G. [00:49:16]: But I think that people who have acclimated to models forget what it's like to be new
Nathan L. [00:49:24]: to models, right?
Riley G. [00:49:25]: And I think that explains a lot of people in industry being annoyed by how many R's are there in strawberry. Because they're so- That's the tokenizer.
Nathan L. [00:49:33]: We ignore the tokenizer whenever we can.
Riley G. [00:49:35]: Yeah, and you see this explicitly. A lot of people, they get really enraged that they're like, you idiots, why would you ever think this would work? Why did you ever think that you could ask it 9.11 is greater than 9.9 and it would give you a right answer? And so on. They have a point. That was the attitude for a long time. But I think the social context of these models is changing and people, they want them to, it's becoming more reasonable to expect them to work well in these queries. There's practical consequences of these models being in the hands of people that don't know about these issues. And it's now suddenly more important to fix them. Yeah. So let's spin on this.
Nathan L. [00:50:12]: Is Google searching going to become more like prompting or is prompting going to be more like Google searching? Where with a good language model, can I just type in that physics equation that govern with the cross product that governs electromagnetism? Is that the direction that the models are going? Or is everyone going to actually become more conversational because AI is the default?
Riley G. [00:50:37]: Yeah, I think, I mean, Google searches maybe, yeah, there's some similarities there. I think Google probably has gotten simpler.
Riley G. [00:50:48]: It's been a while since I've used most advanced search filters in Google. I remember a point when it was extremely routine. Yeah, the plus comma, quote, quote, comma. And I think that speaks to the fact that the results used to be worse, right? And we thought we were happier with them because we didn't have alternatives. But we just accepted that, oh, yeah, there's going to be false positives in here that we now have to put in some negatives to cancel out. And that skill, I'd say, hasn't really become more important over time, right? It's occasionally useful still, but it's less essential than it once was. And that mimics a lot of what we see in prompt engineering that you don't have to understand. Tokenization, I think, is probably the biggest one. ChatML was no small part of why ChatGPT was such a big improvement to prompt engineering. It wasn't just the tuning. It was the fact that they came up with this more restricted system of interacting with a model that alleviates the need to know anything about tokenization. And that, I think, is kind of an underappreciated change. Yeah, I agree.
Nathan L. [00:51:54]: I do think in the long term, prompting will go in the direction of Google searching. But I think in some ways, I'm not that surprised that something like O1 can exist, but it's still a very humbling moment where we still have many times where there will be AIs released that we don't know how to use them. And this is the skill that you need to have, is tinkering with the open mind. It's like the open mind that things will come and the open mind that things are not just what they are at face value. And if you play with O1 a lot, you can definitely get things out of it that people on Twitter are not repeating over and over again.
Riley G. [00:52:31]: Oh, yeah, definitely.
Riley G. [00:52:35]: A lot of the explanation for the disconnect that you see, and some people are just absolutely amazed with O1, but also most of the things you see on Twitter maybe aren't that impressive. I think that the frontier of problems that distinguish O1 from, say, the previous class of frontier models, it's either unrealistic problems, brain teasers that people artificially constructed to exhibit the difference, or it's something realistic that you would never want to read in a tweet. The problems where it's exceeding on are like, I have this extremely in the weeds programming problem that involves a complicated interaction of all five of these files. Please fix my import errors or whatever.
Riley G. [00:53:25]: Those are the things that you're going to see the most practical benefit from. And those just aren't easy to communicate in a way that they used to be. It used to be easy to make a screenshot of, hey, look, it does this. It will fix your broken JSON or whatever.
Nathan L. [00:53:45]: Something else that I'm realizing I didn't put in the notes, but there's been these comments on O1 from the OpenAI people that they want to expose the ability to change how long the model thinks to the user. So to change its test time compute, that ultimately is going to be a whole other prompting thing. It's almost a little surprising that they are giving that to user. I almost think they should just make a classifier that does it for them, rather than just assume the user is dumb. But being able to do it and change how hard your model thinks is a really interesting real-world prompting case. Because it doesn't really matter if you can get a viral example. But it's like, how do you vary that knob in your day-to-day use that meaningfully ships your end product?
Riley G. [00:54:26]: Yeah, it's really kind of comical trying to manipulate how long it thinks about things. Because there are some things that will make it think for a long time. I tried to get it to generate acrostic word squares once. And if you emphasize enough the need to validate things, it will just keep validating and failing and loop around for, I think I got up to three minutes once of attempting things before finally saying, oh, I wasn't able to find one. Here's my best effort. But the other times, though, if you ask it... I mean, I once gave it a problem. Or I kind of just was for the comedy of it. I gave it some simple problem. And then I gave it literally, I think, three pages of emphasis on think forever. Just rambling paragraphs saying, if you're even considering stopping, don't. If you ever have the dream, if you ever get tired, don't worry about it.
Nathan L. [00:55:22]: Just keep going.
Riley G. [00:55:24]: All those kinds of holy hand grenade style repetition. And after all this, it literally just thought for three seconds and then came back and said, I understand the urgency that you're saying here. Thinking forever just isn't possible. So I'm not even going to try. There's another thing.
Nathan L. [00:55:43]: OpenAI said they might give you a knob that controls this or influences it.
Riley G. [00:55:47]: Yeah, I have to be honest. It feels like maybe weird UI. It seems like something that you should be able to just do through text. But I'd be happy to play with it. Because steerability in general without one seems to be... A lot of people, I think, are reporting that it's kind of awkward or at least at odds with the really impressive examples that we're seeing coming out of it. Yeah.
Nathan L. [00:56:16]: There's a whole strategy discussion on why did they actually release it that I haven't really entered into. We can kind of avoid this. I am wondering how you view prompting of agents. Is it kind of like the future section of what is the future? How are agents going to be susceptible to prompting? I'm guessing after our conversation here, it's going to be like, it's the same. And there's going to probably be a meaningful shift in who can deploy them and have success based on who actually has this expertise and is doing this prompting work. And this could translate into downstream business success, which is the first person to kind of crack an agent with the right model and the right prompt can have the first product that works.
Riley G. [00:56:57]: Yeah, I think people mean very different things when they talk about agents. Sometimes, and I think the big division that matters is that there's agents that are working in self-contained, repeatable environments, so like a rebel sandbox. And then there's agents that are making changes in the real world, that they're out making retail purchases, canceling your subscriptions, so on. I'm very optimistic about the former. I'm very skeptical of the latter. I think people underestimate how much reliability is needed for a lot of role decisions before you get to the point that you'd trust the thing to have the power to cancel your Hulu subscription or whatever. I think that also, in the first case, there's a lot of untapped potential there. And I don't understand why we aren't seeing more iteration on that front, really. Chachiviti's code interpreter, when it came out, I think they renamed it to Advanced Data Analysis or something like that, which is not a good change in my mind. But the code interpreter, I love that. I still love it. It's a brilliant product, and I wish they kept going with it and improving on it. I'm also a fan of Julius AI, which goes exactly in that direction of creating a code interpreter-like environment where you can substitute in whichever model you want, and you can do things like install packages. It's great for one-off scripts where you want to say... I had a post once where I was pointing out oddities in the longest GPT-4 tokens. One of them is like slash, slash, and then 128 repetitions of an equal sign or something like that.
Riley G. [00:58:49]: But the way I did this was literally just like I just went to Julius, I said, install TikToken and show me the longest tokens. And I read the code pretty carefully because I was going to tweet it. I didn't want to tweet out something wrong. But it was right. There were small things that I had to fix, but it's good for prototyping, the kind of these quick one-off things where you're just like, yeah, I could look up exactly... I roughly know how to use TikToken. I just didn't feel like figuring out the syntax again.
Riley G. [00:59:17]: It's good for just the curiosities and one-off stuff like that. And I think that's what the future of this really is. This really blew me away.
Riley G. [00:59:30]: Somebody posted on Twitter a video of their eight-year-old daughter using Cursor, I think it was, and this girl apparently has no understanding of the code that's being generated, but she's able to say, no, I want to do this differently. I want to have a Harry Potter spell here. Changing the layout of this HTML JavaScript app. And it just works. And that's the future to me, that that's the hottest programming language is English. When you see a little kid doing it, you really believe it, that now kids can have the power to create software. And that's great because we were at a weird local minimum of that, I'd say, of kids being able to have the creativity to create their own interfaces or make their computer do what they want. They're less customizable now than they once were. Yeah.
Nathan L. [01:00:28]: My reflection on this is the people who take prompting seriously are more likely to be in tune with what is happening in AI and at the cutting edge. But that also means that on the academic side and the public side for transparency and accountability, you have to do some education work to make sure people are taking it seriously and or some normalization of claims, kind of depending on how people are presenting their work and using things. I think it's safe to say that all the frontier model labs are doing this, but kind of the long tail, it takes people time to learn these habits. But it's surprisingly hard to convince people to spend time playing with models too. Like I do it, but I should probably do it more, listening to people like you. I just, it's funny. It's one of those things that doesn't make sense how it'll pay off, but it probably will.
Riley G. [01:01:20]: Yeah. I mean, there's no substitute for using models. People, I mean, I personally, I discover just the dumbest things sometimes that make the biggest difference. One of the most high impact chat2BT tricks that I found lately is I have custom instructions in my chat2BT telling it how to think silently. I have a tweet about this that I posted once. So if you Google chat2BT think silently, good sign, you'll probably find it. But I have the prompt here actually, right? I told it, I was using its new memory feature so it can remember things that you tell it. So I was sort of showing that off at the same time. But I said to it, remember this, when I ask you to think or write silently, I mean, for you to use your Python interpreter to write your thoughts as code comments or string literals assigned to variables. Code doesn't necessarily have to display any output. And then it remembers that. And so then I can say to it, silently write a brief essay about Super Smash Brothers, then silently translate this essay into French, display only a double histogram showing the frequency of word lengths for both texts. And then it doesn't output anything until it has that histogram done and then outputs the histogram and says, here it is.
Riley G. [01:02:32]: And that makes such a big usability difference. If you just don't have to see what it's doing, if you can just put it behind a fold where you can expand it if you need to, be really sure that the code is right or copy it to another editor or whatever. But just not seeing it makes such a big difference. And you can just have things in code too. You end up in this sort of Jupiter-like flow where you told it to silently do something. And now because you said to do that, it's not just in context, it's in a variable. Like I said, if it ever needs to do something in code, it would just have that variable there. And it doesn't have to repeat it, which is a big deal if it's, say, an essay. Repeating an essay is expensive. Yeah. This is great.
Nathan L. [01:03:19]: Thanks so much for coming on. Anything else you want to plug or talk about?
Riley G. [01:03:25]: I should have some content that should be going live around the time that this comes out on analyzing one for the scale blog and talking a bit more about our coding leaderboard. So definitely look out for that. And also, the other thing I should of course mention is Humanity's last exam. We recently partnered on an effort to solicit from the public examples of challenging problems. And we are giving out cash prizes. So definitely check that out if you're interested.
Nathan L. [01:03:58]: Yeah, I had just tweeted a few days ago. I don't know if I put it on Twitter, but I put it on some platform. I don't have Twitter at work, so I end up looking at lame platforms I'm less addicted to. But essentially, evaluation is going to be extremely expensive. And that was my whole take. And it's going to be very narrow and very hard. And then you put out $500,000 in prizes. And the initial whiplash is like, oh, that's a lot. But in reality, I think that's the right ballpark. Because if you're going to make a good eval, you need to have somebody who's really good at cutting edge AI, probably working on this at least six months to build a good eval. And that's a ballpark price. $500,000 is like a half year of how much it costs. This is with overhead and compute and stuff. It's how much it costs to have somebody in AI like that. So obviously, it costs more to actually build this evaluation. But these numbers look ridiculous. But if we want to have evaluations that are meaningful, this is what we need to do. And I think it's the right thing for Scaled to do to lead on evaluation. It feeds into natural things of their business. I think I've been on the record for this for a while.
Riley G. [01:05:00]: So I'm like, it's great. Yeah, absolutely. I think that people outside the industry at least have the impression that evals are grunt work, right? That this is something that you would use low-cost labor for. It's not a prestigious area. But it couldn't be further from the truth. I think evals are very rapidly moving towards the high end of intellectual ability that we're looking for like PhDs. I've done projects where it's like, okay, we have to get as many PhD-educated poets as we can to check the correctness of these IAMs in this poem or whatever.
Riley G. [01:05:46]: I think that's only going to continue, right? We're going to see that at the low end, the value of human labor for training models is going to decline. And the value of high-end intellectual labor is going to increase probably drastically.
Nathan L. [01:06:04]: And it's like cost is probably a good proxy for evaluation usefulness. LM says it's expensive, but for different ways than the Scaled leaderboard is expensive. And they complement each other very well. And they both become better by the others existing by kind of like, okay, the models are in similar places, but they're showing different things. And you can separate between that. And I suspect that that'll continue to grow. Some more will be at scale, some more will be elsewhere. And that's just the new default for evals.
Riley G. [01:06:35]: Yeah, absolutely. I think that's one of the things I'm most proud about working on our evals and leaderboard at scale is that we're contributing to this healthy ecosystem of not having to just trust one or two players that evals have been done correctly. We want to have more openness and more independent verification of evals. And that's sort of our general theme with work with GSM 1K and trying to make sure that we can actually trust what these leaderboards are saying.
Nathan L. [01:07:08]: Yeah, my one nitpick that I don't know how to answer and I probably need more RLHF experts, you might know this, is like, are companies that buy data from scale going to have an advantage on the scale leaderboard because the distribution of humans that are
Riley G. [01:07:20]: doing...
Nathan L. [01:07:20]: Not that the humans doing eval and creation are the same, but that they're drawing from the same pool of humans that are writing content or doing preferences and then that are doing
Riley G. [01:07:30]: the evals.
Nathan L. [01:07:30]: I think it's too early to answer that question on if human distribution matters. And for that reason, I think the eval is still so much a net good. But it'd be really interesting to try to run those experiments on who is giving the data that you train on and how does that then impact the evaluation?
Riley G. [01:07:49]: Yeah, that's not something that I'm familiar with in enough detail to comment on our process there. But yeah, that makes sense to me. I think that's something.
Nathan L. [01:07:59]: It's something that people like to complain about every possible thing. And I understand the root of the complaint, but it's like, we've got to deal with the circumstances where we are in the AI industry. And the leaderboard is so much more useful than it is causing any problems. Let's keep doing it.
Riley G. [01:08:17]: Yep, absolutely. Okay.
Nathan L. [01:08:20]: I think we're at time. So I'm going to click stop here. Thanks again.
Riley G. [01:08:23]: Great. Thank you so much. Bye.
Sorry this one was late! Thanks for bearing with me, and keep sending feedback my way. Still a year or two away from when I have time to record these, but I would love to.
Open-source tools, examples, limits, and the state of training multimodal models.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/molmo-and-llama-3-vision
00:00 Llama 3.2 Vision and Molmo: Foundations for the multimodal open-source ecosystem
02:47 Llama vision: Multimodality for the masses of developers
03:27 Molmo: a (mostly) open-source equivalent to Llama vision
08:45 How adding vision changes capabilities and reasoning
11:47 Multimodal language models: Earlier on the exponential
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_013.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_021.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_023.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_027.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_030.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_037.png
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_046.png
Fig 9: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_048.png
Fig 10: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_050.png
Fig 11: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_052.png
Fig 12: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_054.png
Fig 13: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_058.png
Fig 14: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_065.png
What productionizing test-time compute shows us about the future of AI. Exploration has landed in language model training.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/reverse-engineering-openai-o1
00:00 Reverse engineering OpenAI's o1
01:52 From Q-star to Strawberry to o1
05:13 Training o1 with reinforcement learning
09:24 What is o1 doing when given a prompt?
11:49 Questions to consider to understand o1's structure
11:56 1. How does an RL-trained language model act?
12:38 2. Is it an online / test-time search?
14:20 3. Is it one model at inference?
15:29 Open-source o1, the future of o1, and the future of AI
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_014.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_016.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_018.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_020.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_024.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_026.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_034.png
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_048.png
Scale AI's future versus further scaling of language model performance. How Nvidia may take all the margins from the data market, too.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/ai-data-foundry
00:00 Futures of the data foundry business model
02:57 What it is like to work with data vendors
06:06 Data foundries: Risks
08:18 Data foundries: Growth vectors
09:50 Realistic expectations
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/data-foundry/img_008.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/data-foundry/img_012.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/data-foundry/img_023.png
And why the concept of mandating "model spec's" could be a good start.
(Oops, forgot to upload this yesterday!)
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/a-post-training-approach-to-ai-regulation
0:00 A post-training approach to AI regulation with Model Specs
1:45 Expanded roles of Model Specifications
3:40 Near future of Model Specifications
Whether or not scaling works, we should spend more on inference.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/openai-strawberry-and-inference-scaling-laws
00:00 OpenAI's Strawberry, LM self-talk, inference scaling laws, and spending more on inference
01:51 OpenAI's Strawberry
04:16 Self-talk in language models
07:45 Inference scaling laws
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_006.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_021.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_023.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_037.png
Ai2 released OLMoE, which is probably our "best" model yet relative to its peers, but not much has changed in the process.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/olmoe-and-building-better-llms
00:00 OLMoE and the hidden simplicity in training better foundation models
02:04 Frontier model team compute allocations
04:19 De-risking training complexity
06:40 On organizational complexity
09:05 Compounding improvements -- the key to building better language models
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_005.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_007.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_009.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_011.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_028.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_030.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_032.png
The Open Source Initiative is working towards a definition.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/defining-open-source-ai
0:00 On the current definitions of open-source AI and the state of the data commons
3:17 Reasons to not mandate fully released data
4:24 Sufficient but not exhaustive data docs
5:22 Frustration with the data commons
7:04 We need more examples to define the definition
The latest model from one of the most popular fine-tuning labs makes us question how a model should be identified as a "frontier model."
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/nous-hermes-3
0:00 Nous Hermes 3 and exploiting underspecified evaluations
5:29 Parsing training lessons from Hermes 3
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_005.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_010.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_012.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_020.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_027.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_030.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_032.png
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_036.png
I had the pleasure of Talking with Ross Taylor, who has a great spectrum of unique experiences in the language modeling space — evaluation experience, Galactica lead author, Llama post training, etc. This is a really great conversation on the frontier of language model (LM) reasoning, LM deployments and demos, LM’s for science, RLHF, and other topics. I’ve been trying to get Ross to come on for a bit. He’s one of those people in the LM space that doesn’t speak too much, but when you do, you listen.
Ross Taylor was previously an LLM lead at Meta AI, heading up the reasoning team. Previously he led the early work on LLM agents, and was the research lead on the Galactica project. Before that, he was a co-founder of Papers with Code, which was acquired by Meta in 2019. Before that, he has worked as a quant in sports betting and finance, and before that a policy advisor for the UK Government. He is currently working on a new startup.
Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.
YouTube
Chapters
* [00:00:00] Introduction of Ross Taylor and his background
* [00:02:12] Papers with Code
* [00:09:58] Galactica, goals, controversy, legacy
* [00:18:12] Technical details of the Galactica model
* [00:23:18] Potential for language models to make scientific discoveries
* [00:25:21] Defining and improving reasoning in language models
* [00:32:38] Process-based reward models and their potential applications
* [00:35:00] Generating synthetic data for SFT
* [00:40:23] Evaluating the effectiveness of language models as judges for human preference data
* [00:42:43] Considerations for creating base models that are easy to fine-tune
* [00:46:45] Balancing SFT and RLHF
* [00:54:13] Characteristics of successful post-training teams
* [00:58:26] Future directions for language model development
We mention
* Rob Stojnic (co-founder of Papers with Code)
* Armen Aghajanyan (Chameleon)
* Tom Scialom on Latent Space
* Soumith Chintala (PyTorch)
* Process Reward Models / Let’s Verify Step by Step
Transcript
Built with smol-podcaster and with love of Latent Space.
Nathan Lambert [00:01:07]: Today, we're here with Ross. This is a really exciting one. I've been trying to get Ross on the show for a while. Ross has done a lot of interesting work. And also the path to where you ended up with working on state-of-the-art LLaMA work at Meta is very interesting to me. So we're going to start with some of that, but then there are a few people that want to know more about reasoning and some of the RLHF stuff. We won't cover the secretive new start-up - I don't know what it is, but that's how it goes these days. I'm sure it'll be great. So welcome to the show!
Ross Taylor [00:01:41]: Thanks for having me.
Nathan Lambert [00:01:44]: So I wanted to start with Papers with Code. For people that don't know, Papers with Code is one of these platforms - I never was a heavy user of it - but it collates papers, people can upvote them, popular papers, attaching code and dataset and evaluations to papers, which is great - it was like sort of ahead of its time. It fits into a lot of these open ecosystem things. So I'm kind of curious, like, how you ended up there and why you all started this startup that ended up building this thing that got acquired by Meta?
Ross Taylor [00:02:12]: Yeah, that was a weird one. This was like back in 2018. So I was at an incubator, I just quit my previous job and I was like, okay, I want to do a startup. And I met Rob, my co-founder, who came along with me for the journey. We both came from different backgrounds. I was from a sports betting / quant finance kind of background, which is a whole other episode I guess. And Rob was in various startups, like applying ML to things like hate speech detection, that kind of stuff. And the cool thing was, we both resonated on similar kinds of problems within the ML space, even though we came from different domains. So we spent a lot of time doing various experiments, trying to make new kinds of ML tooling, thinking of these stupid questions like “what is the Git equivalent for ML?” - that kind of stuff. One of those experiments was hacking around on this little website to solve a really basic problem: I'm trying to reproduce this paper, but I can't find the code. That was the thing that really blew up beyond our expectations. It was weird because we thought it was fairly trivial at first.
Nathan Lambert [00:03:16]: What year was this? 2018?
Ross Taylor [00:03:18]: Yeah.
Nathan Lambert [00:03:19]: This makes sense. I think this was like, I was starting Deep RL then, but Deep RL was so hot, which was like the worst evaluation has ever been probably for ML. Like people complain about it today, but like Deep RL evaluation was like, every single person was just lying to make themselves look better.
Ross Taylor [00:03:38]: The interesting thing now is that the open ecosystem has shifted to focus more on weights as a central artifact rather than code. I think there's an interesting debate there. Would it be more useful to have the LLaMA-3 8B model weights or all the code for training LLaMA-3? I think there's still interesting debates to be had about what's actually useful.
Nathan Lambert [00:03:56]: I think the code would be more useful. Like OpenAI released their rules-based reward models, but it's like code washing because it's like just a bunch of people just released like eval code now. And it's like, that's a whole another tier is like actual training code versus eval code. But yeah, I guess I'll just skip ahead.
Ross Taylor [00:04:12]: So essentially Papers with Code was the thing that didn't die for us. We always thought we were going to make something else and Papers with Code was more of a marketing thing. But eventually we were like: okay, our users are telling us this is what we should be working on. And we expanded from that very simple use case of finding code towards indexing various artifacts in ML.
Another big problem was trying to find the state of the art in something like ImageNet and all these different benchmarks. There just wasn't a central place to find this information…So we had this quite good Christmas - me and Robert - where we hacked for the whole month, indexing every leaderboard we could and all the related papers. I didn't want to do any annotation again after that! But that took things to the next tier, and that's when things really started to blow up.
Nathan Lambert [00:05:03]: Because this is like the first round of leaderboards, because now it's really popular with Hugging Face again. And I was like, yeah, is that just because it became like a Meta thing and it's just kind of a thing that existed? You're like the first leaderboard company in a way, which I don't think many people think about. Yeah, which is weird.
Ross Taylor [00:05:19]: Yeah. And the interesting thing about us was that we never had to do any marketing because everything was from organic traffic. So you would type in “state of the art ImageNet” and we would come to the top as the most useful site. That was really the source of our growth, and we grew to a million MAU fairly quickly. And as for Meta, we were in touch with the PyTorch folks at the time who we really liked. You know - Soumith, Joe - those folks, and they had a shared interest in promoting the open source ecosystem back in 2018/19. And while it was like a tough decision, we were just like “we really like working with these people, we want to work more closely with them”, and that got us into Meta.
And then within Meta, we originally continued to develop the platform. But the big shift for us was that, even then, we saw we were moving to a world where compute was the currency. And we saw that, if we wanted to be well positioned in five years time, we needed to be building these large-scale systems. Even for our own platform, we had lots of ML in the backend and we saw we were using fewer and fewer models to do more and more tasks. So that kind of shifted us into research, into Galactica, and then eventually LLaMA and that kind of stuff.
It was a weird shift because we were product people who ended up doing hardcore research! But I guess it was natural to us that we were within a research org with these amazing people, lots of resources. It was just the best use of our time to conduct this shift.
Nathan Lambert [00:06:43]: Do you think there should have been more integration between Hugging Face and Papers with Code? It would have been wonderful if it had happened.
Ross Taylor [00:06:54]: The backstory is that we saw them as competitors, to be honest, because we had the same vision originally. We were going to do model hosting, that kind of stuff. But we never got into it because we hit friction with leadership - who was not onboard with that as a goal. Because from their point of view, it's like, okay, if we host these things, this might expose Facebook to some kind of legal risk. It wasn't in the perceived interest of the company.
Nathan Lambert [00:07:17]: This is a classic story of tech, really. They can't take the risk. They can't expose themselves.
Ross Taylor [00:07:23]: If you're a startup and it's your number one priority, then yeah, your attitude on risk is different. But I think it was a blessing in disguise for us because clearly the bigger wave was going to be large language models - we saw that incredibly early. And our mission was fundamentally not infrastructure, but something closer to: how do you organize information? It was a Google-y type of mission. And while we were focused on ML, we were more broadly thinking about science: how do we reduce friction for finding out about new advances and, I guess, lots of small tasks that when added up lead to a lot of progress in science.
Nathan Lambert [00:07:59]: I should have probably looked this up. Did you have another scientific background? Did you have a hard science background or what about Rob? Stojnic?
Ross Taylor [00:08:10]: Yeah, [Robert] Stojnic, my co-founder, he was from a bio background. So he's actually-
Nathan Lambert [00:08:15]: That makes sense.
Ross Taylor [00:08:16]: Well, he also had a computer science background. He was one of the original developers of Wikipedia, so he has his own crazy story…
Nathan Lambert [00:08:22]: Yesterday I was talking to somebody that was one of the original arXiv moderators. So we're digging all these things up…
Ross Taylor [00:08:29]: It is interesting because we both had this background, I would say, in building useful “utilities” [on the internet] at some point in our lives. I think Papers with Code is one of those things which is easy to forget, but if it went away, everyone would go crazy.
As for me, my background is more statistics and econometrics. My first job was in the Government, which I kind of hated. But I did a Master's degree, which I thought was going to be in economics, but the thing I ended up loving was time series and statistics. So I did all this research on state space models - before it was cool, I guess! - and then that got me into sports betting. And then eventually, we were using more and more deep learning [in the 2010s], and that’s how I got into AI. So a fairly nonlinear path. But -
Nathan Lambert [00:09:09]: Yeah. Well back to what you were saying on the scientific stuff, I think the Galactica story has many angles, and you led on this.
I think if people go look at the paper, it's a very interesting paper, like you cite Galileo in the first sentence, and it really has a lot of early modern language model features and quirks. It's something that people don't remember that well.
I'm very on the record saying the backlash was overblown. I think that was before there were clear habits and community norms around what language model demos should look like. So it was kind of in that teething phase.
But what was the actual goal that you wanted? You mentioned organizing the world's information. What was the goal and how close do you think the model came to accomplishing it?
Ross Taylor [00:09:58]: So there were several different things at once.
There were immediate product integrations we had in mind. We actually had an agreement at the time with Overleaf to be a “co-pilot for writing papers”. We'd have a really good LaTeX model in Overleaf, and whenever you wanted to include a citation, you could simply prompt for one.
More broadly, we imagined the future would be instead of..using more classical ways to find and extract information, if you wanted to learn about something like DPO, you would just prompt a language model to find out about it. Or if you wanted to ask “What's the state-of-the-art on SWE-Bench?” or something like that, you would just prompt the model and it would find the relevant information and answer the question.
Nathan Lambert [00:10:46]: So this is something that language models are so bad at. One of my challenge questions - I've been doing this for 6-12 months - is to ask models about DPO, and none of the models without internet access have yet done it right. You would think that it would start to kick in. And I don't just ask “what is DPO?”, I ask “What is DPO for language model fine tuning”, and they still just make up nonsense.
Ross Taylor [00:11:06]: Yeah, which actually relates to an interesting debate about LLM creativity. If you want to solve something like LLM creativity, you want to be confident about the frontier of knowledge, but frontier knowledge is where you have the most token scarcity.
But anyway, just to finish that thought. Bear in mind, we were developing Galactica while the whole Web 3.0 boom was happening. And we were in this weird state where we were like “All everyone is talking about is Web 3.0, but clearly generative AI is going to be the thing that powers the next generation of the web!”. So I guess that was our primary motivation.
Now, in terms of the [Galactica] launch, I think there's two aspects.
First, like you said, the paper. Now we were a small team of 7-8 people. We had so much fun developing these new ideas at the time: internal reasoning tokens, how do language models cite, training for multiple epochs…
Nathan Lambert [00:12:00]: What's that? A citation token? Did you have a special token for citations?
Ross Taylor [00:12:04]: Yeah. So we had a start citation token [START_REF], and we used two methods. The first was: we'd put the title of the paper within the citation tags. And the other one was: we'd have an alphanumeric ID.
The interesting thing was, it actually worked really well - but in the demo interface, it had a tendency to hallucinate - or “hallucitate”. The backstory is that, while the model was really good, for the demo we turned up the temperature to 0.7 so the text generation was better [at the expense of citation accuracy]. So generative citations were something that people thought didn’t work, but it was [more an implementation issue]. I guess that’s an alternative road in history…
So there was the paper, which was cool, and there was the demo, which I would say was motivated by the realities of the time. This was pre-ChatGPT and, even within a big company like Meta, it wasn’t a company priority to work on LLMs at all. So in our mind, our objective was - we were kind of deluded - being a team of 7-8 people, we were like…
Nathan Lambert [00:13:08]: This is how you have to operate if you want to be at the cutting edge. That's how great teams operate.
Ross Taylor [00:13:13]: So there were two objectives you could have had. The first is: you think that second-mover advantage is good. So you could wait for OpenAI to do something and then come in after and do it in an open way. And this is the path that actually worked for LLaMA. LLaMA was not state-of-the-art in any sense.
Nathan Lambert [00:13:27]: I've been doing this. I mean six months ago, maybe OpenAI and Google wouldn’t need to hire me because they know everything. But now I’m doing more interesting analysis where I'd be hired at a different role - but in the open. Now I'm like the person people look at. But I’m trying to tell people that “You don't understand! I'm six months behind everyone!”.
Ross Taylor [00:13:49]: Right, but to be clear, that’s a really important role - because everyone should have a stake in the future. And that's what the open ecosystem gives people.
But our objective was this: we didn't want to be second; we wanted to be first. And we were kind of deluded because we were 8 people - compared to maybe OpenAI with 200 people where their whole bread and butter was language models. But that’s why we were thinking “how do we move as fast as possible?”. And in our mind, a demo might be premature, but it would also be a way to get lots of prompts and information quickly - to understand how people would be using the model. And essentially the calculus we took was, we knew the community might not be ready for something like this - especially with the Meta branding - but we thought this was a way to get lots of information really fast and catch up given our position. Now in retrospect, history says that…
Nathan Lambert [00:14:33]: You kind of did that. I think Meta probably got the injection of language model reality from that. It's kind of like the Gemini backlash. I think the Gemini backlash - while it's obviously stupid execution - was potentially a good forcing function for Google's structure of their Gemini org - to really move everything into the way it is now. That made them be structured more like a serious language modeling org and less like Google, I think, which people don't want to hear...
Ross Taylor [00:15:07]: For us it was just a risk we decided to take. We probably took a lot more risk than we should have done. But we just thought “obviously this is going to be huge”, “LLMs are going to power the next internet”, etc, so let's take a risk. And you know, if we ran the universe several times over - it would have succeeded in some of those runs. But [in our universe], the criticism, which was obviously overblown, reached a critical point where things didn’t work out.
And then there's the story about the demo coming down, which - I’m not sure I’m able to talk about - but I think that is one of the things where, if people knew the true reasons, they'd be like “what the f**k!?”. But yeah, that's what happened…
Nathan Lambert [00:15:44]: Yeah, this is why any company that makes a demo now has block lists, where there's certain words that if they're in the prompt of the generation, you get a really, really stupid response. Even if it's like an open model, you just put like a little filter that's like, “you can't say the most obviously bad words”.
Ross Taylor [00:16:01]: But we actually did that and that created backlash as well. Because if you have false positives, you actually exclude some words which aren't actually offensive [in certain contexts], right? And then you also offend people… so it's not a win-win situation.
But if I have to look back at it now, I think with any new technology, it's never going to be absolutely better than what came before it. With LLMs, the relative comparison is with search. If you’re going towards search and information retrieval, you're prioritizing factuality as opposed to creativity, right? And the fundamental tradeoff with LLMs is saying, “I can trade off some amount of like factuality or ‘closeness’ to the corpus for some amount of synthesis and creativity”.
I don’t think that if we had a better model, it would have helped things at all. You could say maybe if [Galactica] had RLHF, would that have helped? I'm not too sure given that the project came out of [a big company like] Meta. Meta has a really good reputation now - people appreciate the open work they're doing - but at the time, things like the 2016 election were still in people’s minds. So I think the LLM revolution was never going to start at a big tech company, in my opinion. It was always going to happen at a company that had less reputational baggage. But I think it's pretty cool now that people see things differently. Because FAIR always had a really strong commitment to open science. It’s good that they're finally getting the credit for that.
Nathan Lambert [00:17:38]: Yeah. I have two technical questions on Galactica that I find really interesting. One is from Luca Soldaini at AI2. He said that you mentioned that the Galactica log probabilities (when producing citations) were proportional to how far in the citation graph the current paper was to the cited paper. Do you have any more interesting comments on how the latent space of Galactica actually worked? Because that is cracking the most important question of a language model for science - building a better latent representation of how the scientific information is organized.
Ross Taylor [00:18:12]: Yeah. So there were a couple of aspects to that. The first thing is we had this really nice graph that showed, as we scaled the model, the distribution of citations became closer and closer to actual citations - which is what you'd expect. But this was important for us, as our main worry was - because we were thinking about deploying to Overleaf - we didn't want to prioritize the most cited documents and create a “rich get richer” dynamic.
Nathan Lambert [00:18:38]: Google Scholar already does that. Were you re-indexing all the papers rather than building off like the Scholar graph or something?
Ross Taylor [00:18:45]: I think we were building off existing ones, using things like CrossRef…but there were lots of gaps that we had to fill. The other weird thing was that we saw some strange biases in the model. So if the model didn’t know what to cite, it would sometimes cite a general review paper, which is really weird emergent behavior. It was like the model was saying “I don't know a specific example, so I'll just give you a general overview”.
Nathan Lambert [00:19:11]: It's probably in the data.
Ross Taylor [00:19:12]: I think the thing that surprised me the most was multimodality. So we trained the model on SMILES formulae and protein sequences [alongside natural language]. And the thing that really surprised me was, we had tasks which we didn't explicitly optimize for - like converting a SMILES formula to a IUPAC name for a chemical. And if you actually looked at the attention as the model was predicting the next token, it would say something like “amino” and you could see in the chemical graph, it was explicitly attending to the relevant part of the sequence.
I found that amazing because we didn't train for it explicitly. That's the beauty of self-supervised learning. But I also found it highly ironic because some of the criticism of Galactica was “it’s ungrounded”. I was like “how grounded is this? The natural language tokens are literally attending to the underlying chemical structure!”. So that was kind of cool.
And then the other cool thing was: if you prompted with a protein sequence and asked “what is the function of this protein?”, the model was really good at answering those questions in natural language. That was awesome for me.
Nathan Lambert [00:20:33]: There's another prompting thing that I had known of [for Galactica], which was asking the model to do open-ended generation tasks. The models are still out there - people can spin them up and do demos on their own - but if you asked it something that people think of for ChatGPT - e.g. write me a poem about a sad goldfish - it wouldn't work unless you put it in a header format. It was markdown, I think? If you prompted it in that format, it would actually do a great job.
Ross Taylor [00:20:57]: Yes, so in the Galactica demo, a lot of people were being malicious with this type of prompting for markdown articles. But I did enjoy some of the creative ones. Someone was like: write me a theorem on finding a girlfriend, and it was some of the most hilarious model output I’ve ever seen. And people also generated some amazing sci-fi…but then I think some people took it too far. But whatever. I guess it was a traumatizing experience for me at the time. But with the benefit of hindsight, I was also fun in some sense, I guess.
Nathan Lambert [00:21:30]: Yeah. It makes you understand the bigger context of the work much faster than you would otherwise.
Ross Taylor [00:21:37]: It was actually crazy at the time. So many people were using it. Even then we could see that - while it wasn’t a product - we could see that most systems were going to be designed in a similar way.
I think the interesting thing was how the winning form factor in the end was like a chat interface - you know, with ChatGPT being the winning UX. I think that was actually a big part of the story [why they succeeded]. There's a debate on whether RLHF is actually a capability advance or whether it’s just alignment…but a big part of the story [for ChatGPT’s success], in my view, was the kind of UX of how you interface with a language model, rather than the actual capabilities. But I think it's obviously not monocausal at the same time. There were several factors at play.
Nathan Lambert [00:22:25]: Yeah. So the last thing on this is that you mentioned in our e-mails about language models, creativity and making discoveries. What do you mean by that? Is that the agent-like projects you worked on at Meta?
Agents are largely something that I don't have too much comment on. I'm taking the approach of wait and see what we actually get, because there are a lot of practical approaches that I think will be reasonable. People use language models for basic formatting, for code, etc. But it's easy that if they have a little bit more feedback for things like writing a paper - e.g. find me a citation for blank and justify your answer - that step is something that I think will come. I don't know how expensive it will be to run, but is that what you mean when you think about making discoveries? Is it more autonomous? Is it a grander vision? Anything like that?
Ross Taylor [00:23:18]: I think it's more like this: the killer use case right now is information synthesis. For example, I use Claude a lot more than Google now because it combines information in a better way and sometimes generalizes well to things it hasn’t seen before.
But a really cool thing would be: can a language model answer a question which is more out of distribution? That we don't see in the training data?
So an experiment I've never done because I didn't have to compute would be this. Imagine if you could train a language model on all documents up to 1905, which is the year when Einstein had his miraculous year of four seminal papers. With that model, which is trained up to 1905, could you prompt the model to come up with a good explanation of the photoelectric effect, special relativity, this kind of stuff? And what would it take to rediscover these things?
Because presumably, with all these major discoveries, it’s never out of the blue. You’re standing on the shoulders of giants, but there’s still a lot of thought and inspiration you have to do to get to those great ideas. So that's the setup. But the creativity problem is, by its very nature, hard to benchmark.
Maybe this is a digression, but my problem with the field right now is: we’re in a situation where we've almost solved a benchmark like MATH, which is a very hard benchmark, in my opinion, at least Level 5 MATH, but I don't think we've really cracked something like reasoning. So I think it's like a whole different question about how you even evaluate these frontier tasks. But yeah, hopefully that gives a flavor of the kind of questions here…
Nathan Lambert [00:24:58]: Yeah, we can go into the reasoning conversation. I think reasoning in RLHF will take up however much time we want to keep talking. I guess we can start with the basics. What do you think people that are using language models think reasoning means? And what is the way that you would interpret what you're trying to do in improving the reasoning capability of a language model?
Ross Taylor [00:25:21]: So there's a lot of controversy on this on Twitter/X. And I think people are talking past each other because sometimes people mean different things by reasoning. At a very granular level, is legal reasoning fundamentally the same thing as mathematical reasoning? Common sense reasoning? I guess my very basic definition is that reasoning is the process of drawing conclusions based on a body of observations, or in the case of deductive reasoning, basic premises.
Nathan Lambert [00:25:50]: So math is like a subset of what you think about.
Ross Taylor [00:25:53]: Yeah. And then I guess the bigger circle is the broader topic of outcome directed behavior. I have an idea of an outcome I want to achieve, but what's the best path to get there?
And then in the LLM space, I think this problem broadly equates to the technical problem of how you use compute to get from your question to your answer. In the old days, you would just prompt the language model directly. You would just put in a GSM8k question, put in “Answer:” and then parse A, B, C, D. So you're relying on the forward pass.
Nathan Lambert [00:26:27]: Yeah, like the FLAN data is really weird. That's a popular one that people used to train on this stuff.
Ross Taylor [00:26:33]: Yeah. And then came chain-of-thought, scratchpads, with Galactica…all these ideas of using the context window to do intermediate computation. And the more recent, although to be honest, it's actually quite an old idea, is: you have chain-of-thought, but how do you better learn the internal reasoning tokens that get you to your answer? So things like, you know, Quiet-STaR and variants of this idea.
Nathan Lambert [00:27:01]: Claude now shows you when it’s thinking, and in the Claude system prompt, it has information on how many tokens to take to think about a question. We're all thinking about trying this stuff and it's all so hard.
Ross Taylor [00:27:11]: I think it's a question of how do you learn those tokens? For us, the original thing we did was just supervised learning. So we trained on some examples and let the model generalize to know that it should do the thinking in between some tokens. There are more sophisticated ways you could achieve this nowadays.
Another point is this: there’s an analogy that’s often used about language models, that they are “thinking out loud”. I actually don’t like this analogy at all. I think “thinking out loud” makes you think there’s something wrong about this kind of thinking in token space. But it’s not clear to me that the alternative - or these old adaptive computation ideas - are any better, actually.
Nathan Lambert [00:27:58]: What do you mean by adaptive computation? Because I mostly think of “thinking out loud” as being like chain-of-thought or generating its own explanation before it gets to an answer. What would adaptive computation be?
Ross Taylor [00:28:09]: So there's a paper by Alex Graves, who wrote all these amazing papers ~10 years ago, which had a lot of foresight. He did stuff like the Neural Turing Machine paper. Adaptive computation is the idea of, instead of having fixed compute between your input and your output, you can extend the forward pass to do things better, like arithmetic, where you have to maintain/manipulate state.
When chain-of-thought came out, there was an impression that it was a bit of a hack, because you're thinking in token space whereas you should be finding a way to make the forward pass dynamic. Universal Transformer is another variant of this [adaptive computation] idea. But I think there needs to be more empirics on which approach is actually better to maintain and manipulate state. I used to be more in favor of thinking, OK, chain of thought is more of a hack, but now I actually think it's probably…
Nathan Lambert [00:29:02]: What do you mean by state, like the state of the problem in that sense?
Ross Taylor [00:29:08]: So imagine that you're doing a GSM8k question, where John originally had 100 apples, then Jane gives him five apples. He has 105. And then he gives 20 away to like Susan or something and he's left with [85 apples].
So if you’re prompting the language model directly for the answer, you're expecting the language model in that forward pass to maintain and manipulate the state in a latent space, whereas the way chain-of-thought does it is in token space.
So you essentially output the intermediate steps. One of the problems with reasoning is that we have no idea how humans mechanistically reason…but if you think about how you'd solve a GSM8k problem in your head, then to me this seems a lot closer to something like chain-of-thought than adaptive computation.
Nathan Lambert [00:29:57]: Especially when you look at the architecture and attention mechanisms. A Transformer is really good at copying. So if you keep feeding in the recent information, it copies that in some way. So I think chain-of-thought and all of these things, I mean, they're only growing in popularity in my mind, along with Quiet-STaR and these kind of methods. I’ve heard the rumors about self-explanations and all these special things. The LLaMA-3 paper has all these special tokens. I don't know what all of them do, but I can see the direction. The state is stored in context and in special formatic tokens if it needs to be.
Ross Taylor [00:30:37]: So the other big picture thing is this. With the internet, you’re only seeing the output context.
So take StackExchange. If it’s a good answer, the author probably hasn’t just responded by generating words left-to-right. Maybe they’ve looked something up, maybe they’ve done a back-of-the-envelope calculation, either explicitly or in their head, right? And the internet is missing those “internal tokens”, essentially.
Now this isn’t always a problem because the models can learn how to construct them. And the effort now is to make artificial latents / internal thought, through RL or otherwise. But I think this is actually a much bigger question, which is more than just reasoning. In the end, as models become more capable, we’ll be talking more about how we can make them human-like in the way they can answer questions and solve tasks. For example, in some situations we might like the models to have [human-like] empathy, which is also “missing” in some sense.
So my prediction is that this becomes a bigger deal in the next few years: caring more deeply about the computation these models perform to reach a conclusion. And that will be the essence of alignment, in my mind. But that's a big topic!
Nathan Lambert [00:31:50]: OK, I have a long list of specific questions on this. My first question is about process reward models.
I think the canonical paper is “let's verify step by step”. My whole gripe is that it’s hard to create the data. That’s why they don’t exist in the open. But I’m guessing you can just label data with GPT and ask for feedback on each step, and just use that as an “LLM-as-a-judge” to get reasonable step-by-step labels on process rewards. But there’s so little work on this, so I don’t know if it is worth exploring. There is some research from Meta - I think Alex Havrilla did a couple of internship projects which related to this, and he’s good - but there’s such a lack of signal.
Is this something that people should work on more, or is it too complicated? Are there simpler things to do?
Ross Taylor [00:32:38]: Our big direction was integrating outcomes into reasoning - because next token prediction isn’t the objective we actually want to optimize. So the two ways to integrate outcomes are through something like PPO or inference-time search. And in both cases, you want a good reward model or value model.
Instead of (human-annotated) “process based reward”, we were exploring ideas along the lines of Monte Carlo policy evaluation (MCPE), where the key problem is how to learn a value model. It’s maybe a separate topic, but it’s underappreciated that something like MCTS - which in the public imagination is this inference-time search technique - actually has its real magic in giving you a value network for free.
This is why it was introduced in Go, because humans couldn’t come up with good heuristics for evaluation. So if you have something like MATH where you know the answer, then the question is how do you assign step by step feedback? It doesn't have to be MCTS, but something where you backprop the outcome to these individual steps is a way to get this dense feedback.
That's a way to get “synthetic process reward”. I should stress that PRM and MCPE are actually different things. Alex Havrilla was doing something along these lines also - but anyway, hopefully this gives a sense of the approach we took.
Nathan Lambert [00:34:21]: When Q* came out, that's something that I thought it might be doing. Instead of chain-of-thought, there's this idea of tree-of-thought. You could swap in the reasoning steps. And then if you could get labels on all these reasoning steps, you’re doing search over a reasoning space - which I would expect to work, but I think it needs the right datasets. I think a large part of the open alignment community right now is underappreciating datasets, where there's a lot of focus on methods, but we don't even have the datasets to use the methods… Like, why are you coming up with seven DPO variants if you don’t have the right datasets? I understand academic incentives, but if you are not an academic, you don't need to be doing that…
Ross Taylor [00:35:00]: It's an interesting question, because I guess the first chapter of LLMs had a lot of reliance on human annotations. In a way, that's a barrier to entry for the open community, because big firms can afford to pay millions for it but open source developers can’t. But more recently, you've had the rise of things like constitutional AI [and RLAIF approaches], which I believe are comparable to human-annotated datasets anyway. So is that a good thing for the open community?
Nathan Lambert [00:35:31]: I think it is, but human preference data might be a leg that is hard to remove. One of my latter questions was: can we actually do LLM-as-a-judge for human preference data fully? I think is the critical step that we don't have an answer for. Everything else in the modern RLHF stack is becoming more reproducible in the open.
And that relates to a question I have on synthetic versus human SFT. I think Thomas [Scialom] said on the Latent Space podcast that we just use generations from the model because they're better for humans on a lot of SFT tasks. Apple had a quote in their foundation model paper saying the same thing.
So I’m thinking, shouldn’t we be redoing all of our generations for our SFT dataset with the latest GPT-4 or LLaMA-405B? Why are we using GPT-4 from March 2023? That model was not as good on reasoning. So we have headroom there on synthetic data. We have prompts that we could reuse, but we don't have the right preference datasets - datasets like UltraFeedback are not big enough. And I think they're not in the same style that a lot of labs are doing this preference tuning - where it's on-policy generation.
We tried to work with Scale at Hugging Face to do this, where we had our own SFT models. We were getting data from Scale. We were labeling it every week and we were trying to retrain the models and we weren't getting a signal. This was last July/August. So we just didn't really know what we were doing. But I suspect that what people in the open should be trying to do is generating a lot, labeling it…That was a light bulb moment for me recently. This is what we have to do, but no one has done it.
Ross Taylor [00:37:21]: Yeah, I think it's definitely underappreciated how you can get better answers than a human by sampling the models [enough times]. You mentioned that Thom made this point early on in the [LLaMA] project, but you'd be surprised how this extends to reasoning as well. Even with the Galactica model - which is now an ancient model, a bronze age model - the pass@100 on GSM8k was 98%. And it's absolutely crazy to me that even now people are using GSM8k as a benchmark. In my mind, that benchmark was solved several years ago.
It’s a subtle point because the zero shot performance was ~48% but the pass@100 was 98%. The insight there is that the model already has knowledge about how to answer correctly, it's simply not reliable. This tells you that you need to invest in reward models, process based reward, outcome based reward, everything we talked about earlier…
But the same applies to the general RLHF pipeline. If you asked me to write a poem in the style of Bertrand Russell but also mix in Snoop Dogg’s style, then I couldn't do that. But the model has knowledge of how to do that, right? So why wouldn't you sample the model?
I think now with LLaMA-3, and the 405B model being out, it’s going to be good for the community that they can use it for generating data synthetically. And I'd imagine the quality will be good enough if it's done the right way.
Nathan Lambert [00:39:30]: Yeah, I think it should be doable. But there's a fundamental question of what do we think the human preference data is doing? [Compared to] model labeled preference data, is the noise that the humans provide of a different distribution that makes the human preference data better? I don't have a lot of signal on this, but I would love to know because I would guess that Meta would love to eliminate the $10 million plus estimated human preference data spend if they could. Meta is a reasonable company…
Ross Taylor [00:40:23]: Yeah, I don't know. But here’s something that surprised me. I was originally skeptical - at least on the reasoning side for LLMs - about LLMs marking their own homework. I thought they would eventually have that capability, but I wasn’t sure…
Nathan Lambert [00:40:40]: how fast.
Ross Taylor [00:40:41]: But the interesting thing we saw was as follows. We had experiments where we’d have a LLaMA-2 model that we’d sample generations from to train ORM models, and then we’d train different reward models on this data with different base models.
What we saw is that, the better the (underlying) base model, the better the reward model was for evaluating. And there were very clear patterns we saw: as the base model scaled, so did the quality of the reward model.
So that tells you that the knowledge is not in the ORM samples that you've fine-tuned the base model on. The knowledge on how to judge is within the model itself. And the pattern was so clear in the scaling. I concluded that eventually these self-verification approaches would work. It was just a question of when they would start to work for different types of problem.
Nathan Lambert [00:41:31]: Yeah. Model capabilities are also getting more dense which helps as well. Like with smaller model, there's all these experiments with better data, showing that you get a better model with X% reduction, which is kind of off-topic…
To double-down on what you said, I think this is one of the things I also debate: what makes a good model for downstream fine-tuning? I think in the LLaMA-3 report, they train the reward models directly on the base and not on the SFT model. The Apple report mentioned that they don't just use their evaluation suite for SFT models, but they evaluate with a reward model to see what is ready for RL.
I think, especially in the open, if you want the people to adopt your base model, there's a big gain in making it easy to fine-tune. For example, LLaMA has been pretty good; LLaMA-2 especially was really good for fine-tuning. There's also been base models that don't really work for fine-tuning, partially due to bugs and partially due to the state of the optimization. Is this something that you have any insight into?
Ross Taylor [00:42:43]: Yeah, I don't think I have enough insight into it to say, but I think it's definitely something that's been undervalued. I think the view of a lot of open model providers is: you get the model out, get good Open LLM Leaderboard results, and it's mission accomplished. But the real evaluation is in two days time when you get anon accounts on X saying “I'm fine-tuning this LLaMA model, it's not working”. And when you see a pattern with this kind of behavior, you have to conclude something is wrong…
Nathan Lambert [00:43:11]: It's always a chat template thing. A lot of it is a chat template thing, but those problems do get ironed out eventually. There's this whole idea of annealing and staging pre-training. I can't tell if it is boosting current capabilities at the cost of later capabilities. I think in a few years, this will all shuffle out and it's just how we do evaluation in stages. So you're always going to optimize for the right metric.
Ross Taylor [00:43:50]: There's two points to that.
The first is about annealing. It works for the kind of benchmarks people focus on the most, but then there's a question of whether you are actually just collapsing the task distribution of the model to things you're measuring - and not the true task distribution used by the community.
And I think there's a second point - which is maybe too much of a digression - but there's an interesting debate to be had about data quality being a bit of a misnomer. In a sense that when we say “data quality” we're actually saying “this data mix works well on these benchmarks”. But if you take a “No Free Lunch (NFL)” kind of approach to this, you must be hurting task performance somewhere else, right?
Nathan Lambert [00:44:34]: Yeah, I think I’m on the record of being an AlpacaEval hater. I say this all the time, because I think AlpacaEval is sacrificing actual usefulness for their own metric. If you get a 1-2% bump on alpaca eval, maybe that’s great. But you could be getting a 10-20% bump while sacrificing actual chat abilities.
We released some models trained with PPO and our PPO models are not very good at instruction following because they don't follow modifications like be concise or some stylistic things. They're also so yappy. They just say so much…but they do well on metrics and PPO especially helped AlpacaEval. So we had to figure out how to kind of use that signal without overcooking it.
Ross Taylor [00:45:16]: Yeah, it's like a whole discussion about evals, I guess…
Nathan Lambert [00:45:21]: We could come back to evals in a second. The last question that I have is: there's multiple trends like LLaMA-3 downplayed the importance of instruction fine-tuning relative to RLHF. I think there's other quotes in [Thom’s] LatentSpace podcast talking about it. Nematron also had this report where they use SFT and then multiple stages of RLHF.
I think DPO versus PPO is overblown and that'll kind of be a wash eventually. Everyone knows DPO's advantages of being simpler. But my question is this: are there certain capabilities that only come for RLHF, and people trying to do them with SFT are just wasting their time?
I always thought safety was in this bucket where it kind of makes sense - it’s hard to train a model to refuse just with SFT. But with something like reasoning, are there certain sequencings where SFT primes you and then RLHF really helps reasoning or code? Because it seems like OpenAI is really leaning on PPO to help with reasoning and code?
Ross Taylor [00:46:45]: Yeah, I think there's two ways to answer this question. First, maybe the history of this debate on the LLaMA side, and then something on the reasoning side.
So the history is quite interesting. I would say, you know, when was it? 2023? My dates have been wrong since the pandemic…But this just was after ChatGPT. There was actually a debate internally in Meta about using RL, and a lot of senior people were very skeptical. I would say the view was…
Nathan Lambert [00:47:13]: Not just at Meta. You can see when different companies embraced RLHF, if you really start to look at their models…
Ross Taylor [00:47:22]: The view was that RL was a dead end. And that even DeepMind was moving away from RL at the time, so you should just do SFT.
But, you know, at least for the folks in the Galactica team that came to lead post-training for LLaMA, we were quite scarred by hallucinations! We were definitely of the view that we needed to have the right objectives, and that we needed to make sure language models could “know what they don’t know”. So we were quite high on RL from the beginning. And eventually, I think the LLaMA-2 paper showed that a lot of the advances in helpfulness/harmlessness were via the RL stage. So I think that approach was fairly vindicated.
On the reasoning side, I would just say it’s quite simple. It comes back to the next token prediction objective not being the actual objective you want to optimize. The objective you want to optimize for reasoning is: do you get the right answer or not? Especially since reasoning is a high precision task. If you get one token wrong, unless you have a backtracking capability, you’re never going to recover…
Nathan Lambert [00:48:32]: That's a throwback, the backtracking token. Sorry, that was a random paper! That is interesting…
Ross Taylor [00:48:38]: Yeah, all these weird methods… But I think on your question, there is a point at which these techniques kind of overlap, right? So if you're, you know, doing SFT with rejection sampling: you’re doing something close to PPO anyway. And the same for reasoning: if you sample the model and pick the trajectories that your verifier says are correct, and then do SFT on that, it is a form of RL.
The final point I’d make is this: I would say the community overreacts to certain methods being used by popular models. They think: this company uses DPO because they must have found it's fundamentally better. But actually, it's usually due to either practicality or…
Nathan Lambert [00:49:22]: Yeah, that's what I think.
Ross Taylor [00:49:24]: You have a 405B model, and if you want to do PPO, you need to have a policy model, a reward model, value model etc in memory, and it’s not like…
Nathan Lambert [00:49:33]: Especially with DPO. I think with the 405B, I'm guessing what you did was cache the reference model. You could cache the log probabilities from the reference model. So you don't need to keep them in memory when you're doing the loss of the primary model. For DPO, you don't even need an extra copy of the model in memory, which therefore means you can use the same exact stack that you use for training. So you don't have to comment on this. But I think that's probably partially why LLaMA-3 just used DPO...
Ross Taylor [00:50:07]: Yeah, I think people don't appreciate how compute works either. People assume the big companies have so much compute - tens of thousands of GPUs - so compute isn't a constraint. But all these things are subject to Say's Law, right? If you have more compute, you're going to train a bigger model. And then you're going to hit the constraints again. It’s like the old thing of trying to solve traffic by building another lane. But if you create another lane, people will use that lane of traffic.
So practicality is still a factor [behind choosing methods]. Also things like which researcher is in charge, what’s their favorite method, and also politics as well.
So I think the community has made a mistake of overreacting to these choices. There was a mixture-of-experts phase too, right? I don’t think there’s anything inherently better with either method (dense or MoE), they just have different trade-offs, and it depends on what you are trying to achieve. If you’re serving lots of people with inference, then maybe a MoE approach is better. If you’re optimizing for something simple that’s easy to train and gets good results, maybe you favor a dense approach - although that’s debatable whether it’s easier to train. But I don’t think these things are clear cut.
So I would encourage people to not just copy things because they're in a paper from a big lab. I would encourage people to try things out themselves to know what works, and figure out what the problem is that you’re really trying to solve.
Nathan Lambert [00:51:20]: I think people don't have enough long term direction in their decisions. People are not trying to make decisions about what will be right in 10 years, they are trying to get a model out as soon as possible. So there are very few people with the incentives of trying to understand in the asymptote, which method is better… I might have that incentive, because I'm a nerd, and I have an audience that is okay with me writing four paragraphs around esoteric nerdy topics, but for all these companies, that is not a real incentive.
Ross Taylor [00:51:53]: The other point I’d make - maybe it is a separate thing - is this. I made this mistake throughout my career of focusing too much on novelty and complexity.
So in my first job in sports betting, we were making models for horse racing, football, that kind of stuff. And I always had the perception that other funds had really advanced, cutting-edge, complex models - but that wasn’t the case at all.
I think there is this tendency within deep learning to assume that - especially for the secret labs - that their good performance is due to some secret, amazing method. But more often than not, good performance is due to lots of small things from different people combined into one model. Really, lots of simple things done well and solid execution. And frankly, for big firms a lot of brute force too, right? Because big companies are naturally slow. But once they find a way to mobilize resources, they’re very intimidating and hard to beat. If you’re in a big company, and you’re aware of this, which approach are you going to take: are you going to prioritize novelty or are you going to do brute force if you have 10,000s of GPUs?
So I would encourage people not to be too intimidated by this perception that the big labs are smarter. I don’t think they are.
Nathan Lambert [00:53:03]: They're earlier but they're not necessarily smarter.
Ross Taylor [00:53:09]: Yeah. So obviously the constraints are different because of less compute in the open, but still: you’ve got to use first-principle thinking and be empirical as well, and just follow that path.
Nathan Lambert [00:53:21]: Yeah. So following up on this, there's a lot of discussion around what the processes are for making a successful foundation model lab. I think Armen has been talking about a few things on Twitter with great visualizations around de-risking pre-training based on FLOPs efficiency. Do you have any comments on what makes a successful post-training team and project?
I've talked to John Schulman a couple of times - he's been the king and started all of this - and OpenAI is still looked at as being the leader in the space. I think they've always been top on Chatbot Arena, and have cracked what most people like in the style. They started early. Are there different considerations for the post-training side of things rather than the pre-training side that we might hear more about?
Ross Taylor [00:54:13]: Yeah, there's probably better people than me to answer. So in our team, originally like Robert (Stojnic), my co-founder, he was kind of managing the post-training team. And then I'd say Thom Scialom was doing a lot of the work. And then more recently Rui Hou - he kind of flies under the radar a bit - but he’s been doing a lot of the work. They are all better placed to answer than me, since I was focusing on reasoning and agents.
But I think the key thing is this: post-training is just a lot of iteration. Frankly, lots of hard work - e.g. making sure at each round of RLHF you’re not regressing in certain ways, filling holes, etc. I guess it’s hard to put a finger on a single thing, but…
Nathan Lambert [00:54:58]: There's simple things like I'm trying to get people to talk about more. I’m trying to establish a good vibe test about internal culture. How do you vibe test for a good post-training culture (or for reasoning)? I remember somebody at Anthropic told me there’s still a lot of cases where you just put your finger up to the wind and you're like “model good”. And I'm sure that is still happening. And that's just a simple cultural thing of telling the team that you can’t always trust all of your numbers.
Ross Taylor [00:55:26]: I think it is maybe a more fundamental question. I wasn’t there at the early days of FAIR - I came in 2019, but FAIR was always a very bottom up organization. Which is a great thing: that's why things like PyTorch emerged. But the real insight as to why OpenAI was ahead historically, at least until recently, was that they had more of a top-down culture and focused bets. They saw the potential of LLMs early on and it was a top-down prerogative of the company to focus on that. And in essence, it was more of an engineering problem than it was a research problem in a lot of ways.
Relatedly, I think a lot of people were surprised that the LLaMA-3 paper wasn't as “novel” as they were expecting. But that just reflects the fact that a lot of it is just engineering and engineering is really hard - a lot of hard work. Not always a lot of new methods, but it is a lot of hard work.
Nathan Lambert [00:56:22]: Yeah, we're starting our next fine tuning model and everyone's asking me: “what should we work on?”. I'm trying to tell them “we just have to filter data and generate more completions”. We’ll have a lot of prompts, we have to filter them, generate completions from good models, and then we’ll have to generate more completions and keep doing this process…And in 10 weeks, we'll probably have a very good open model. We’ll just have to be boring for 10 weeks! And we have like 10 people involved.
So it's a bit of a bigger project, which I think is the right way to do it. We have just started getting improvements on IFVL by copying Nemotron. We use some open math datasets and the math scores are getting closer to LLaMA. It is really the simplest things ever. It's like browsing Hugging Face and being like, “NVIDIA released some JSON format data, some instruction format data, like we add it in and the numbers go up”.
Ross Taylor [00:57:16]: Yeah, I think I said earlier, but it raises an interesting question where this kind of approach - of grinding until the open LLM leaderboard numbers get to 100% - I think we’re going to get to a situation where all the benchmarks are solved, but where we haven't really, in my mind, at least solved intelligence.
What does it mean that we'll get close to 100% on MATH, you know, without any inference time search? I think sooner or later, while it looks like we’re on an exponential with LLMs, we’ll realize we’re actually on an S curve. Eventually we're going to get back to this mode where we have to do new things. And I think that's great, because that's what motivates me.
But yeah, I think there's waves, and we’re in this heavy exploitation mode right now with LLMs - away from the glory days of architecture exploration. But my hope is that we'll get back to the stage where, after exhausting all the [current] benchmarks, we say: OK, now we need to do something completely different. But who knows?
Nathan Lambert [00:58:26]: I see it similarly. I think we still have a year or two, at least in the open. If the closed models start saturating and they start doing things differently, that's fine. But eventually it'll all get there. And in that phase, I mostly keep working just to make sure that the ecosystem doesn't fold in on itself. So that's probably the one-sentence summary of what I'm doing these days: add transparency so that regulatory capture doesn't nuke everything. And that's fine, but I think it's still going to be longer than people expect. I don't think we have true signs of saturation at the top. We'll see what GPT-5 does - if GPT-5 never comes out - and then we’ll really know.
But it seems like it's going to come. I think there's enough signs that it'll come eventually. I think I don't know the answer to this - and it's not really our expertise - but I'm interested in the potential architecture of GPT-5 and if it's GPT-4o like and they're using more multimodal data to try to keep the data engine going relative to just going bigger. I don't know the answer, but that's kind of the future questions I’m thinking about.
Ross Taylor [00:59:34]: In my mind, like three years ago, the thing on the horizon I saw was agents. That’s where a lot of people are working right now: long form tasks where an agent doesn't have to answer a question immediately, and [can instead] go away for a while doing some research and answer later. I think that will take up a lot of time in the next five years.
It's both a compute problem of bigger models - more scale will do better - but also a data problem of how do you generate these trajectories? How do you get reliability? So it’s more successful and less error-prone at each step. And I think in principle it's solvable, but I just think it would take some time.
Nathan Lambert [01:00:18]: Yeah, it seems that engineering is required. It doesn’t seem like something that's just going to emerge. It's building a whole system and scaffolding around agents. Just unglorious work.
Ross Taylor [01:00:32]: Yeah.
Nathan Lambert [01:00:34]: OK, anything else you want to add? Do you want to get people excited about your start-up or is it too early? Maybe too early, yeah?
Ross Taylor [01:00:43]: Yeah, what else should I say? It has been nice to step back for a bit and look a bit ahead into the future. For me, my best days creatively were my teenage years when I got back home from school and spent the rest of the day programming. It’s quite nice to feel like that again: to be in that zone again where I can shut the world out and do some work.
But maybe just to give a hint of the areas I'm interested in, I think it comes back to this problem of how alignment is going to be a process of making AI more human-like. For example, how do you control for things like deception - which Anthropic has done a lot of really good work on.
Essentially… the latents of AI are [potentially] misaligned with human latents, and the question is: what do the human latents look like anyway? And how do we model these things?
That is very abstract and high level, but that is the fundamental question I want to work on. But yeah, I think I can talk about it later in the year!
Nathan Lambert [01:01:49]: Yeah, sounds good. Thanks for coming on. This was great. I think people are going to get a ton out of this. I think just a very sensible conversation on fine-tuning, reasoning and some of the things that got us here. And that's what I was hoping to get out of it, so thanks again!
Ross Taylor [01:02:06]: Yeah, great to talk, Nathan. Have a good one!
Apple, Meta, and Nvidia all agree -- synthetic data, iterative training, human preference labels, and lots of filtering.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/frontier-model-post-training
00:00 Llama 3.1 post-training and the new normal for RLHF
01:18 A new standard pipeline
01:45 Human preference data
02:59 Scaling RLHF
05:03 Synthetic data
06:10 The new normal
06:51 Data quality is king
07:18 Apple confirms the new normal
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_018.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_020.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_031.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_033.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_035.png
This week, I had the pleasure of chatting with Sebastian Raschka. Sebastian is doing a ton of work on the open language model ecosystem and AI research broadly. He’s been writing the great Ahead of AI newsletter (that has the biggest audience overlap with Interconnects, at 26%, so a lot of you know him) and multiple educational books, all on top of being a full time machine learning engineer at Lightning.ai, where he maintains LitGPT, which he described as being like Karpathy’s NanoGPT, with slightly more abstractions.
This conversation mostly surrounds keeping up with AI research, the state of the open LLM ecosystem post Llama 3.1, and many narrow topics in between. I learned that Sebastian used to be an Arxiv moderator, which gives some simple color on how Arxiv and sifting through thousands of papers works. We cover a lot of ground here, so I hope you enjoy it.
Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other interviews, go here.
YouTube
Chapters
* [00:00:00] Introduction & Sebastian’s background
* [00:04:28] The state of deep learning and language models in 2018
* [00:08:02] Sebastian's work at Lightning AI and LitGPT
* [00:12:23] Distillation and its potential in language model training
* [00:14:14] Implementing language models and common pitfalls
* [00:18:45] Modern architectures: Mixture of experts models, early v. late fusion multimodal
* [00:24:23] Sebastian's book on building language models from scratch
* [00:27:13] Comparing ChatGPT, Claude, and Google's Gemini for various tasks
* [00:38:21] Vibing and checking new language models during implementation
* [00:40:42] Selecting papers to read and moderating Arxiv
* [00:45:36] Motivation for working on AI education
* [00:52:46] Llama 3 fine-tuning
* [00:57:26] The potential impact of AI on jobs in writing and education
* [01:00:57] The future directions of AI
Transcript
Built with smol-podcaster and with love of Latent Space.
Nathan Lambert [00:00:00]: Hey, Sebastian, welcome to this kind of interconnects, normally researcher interviews. You were a professor, so that definitely counts. You do a lot of different things these days. Let's get talking into language models. Welcome. Yeah.
Sebastian Raschka [00:01:35]: Thanks so much for the invitation, Nathan. I'm a big fan actually of the interconnects newsletter, so I'm hoping we can have some fun chat about research, LLMs, and what's hot these days, basically. Yeah.
Nathan Lambert [00:01:48]: I have a little section on the end, which is keeping up with AI research, writing about AI and process, because you do so many things, but I kind of want to jump into how you got to AI, because you have an interesting career path. So you were a professor at Wisconsin Madison for years. I saw in statistics, which ... I also went all the way back to find your PhD thesis, which was uncovering hidden patterns of molecular recognition. So this was a while ago, and is this kind of ... Can you explain your background and how you got into AI? I'm guessing it's through computational statistics or something like this.
Sebastian Raschka [00:02:24]: Yeah. Close. So yeah, you did some research there. Interesting. So yeah, it's been a long time since my PhD thesis. This is maybe seven years now. And back then, it started even earlier when I got into AI, that was like, I would say 2012-ish. I was in grad school and I was taking a statistical pattern classification class. And in that class, yeah, the star of the show was basically naive Bayes classifiers, or in general, Bayesian methods for pattern recognition. And from there, I kind of really got into machine learning. So there was, I would say, more statistical-based, but it was all about classifying things. And then I think it was also right about the time where Cozera was launched, and I saw Andrew Ng's Cozera class. That was, I think, the first class in 2011-12 back then. And yeah, that's basically how I started from statistical pattern classification into machine learning. And I applied that for computational biology problems like molecule and drug discovery, like pharmaceutical drug discovery. And yeah, from there, I joined at some point after my graduation, the University of Wisconsin in Madison, where I was in the statistics department, but I did mostly deep learning research, essentially. I was the only one basically doing Python, deep learning, machine learning stuff. So yeah.
Nathan Lambert [00:03:48]: What year was this, and what did it look like at the time?
Sebastian Raschka [00:03:52]: That was around 2018, I think August 2018, when I joined the department. And yeah, I mean, so it's the statistics department, but my work was technically all machine learning and deep learning. I mean, a lot of students were really excited about learning machine learning. I think it was just around the time where it got really popular. And yeah, I was teaching machine learning and deep learning classes as well. They were always like, you know, full and crowded, like a lot of students were excited about that. Also, in general, like the time learning about Python, machine learning, data science, all these topics.
Nathan Lambert [00:04:28]: It's, I mean, it's very interesting because I was a student, I was a grad student at this time or that time in like 2018. That's what deep RL was really taking off. And it probably feels like that probably felt kind of like the language model thing was like as a student at the time, where it's just like, there's so many people in all these classes. And now language models have more of a real world application, but I think as a student, it probably feels so, so similar. Yeah.
Sebastian Raschka [00:04:50]: So also back then, if I may say that it's like large language models already existed. I think the GPT paper, was it 2018? Something like that?
Nathan Lambert [00:04:59]: Yeah, 2018 or 2019. Yeah. For GPT-2, I think.
Sebastian Raschka [00:05:04]: Remember covering, like I had a whole hour or two hours on large language models back then, but it was all focused on BERT models and basically also using them for more like classification tasks. Now, I would say maybe a lot of business problems still evolve around classification, but everything else is basically generative, generating text, generating images and stuff. So it has changed a lot.
Nathan Lambert [00:05:28]: Yeah, for sure. It's like a sequence of like, is it like the transform, is it like Elmo, BERT and the transformers are probably the things that you're talking about all the time? Just very interesting. I think Yitay had this, did you read Yitay's recent blog posts on language model architectures and kind of walked through why encoder decoder is no longer in vogue? Did you see this?
Sebastian Raschka [00:05:51]: Yeah, I think I haven't seen the article, but I remember having discussions with people about that recently. I mean, I think there was actually, it's interesting. So I think T5, if you would train it and fine tune it, it would still be a really good model for sequence to sequence tasks, like language translation and stuff like that.
Nathan Lambert [00:06:10]: Yeah. Cohere for AI did this with AYA. They used T5 for their first AYA version, which most people were like, oh, they've Cohere branded it so well, but no one realized they're using T5.
Sebastian Raschka [00:06:21]: See, I even didn't know about that. And so also on that note, I would say there was something else I wanted to say. So then there's also still the classification thing and using LLMs for classification. And it was also usually either a bird like encoder, or you could also use an encoder decoder, but mostly an encoder. But I've seen also recent papers using just decoder models for that. Just basically removing the, I saw two papers on that actually, like removing the causal mask. So basically reverting it back to an encoder using LLMA and then removing the mask. So in that sense.
Nathan Lambert [00:06:59]: And it works well as a classifier. You can just kind of use it. That's awesome.
Sebastian Raschka [00:07:04]: I mean, you could even do that without removing the causal mask. So you could just tune the last token basically, but yeah, if you remove it, yeah. They found that you could use probably the first token even, because if you have the last token, you don't, you have to have padding always because you have to pad it to the longest sequence. Otherwise the last token would be a different one in each training example. And so in this way you could use an earlier token basically, and then keep it fixed.
Nathan Lambert [00:07:30]: Yeah. Yeah. Now with your work at Lightning AI, do you do a lot of these things like hacking around with language models? Because I think it's kind of an underexplored space where just like people remove layers and plug things together. I think there was like, when merging was just getting going, there was like Franken Llama 2, where somebody made like a Llama 2 30 B by just chopping layers and stuff together. There's so much unexplored signal there that I just, do you have your, have you ever looked at these things or you don't do that much?
Sebastian Raschka [00:08:02]: I must say I'm not a big fan of merging. Maybe I'm just not good at it. I rather prefer fine tuning, start changing things or training and fine tuning things. So yeah, I do a lot of this type of hacking. Sometimes voluntarily, sometimes involuntarily, because I make a mistake or something or like, because at Lightning I developed this library, LitGPT, which is an open source library, pre-training, fine tuning and serving and deploying LLMs. But it's basically a from scratch implementation. You can think of it as a NanoGPT from Andrej Karpathy, but for all types of LLMs, like Llama, Gemma, PHY, all of them. But the focus is also like NanoGPT is on readable code or like keeping it relatively simple. Of course it gets a bit more complex there when you add multi-GPU training, tensor parallel, fully sharded data parallelism and stuff like that. So if you add all these settings, it gets a bit more complicated, but the focus is still on having a code base that you can easily work with. And in that context, it's very easy to remove layers and change things. I mean, yeah, so that is usually, I build it like for colleagues at Lightning, but also like open source community, but then also for myself to tweak things, to change things and stuff like that. So yeah, I should also say, it's not just me, it's Carlos and Adrian who started this library. Currently I'm like the main person maintaining it, but a lot of people contribute to it. So it's actually a nice playground.
Nathan Lambert [00:09:41]: There's kind of two follows odds for this. One is like, what part of the language model training stack, if somebody is going to start with libgpt or HuggingFace or whatever, like they're trying to fine tune a model, you can do an example. And then what is the thing that they should do to go like one level deeper to learn how these things work? Because you're saying with libgpt, you can do all these different architectures. I don't know if I would recommend architectures, but it's a good way to learn how like the attention implementation and how different layers are shaped and things like this. Is there different areas you'd recommend people to look at?
Sebastian Raschka [00:10:14]: Yeah, I would actually, okay. So it's like a shameless plug, but in my book, I have a book where I do this step by step, the implementation. And this is for only one model, like a simple model, a GPT-2 model. Because it's like the, I would say the one that started all of this, right? Like the main architecture and everything else is kind of like a derivative almost of it. So I would think in a good way that it is making tweaks and improving things, but basically starting with one architecture, like you said, not looking at different ones at first, and then just understanding what is, I would say the best way is what is the input data here? How does it look like? What does go into the LLM and really how does it pass through the layers? And then from there, okay, we understand how a model learns to generate one word at a time and then going from there to instruction, fine tuning, and then even like alignment with a DPO, for example. So doing like all these different lifecycle things from implementing one architecture, pre-training, fine tuning, aligning, and then from there, I think it's a useful or interesting exercise to see how different architectures make slightly different choices, like replacing the Gelu activation with a Silu activation or pre- and post-layer norm and like these like nuances, changing the number of heads or number of layers. And yeah.
Nathan Lambert [00:11:38]: Yeah. I mean, in industry, everyone kind of is converging to similar things or like people converge to a similar recipe and then they stick with it for infinity. So like each of the orgs have these recipes that it's too risky to change and like AI2 are like still converging at a recipe. So we're like learning things that the Llama team does and it's like RMS norm and they think it's very important or like these different things. And I wonder how like the open community is going to converge on pre-training things. So like what scale of models do you recommend people train for your book? Are they training like the hundred million scale GPT-2? Is it smaller? Because I think in Colab, you can fine tune maybe with Laura, a 7b model, I think. Is that true?
Sebastian Raschka [00:12:23]: Yeah. So this is true. But I think for Laura, if you want to fine tune 7b model, you would need, I think, bits and bytes of quantization, the normal float for like some quantization. But yeah. So for the, or maybe going one step back for the book, it's really the smallest model, like the hundred, what is it, hundred something million. But I also have settings. If you like, if let's say your machine permits, use the larger version. So there are four larger versions, like 300, 700, and 1.5 billion. But it's really up to the reader. I have all the examples with the smallest one so that it even runs on a MacBook Air. So on this podcast, I'm here on my small MacBook Air and all the models train in a few minutes fine. Of course, I'm not doing the whole pre-training for that. You would need a GPU for a week or maybe I would say maybe even longer than that now. I mean, it depends on the GPU, of course, but H100, maybe a week. But also the other reason is yeah, in practice, you would probably use pre-trained weights and then you can find, so you can do continued pre-training and then fine tune. So the focus is basically understanding how the pre-training works, then loading pre-trained weights. But then also the fine tuning is like the full, the full thing, like doing it to fine tune a classifier, but also instruction fine tuning essentially. And that doesn't take too long. I would recommend using a GPU, but it would technically run on a CPU. And get back to the question you had with a 7 billion model for that one A100, I would say yeah, one A100 would probably work for a 7 billion model. But you can also, if you have Litt-GPT or if you use Litt-GPT as a setting, you can set the number of devices and shard it over multiple GPUs. Yeah.
Nathan Lambert [00:14:14]: I mean, all of this stuff is getting so much easier. I think, I don't know, when did you start writing this book and all of these chapters? Because I've seen the GitHub, I haven't looked at when it started.
Sebastian Raschka [00:14:23]: Actually longer than you might think. It took a long time. It's almost, at this point, one and a half years approximately.
Nathan Lambert [00:14:30]: Because at that time, like a 1 billion parameter model, like what was the state of the art 1 billion parameter model a year and a half ago? Some random model. But today, like people are trading 1 billion parameter models for 15 trillion tokens. So the fine tuning that you can do there is getting extremely good. And I'm going to guess that people are going to start training even smaller models with these distillation losses. So have you looked at distillation at all? I think it's full on coming in the next six months. We can shift it to like the LLAMA3 and the state of the open ecosystem section, because it kind of goes in. It's like LLAMA3 was not distilled. It's a specific loss function. I hate it that there's synthetic data came around and people call, I was on this paper, the Zephyr paper, the title is Direct Distillation of Language Models. But now the technical definition of distillation, which is like knowledge distillation from a teacher is becoming popular. So the whole synthetic data and alignment and everything is like screwed in a doubly defined word.
Sebastian Raschka [00:15:30]: So basically what you're saying is that people who just use synthetic data refer to it as distillation because it's from a larger model. Yeah. Yeah. Yeah. Confusing. I think Gemma too did that actually recently. So that was an example where they did that. And I do think, you know, I think it's also coming. So I have for my book, that's like the core chapters I have, but I have a whole long list of bonus material that I want to cover and distillation, knowledge distillation is one of them. So this will be something over the next few years, but you know, doing tutorials on those and yeah.
Nathan Lambert [00:16:04]: Because I think people can actually use it as a thing. So how distillation works, I've thought about implementing it, but as it works is that if you have a fine tuning corpus, you get all the predictions from your big model. So all the log probabilities from your big model and you store them in memory. And then as you're training the model you're training, which is smaller, you essentially weight them by those predictions because you store them from memory. So you don't need to store the big model in memory when you're training. So I think people should be able to like, or someone will upload a data set file of like a giant log probs of Lama 405B and that people will just try to fine tune from it. I'm surprised that Lama 3 didn't use it, but I think it's just because they're focused on scale and data more than any fancy things.
Sebastian Raschka [00:16:49]: Yeah. And I think the, I can, I think I probably know why, but also, yeah. One thing is I should, one should also add is why I think it's also becoming more popular is like Lama 3.1, they just allowed doing that. I think before it was according to the license, technically not allowed to use Lama 3 models to improve other models, but now, now we can. So I think, like you said, it's probably going to be a hot topic, but I do think they didn't do that because the 405B Lama model just finished, I think. So I think, I mean, if you think back, they shared the Lama 3 model, it's like, I don't know, half a year ago or something, many months ago. So I think it's really more like, yeah, it hasn't finished training, but maybe for Lama 4, we will see more distillation using the 3.1 model for that.
Nathan Lambert [00:17:38]: Yeah, it's more architecture things. So for while we're talking about distillation, almost like Cloud Flash or Google Gemini Flash is confirmed as distillation. And it is very likely that Cloud Haiku and GPT-40 mini are distilled in the technical sense of the word, which is like, I think it's obvious that that works on pre-training. And I think there will be a breakthrough fine tuning model, kind of like the likes of Zephyr, Starlang, I'm forgetting more names, but ones that really reach the narrative from fine tuning on distilled data. I think that'll come in the next six months. So honestly, I'm telling the people I work with, we should try to do this before something new, because it's so obvious now.
Sebastian Raschka [00:18:22]: One thing I've seen also a trend, I wouldn't say backwards, but a thing that doesn't seem to be that popular anymore is a mixture of expert models. What do you think about that? Is that like something like that was like a fad and now people don't, you know, they explore other things like distillation. I mean, you could do both, but it feels like a mixture of experts is not as hot anymore
Nathan Lambert [00:18:45]: somehow. I don't know.
Sebastian Raschka [00:18:45]: What do you think?
Nathan Lambert [00:18:47]: There's two things. Small mixture of expert models are definitely coming out. Essentially, you get a fixed improvement in flop efficiency at pre-training. So essentially, if you're going to pre-train like an X billion parameter model with mixture of experts, it'll go like 40 percent faster or some pretty appreciable number. There's a lot of rumors and discussion that scaling up mixture of experts models is really hard from a stability point of view. So a lot of these open people, you could get it started and we're playing with these AI too. So we want to play in the mixture of experts space as well. And doing a small model works, but there's a lot of headaches. I think like some of the friends at Databricks Mosaic ML have been the clearest about this. It's just like you do not, like you at AI too, do not have the engineering throughput to deal with the headaches that comes from mixture of experts. So I think there's still clear signal from industry and people and like, I mean, Deep Seek's releasing MOEs. I think Quen has a small MOE and these are pretty good models. But I think it's a really heavy engineering lift to get to mixture of experts to work. I like GPT-4 scales. I expect Meta to figure it out. I think it's just on their list and they figured out dense first. The thing I'm more interested in for GPT-4, I don't care if it's mixture of experts. I think they have the compute to do either way. But for Llama-4, God, all the numbers throw me off so bad. But I think that OpenAI and Google might be slightly ahead by having the early fusion model. So essentially with these multimodal models, there's the concept of early versus late fusion. The first visual models that people were playing with the GPT-4 were this late fusion. And now like GPT-4.0 is early fusion. And it seems like Gemini is probably early fusion, which means they take in direct audio, video, text directly at the input, the training data changes. And I don't know how much of a heavy lift it is to get that to work. I think that might be the bigger change. And that might be harder for Meta to catch up on than anything else. But no one's really talking about it.
Sebastian Raschka [00:20:58]: But also here, I think that is something I feel like others have. I mean, I remember even like last year, there were a lot of papers with a late fusion thing, like I think Llama adapter papers and stuff like that, like retrofitting the models. But yeah, I haven't seen that much focus on that from Meta. But I mean, they had a section on that in the paper, but it felt almost like an afterthought. I don't know. It's like where, yeah, I think maybe there's a different team at Meta that works on
Nathan Lambert [00:21:26]: that. There is a Chameleon team that was doing this, and I think a lot of them have left. My question, essentially, that I want to debate and I don't know the answer to is like, because essentially it takes so much different data pipelines. So you have to have a much clearer balance between video images and audio and text when you're training early fusion than with late fusion, because you just add a bunch of images at the end. And like if that data curation step is going to be a big bottleneck for kind of shifting and if Google and OpenAI have an advantage by just scraping YouTube, like Google obviously can't scrape YouTube and I'm not saying that they are, but like if it becomes a way that you can get more data and like GPT 5.0 is the first model that OpenAI releases, then I'll be like, OK, the GPT 4.0 thing was just a pivot. And I actually think this could happen. I don't put this at like a one percent probability. I could see this as being what the labs are betting on. It just takes so long to spin up this entire new pipeline of training.
Sebastian Raschka [00:22:25]: But one question here is going back to a point you mentioned earlier regarding the knowledge distillation where you can just precompute all these things, you could technically do that also just once for the whole data set. Let's say you have a very good image encoder, audio encoder. You would never have to redo this if you do it well. Right. I mean, it would be something you do it, take care of it once and then you pass it just as tokens to the to the other team, basically.
Nathan Lambert [00:22:49]: Yeah, probably. I don't know. I'm not like I don't have as much insight into really advanced pre-training practices as I would like. I'm mostly of a similar boat of like fine tuning models and playing with things because I'm trying to play like, have you played with Llama 3405b at all? For context, the recording is like, what is this, like a week after, like six days after. Like I haven't gotten it set up, but I'm really curious. Like I don't have clear expectations on how the open source community, like the open language model ecosystem kind of evolves from here with these new Llama models, the new Mistral models. It feels like a total, from like a technical and a policy perspective for me, it feels like a pivot. I think the educational side of things, it's actually more of the same. Like we knew we knew this was coming, but it just it feels like it could be qualitatively different going forward. Do you see anything? Have you tried anything?
Sebastian Raschka [00:23:45]: Yeah, I did actually try the Llama 3.1 models. I, when they came out last week, we added them to Litchipiti. I took care of the eight and 70 billion models. And my colleague Adrian, he also added support for the 405 billion models. So just briefly trying it, it looks really good. So the thing is with a 405 billion model, it's a bit tricky. So I think the problem here is, of course, it's free. Everyone can use it, but in a sense it's still expensive to run it because you need, so we were running it with bits and bytes of quantization, like a normal float four on eight H100s. And this is expensive, right? I mean, eight H100s, it's probably more than a hundred bucks an hour.
Nathan Lambert [00:24:26]: I was trying to do the same and I messed up the BLM installation. I was like, okay, I spent an hour on this. Yeah.
Sebastian Raschka [00:24:32]: So you can try Litchipiti maybe. So it works with.
Nathan Lambert [00:24:36]: Yeah. And there's a related question. One of the things I'm trying to ask people who are hands on, just like, how do you, what do you do to vibe check a new model as you go through so much AI research material and language model material? It's like, everyone has their procedures and how do you go about that?
Sebastian Raschka [00:24:51]: So for me, it's like, I, I mean, I use these more like for making sure they generate the correct answers and stuff like that, or something that is reasonable. So honestly, really simple questions for me just to see, so this is more like, I'm not necessarily benchmarking these models. I'm more like making sure the implementation is correct. And for that, I use simple questions like what do llamas eat? What is one plus two? You know, like just making sure, because it's actually easy. Something I just fixed this morning. It's easy to mess up things like KB caching, where you cache, you don't clear the cache and then there's something from the previous answer and the answer looks kind of correct, but it's kind of weird. And, you know, like simple questions can sometimes reveal that. So basically what I do is I ask it multiple, multiple questions the same time. So, sorry, repeatedly, like the same question repeatedly and see if the outputs still make sense and stuff and then mixing them up, but like in a loop basically, but I'm not so much like, that's a great way to make sure the implementation works.
Nathan Lambert [00:25:53]: Cause I think in transformers, they had a missing end token. There's so many little things like this when implementing stuff. Like the, the end tokens is such a ban or like the chat templating can always break things. Cause it also can happen that you mess up pre-training and then you need to have something in the chat template that people might not know. I think in one of the early Olmo models, we like missed a new line in, in one of our documents when we were annealing it. So in order to fine tune it, you had to like have an extra new line before the chat template and like most people will just miss that. Yeah. This is very, very interesting point.
Sebastian Raschka [00:26:28]: It's like, you don't even notice it usually when you use something like, I don't know, chat GPT, because it's applied behind the scenes. But if you implement these things yourself, you have to be really diligent and careful to do it very consistently. Like one little, like you said, new line throws it totally off. It's, it's, yeah, it's interesting. It's like, you have to be, I noticed that I was actually working on some DPO stuff this weekend and my template for fine tuning and DPO alignment, the one that I'm working on alignment, the prompt template was a bit different and I got like garbage results. And then, oh, I, I stripped some line here, the new line character, basically something similar, like you said. So it's, it's very sensitive to that.
Nathan Lambert [00:27:04]: Yeah.
Sebastian Raschka [00:27:04]: Yeah.
Nathan Lambert [00:27:05]: This, this makes sense. Um, related, do you use Clod, chat GPT, any of these regularly in your workflow? Are you team Clod?
Sebastian Raschka [00:27:13]: Uh, so yeah, so it depends. I have both and I flip back and forth between them. I don't know. I'm probably not really good at prompting, but sometimes I get better results with one over the other. Um, I think. I wouldn't say one is better than the other. They're just different. I would say I'm using.
Nathan Lambert [00:27:31]: That's kind of what I think. It's important. Like, it's good. Like, what do you think of both of them? I think it's good for people to know this because it's, it takes some practice to understand and using both. Both people don't use both. Yeah.
Sebastian Raschka [00:27:43]: I would say when I use also GPT-4, I must say I use the, uh, it's called legacy now, but the original GPT-4, I don't like the mini and old versions. And, uh, for Claude, I use the opposite of the, not the new one, but the one, the previous larger one, the slower one. And, um, I think for me it's like coding wise, it's kind of weird, but most of the time I like GPT-4 better for code stuff. But then I think also, uh, I think, you know, what, what's better with GPT-4 was it's, it's a bit more up to date, um, with knowledge, I think. But Claude has, I think better, you know, when you say improve my writing or something like that, it has more, it has less, you know, like these, like I delve into something, these weird words and stuff like it, it's a less, it's more natural a bit, I would say, but
Nathan Lambert [00:28:34]: also not always.
Sebastian Raschka [00:28:34]: I agree.
Nathan Lambert [00:28:36]: It's like, it has a bit more flair and a bit more unpredictability. So I like use a Claude on my phone, but I've found, I've tried to use Claude for like information transformation tasks, like LaTeX or taking, taking data out of a table. And sometimes it just like refuses. Like I do research on like AI safety, like safety and bias. So if I put anything into Claude that I'm trying to transform that data, it just says no. Cause it's like, I can't comment on like a mean story. Well as OpenAI will just do it. And it's like the processing that OpenAI does is very good. So I actually like canceled my GPT subscription when I started Claude, but I kind of regret it now. I'm like, oh, now I need both, which is, which is a little annoying. Yeah.
Sebastian Raschka [00:29:16]: It's like, yeah. So one thing is what is interesting though, is we, we're talking about GPT-4 and Claude here, but we haven't even mentioned Google Gemini.
Nathan Lambert [00:29:24]: I don't know.
Sebastian Raschka [00:29:24]: I personally, I tried the early versions. I don't want to say the newer versions are not good. I just haven't tried because I didn't need to, but do you have experiences with Gemini
Nathan Lambert [00:29:34]: or? I was using Gemini in search preview. So if you have the Google app, I can, I'm recording this in, in video. Like you have the Google app, like at the top, you could click on Gemini, which I was doing for a while just to play with it. But like, I don't use it on the web. I, they do have a nice interface that looks exactly the same, but somehow I got grandfathered into like AI studio, which I use for, if I upload, record a podcast, I upload the podcast and I'm like write chapters or something. And it actually works, which is pretty cool to be able to upload like an hour long podcast. But for whatever reason, the Google interface, other than the Google app, hasn't stuck for me. And I think that's the biggest, biggest limitation. And I use it more in a googly way. So I'd not, I'm not as perceptive to style. I see. I see.
Sebastian Raschka [00:30:20]: So also I'm curious. I just yesterday saw Apple's on device AI is a bit delayed, I think. And for that, I think it's an interesting one. We will see how this will work because this will be, I think also smaller models. And there's a, for me, it's like, I never really care about speed for these things. It's like, I just want the best possible models. So this is also why I was a bit disappointed when GPT-4 O came out and GPT-4 Mini came
Nathan Lambert [00:30:46]: out.
Sebastian Raschka [00:30:46]: It's like, ah, I don't really care about if it's faster or not. I just want it better. You know, I want to have better quality. I don't know. It's maybe it's just me.
Nathan Lambert [00:30:53]: I think for building applications, speed is really good. So I have a few friends that run startups that are heavily built on language models and they have a similar stack to perplexity, which is like the user passes in a query that have a primary language model request and they have a series of feedback requests or small requests on top of that. So when you're concatenating multiple requests, like speed is extremely important. And when you're like selling a product, speed is extremely important. But if you're like tinkering and trying to learn, it is much slower. It's true. Yeah. Yeah.
Sebastian Raschka [00:31:19]: It's like the real world, like, sorry, not real world, but the individual user, um, yeah, using it as a tool in everyday life versus really building an application based on an API that makes sense.
Nathan Lambert [00:31:32]: Yeah.
Sebastian Raschka [00:31:32]: So there are two different use cases.
Nathan Lambert [00:31:34]: Yeah. Yeah. I think we're kind of talking about style. I have a section on RLHF here. I just wanted to like, what do you think you do spend a lot so much on AI education is like, what do you think is most confusing to people about this kind of whole post-training thing, which is instruction tuning, reinforcement learning from human feedback, other safety modules, like adding a filter and stuff like this. I'm really on the bandwagon of trying to convince people that RLHF is deeply tried with style, which is like this, how this discussion of cloud versus, um, open AI and Google and all these things. And I don't really know how to portray that in like an educational technical point of view. So like, I'll do an analysis of the paper and I'll do like DPO and like scores and all these things. But at the same time, for most people reading my articles, the most important thing is probably to know that open AI is really smart about their style. And that's why they're so high on chatbot arena. But like, I've written about it a couple of times. I have another article in the drafts, which is essentially like why GPT 4.0 mini like broke chatbot arena. Because everyone's so upset that it scored so highly, but it's not that surprising if you look at historical events.
Sebastian Raschka [00:32:39]: So it's basically exploitation of the benchmark almost you're saying or like the benchmark
Nathan Lambert [00:32:45]: is focused on style and it really penalizes refusals. So like I get refusals when I use cloud. So it's definitely going to like be downweighted. And like open AI is really good at this. This is what they've been doing for a long time. But I don't really know how to educate this. Like, have you thought about, like, there was a question on Twitter of why didn't you include RLHF in your latest? It was kind of a joke, but I took it out.
Sebastian Raschka [00:33:09]: Well, if yeah, I can maybe answer that. It's it's in the works. No, so there are multiple reasons. And so one is it's so there are page limits per chapter. And originally it was meant to be in chapter seven. It got way too long. It's actually even without it. Chapter seven is the longest chapter already. And what is the other one is fine tuning.
Nathan Lambert [00:33:29]: Oh, sorry.
Sebastian Raschka [00:33:30]: Instruction fine tuning. Yeah, I called it not instruction fine tuning. I called it fine tuning to follow instructions, which were originally, which was originally meant to have both, but then it got too long. And the other thing is, you know, like one book chapter takes about two months and a lot of people who really want to book before the new semester starts. So it's like, you know, it's, there could be another chapter on it, but it would be
Nathan Lambert [00:33:54]: another two months.
Sebastian Raschka [00:33:54]: And that, I mean, it's not really an excuse, but the other one is I was not happy with the results. And this is a very mathy topic. And I was like, okay, I have this book, which is very clear and makes hopefully a lot of sense. And then I have this really super complicated chapter at the end. I don't know if that's very satisfying to read or death.
Nathan Lambert [00:34:15]: Yeah.
Sebastian Raschka [00:34:15]: Where it's like, so you read this book, everything makes sense. And then it comes to this huge...
Nathan Lambert [00:34:19]: Why is RLHF so much mathier? Like, I know a couple, there's a couple of core equations. Like the core equation is like the RL optimization step, which is expected expectation, maximization of reward subject to penalty. And like, where does most of the, like compared to pre-training, which is like one equation, like that is also one equation, but there's a lot of downstream stuff, I'm guessing. Yeah.
Sebastian Raschka [00:34:41]: I think it's the explaining a bit about reinforcement learning. I mean, you don't really have to explain reinforcement learning in a classic sense, maybe, but yeah, there's still like KL divergence and penalties and reward margins. And there are lots of things happening at the same time. And the code is also very long if you especially want to track the rewards and stuff. So for my instruction fine tuning chapter, I'm using exactly the same training function I implemented in the pre-training chapter.
Nathan Lambert [00:35:14]: And it's really nice.
Sebastian Raschka [00:35:14]: It's like, well, you can actually reuse everything. It's, it fits together.
Nathan Lambert [00:35:18]: Yeah. Like what we're doing on OMO, we can baseline our instruction fine tuning in our fine tuning code base, which also has some RL things and in our pre-training code base. So it's nice to have both, but that is definitely why it's simpler. And the RL is only getting worse in my mind, I think. Like we've seen that LLAMA has used rejection sampling for two iterations and there's no public implementation of rejection sampling that at least public enough to know that people have actually trained models with it, which is the idea of ranking completions to a reward model and then running instruction tuning again on the top completions.
Sebastian Raschka [00:35:54]: I think also in the recent LLAMA 3.1 paper, they used rejection sampling with DPO, for example. Like they didn't use the RLHF with reward model, but then they used the reward model for the rejection sample. And yeah, so I must say, I have the code for the DPO. I wanted to do TPO because it's also more resource efficient. You don't have to train that reward model for, let's say the book, but I was not really happy with the quality of the output yet. So I must say it's like, okay, this is not, it's not helping the instruction fine tune model. And it's like, I think a general thing where I, I mean, you might correct me if I'm wrong here, because you are the expert in RLHF, but for me, it's like, it's like a optional thing where unless you need a specific style or need to deploy something in like a safe manner, it's maybe not giving you the best results. If you need a private model that just runs on your own computer and gives you correct answers, I don't think DPO or RLHF will make the answers more correct. They will just change how they look like.
Nathan Lambert [00:37:01]: And yeah, I mostly agree, especially on what we have in public implementations. The public implementations are really good at improving on like alpaca eval. But if I'm training a model that I actually want to use, don't worry about alpaca eval. I think I'm like the most annoying person internally running these experiments because I just get so annoyed when only alpaca eval goes up and be like, that has made the model worse. Like we've, I've been building internal demo tools, which is just like making Gradio better and showing how to use VLLM for serving. But it's like a lot of the models we put out for research are like really, really annoying to talk to. You put no yapping or just be concise in the prompt and it doesn't do anything. So like a lot of the open datasets, and this is something that Nibetron and Lama3 have shifted to is this new evaluation, which is like IF eval, which stands for instruction following eval, which I think is a great one. So it's like write a response with less than 300 words or something. And it has these verifiable claims. And this is something that the Nibetron report showed that like doing fine tuning really unlocked a lot more performance in the DPO stage. So I'm hoping that we start to get more evals than just alpaca eval that are helped by this RLHF and that'll help the whole ecosystem come forward because it is in a kind of young, rough state right now. Yeah.
Sebastian Raschka [00:38:21]: And also one last thing about this topic is for me, like you said, the last sentence is kind of also one of the reasons is where I was like, okay, if I include something on DPO as the last chapter, I don't know if it's still going to be used next year or if there's so many variants, ORPO and QTO. And I mean, right now, I mean, Lama3.1 used DPO, which is like a big endorsement. But to be honest, I'm not sure if this exact variant is here to stay.
Nathan Lambert [00:38:47]: And so I think DPO is here to stay. DPO will be a canonical example, much like PPO. But I think the things that people are using will go away. Like PPO has stood the test of time of multiple eras of RL. So I don't think that people use it in its exact form, but people are always looking at it. And same with DPO, just because DPO is so simple. Like the exercise, this is like one of the best getting started with RLHF exercise is taking like the hugging face trainer and modifying it to use the DPO loss because you could use all the other infrastructure for like most of the infrastructure for batching and stuff like this. And then add that loss function, which is a few lines of code. And like, that's a good, that's like the entry point to doing RLHF implementations. Like when I interview people, I'm like, make sure that they have looked at this DPO loss function before. And if they haven't, I'm like, I don't know if you're in the weeds enough. I feel like you should look at this.
Sebastian Raschka [00:39:37]: Speaker 3 And if you need, if you are listening to this and you are about to get interviewed by Nathan, I will hopefully have by next weekend a tutorial on DPO, on implementing it from scratch. I was, this weekend I used actually Lama 3.1 to make a synthetic data set for that and got much better results. So it looks good enough to probably upload it next week. So nice.
Nathan Lambert [00:39:58]: Okay. Let's shift gears into like AI research and AI education, which is I think the thing that you have some of the most insight into. So you're a head of AI newsletter. You, I wasn't originally reading it when I subscribed, but now I almost always skim through to kind of see what papers you uncover. I'm pretty interested in like how you select papers, like how much you actually prioritize reading papers and why, and just like any advice for people, because it's hard to sit down and do this. And I, speaking for myself, sometimes writing is like how I force myself to read some papers. I don't know if you're in the same boat, but like, what is your worldview around reading AI papers these days and skepticism or excitement, everything?
Sebastian Raschka [00:40:42]: Yeah, that's a big topic. So I must say, so I, I look at more paper than I actually literally read. I mean, I look at the abstracts and the titles and then that's like a huge funnel as a section
Nathan Lambert [00:40:54]: processor.
Sebastian Raschka [00:40:54]: I must say for like, I was an archive moderator for the machine learning archive a few years back and that got me into the habit. So how it worked was basically as a, maybe it's useful because some people complain when
Nathan Lambert [00:41:06]: How did someone become an archive moderator? I didn't know that it was like a community position.
Sebastian Raschka [00:41:12]: So that was originally by Tom Dietrich. He was doing it by himself and he was looking for people to help him with that. Because as you mentioned, there is an ever increasing number of papers. And so how it works is essentially that when you submit a paper to archive, you select the categories. But a lot of people, they select not, let's say the correct, I wouldn't say not correct, but like the preferred categories because Yeah, the AI and ML.
Nathan Lambert [00:41:39]: It's like ML, AI, and then everything else. Yeah.
Sebastian Raschka [00:41:42]: And AI in archive is interesting. It's more like the classic AI. It's like, it's not LLMs. It's more like symbolic AI, that kind of stuff.
Nathan Lambert [00:41:51]: What do you think the difference between, or like as an educator, how do you define AI and machine learning? This was also one of my favorite interview questions to like see where they're at.
Sebastian Raschka [00:42:00]: Well, right now I would say I go back and forth on that. Right now I would say AI is this big umbrella thing where you have deep learning and machine learning as subfields. But if you think about it, if you consider a logistic regression classifier, it is essentially machine learning. And if machine learning is the subfield of AI, you would say, okay, then logistic regression must be AI. But is like classifying iris flowers really AI? I don't know. So today I would say
Nathan Lambert [00:42:28]: I also think about search as AI. Yeah. Like, yeah.
Sebastian Raschka [00:42:31]: Like, yeah. So there's like the good old fashioned AI. So I would say with AI, yeah, you have both, you have the machine learning and deep learning branches, but you have also, you can also implement AI with if else statements, I guess, like, you know, like, so. So that's how I would define AI. But I think nowadays when people talk about AI, they mean specifically gen AI, like generative AI models, like LLMs, stable diffusion, that type of stuff. But yeah, so the archive thing. So just briefly, basically there is in the background, it's also using machine learning or NLP to detect whether the title based on the title and the abstract, if the category is actually matching. And if there's a mismatch or in general as moderator, you go through them and, oh, this looks good.
Nathan Lambert [00:43:17]: This looks good.
Sebastian Raschka [00:43:17]: This looks good.
Nathan Lambert [00:43:18]: They started exposing this to the user. So I submitted a paper recently under ML and I was like, this looks like language. And I was like, I've been in moderate, I've gotten papers stuck in moderation. So I was like, I'm always going to hit, except if they tell me it might be in the wrong category, because archive moderation is a black box that you don't want to get stuck in. No, no, like as a user, but I understand the service it's providing. So it's good to expose that to the user. And if anyone's listening, just click it, click. Yes. It's not worth delaying your release. We get stuck in moderation and help archive out. Yeah.
Sebastian Raschka [00:43:50]: And so just the last thing on that is by default, everything gets accepted. However, sometimes it's something gets flagged. If there's duplicate content, if it doesn't look like a paper, sometimes people submit like one page blog posts or something. So there is this thing where sometimes there are also false positives and then it gets stuck. But long story short, that got me into the habit of reading the titles. And that's what I still do. Also for my head of AI newsletter, I just look through the titles and select. How have titles changed?
Nathan Lambert [00:44:21]: Like titles have changed a lot though, as I feel like they used to try to be. Accurate. Mostly descriptive. Yeah. Descriptive, right? And now they are a mix of, it's more of a storytelling than descriptive. I think it's the right way to tell it.
Sebastian Raschka [00:44:36]: At least we don't have the, it's all you need anymore. I feel like this went away finally, but yeah, you're right. It's more.
Nathan Lambert [00:44:43]: It ended with Ryland Schaefer's test set. Training on test is all you need. Yes. Did that make it on archive? It did.
Sebastian Raschka [00:44:51]: I think I also had it featured in my newsletter one time. I think. Or not featured, but at least mentioned. And so how I select papers is also often selfish. I read or select papers for the newsletter that I find interesting. And because I think this is also for education. When people ask me about how I would suggest doing things, I think the most important thing is to talk and work on things you are interested in. I think it would be really hard to do a good job if it's a topic that is not interesting to you. For example, I know, I don't know. R, sorry, or Rust is interesting, a very important topic, but I'm not into it. So I don't try to, let's say, make videos or content.
Nathan Lambert [00:45:35]: Yeah.
Sebastian Raschka [00:45:36]: So it's like, I think if there's something you're excited about, I think it comes almost naturally that you want to talk about it. So in that sense. So the newsletter, I almost, it's weird, but I almost write it for myself. It's like, I find it interesting.
Nathan Lambert [00:45:49]: How much do you spend reading versus writing when you're reading these papers and writing a blog post? I'm guessing a lot of it is just the natural process of synthesis is what you put into the newsletter. It's not like you're doing it from my read. It's not like you're doing a ton of scaffolding and editing after the fact, which seems similar to what I do.
Sebastian Raschka [00:46:09]: Yeah, you're right. I don't do, I don't spend too much time on it in the sense that I wish I could, but I have a full-time job. It's literally just the weekend project where I aim for one newsletter per month. Of course, I would like to do more, but there was also a book to write on weekends or sometimes I'm doing videos. It's like keeping it fun, you know, like where it's like, okay, this is not a chore. This is something that is supposed to be fun. Like in that sense, I read a paper and then I take notes and then I collect them and spend maybe half an hour, an hour to polish them a bit up or make some figures. And that's it per paper, I would say. And so I also don't write the whole newsletter on one day or one weekend. It's really spread over the month. I read a paper. Oh, this is an interesting one for other people. Let's write this up basically. And then this way I collect material over the month and then.
Nathan Lambert [00:47:00]: Yeah. What motivates you to work on this stuff? Is it purely like education? Because I, in some ways relate to that. I've been in that mode before.
Sebastian Raschka [00:47:09]: Yep. So if you have noticed, I don't have any sponsorships or something.
Nathan Lambert [00:47:14]: Never done that. Respect.
Sebastian Raschka [00:47:16]: I will never say never, but it's not something I do. It's really just a hobby. And I do like discussions that come around it. There's a certain satisfaction that if you put it out, it helps others and people tell you positive things about it. It's kind of very gratifying. I don't know. There's like a reward in a sense. And what's also cool is there are a lot of people. It's like being part of the community and exchanging information because there are also a lot of people who sometimes know something I don't know. And this is really, I think, really cool. You write about something and then someone, Hey, have you seen this? This seems like it's taking it to yet another level. Or this is the same idea. It's even better or something. And this is super cool where you get this effect where you learn by doing this, actually, because there's always someone who knows a bit more than you do in a specific area. So, yeah.
Nathan Lambert [00:48:07]: Yeah. I feel like it's increasingly important these days and increasingly impactful because so much of research has become closed off and for business reasons. So there's fewer people that do more of the work. I don't like it. I always feel like people don't realize how few people are informed and share on any given topic like AI research. If you take away three people, I've yet to find people that just tweet the same random RLHF crap that I tweet. It's like, I don't do it because I just say random things, but there's not that many people that represent each of these corners. Ahead of AI, I think Jack Clark's important AI. I should have him on the pod. I think I've talked to him a few times. He's great to talk to. And his is the same thing. It's like these few people that are disseminating AI information, which is crucial for policy at future angles. Have you ever gotten criticism that your work is accelerating AI and that you are a safety risk? I've gotten some critical emails that are like, you shouldn't talk about this.
Sebastian Raschka [00:49:07]: Yeah, I've more gotten emails about the fact that I talk about LLMs is not good because LLMs violate copyrights. I mean, not that I do it, but that other people's LLMs do it.
Nathan Lambert [00:49:21]: And I'm happy that I haven't had this audience very much, but it seems this is like one of the challenges of having a tech audience is like you cultivate it in kind of one of two, like there's multiple ways to go. And one of them is like this all data is for language models is theft thing. And I just don't know how to deal with it because like I disagree, but the normally people that aren't receptive to it, which is really hard. It needs to be played out. Yeah.
Sebastian Raschka [00:49:47]: My book also just to make extra sure all the data I use there is so the pre-training data is public domain data, like a book from Project Gutenberg. And for instruction fine tuning, I did my, I created my own data set basically. So just to avoid any issues, you know, like. Did you do, you wrote it by hand?
Nathan Lambert [00:50:06]: Yep.
Sebastian Raschka [00:50:06]: So I took, no, actually I used, I used part of an LLM and some by hand.
Nathan Lambert [00:50:12]: Yeah.
Sebastian Raschka [00:50:12]: So it's a great exercise.
Nathan Lambert [00:50:14]: Yeah. Yeah.
Sebastian Raschka [00:50:15]: And for the synthetic one, I use LLAMA 3.1 now too. I mean, yeah, you can tell me also about that a bit. I mean, that's maybe interesting for the audience, how to generate a preference data set, because there are multiple ways, I mean, naturally it's crowdsourced, right? So you ask people, you have the model generate two answers or have flavors of the model generate answers and then, oh, which one do you prefer? But it's not really scalable. And so you could technically do the same thing with an LLM. You could basically have the LLM generate a more polite version because I think LLMs are very good at, even the small LLMs, the open source 7b models are good at rephrasing things or evaluating things. They're not necessarily good to generate the answer in the first place if they don't have a reference, but given a reference, I think it's super useful to use open source LLMs in that sense.
Nathan Lambert [00:51:07]: I'm surprised that this hasn't caught on sooner, but I think it's starting to catch on. I think in the meta report, they essentially have edits. So then they rank, they make their preference pairs as edited better than chosen, better than rejected. And that's like, you can create multiple players by binarizing. There's a few research projects that have done this where they have like, constitutional AI is popular, but that's not really reproduced. One of my collaborators slash friends at Synth AI Labs, Louis Castricado, he did a paper on like the pink elephant problem, which is like using provisions to get the model to not just say whatever is in the question if you ask it not to. We did a follow-up work that's out literally today, which is like on self-directed synthetic dialogues where you have the language model generate a plan, and then it follows the plan. And then you can also do revisions on it. So I think Nemetron did this with Prompt. So it's really getting going, but it's something that took longer than I expected. There's the kind of question, this is like too big of a topic to go into, but it's like, how do you use GPT-4 feedback? Do you use like, are your completions from two different models or the same model with different generation settings? How do you use humans? I think that the labs are using humans for preference data because it eliminates some of the problems in language modeling. And then that's one of the biggest impactful research questions in alignment. It's like, we can't afford the $1 to $10 million dataset. How do we do this? And that's what, we're starting a project to do that AI too right now. And it's a big open, like, I don't know where it'll go. I don't know how much, like how far can we reproduce the LLAMA-3 alignment methods. Yeah.
Sebastian Raschka [00:52:46]: So I would say the LLAMA-3.1 paper or the LLAMA-3 paper, it was like a 93 page paper
Nathan Lambert [00:52:52]: and it was great.
Sebastian Raschka [00:52:52]: I love it. It's like a lot of detail, but on the alignment part, I feel like I wish there was more information
Nathan Lambert [00:52:58]: about it.
Sebastian Raschka [00:52:58]: Even like LLAMA-2 had more information where they showed what is the improvement actually over the different stages when they added to supervised fine tuning.
Nathan Lambert [00:53:05]: So I'm talking to Ross Taylor tomorrow, and I'm going to ask him the specific thing. On latent space, like Thomas S., one of the leads, said that most of their gains come from RLHF rather than SFT. So I think the open source community is over-indexed on instruction fine tuning because it is accessible and we have the data. And this is like one of my, like, try to guide the community by doing things is like, go do RLHF. Don't worry about instruction tuning data sets. Don't worry about that. We'll just leave that the same and go find more preference data and keep playing with this. And don't worry about the DPO methods. Just literally go make preference data and keep trying to train things. Like don't implement a new loss function.
Sebastian Raschka [00:53:48]: Practical question to an expert like you. How good is actually a preference data set if you download it, if both the chosen and the rejected answers, if you download a preference data set, they're not generated by your model, right? And if you have a model and you use the responses that the model has never basically seen before, does this actually work or would it be advisable?
Nathan Lambert [00:54:11]: So the most, the two most popular preference data sets in the open right now are UltraFeedback and Nectar or variants of them. Both of those are collected from large suites of other models. And part of my, there haven't been data sets or papers that have trained really good models using on-policy preference data from the model you're training. And I think that's a question that we need to answer. It's like, how do we get UltraFeedback level results with on-policy data? Because all the labs are using on-policy data. I wrote about this in like Barry to one article. I have a theory that UltraFeedback and Nectar, these general data sets work so well because within them, there is something close enough to your distribution and you don't have to get it quite right. But it's just like a gentler, more uniform learning signal for the models doing preference tuning. But we don't know. That's something that I want to answer.
Sebastian Raschka [00:55:02]: Yeah, this is an interesting one. I would also like to know the answer because that is one thing where I got a bit stuck when I was writing this DPO chapter with smaller models. I think bigger models also, they hide these weaknesses a bit because they have been trained on so much data that like you said, it's kind of in distribution already. But if you train a small model, it would be out of distribution, right? If you use someone else's preference data set. I noticed even something simple when you train a model on one simple instruction data set, let's say something like alpaca. And then let's say you have just to have something visual. You want the model to generate Yoda speech, like where every sentence is reversed. But the model has never seen sentences like that unless it was maybe in the training data. But in that sense, it doesn't work well at all because you ask the model during preference tuning to write sentence structures. It has never grammatically written before. And so in that sense, I think what I found is it's much better if you, I don't know, you say be more polite or like you have a more polite answer because you use the same grammar or so. So things like that basically. And yeah.
Nathan Lambert [00:56:08]: Yeah, I think that's a smart approach. It also might be why learning rates are getting so low. Where like all the learning rates for DPO and things have been going down in the fine tuning space. And it might just because distributionally, like we're far off from the model. There's the other theory that the model is like really, really done training. So they get it to a really good optimum. You don't want to move it from them. But it might just be that like our data sets are in the wrong space. Yeah.
Sebastian Raschka [00:56:32]: So you try to be gentler with a lower learning rate.
Nathan Lambert [00:56:36]: Yeah. All of this stuff changes fast, but not fast enough. Like this ultra feedback data set they were talking about came out last October. So we're like almost 10 months in and it's still the state of the art data set. And it's only like 50,000 examples. So there's so much opportunity for someone to like at this level, like go build data sets if anyone is watching. Because it's like, I think we're so far off where we could be just because people don't know how to make good preference data sets.
Sebastian Raschka [00:57:02]: Well, now we have LLAMA 3.1, 70 and 405 billion that allows us to do that, right?
Nathan Lambert [00:57:08]: We'll see. Yeah. I was wondering, this is a change of topic, but how do you think like, do you think AI will change our jobs in writing? How do you see AI coming for this kind of educational space? Like how much of what you do as an educator could be taken in N years by AI?
Sebastian Raschka [00:57:26]: Well, I think it's like, of course it will automate away some things because nowadays you would ask a model something instead of searching for it and reading it on a website. But I do think the creation process, you still need a human to put it together well. Because I don't know, I think LLMs are not nowhere near like generating a whole article that is actually, I would say even good where it can generate the right things, but you still have to put it together. It can generate good blocks of text or something like that, but you need to, as an edit, like you become maybe more like the editor then in that sense. But I'll try this.
Nathan Lambert [00:58:09]: Also like, do you write, do you have AI write any parts of your articles? I'm so scared for like moral reasons to have any AI writing in it. I'm like, it's just a slippery slope. It feels like I could get addicted. Yeah.
Sebastian Raschka [00:58:21]: So sometimes I don't have it write anything from scratch, but I sometimes do do that. And especially, I don't know, I have a, I mean, I'm a non-native language speaker and sometimes I have a harder time than other days to make the sound right. It's like, okay, this is what I want to say, but it doesn't sound right. And then I, can you revert this with a focus on XYZ or something? So like, it's basically like a, you know, like a thesaurus where you find similar words, you find similar sentences, like just rewording it, like these types of things. But one also, now that you mentioned it, one weakness it has, or LMs can't do really, is they can't generate figures. You know, maybe that's coming.
Nathan Lambert [00:59:01]: I don't know.
Sebastian Raschka [00:59:01]: You can do that probably with ticks, like the latex thing where at one point, but right now nowhere near, can you generate any useful figure? And I think learning is very visual too. I think if it's just text, it would be really hard to learn anything.
Nathan Lambert [00:59:17]: Yeah.
Sebastian Raschka [00:59:17]: So you can, of course, but I do think, you know, there's a saying, image is worth a thousand words, right? So yeah, in that sense, you still need someone, you know, like the mastermind behind an article, even if it's just an editor, I don't think LMs can replace everything at least. And we'll see. I mean, I don't know how much better, I mean, we just don't know how much better, let's say GPT-5 as a placeholder here will be then GPT-4, you know? So maybe if it's saturating, who knows, right? So maybe it will be five more years till we, yeah, get in a more scarier territory in terms of replacements, you know? So we'll see.
Nathan Lambert [00:59:55]: Yeah. I mostly avoid the agent word, but it does seem like there's enough culture and cultural investment in the Bay Area and tech executives to do something. Like they're going to get to something that is triable, which I think is mostly like automatic Google searching, more code execution, which is going to be interesting, but I have such wide expectations of what it actually means. That's probably the next big shift. I think this LLAMA 3.1 is probably right now leading the year in terms of AI news. This recent DeepMind thing on the math might be a better example of what's really hot news. I need to go read more about it. There's some long write-ups on how the qualitative between the AI math and the human math and the different directions they're going. So that's kind of what I want to read about it. But it'll shake things up. We're multiple years into this fast phase. It's not exactly new at this point. Yeah.
Sebastian Raschka [01:00:57]: Last thing on that is I do think, though, LLMs make good assistance in the literal sense where one thing where I use it for my newsletter for is at the end, I have a list of all the papers I have found interesting, like 30, 50 papers usually. And usually per hand, I edit the author names, like the last names of the first three authors. And now I use an LLM to go to the website and get the names of the authors, basically. And so this is where it saves a lot of time. You could do that without LLMs. You could write some code to do that, but it would probably take me half a day to write because I'm not good at this web scraping code to do that type of thing. And I think in that sense, it is actually a useful assistant for certain things like
Nathan Lambert [01:01:44]: delegating actions. I think it'll keep creeping up. I don't expect their usage for those things to go down because they already are so useful. And the little coding things, the hacking data together, the automatic searching, people aren't going to want to stop using that. I don't know if it supports the whole valuation we have, but it's fun to be in a space where we get to try new things. As a computer nerd, it's really fun to have a new type of software that we can try all sorts of things in our workflow. And I think that's underrated. So I don't know. Thanks for coming on. Any last things you want to discuss?
Sebastian Raschka [01:02:19]: Yeah, I just wanted to say thank you for the invitation and I hope you keep creating these awesome newsletters. I think this is much needed because there's so much hype, like you said previously, it's
Nathan Lambert [01:02:32]: creeping up on us.
Sebastian Raschka [01:02:32]: There's a lot of over, let's say, evaluation and praise. And I think something that is kind of like cutting through this is it's much needed like this honest, straightforward, no b******t content. So yeah, I hope you keep creating that. It was fun to chat. And yeah, to everyone out there, I think also what keeps us motivated, I think, is the awesome community that people give feedback and discuss things and bring things up. And yeah, I think without people also giving us feedback, we wouldn't be probably doing this because it's kind of a lot of fun to be in that space, I must say. Yeah, it's fast moving, but there's always something interesting every day.
Nathan Lambert [01:03:14]: Yeah. Yeah, this is really interesting. We covered a lot of kind of low level of just what it's like trying to use language models on the day-to-day basis in July of 2024. So thanks for coming on. And I'm sure we'll talk soon. All right.
Sebastian Raschka [01:03:27]: Yep, it was nice meeting you and see you then. Bye.
And how to understand Llama three point one's results.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/gpt-4o-mini-changed-chatbotarena
0:00 GPT-4o-mini changed ChatBotArena
3:23 Llama 3 in the arena
5:13 Partial solutions and next steps
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_013.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_019.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_021.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_025.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_039.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_043.png
Defining the future of the AI economy and regulation. Is Meta's AI play equivalent to the Unix stack for open-source software?
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/llama-405b-open-frontier-model
00:00 Llama 3.1 405b, Meta's AI strategy, and the new open frontier model ecosystem
01:37 Meta's open frontier model
03:51 Zuckerberg's vision for open-source AI (vs. reality)
08:35 Does the Llama 3.1 license support open-source AI?
12:55 Different futures for regulating frontier models
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_008.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_010.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_015.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_018.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_050.png
SB 1047, AI regulation, and unlikely allies for open models
The rallying of the open-source community against CA SB 1047 can represent a turning point for AI regulation.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/sb-1047-and-open-weights
00:00 Introduction
01:53 SB 1047 and targeting regulation
07:57 Unlikely allies of "open"
12:05 What would I regulate today?
I Switched to Claude 3.5
Speculations on the role of RLHF and why I love the model for people who pay attention.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/switched-to-claude-from-chatgpt
00:00 I Switched to Claude 3.5
03:57 Product priorities
05:15 RLHF's peak?
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_016.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_018.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_020.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_022.png
I’m really excited to resume the Interconnects Interviews with Dean W. Ball from the Hyperdimensional Substack (you should subscribe). We cover the whole stack of recent happenings in AI policy, focusing of course on California’s bill SB 1047. We cover many, many more great topics here including:
* What will happen in the case of a minor AI disaster,
* If Meta will release the 405B model, and why,
* The status of Chinese open-source AI,
* Training on model outputs,
* Anthropic’s recent strategy,
* What scaling laws actually mean,
* Creating content and shifting the needle of the AI discourse.
Watch the video on YouTube below or listen on podcast players here.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Chapters
* 00:00 Intro and Welcome Dean Ball
* 02:44 The Origins of California Bill SB1047
* 08:56 The Evolution of Bill SB1047
* 13:00 How SB1047 Affects Fine-Tuning
* 20:00 The Future of Bill SB1047
* 21:58 The Impact of AI Disasters
* 29:02 Meta and its 400 billion Parameter Model
* 32:25 Open Source AI and the Chinese Market
* 37:37 The Future of Open Source AI
* 43:35 Synthetic Data, Licenses, and Future AI Development
* 45:18 Anthropic's Approach to AI Safety
* 50:46 Scaling Laws
* 53:01 The Role of Audience in Influencing AI Policy
Links
* Dean’s series on SB-1047: one, two, and three.
* Other AI policy Substacks: Jural Networks and Intersecting AI
* Senator Scott Wiener. CA SB 1047 itself.
* Another post on CA SB 1047 from Answer AI.
* Situational Awareness by Leopold Aschenbrenner.
* Lina Kahn on her P(doom) and warnings in support of open-source.
* Ben Thompson’s Framework for Moderation in technology.
Transcript
Nathan Lambert (00:00:01): Hello, and welcome back to InterConnect's interview series. It's been a few months. I'm really excited for this one. We're here with Dean Ball, who is a research fellow at the Mercatus Center. He works on AI policy right now, and he's the author of the Hyperdimensional Substack, which is kind of the AI policy substack that emerged when I was spamming into the void that we need to have some good AI policy newsletters out there. There are a couple more that I could add to the show notes of this that I'm aware of from friends that used to be at OpenAI, friends at AI2, so I'll add some of those as well.
But in this kind of summer slowdown of releases, I thought it would be a great time to kind of revisit some of the core themes on AI policy, open versus closed, kind of things that I'm wondering about in the future that I know are coming that are looming AI disasters, what some of these closed source companies are trying to do in the policy space. I think this is the sort of interview that we could probably do multiple times. I think we've started talking in DMs and it's clear that we're aligned on a whole bunch of things. We read each other's work. I think this should be kind of fun and I'm just happy to do this.
I think the core of this interview I'll give you a chance to introduce yourself if you want, if you want to add anything else that I missed, and then we're just going to go into this California bill SB 1047. Probably talk about this. I'll ask you about the story of how it happened and then where we're at now. And I think that'll kind of lead into a lot of interesting debates. So do you have any background you want to add that makes you an interesting person in the AI space? Or is it just that there's so many things that need to be done in AI that if you're focused, you can kind of have an impact in an area?
Dean W Ball (00:01:44): Yeah, I mean, I think basically, you know, I've mostly written on policy unrelated to tech for my career, state and local a lot. So the fact that a lot of the policy action on AI seems to be happening at the state level has been very relevant. But I've also just like always been paying attention to the AI literature. I remember 2017, I think, when the Alec Radford Amazon podcast product reviews paper came out and I said to a colleague this is gonna be a big deal I think one day and you know we I tried to use GPT-2 to do like social science research like policy research back in 2019 so I've been playing around with these for a while and I try my best to write from a combination of a relatively technically informed person, but also someone who understands the policy side.
Nathan Lambert (00:02:43): Yeah, so I think we should jump right into it. What is the origin of the story of this California bill? My understanding is it just kind of showed up and everyone in the Bay Area was like, like where did this come from? Having actually passed the state Senate as like, do you have any, does your story start there as well? Or did you kind of know this was coming?
Dean W Ball (00:03:03): So I saw, Scott Wiener, the author of the bill had telegraphed that he was working on, something in AI policy, I think in maybe October or November of 2023. And then the actual bill text came out in early February. And I remember when it came out because I was having dinner with my wife and, I was like, I have to drop everything and go work on this. I stayed up until like one in the morning, you know, reading the bill and writing about it. And that was kind of my first Substack post that really went anywhere in terms of audience. And so, yeah, then there was kind of a couple months of quiet. You know, I had been writing about it, but people weren't really focused on it in the Bay, in the tech community. And then closer to around April, people started to pay attention. And the conversation has been pretty, you know, pretty active since then.
Nathan Lambert: Yeah. And like, what does it actually say? Like, what are the core points? I know there's stuff around thresholds and giving California power to do like California creating a new body. Like, what are you think? What are the few like core things that people should know? I think there's probably some details, but just the core stuff.
Dean W Ball: Yeah, so the core idea behind SB 1047 is to create a regulator inside of the California government called the Frontier Model Division that would oversee models. Really, now the threshold is models that cost more than $100 million to train. We can talk about how specifically you really even specify that cost, but really all the bill says is $100 million of compute costs to train. Those models are subject to a series of testing and safety requirements, and more importantly, I think, a liability regime that basically says that most downstream uses of that model, including in the case of an open source model, most fine tunes, most uses of models combined with scaffolding software, other software. So things that are very combinatorially distinct from the initial model release. Any downstream misuse is the legal responsibility of the developer who made the original model.
So, if I fine-tune Lama 3 and then someone else puts that in an app and then a user of that app misuses it in a way that causes a serious harm, the bill does have a high threshold for the harms that have to count here.
Nathan Lambert (00:06:00): Is that eligible? Is it specific? Do they have a safety taxonomy?
Dean W Ball (00:06:05): So, they basically, it really, it's a static threshold that comes in at $500 million of damage. They would say harm to critical infrastructure and things like that. Critical infrastructure pretty much just means everything. It's kind of a catch-all term. It's a little weird. Critical infrastructure, the way we think of it, like highways and power plants and stuff, is actually a subset of critical infrastructure. Critical infrastructure includes things like casinos and ballparks and amusement parks and all kinds of stuff. So anything really, any major cybercrime, bio attack, all the things people are worried about with AI would count. And the developer of the original model, which is many stages upstream from where the harm happened, would have legal responsibility.
Nathan Lambert: So it's like the risk for these, probably the expected value risk for open models in this bill is definitely low, but it's just kind of this thing that it's like, if you're kind of comparing on the two axes, the open versus closed risk, like the risk for open models is way higher because of this downstream use term. And that's for the people that are getting like, oh, why is everyone that cares about open AI, like open AI as the field mad about this? So I think that was why everyone was kind of caught up in ours.
Dean W Ball: Yeah. And the other thing to keep in mind, though, is that under this bill, if you're making a model more than $100 million that costs more than $100 million, you have to submit a variety of documents annually about your safety procedures and sort of testing regime on the model to the Frontier Model Division. And I think something that's not all that well understood, and it's kind of just like how administrative law and regulation works in America, but that the tech community might not understand, is that the Frontier Model Division has the capability to create problems for developers, even if their model's never used for a hazardous capability. They could see your safety plan and say, We don't like this or we want more information on this. And they can subpoena you. They can bring you to court and they can, you know, issue it. They could they could order a cease and desist.
Nathan Lambert: Yeah. And this is where you only post on the political economy of AI regulation comes in as like, what are they going to do with that kind of open ended power?
Dean W Ball (00:08:40): Yeah, it doesn't necessarily. I mean, they're an agency that has all the regulatory powers of an agency, which are substantial. I think one other point that is worth making about 1047 that would be relevant to your audience in particular is. So the initial version of this bill, any fine tune. No matter how substantial the fine-tune is, the original model developer held the legal responsibility and had to test their models with the margin and the realization that people could fine-tune them or do whatever they wanted to them, modify the weights in arbitrary ways, which obviously doesn't really make a ton of sense.
Nathan Lambert (00:09:38): I was going to ask about the edits. This is where I probably stopped reading as closely as I should have.
Dean W Ball: In a fundamental sense, everything I've said so far has basically been true of the bill for the entire, the fundamental points, the liability, the frontier model division, these kinds of things. Basically, the actual making developers guarantee model safety when I think we're probably both in agreement that safety is not a model property.
Nathan Lambert: Yeah, at least in the way that the bill concerns it. They're considered about infrastructure. If critical infrastructure is the primary target, safety is not a model property. This is why I ask about a taxonomy. It's because it's like... We're going through this exercise at AI2 to kind of say like, what do we mean by safety? And it's a total headache. It's like extremely hard to get this right and to communicate it clearly. So now when any other organization or somebody mentioned safety and I'm like, oh, do they actually define it? Like it's such a risk to put it into words because when you put it into words as well, you're exposed to all this like people being like, so you don't care about X, Y, and Z. If you don't put it explicitly, it's like a total trap.
Dean W Ball: Well, and actually just to expand on that a little bit, because, you know, the Center for AI Safety, which is the nonprofit that was heavily involved in authoring the bill and Senator Wiener, you know, one of their primary concerns is bio risk. So people making biological weapons with AI models. You know, and I think people who don't understand biology all that well have this idea that you can say, oh, well, that's a good idea. biomolecule to make, and that's a bad one. And so we'll make a list of the bad ones and you can't make the bad ones. And that would be a way to like, RLHF, a biological foundation model.
Nathan Lambert (00:11:34): My understanding of biology is that the more powerful, the more specific a molecule is, it'll probably have good uses and downsides. It's like Teflon. Amazing physical properties, extremely bad downside health concerns. I would guess, obviously, if you're consuming... engineering like living creatures it's going to be a little bit of a different consideration but yeah.
Dean W Ball (00:11:56): But I mean also a lot of biomolecules just and like code um their their goodness or badness is really context dependent they'll do different different things in different settings and so it's not necessarily easy a priori to identify you know what what what how even would you steer a biological foundation model, like something that's predicting protein structures or nucleic acid sequences or whatever it may be? How would you even steer that towards safety? It's not like a priori obvious that that's currently possible. But that's just, you know, I think this idea that safety is something that can be legislated in that way, I think is a fundamental problem.
Nathan Lambert: So what is next? Or you could continue. I was going to ask, what is next for the bill?
Dean W Ball: Oh, yeah, yeah. So I'll just say one thing about the fine-tunes in the most recent amendments to the bill. So fine-tunes now, if you do a large fine-tune, large being anything more than 3 times 10 to the 25 flops involved in the fine-tuning compute,
Nathan Lambert (00:13:13): I need to learn all these numbers. I need to learn what they mean. I need to know. Essentially, it's a linear relationship between model size and tokens. And then you should be able to have specific points, which is like, is Lama 3 base crossing that? Like 15... trillion tokens at 70 billion parameters like I think I I don't know I'll loop back on this I need to know this in the future.
Dean W Ball (00:13:35): It would be as much fine-tuning as you use as much compute as you use to fine-tune the model that's how this threshold is calculated.
Nathan Lambert: Yeah I was just one like a rule of thumb for people would be great I'll figure that out it's on my to-do list of mental mental math that would be great.
Dean W Ball: That would be great to do um but if you're in that situation uh then the bill applies to you too. So you have to create a safety plan and a certification that you submit to the Frontier Model Division every year. Starting in 2028, like the foundation models, you'll be subject to mandatory annual audits.
Nathan Lambert: Is this prescribed to anyone that trains in California or anyone that operates their model in California?
Dean W Ball: Anybody that distributes a model in California. So the bill is at least everyone in the United States, if not really everyone in the world. Certainly, but they could certainly sue you in the United States if you're an American company or operating in America. Now, the important thing about that fine-tuning threshold, though, is that the fine-tuning threshold can be lowered arbitrarily by the frontier model division. So the $100 million threshold for foundation models, that's fixed in statute. So you would need an act of the legislature to change the $100 million threshold. But the fine-tuning threshold, there's no dollar amount. So the same problem with compute thresholds, that compute cost is getting cheaper and cheaper rapidly over time. applies and the frontier model division can change that threshold arbitrarily.
Nathan Lambert (00:15:35): Who elects these officials? Is it like the governor of California? Or the federal branch or something?
Dean W Ball (00:15:43): This is all state-based.
Nathan Lambert: Oh yeah, I meant in the state.
Dean W Ball: Yeah, so the frontier model division would be staffed by unelected civil servants, primarily. Led by unelected civil servants. And then on top of the frontier model division, uh, the new, the newest version of the law creates a committee that is like a governing committee. And that committee is composed of, I believe three members appointed by the governor and confirmed by the legislature. And then two members that the legislature itself points each house, the Senate and the assembly.
Nathan Lambert: Mostly what I would expect.
Dean W Ball: Yeah, yeah, exactly. And like, I think there's a requirement that, you know, one person has to be from industry, one person has to be from the open source community. There's a lot of, there's a lot of bones that they throw to the open source community.
Nathan Lambert (00:16:37): Random credentialing.
Dean W Ball (00:16:38): Yeah, yeah, exactly. But I mean, I don't really, that could be anyone, you know, really, like, yeah, who's who's from the open source community? Exactly. Yeah.
Nathan Lambert: Um, so what's next for this? Like it passed this, it passed the state Senate and then it got revised by the, what is the state, like state general state house. Is that how it works? The state assembly revised it. Does it then they would have to vote and then the Senate would have to vote again. And then the bill would have to actually be signed. Is that how, is it worked that way in California? Yeah.
Dean W Ball: Yeah, basically. So so it's right now making its way through the committee. So it went through the Senate committees and then was voted on by the whole Senate. Now it's going through the assembly committees. It just passed one, I think, last week or the week before the Consumer Protection and Privacy Committee is what it's called. I could be wrong on the exact name, but that's the basic idea. So they just passed it. They did some amendments. It goes to the assembly's committee. judiciary committee next and then uh eventually it will go to the full assembly for a vote and then to the governor for uh for signature or veto.
Nathan Lambert (00:18:04): When would this start? When would it kick in?
Dean W Ball (00:18:00): Uh the bill would kick in I think most of its provisions would start January 1, 2025.
Nathan Lambert (00:18:05): Yeah. And the original vote in the state Senate was like very pro, right? It wasn't even like, it was just like, Oh, this seems normal checkbox for, but this is kind of a cynical take, but I kind of viewed it as mostly these politicians are serving constituents that know that AI is a big thing, but know nothing about AI. So for a politician saying, look, I'm taking action on AI and they're not going to be able to decipher any of the details is probably a political win.
Dean W Ball (00:18:31): Yeah, well, and I think also worth noting is that Scott Weiner, the state senator who authored the bill, is a very powerful figure in California politics. And I would guess that a lot of the senators who voted in favor of the bill really barely looked at it and aren't even necessarily thinking about their constituents. First and foremost, they're thinking more about, well, Scott's my ally. I need X, Y, Z thing from Scott. So I'm going to vote yes on his bill. Um, and that dynamic will apply at the assembly too is, is very common. Uh, the, the California legislature has a history of, um, uh, sometimes even unanimously passing bills that the governor then vetoes. So the governor is often expected to be a little bit the adult in the room on this stuff.
Nathan Lambert (00:19:25): This is so funny. I have no comment.
Dean W Ball (00:19:27): I do suspect that the governor is probably going to be, whether or not he wants to, he will probably be the final voice on this bill.
Nathan Lambert (00:19:41): So that's who people are talking to, probably, realistically, from what you've said.
Dean W Ball (00:19:46): Yeah. So, I mean, the one thing, and this is, again, this is a kabuki that's very common in state legislatures. The governor has not said anything publicly about SB 1047 specifically. I think he's as a general matter, he tries not to comment on legislation that's in process.
Nathan Lambert (00:20:08): That makes sense.
Dean W Ball (00:20:09): Yeah. And then kind of. But, you know, he also might signal in various ways. He there are times when it gets closer.
Nathan Lambert (00:20:17): I would guess they do.
Dean W Ball (00:20:18): Yeah. I mean, like he could say, you know, a lot of bills. I think one outcome that is extremely unlikely from this bill is that it's like voted down by the assembly. Like, I don't think that's going to happen. It could die in the assembly. It could just kind of be forgotten, never get brought to a vote, or it could go to the governor and be vetoed. If the bill's not going to pass, it's going to probably be one of those two ways.
Nathan Lambert (00:20:43): Okay, that's a great little lesson in state politics that I'm sure the vast majority of people listening to this will not know. I did not know all of this. Do you have any final comments on this? Otherwise, we're going to move into kind of fun, faster questions and discussions.
Dean W Ball (00:20:59): Yeah, sure. Let me just think. I think the one other thing that is worth keeping in mind here is that the latest version of the bill, I mentioned this, but just to expand on it a bit, it does require mandatory audits starting in 2028. So if you make a covered model or a covered fine tune, however, the Frontier Model Division chooses to define that. Not only do you have to submit stuff to your certifications to the Frontier Model Division and have the legal liability and all that, but you also would have to comply with an audit done by a private company. So just like accounting, you pay for someone to come in and look at your stuff. And the auditors are, it's not an open market for competition. The auditors are licensed by the Frontier Model Division. So it's probably two or three different, companies that'd be doing that and it's probably that's the sort of thing that i
Nathan Lambert (00:21:59): think people have wanted i don't know if you want it like we don't i don't think people i don't want all these types of oversight to be cobbled together i think individually each of them have different types of merit but like the execution is important and then when you cobble them together it's like wait wait wait this is this is too much
Dean W Ball (00:22:19): Well, and also I think I think it's just questionable whether I agree that an audit like structure like that might be the good long term way to go. I think it's questionable whether a California state agency really has the capacity to do this kind of assessment of like who is an accredited auditor. That feels much more like a federal responsibility. So, yeah, but that's I think that's that's pretty much the main message on 1047.
Nathan Lambert (00:22:49): Yeah. Okay. I'm going to move into other fun questions I have. I'm going to start with one that's potentially related. I've been trying to get my brain around what is going to happen when there is actually a minor disaster from AI. It loops into open versus closed debates. I think a lot of the things I've been talking to people is it won't actually be about whether or not it was an open or closed model. It's some weird infrastructure that people plugged it into and that causes the power plant to go down. Do you have any ideas about how this will happen? I'm expecting this to happen within a couple of years. I feel like the state of our infrastructure is that it is not that reliable and that we're adding all this new digital information into it. And I think all of this is very fragile digitally. So it's like, I think this is going to happen. And how do we preempt any communications around that?
Dean W Ball (00:23:37): Yeah, well, I mean, you know, cyber attacks take out digital infrastructure or take out critical infrastructure all the time. You know, earlier this year, I think maybe it was last year, courts in Dallas could not convene. Like there were no judicial proceedings in the city of Dallas because of a major cyber attack on the judicial system's computers. Parts of the power grid go down. Water plants go down. Hospitals all the time. This happens. $500 million in critical damage. That sounds like a lot. It's not actually that much.
Nathan Lambert (00:24:13): It doesn't have a B on it. It doesn't sound like a lot.
Dean W Ball (00:24:18): Exactly. It's a big economy. I think about this all the time. I think a couple things are very likely to be true. If there is... an attack of this sort, people will probably suspect that AI was involved, whether or not we get, how are we going to know? Right. Let's say like somehow we do have a strong hunch that an AI model was involved.
Nathan Lambert (00:24:47): Yeah, like, do we normally figure out what happened in cyber incidents? Or is it normally post hoc? Or not at all? I guess that's a good thing to know with my question. It's like, can we know that a language model is actually involved? Like, how often will they be able to get that far into the stack of the attack?
Dean W Ball (00:25:02): Yeah, right. Like, I don't know. I mean, like, probably they're... I mean, if you were using, like, an agentic GPT-6 model to do some kind of zero-day exploit on something, like, presumably in the server logs, like, you'd be able to see that, like... what was interacting with it. Right. But like, who knows if that would be masked, but I, so, so let's just say though, that we, we have some, you know, circumstantial evidence to suggest that an AI model was involved in the execution of, of some cyber attack. It's like very much to me, unclear, unclear, Are we going to have like the person's chat log? Like, are we going to know how they prompted the model?
Nathan Lambert (00:25:46): Like, I mostly think it's like it's going to send requests over some generic Internet protocol. So there'll be this big gap where we can't really tell.
Dean W Ball (00:25:54): Yeah. I mean, that could totally be true. That could absolutely be true.
Nathan Lambert (00:25:58): So I expect there to be – it's like almost if somebody takes ownership or does a really bad job or it's an own goal, which is like a hospital implemented some agent and then it took down their authentication system type of stuff.
Dean W Ball (00:26:12): Yeah. No, that could very well – that's all definitely possible. Yeah. I think that, though, how would we actually know what an AI model was used for? It seems to me like we don't actually... People are imagining a situation in which this happens with perfect information.
Nathan Lambert (00:26:32): Yeah, I think that's the answer to my question. It's not that it's like what happens. We can't answer what happens because it's so much of a media question. It's like we won't know. It's likely to happen, but it's very unlikely that we know the specific stack that caused it. Which makes it more of the same around like if cyber incidents increase in rate, then people will talk about AI and people like without actually having the logs, it makes it easier to spin narratives. Because I'm worried that this could be like people are like, oh, this is why open source AI is bad. Yeah. And it's like, I don't expect to have any proof for that, but I expect that to be what people say.
Dean W Ball (00:27:10): People are going to blame AI on things that were already happening. I think that's like a trend that we will see across the board. Whether it's misinformation or whether it's cyber attacks or whatever else, like there are all these curves that we're already pointing up and they're going to continue to most likely. And I think people will blame that on AI. Now, like the sort of, you know, long tail situation is like, what if something really bad happens? You know, what if a power plant, you know, no one has water in Los Angeles for a month or something like that. And in that situation, not only do I think that an attack could be hastily blamed on AI without us knowing whether that's true, I also think we could see legislation move very, very quickly. The Congress, the federal government is not known for moving fast, but in a crisis, they will move fast. It's for the same reason that I suspect, I don't think he is right, but if Leopold Aschenbrenner is right about super intelligence being here and, you know, 50 months or whatever he says.
Nathan Lambert (00:28:26): Yeah. This is another one of my later questions, but I didn't have the best way to frame it.
Dean W Ball (00:28:32): Yeah.
Nathan Lambert (00:28:33): Like AGI timelines and stuff.
Dean W Ball (00:28:35): Yeah. Like if he's right about that, then like, yeah, I mean, that's going to get nationalized by the federal government and it'll happen in a heartbeat.
Nathan Lambert (00:28:42): You know, I found it interesting that Alexander Wong of scale was also kind of touting this point of view. Yeah. I guess it makes sense for them because they're the only AI company that is leaning into federal contracts. Yeah.
Dean W Ball (00:28:59): And they were before ChatGPT, too, I think.
Nathan Lambert (00:29:04): Yes, they have been for a long time, which is why it was easier for them to continue.
Dean W Ball (00:29:08): Yeah, their early big revenue source, I think, was federal government contracts.
Nathan Lambert (00:29:13): Okay. Yeah, we might come back to AGI. I've been confused by the... lines they're drawing. I have a quiz to debate later on. I don't even know the answer. We'll see if we get to it. But another fun question. Do you think meta will release the 400 billion parameter model? And if there will be any governance questions around that?
Dean W Ball (00:29:32): Will they release it open source?
Nathan Lambert (00:29:34): Open weights in a similar manner to the other models. Yeah.
Dean W Ball (00:29:37): Yeah. Open weights.
Nathan Lambert (00:29:42): Do you think they have government? I've been decreasing probability. At best, I was ever 50-50. But is this for government's reasons that you don't think? Are they flying? They've always been flying close to the sun where there's back channel discussions where it's like, The Biden administration is telling Meta that they're like or they're not invited to stuff because they're not happy with how they're like open waiting models through this other like probably they're probably getting lobbied by people saying open source is bad. But it has always seemed like Meta is on kind of thin ice with the executives in Washington. And I'm guessing it's reasonable to say that this model's release is bad. heavily influenced by feedback they're getting there. And Zuck will make the final call.
Dean W Ball (00:30:28): Yeah, I think that that's part of the calculation. I think that also they probably just want to set a precedent that they're not going to release everything open source because they don't know how things are going to go. Yeah, I mean, they just don't know. Will the model end up being... the most important way that we all interact with computers, you know, in a few years? Or will it just be kind of another layer and another tool? I think they don't know. I feel like Zuckerberg's intuition is that it's just going to be another tool. And so that's why he's inclined to open source.
Nathan Lambert (00:31:07): Yeah, this relates to the whole Apple thing. Like Apple is making these as features rather than products. Yeah. That does a lot of good for the narrative around AI, in my opinion, at least for things that I care about. It's like, this is what we're saying where AI is about a system and not just a model. The Apple's model doesn't matter to people, but it is enabling these products and systems or these things on their products to just be better. It's always Apple and Meta together. They are always forcing their way into whatever the next thing is going to be in technology.
Dean W Ball (00:31:44): Vibes policy or whatever. Yeah and it's funny because they hate each other. Yeah yeah but it's so funny but yeah i don't think they're going to uh that that's my just my personal intuition and i think that's like i think we're going to see a lot of people um not just in the language model space but elsewhere kind of do this this dual approach where they can they realize how much political cred you can get by open sourcing things. It's still happening.
Nathan Lambert (00:32:12): Google today, when we're recording, released Gemma V2. And their 27 billion parameter model is just a little bit below Lama 370B. I think that's a nerdy thing. But when the first Gemma model was released, it wasn't used as much by the community, mostly because there was a lot of minor bugs in the implementations in popular tools. So I think the initial feedback loop wasn't caught on. So it'll be really interesting to see if these second generation models, which are in the same ballpark as what Meta released, there's some strange things. They trained the biggest model on 12 billion tokens, and then the 9B model only on 9 billion tokens, and the 2B model on 2 billion tokens. So the models that have more reach by being smaller are like intense... There's got to be a reason, but I think they were like scaling runs preparing for the biggest one, but they didn't finish training them. So like the models that the most people could use relatively are worse than the bigger ones just by the amount of compute that they put into them.
So I think eventually if there's decent uptake of these, Google will change this. But it's like the Gemma 2, whatever it is, 9B model, it's going to be way worse than the Lama 2 8B, just because Lama is trained on twice as many tokens. But like Google could have resolved this. So that's my like kind of, that's an aside. But these dynamics actually feed into what we're talking about, which is like Google, Microsoft, Beta are all still releasing these models.
(00:33:42): Yeah.
Nathan Lambert (00:33:42): Which is good. I have on this outline like the general state of open versus closed. It seems like we haven't had major updates in a while. It seems like there's much less pressure taking on open. I think maybe people are okay with the steady state that we're in. I don't know if this Nemotron 340B changes that much.
Dean W Ball (00:34:01): I don't think so. So I think that there are the people who believe that open source models are an existential risk to the world. And they continue to mostly think that, and they continue to support policies that either in absolute terms or on the margin would diminish open source. I think that DC has had a really radical shift in the last year because the climate towards open source models in the policymaking world a year ago was not good. And now it is much more... Oh, well, we think this is really important for competition and we think it's important for innovation and we actually like want to make sure we have a really healthy open source community and all these kinds of, I mean, I'm sure you've seen, you know, Lena Kahn, no friend of the technology industry. Um, has she had a comment on this?
Nathan Lambert (00:35:09): Um, that's good. Did you see her clip on hard fork where she was asked what her PD is?
Dean W Ball (00:35:14): Yes. Yes.
Nathan Lambert (00:35:15): Oh, my God. If people haven't seen this, you've got to go find it. It is so funny.
Dean W Ball (00:35:18): Yeah. And the sense I get from like talking to people in Congress and whatnot is that like the staff, congressional staff, is that – People have just realized like open source is really popular and it would be really hard to go after. The government figures this, this isn't new. The government figures this out like every 15 years. They get like really freaked out about something in open source software. And then they... It's a good way to put it. They go and like they try to ban it and then they realize like, oh, wait a minute, this would be really hard. This would piss a lot of people off.
Nathan Lambert (00:35:56): It'd be a giant economic own goal. I think it's inevitable that it's an economic own goal. I mean, China is ready to take this over as beating the lead. They're right there. They don't have the ecosystem. The ecosystem is landing in the U.S., but they have perfectly good models. So if U.S. were to own goal and the U.S. stops building the models, I think that that is the path by which they could then own a ecosystem. Because there's not incentive to recreate the ecosystem when the ecosystem and the models exist in the US. But if these kind of tools and hosting all go away, then it's when other people take over.
Dean W Ball (00:36:29): Well, it seems like, I mean, as a bit of a question for you, I guess, but like, it seems like the Chinese, like, you know, the export controls on compute are going to start to really affect them. Because they were able to buy H100s.
Nathan Lambert (00:36:44): Yeah, this is what I was going to ask about. Isn't it that like a lot of NVIDIA's recent sales have been just them... prioritizing selling to China because they're not yet blocked. And then that creates a backlog in the US because Nvidia is like, well, they're not going to be able to buy them, so we should get our revenue while we can. It kind of checks out. I don't have a source on it, though.
Dean W Ball (00:37:04): Since I've always gotten... It's all through subsidiaries. Yeah. So Chinese companies saw the writing on the wall about export controls like two and a half years ago. And so they started to buy up A100s and H100s at that time. And then the export controls came through and things are leaky and NVIDIA had that chip. They were selling a chip that was like basically an A100 and basically an H100 for a year. And then that got blocked by the federal government. So like...
Nathan Lambert (00:37:37): Should we put Zuckerberg in charge of NVIDIA? Because I feel like for all the haters of Mark, Mark is pretty American and kind of follows it up, I feel like. He doesn't really care that Facebook is blocked in China. I feel like it's almost... I feel like this is why public companies sometimes have problems because they're too incentivized. Like Nvidia's stock, if they were to have to stop selling to China immediately, would get such a haircut. So literally their hands are tied to doing this thing, which I think is like going against what the executive policy is in such a clear way. It's like what they're trying to do. Which I'm like, this is a market failure. I was like, I don't think, like, I feel like Jensen's probably like, I don't, I guess he's pro-US. I don't know. Like, I don't care whether or not they're a hawk. It's just like, feels bad to go so clearly against what the intentions of the executive policy are, when there is a clear reason they're doing this.
Dean W Ball (00:38:31): Yeah. Yeah. No, I mean, I think that Jensen is going to comply with the letter of the law, but that philosophically he doesn't feel like it's his responsibility or good for him to be policing who his end users are. I think that's just how he feels.
Nathan Lambert (00:38:47): That's another discussion. I think there's... It's a discussion that I've been trying to figure out. I think like Ben Thompson famously has these diagrams for like... where moderation can occur in the stack. And then figuring out what the mirror for where AI is in the stack, like whether or not it is just a product or if it seeps down to being like the AWS layer where like open AI's models are so fundamental to our computing infrastructure that them moderating at all and them deciding who they sell to is like extremely unclear. And I think it might be going in that direction.
Dean W Ball (00:39:20): It feels that way. But it does increasingly feel to me like... You know, the Chinese might not be able to keep up on foundation model training because they're not going to be able to string together 100,000 B100s in a year.
Nathan Lambert (00:39:32): They have more electricity, which seems to be what people are talking about is the limitation.
Dean W Ball (00:39:37): They just won't have the compute, though. And we'll figure out. The U.S., I think, will figure out the electricity. I mean, I don't think we're going to be building 100 gigawatt data centers, but we'll figure out the electricity for the next couple of years, I think. But the Chinese will be able to distill the models and right. And like release them as, as open weight.
Nathan Lambert (00:39:59): Like, I mean, this is what the leading labs are doing anyways. I think this is, um, all of Google open AI and anthropic have now released models below their biggest size that are better than their biggest available models because it is cost effective and like the performance is really good. So like, they're not even pushing the frontier of the model size to the users. There probably are other infrastructure reasons for this, but like, That sort of thing is something that China could also do. They're going to need distilling our models into their models and stuff like this. I think this kind of leads into my next question. I was wondering if in your circles, this idea of synthetic data and various license clauses on whether or not you can train on outputs and models is something that is discussed. I think in the open fine tuning community, keeping track of licenses and how you comply with them on these various models is really really crucial so like with llama 3 you're technically not allowed to train use the outputs of the model to train any model other than llama 3 models which is like this kind of headache and then a lot of nvidia's push with nemotron is like look go wild I've learned that a lot of these clauses on training on outputs come from the data providers trying to protect their business models. So it's like these companies want the models to be pretty open, maybe not meta, but like some of the smaller ones. But then like the data providers are like, you can't do this and they don't have enough power to do this. Like are these types of this is a very like in the weeds technical discussion. But is this synthetic data or clauses on models discussed in your area of the world?
Dean W Ball (00:41:30): So like in the policymaking circles, people are just coming around to the idea that synthetic data is even a thing. And I think a lot of people in DC don't understand that there are licenses associated with open source software.
Nathan Lambert (00:41:45): Well, the licenses with the models don't really make sense. We're in this position where I've generated some data with these models so you can't trade on the outputs. But it's written as if it complies to you as the user. So you're agreeing to their community agreement to use the model. But if I create a data set and then upload it without training on it, can't somebody else just take the data set and train on it? Because they didn't say they agreed to this terms of use of the model. And it's like, this makes no sense. I need to go to our legal department and be like, this is what they're saying, right? I'm like, I don't understand. And so it's just like this weird ecosystem of middle ground messiness, which is it feels similar to some of the open versus closed stuff. And we're kind of going into this peak of this discussion, I think, especially as people get to know better that these new Claude 3.5 bottles is just distillation. It's based on some form of synthetic like data.
Dean W Ball (00:42:36): Yeah. I mean, with a clause like that, too, in a contract, like you got to wonder about enforceability even under the best of circumstances.
Nathan Lambert (00:42:45): Yeah.
Dean W Ball (00:42:45): How would they know? How would they prove in court? How would they prove that like your this synthetic data set came from their model? Maybe they could prove that, but I don't know. A lot of models claim that they're open AI models, whether or not they are.
Nathan Lambert (00:43:04): It's really funny. Yeah, a lot of it is like if you... Well, this is a technical issue with open models. A lot of people spin up demos with open models, but a lot of the ways that the models know who they are is by using a system prompt. And if you just spin up an open model, you're going to say that you're... a model is whatever you are trained on the most of. So like, but like people don't normally write the system prompt. That's like, you are blank, blah, blah, blah. Like we, like we need to do that for like our models and we're like relatively serious actors. So it's like definitely just like open models will always be messier with this because the closed models do a lot more just serving it as a product in a polished way. Yeah. Yeah.
Nathan Lambert (00:43:43): Another quick question related, we mentioned Anthropic. With this Claude 3.5 Sonnet model that just came out, they've said in a tweet that they got clearance from the UK AI Safety Institute. This is from Michael Salido, who I think I've met at a various government discussion. He's like, excited to release this top performing model. In addition to our internal pre-deployment testing, we also... We were also pleased to work with the UK AI Safety Institute. Is this just political gesturing? What is going on?
Dean W Ball (00:44:18): I think that it's political gesturing. I don't love it. I don't think that we should normalize the whole pre-deployment testing thing because that's just fundamentally incompatible with the way that software is made. But like, yeah, I suspect that it's political. I think that these companies, none of them are particularly reliable narrators. Like... Like DeepMind is going through an org. Was DeepMind a part of Google when the AI Safety Summit happened? I think maybe that reorg was happening. OpenAI, we all know, is like a fairly dramatic company.
Nathan Lambert (00:45:04): I need to come up with the right nonlinear dynamics analogy. They're in like an unstable, like homophobic cycle or something. There's these things that are like in nonlinear dynamics where they stay in a cycle, but if they're perturbed, they end up in another cycle. It's like the Lorenz attractor is like the classical, truly chaotic one that oscillates between them. But it's kind of like that because they don't even need an external disturbance. They don't even need an input. They're going to go into some other unstable equilibrium for a while and then go to another one. But nonlinear dynamics is just a great field because the math is simple, but the analogies are really good.
Dean W Ball (00:45:41): So I even think I even think anthropic is that way, too, to be honest, like I and they're not like they're the most stable of the three,
Nathan Lambert (00:45:50): but I think their cultural cultural density is still higher.
Dean W Ball (00:45:53): Yeah, I mean, I think that they have a very clear mission, and that is really helpful.
Nathan Lambert (00:45:59): I don't know if they're achieving it. Their whole line about, okay, I'm close with a lot of people there, but I don't believe that their line of that they're not contributing to the race is true. I think they need to reframe that and figure out how to... combine this with their culture. I think it's true that normal people don't know that Anthropic exists, which might mean that in a normal person world, they're not contributing to some race, but they are in dynamics with OpenAI and Google that substantially are adding pressure to the pace of AI progress.
Dean W Ball (00:46:31): Claude's been my go-to daily model for the last four months. It's good. Since Cloud 3 came out. But yeah, I mean, I also think that they've committed to doing models every couple months too, right? Like that's a pretty rapid cadence, substantially faster than open AI. So yeah, if anything, they're accelerating the current dynamics. And, you know, but... think that the whole you know as uk ai safety institute i think that a commitment was made during a very heated moment uh kind of the peak i think fall of 2023 was sort of the peak of the ai doom rhetoric was this before or after the sam altman stuff i think it was before before it was before it the the the ai i talked to
Nathan Lambert (00:47:16): people who were at that event and they were like this s**t is weird. They're like, why am I on the stage with all of these like billionaires and famous politicians? And they're all like, what is going on here?
Dean W Ball (00:47:27): Yeah. Well, I mean, it was just so incoherent back then. It was, you know, because it was the Biden executive order and the AI safety summit were all like in about a week from one another, as I recall. It's like all this stuff happened. And I think they made those commitments, and I think we will see all these companies gradually try to unwind themselves from those commitments over time. Or what will happen, this will be very consistent with the way that software gets regulated, especially to use software. The big companies will do these pre-deployment tests, and there'll be open providers who don't. And the best way to, like, it doesn't have to resolve itself in a rational way. That's something that's always important to remember about public policy. It's like, there's absolutely no need for it to be rational, you know, like make sense.
Nathan Lambert (00:48:19): Yeah, that makes sense. I think the other thing, this is all like the AGI lab things. It's like, what is your take on the scaling curves? I think for context, everyone got restarted on this with the Leopold Aschenbrenner situational awareness thing, which obviously is a well-written document, whether or not you agree. I think it's interesting. i'm struggling with this one point of the scaling curves thing where i get mixed messages on what the scaling curves actually are when they come to evaluations my understanding of them is that the when you have log x-axis compute and then like log perplex it's an even log perplexity it's a straight line and what i interpret is this is as you 10x compute you get like a like a like it's not like a 10x encryption and performance you get 10 times closer to 100 which is like if you're at 90 accuracy to go to 99 so I don't really understand how people think that this is going to make them become a PhD level, whatever, blah, blah, blah. And I was listening to a recent podcast and I think it was Josh A. from InView was describing this as the reason you have emergent properties is that when you're training at every 10x compute, your model gets 10 times better. So then if you're measuring on a linear scale, it'll look like an emergent property because it's going to go like this. And I was like, what is going on like why does no one understand these fundamentals and it seems impossible that you could get 10 times better when you're going on like it seems like that just seems like total kool-aid drinking like am i am i wrong i i guess i need to go do this basic math it just doesn't track like any computer system how are you going to get 10 like what i don't understand well that's that's kind of my rant
Dean W Ball (00:50:07): I read these charts the same way. Log, log, perplexity, compute, right? That is what I read too. And so that would imply asymptotic progress, but it would not imply a continued exponential increase in capability. I also think like... What is better? That's always like so hard. It's like, what is 10 times? People say, oh, well, the leap from GPT-5, you know, from GPT-4 to GPT-5, will it be similar or less or bigger than the leap from GPT-3 to GPT-4? I'm like, I don't really know if I can quite quantify what the leap between 3 and 4 was or the leap between 4 and Opus, Cloud 3 Opus, which was definitely real for me. You know, I like that that model felt qualitatively different. But I don't think that has to do with training compute. I really I don't think that has to do with the number of parameters the model has. I think that has to do with the way anthropic that the post-training more than anything else. So, yeah, I'm really not sure. I'm skeptical of when it comes to the, you know, to the scaling laws. They're obviously very important. They've held in a variety of different modalities, which is interesting. The fact that we see them apply in DNA sequencing or give sequence prediction to is like, oh, that's interesting. We're just sealing that same line. The models improve monotonically with scale over and over and over again. Um, so like, sure. I'm, I'm, I'm inclined to believe that,
Nathan Lambert (00:51:52): but they're important, but I just am so shocked by how bad the discussion of them so often is like putting this, this is the thing with like the putting levels on the Y axis corresponding to human education. Dumb. Bad move. The technical reality of it may be that they continue to improve, but it's just like, those are the things that I want to see people stop doing. And this isn't really a question. This is mostly just me ranting about this because this impacts policy and these related discussions.
Dean W Ball (00:52:19): if I wrote an essay and like in college and submitted it to my professor, like Leopold Aschenbrenner.
Nathan Lambert (00:52:27): Wait, who was the famous economist that he was like Tyler Cowen is Tyler. Tyler, you didn't check his work.
Dean W Ball (00:52:35): Uh, yeah. Tyler, uh, basically hired me too. Uh, in fact, um, but, um, But yeah, if you did that and you didn't define intelligence, that would be the first thing a college professor would do is circle the first paragraph and be like, you need to define intelligence here. And the fact that he doesn't, I don't think it's a two-dimensional thing. or one dimensional or two dimensional thing. I think intelligence is inherently highly multidimensional, um, and multidimensional things just behave in, in counterintuitive ways. So like,
Nathan Lambert (00:53:08): I think they're getting better at things they're already doing, but we don't have any proof that they're going to start doing new things.
Dean W Ball (00:53:15): Yeah. Is GPT-4 better than a high schooler at some things? Yes. Is it worse than a three-year-old at some things? Yes. Those things are all true. And I don't really think it belongs on a human-defined linear scale of intelligence. I just inherently don't think that.
Nathan Lambert (00:53:31): Yeah. That makes sense. Final question. How much of influencing policy and related discussions comes down to having some sort of audience? I think that this is like
Dean W Ball (00:53:42): remarkably true but not potentially good yeah i think that it is very important and i think that it comes from influencing the way people think you know like a lot of think tanks will judge the success of research by did the ideas from this research get implemented in policy, which is one way to do it, for sure.
Nathan Lambert (00:54:08): But I think... It's a long timescale. It's like a longer timescale than citations in academic nonsense.
Dean W Ball (00:54:14): Well, and also, if I'm successful as a policy scholar, then at least once a month, I should be putting out something, some analogy, some way of thinking about something, a meme, really, basically, that has an effect on the way a lot of influential people think. The other big outstanding question for me, and I've heard you raise this on the retort before recently, in fact, is what's more important? Is it influencing people in the federal government or is it influencing people at the AI labs? Who's going to be more important for determining policy? I don't know.
Nathan Lambert (00:54:55): Yeah. Well. Maybe some people at AI read this and I think this is a great conversation. I'm kind of happy to wrap up here. I could see us redoing this in months based on the kind of coverage of all the recent things here. So I think this is great. I'm excited to share this with people. It's nice to get to know you more. We already have another project lined up where we'll talk more about this. It won't be in the same medium. So that's fun. So thanks a lot and keep writing. I'm sure you'll get a bunch of people to check this out. I'll have all the links everywhere and stuff like that.
Dean W Ball (00:55:28): Awesome. But you too, thank you very much. You played a big role in my building my Substack audience over the last six months. So I really appreciate it.
Nathan Lambert (00:55:35): People just need to say things. People ask me this a lot. It's really like if you make time, most people that I work with have interesting thoughts. The problem is. doing the practice of getting these thoughts into some silly medium. Literally, these long tweets, the tweets are now long. You could just do that. You could do that once a week. You will grow an audience over time. It's pretty simple. You just have to pick your lane and just keep pressing the button and it just works. You're not the only one. I'm going to have some other people that have talked about this on this interview track in the summer. I just think it's so... it's a partially a way to normalize it and get more people to try it is why I bring it up because that's like, I want that to happen to AI too. Cause there's a lot of smart people that don't know how to engage and a hundred percent and other things. And it's like, yeah, it's worth it. So thanks again.
Dean W Ball (00:56:27): We'll talk to you. All right. Bye.
Things to be aware of if you work on language model fine-tuning.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/rlhf-roundup-2024
00:00 RLHF Roundup: Trying to get good at PPO, charting RLHF's impact, RewardBench retrospective, and a reward model competition
04:32 How big is the impact of RLHF relative to pretraining?
05:54 RewardBench retrospective after 100 models and 90% peak accuracy
09:19 LMSYS's reward modeling competition
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_009.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_012.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_017.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_026.png
Synthetic data is known to be a super powerful tool for every level of the language modeling stack. It's documented as being used for expanding vanilla pretraining data and creating large swaths of fine-tuning data. Many, many more rumors surround its use, Anthropic's pretraining-scale constitutional AI, Mistral AI's first models being pretrained on OpenAI outputs, Q-star's hopes as OpenAI's remaining moat, and much more. The diversity of use cases for synthetic data makes planning around the role of synthetic data in solving specific goals.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/frontiers-in-synthetic-data
00:00 Frontiers in synthetic data
01:14 1. Direct distillation is still king
02:54 2. Are Gemini Flash and Claude Haiku distilled?
04:03 3. Filtering prevents collapse
06:30 4. Synthetic data strategy taxes
07:32 5. Pros and cons of training on multi-output-source synthetic datasets
08:54 6. Structured synthetic data
09:42 7. Weak-to-strong generalization is maybe real
10:27 8. Creating synthetic prompts is overlooked again
Signs point to a general-use Sora-like model coming very soon, maybe even with open-weights.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/text-to-video-ai-is-already-abundant
0:00 Text-to-video AI is already abundant
5:08 What's next for the text-to-video market?
6:49 Are text-to-video models good for the world?
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_005.mp4
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_009.mp4
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_011.mp4
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_013.mp4
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_015.mp4
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_017.mp4
Apple Intelligence makes a lot of sense when you get out of the AI bubble.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/apple-intelligence
00:00 AI for the rest of us
02:46 Apple's technical approach
03:32 Core models: What did Apple build?
05:35 Alignment strategies: Some new things!
10:00 Orchestrating adapters and on-device magic
11:58 Light for other narratives around AI
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_005.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_039.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_041.png
A realistic path to robotic foundation models
Not "agents" and not "AGI." Some thoughts and excitement after revisiting the industry thanks to Physical Intelligence founders Sergey Levine and Chelsea Finn.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/robotic-foundation-models
0:00 A realistic path to robotic foundation models
2:51 Key factors for the future of robotics
6:19 Everything is a token: The transformerification of robotics
Data licensing deals, scaling, human inputs, and repeating trends in open vs. closed.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/the-data-wall
0:00 We aren't running out of training data, we are running out of open training data
2:51 Synthetic data: 1 trillion new tokens per day
4:18 Data licensing deals: High costs per token
6:33 Better tokens: Search and new frontiers
Celebrity's power will only grow in the era of infinite content.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/name-image-and-ai-likeness
0:00 Name, image, and AI's likeness
1:11 OpenAI's second terrible, horrible, no good, very bad week
4:36 The expansion of name and likeness
7:46 Culture and AI development
ChatGPT leaves the textbox, and Google is building the same, and more, as practical tools.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/openai-and-her
00:00 OpenAI chases Her
02:10 Talking to ChatGPT
03:53 GPT-4o: Toward omnimodal models
08:25 Google's mirror with Gemini
10:11 OpenAI's AI Safety: Have your cake and eat it too
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/her/img_018.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/her/img_023.jpg
Now we will have some grounding for when weird ChatGPT behaviors are intended or side-effects -- shrinking the Overton window of RLHF bugs.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/openai-rlhf-model-spec
00:00 OpenAI's Model (behavior) Spec, RLHF transparency, and personalization questions
02:56 Reviewing the Model Spec
08:26 Where RLHF can fail OpenAI
12:23 From Model Spec's to personalization
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_027.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_029.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_033.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_034.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_041.webp
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_046.webp
Many, many signs of life for preference fine-tuning beyond spoofing chat evaluation tools.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/how-rlhf-works-2
00:00 How RLHF works, part 2: A thin line between useful and lobotomized
04:27 The chattiness paradox
08:09 The mechanism for making models chattier
10:42 Next steps for RLHF research
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf/img_012.webp
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf/img_018.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf/img_025.png
Models that seem totally out of scope from recent open LLMs give us a sneak peek of where the industry will be in 6 to 18 months.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/phi-3-and-arctic-llms
0:00 Phi 3 and Arctic: Outlier LMs are hints
1:01 Arctic & open mixture of expert trends
6:10 Phi 3, synthetic data, and small models
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/phi3/img_004.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/phi3/img_008.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/phi3/img_018.png
Certain definitions of AGI are backing people into a pseudo-religious corner.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/agi-is-what-you-want-it-to-be
00:00 AGI is what you want it to be
04:01 RL still rules the AGI discourse
05:43 Modern AGI tests
07:37 Agency and shifting goalposts
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/agi/img_018.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/agi/img_020.png
Meta shows that scaling won't be a limit for open LLM players in the near future.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/llama-3-and-scaling-open-llms
00:00 Llama 3; scaling open LLMs to AGI
01:44 Pretraining, data, and basic evals
06:06 Alignment and human evaluations
10:08 Chatting with Meta AI and Llama 3 70B Instruct
11:55 Same Llama license (mostly)
12:52 The healthy open LLM ecosystem
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_011.jpeg
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_013.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_015.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_020.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_036.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_040.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_046.jpeg
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_061.png
Fig 9: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_063.webp
Fig 10: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_066.png
Fig 11: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_068.jpeg
Integrating some non computing science into reinforcement learning from human feedback can give us the models we want.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/reinventing-llm-alignment
0:00 Stop "reinventing" everything to "solve" AI alignment
2:19 Social Choice for AI Alignment: Dealing with Diverse Human Feedback
7:03 OLMo 1.7 7B: A truly open model with actually good benchmarks
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_013.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_018.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_024.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_027.png
Modeling the compute versus performance tradeoff of many open LLMs.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/compute-efficient-open-llms
0:00 The end of the "best open LLM"
3:05 Compute efficient open LLMs
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_004.jpeg
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_009.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_014.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_016.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_018.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_020.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_022.png
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_024.png
Fig 9: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_028.png
Last minute title change from: The tech industry can't agree on what open-source AI means. That's the process.
How to read what multiple people mean by the word openness and see through the PR speak.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/flavors-of-open-source-ai
0:00 The tech industry can't agree on what open-source AI means. That's the process.
2:45 1. Effective Accelerationists, Techno-Optimists, capitalists, etc.
3:39 2. Scientists, promoting understanding and transparency
5:16 3. Inclusion, public interest, and fighting concentration of power
6:19 4. Freedom advocates
7:25 Dissecting "openness"
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/openness/img_004.png
Databricks' new model is surpassing the performance of Mixtral and Llama 2 while still being in a size category that's reasonably accessible.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
https://www.interconnects.ai/p/databricks-dbrx-open-llm
00:00 DBRX: The new best open model and Databricks' ML strategy
03:36 The DBRX narrative
07:33 Databricks' open LLM (and AI) strategy
09:42 Playing with DBRX Instruct
14:54 Digging for details
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_007.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_012.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_023.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_045.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_047.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_059.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_066.jpeg
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_068.png
Evaluation is not only getting harder with modern LLMs, it's getting harder because it means something different.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/evaluations-trust-performance-and-price
00:00 Evaluations: Trust, performance, and price (bonus, announcing RewardBench)
03:14 The rising price of evaluation
05:40 Announcing RewardBench: The First reward model evaluation tool
08:37 Updates to RLHF evaluation tools
YouTube code intro: https://youtu.be/CAaHAfCqrBA
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_026.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_030.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_034.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_040.png
Where moats are tested now that so many people have trained GPT4 class models. Claude 3, Gemini 1.5, Inflection 2.5, and Mistral Large are here to party.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/gpt4-commoditization-and-moats
00:00 Building LLM moats despite the commoditization of GPT4
04:38 The Open's opportunities
08:02 It's amazing people still think LLMs aren't going to be useful
09:50 Things that are coming
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/moats/img_004.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/moats/img_028.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/moats/img_032.png
A proposal for a new definition of an "open source" LLM and why no definition will ever just work.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/an-open-source-llm
00:00 The koan of an open-source LLM
03:22 A new naming scheme for open LLMs
07:09 Pivot points and politics
08:16 Claude 3, arms race, commoditization, and national security
10:01 Doomers debunking bio risks of LLMs themselves
11:21 Mistral's perceived reversal and the EU
13:22 Messy points: Transparency, safety, and copyright
13:32 The muddling of transparency
15:22 The muddling of "safety"
16:30 The muddling of licenses and copyright
20:12 Vibes points and next steps
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/open-source/img_046.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/open-source/img_064.png
This interview is available on podcast players and YouTube.
I’m excited to bring you another interview! This one is a deep dive right in my wheelhouse — all things RLHF. Louis Castricato is probably the hidden star of RLHF in the open. I’m not sure anyone who can speak freely knows as much as him. As I’ve said again and again on Interconnects:
Giving a voice to researchers is the best way to cut through the noise and understand what is happening with core developments of LLM technologies.
Louis recently has been founding a new startup focused on synthetic data for alignment, Synth Labs, and is a researcher at Eleuether AI. This interview should speak for itself, and it’ll need re-listens, even for myself. The list of topics we cover touches on pretty much every major and minor issue facing model fine-tuning. Please reach out or comment if there’s a paper we mention that I didn’t link before. Happy to dig it up for you.
For more on Synth Labs, there was a profile in Bloomberg from Rachel Metz.
This post is very technical, more than usual. If you’re having a hard time with it, I suggest you listen to my RLHF 201 post on Latent Space first.
Chapters
These are generated with smol-podcaster with moderate edits.
High-level chapters
* 00:00:00: Introduction
* 00:01:24: Gemini News and RLHF’s Part in it
* 00:09:05: Long Context, In-Context, and Multimodal RLHF
* 00:21:20: What are people missing about RLHF these days?
* 00:30:30: OpenAI's Influence and the Need for Alternatives
* 00:39:20: Synth Labs and the Future of Alignment
* 00:55:00: Evaluation Talk p2: Open-ended Evaluation and Data Diversity
* 00:59:20: Algorithm Roundup: PPO, DPO, KTO, IPO
* 01:18:38: CarperAI, Early Days of RLHF, Reflecting on ChatGPT
Detailed chapters
* 00:00:00: Introduction and Overview of RLHF
* 00:02:02: Gemini News, Custom Demographics in Image Prompts, and Controllability Issues in AI Models
* 00:05:21: Fixing Biases in AI Models Post-Training, Representation in AI Data
* 00:09:00: Multimodal RLHF and Video RLHF
* 00:16:09: Evaluating Long Context Behavior in AI Models
* 00:17:05: The Potential of In-Context RLHF
* 00:21:24: Shift from PPO to DPO in RLHF
* 00:23:19: Generalization and Evaluation in RLHF, Human Evaluation
* 00:27:03: The Discrepancy Between Research and Company Needs in Alignment
* 00:29:20: Impact of ChatGPT and Language Model Outputs on Data Sets
* 00:31:39: The Concept of Uncensoring Models
* 00:34:05: Lack of Safety Data Sets in Instruction Tuning
* 00:35:23: LMSYS ChatBotArena, AlpacaEval, MT Bench p1
* 00:39:25: Introduction to Synth Labs and Alignment Platform
* 00:43:05: Developing OpenCAI Constitutional AI Data Set
* 00:49:41: The Need for Open-Ended Evaluation in RLHF, eval p2
* 00:54:13: The Importance of Releasing Models for RLHF Research
* 00:58:17: Self-Instruction and Self-Rewarding LMs
* 01:01:03: Working on RLHF at Carper AI
* 01:04:25: Scaling PPO in RLHF
* 01:08:01: The Impact of ChatGPT on Carper AI
* 01:10:56: The Potential of KTO (Kahneman-Tversky Optimization)
* 01:17:39: The Importance of Implementation Details in RLHF
* 01:20:14: The Initial Focus at Carper AI
* 01:23:36: The Future of RLHF and Open Science Collaboration
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Papers & artifacts we discuss
* Recursively Summarizing Books with Human Feedback
* Needle in a haystack recent example repository.
* Urial paper: The unlocking spell on base llms: Rethinking alignment via in-context learning
* Misha paper from Deepmind: In-context Reinforcement Learning with Algorithm Distillation
* Museli Optimizer: Muesli: Combining Improvements in Policy Optimization
* Unintended Impacts of LLM Alignment on Global Representation
* Pink Elephants Problem: Suppressing Pink Elephants with Direct Principle Feedback
* Cut the Carp: Cut the CARP: Fishing for zero-shot story evaluation
* MT Bench data for correlating human to GPT4 preferences
Full transcript
Note: this is generated by smol-podcaster and has minor bugs post human edits.
Nathan [00:00:01]: The ticker's going up. Welcome, Louis. You're the second guest on the InterConnects podcast, I think. It's an interesting one for me because everyone kind of points to me now as the person that is in the face of RLHF and I get a lot of questions and to me Louis has represented that person. I think Louis provided a lot most of the information on the first RLHF blog post that I wrote for Hugging Face back in the day. If there's somebody that I want to ask questions about RLHF, it generally goes to him. So now you all are gonna know this in the open. We're gonna cover a lot of things. As always, I'm trying to talk with researchers on the ground and people actually doing things in these topics. I think we're gonna cover a lot of things today. We're in the Latent Space podcast. If you're watching on video, you may have noticed that we're in the Latent Space studio and they reminded us we've got to start off with covering the Gemini news and what that means for RLHF and then most of this is a long docket of the core questions facing the two of us as we're trying to make RLHF more open, more useful, not only about safety but safety is important to it and important to us. So I think we can kind of get going. I think the first question that I have is just get rolling. What is your favorite Rhode Island fact?
Louis C [00:01:28]: My favorite Rhode Island fact? Oh man, all the H.P. Lovecraft stuff. Like walking around Providence with like friends who like H.P. Lovecraft and be like, oh yeah, you know, this was like that building in Call of Cthulhu or like...
Nathan [00:01:36]: I don't even know this. I mean, for the record, I grew up in Rhode Island if people didn't know and then that's where Louis spends most of his time these days. Providence. So we'll come back to this. I think I'm just gonna start with kind of the hardest question then it'll get easier for us from here. It's like what was your first reaction when you saw all this Gemini stuff?
Louis C [00:02:02]: The, you know, the adding custom like races and demographics to like image prompts component, right? Yeah. So Dawley had done that back when Dawley 2 first came out and was like an in beta and people were reporting like a person holding a sign that that says X and then this sign would say black or this line would say white or this line would say Asian. And I, you know, it was a very hacky solution then and I thought a lot about it then as well and I almost felt like, you know, it gets you 90% there for like 1% of the time of the way that you're doing this like, you know, like in a more proper and auditable way of like making sure your training data has like equal representation or making sure your ROHF data has good representation. And, you know, you can't do those things after the fact but what you can do after the fact is like inject things into the prompt to make it more controllable. And it really comes down to the fact that controllability right now is not a solved problem and most of our solutions to controllability are a little bit hacky.
Nathan [00:03:16]: Yeah, that makes sense. I think to summarize for people this has been an ongoing issue and we're recording on the 27th here. Gemini initially got flack for like actually forcing diversity into historical scenes and then it started getting more flack for flat-out refusing certain requests on race. Like all of this stuff is just like it's like ouch to somebody. Like I know people working on this stuff and it's just like the way that it ends up here is not is not like what a lot of people think. Like the Gemini team is obviously moving fast and it seems to me that the image stuff has always been like a red herring. That's the way that Swicks phrased it as well. It's like somehow he got to the point where a prompt was shipped in this final solution with the further image editing and that's just hard. It's just like obviously there's a big goof up there. But then it's like we're looking at examples and still today like Meta's image generator. So I'm like WhatsApp or whatever you can ask an AI that it'll have similar issues where it forces diversity into it into a question with multiple people. Microsoft Copilot has this. It's like the text thing and really digging into how we think these big companies could be adding like like forcing this into their data or like we know that there's a lot of uncertainty over how all these companies get their preference data. Some of them work with companies like scale and surge. Some of them do it in-house. Who is providing it isn't really an issue because they're probably giving similar instructions and similar workforces across the board. But it's like how do we see this entering the preference data that they're adding to our early stuff because it's like if you look at a base model. We were just working with Olmo and it's like if you you ask a base model you say like hello to a base model. A lot of times the base model will then go off and be like some crazy like Fortan s**t because like so many of the conversations on there even with good data processing techniques is like from weird corners of the Internet. So like I don't see any base model that comes out with some like D-bias thing so it's added on. And it's like how did we end up there.
Louis C [00:05:21]: Yeah I mean you know when I was saying this is something that that they do like retroactively once they've acknowledged that these issues exist in the data set once the model has been trained. It's not something that can be easily fixed even if they had infinite resources like it's very very hard to go back and actually rectify these biases in a way that's like equitable to like all the kinds of preferences that someone might have when wanting to interact with this model right. There's um the the fact that at least as far as I know until recently DALLE did this as well where you could still say a person holding a sign that says X and it would still say black white or whatever and and the amount of resources that they're pumping into making sure that you know they're building a consumer product they're building like the main consumer product in this space the amount of resources that they've been pumping into it and this still presents a large issue for them just just shows how difficult this like really is.
Nathan [00:06:20]: Yeah and another example that people on the I have this discord that's growing for paid and friends or paid subscribers and friend someone pointed out this work where if you ask DALLE to generate like a doctor and an assistant like all the same bias problems still up show it show up so like a lot of the solutions that we have are not necessarily like deep and at this like conceptual level it's at this like you tell your preference labelers to do a certain thing and then they do it but you may not have good tracking of like which data point is responsible for these different things.
Louis C [00:06:55]: Yeah you know interpretability for for like preference learning in general it's it's we're very very far from actually understanding like what preferences result in what model behaviors and and like you know preferences that disagree with each other.
Nathan [00:07:12]: Like the John Schulman talk. Yeah. It's like that was this whole talk and it was great just to have him get up there and be like this is so hard.
Louis C [00:07:20]: Yeah and like I've done like a ton of experiments myself where I just like have an RLHF data set and I like randomly remove 10% and I have like a bunch of models each with like a different 10% removed and I'm like well what behavioral differences can I see between these models and then not only is it like and now you can see differences but it's extremely hard to quantify it's extremely hard to actually understand what the difference is and then like there's almost no way to know what in that 10% cause that difference.
Nathan [00:07:51]: Yeah this reminds me of like the Hugging Face No Robots data set which is like a professionally curated instruction data set. Whenever we added that to a model it was like this is obviously our most valuable data but it would show up on zero benchmarks and we're like well what do we do and it's like we're talking about Google's problems here and we'll get back to like the data problems in the open source and it's like they probably have order of millions of data points that are going into this preference data and some of it is for some proportion it's probably about safety. I think we could talk about like the Anthropic HH data which like the people don't actually know the details of it because it's like a quarter of it is like helpful data or than three quarters is or like a quarter is harmless and three quarters is helpful from different rollouts and it's like these are very specific things as like huge data problems that most people aren't really thinking about.
Louis C [00:08:40]: Yeah most people are just like blindly oh this is safety so I'm gonna throw it into my data set and hopefully like it works and hopefully like we get good behavior but I don't really know what's in this data set I've really looked at the data and I thought that's something that I've heard many many times over the last year of people like trying to get their feet wet in the RLHF space.
Nathan [00:09:00]: Yeah and do you have any intuitions is like the last point of the Gemini thing I'm like if we don't think that the image generation of Gemini is the biggest issue I think it's like in the text and how this preference data is collected but like do you have anyone that is doing multimodal RLHF because I generally think that it's like we don't know how to do this at all which is like how you control input if you have multiple inputs and multiple outputs is like how do you control your moDALLEty distribution and data count and stuff.
Louis C [00:09:30]: Yeah so I mean I have a friend of two friends of mine who have been doing like video RLHF for a little while now like it's a little bit over a year and you know they like condition their video model on some text encoder and they've been talking about like having to do RLHF independently for both the text encoder and the video model but like video RLHF is just like massively underdiscovered and no one really knows what they're doing in that space.
Nathan [00:09:53]: When you say independently what do you mean like before making the video model are they like RLHF-ing the text backbone or are they freezing the rest of the model? Yeah they're RLHF-ing the text backbone.
Louis C [00:10:04]: I think there was actually a paper from Tencent last August that basically did the same thing for like multimodal RLHF where they had to RLHF the text backbone and then the RLHF like the image generation components on top of that.
Nathan [00:10:17]: Does that look like that's the like they you this is potentially basic but like to train a visual language model you have to have some link you have to add some type of a mechanism that links the gradients between the two and sometimes you start with a most of the time I think these days they're starting with this language backbone and they're adding on vision and continuing to train and then this is like at the end of this where you have a visual language model then they're freezing the gradients of the video video part and then RLHF-ing the text part or is this like before the text backbone is even initialized on the model?
Louis C: The space is a little too early.
Nathan: Yeah like I think that's the point like we don't know these links.
Louis C [00:10:53]: But I know people in the last like eight months who have done it the way of like before they even add the image component they RLHF the text model and then they add the image component in the RLHF image.
Nathan [00:11:07]: Yeah so this is really interesting like I'd be interested from like a everyone talks about how RLHF is low low computation and flops compared to what people are doing like in the open we say that it's like 50 or 100,000 day training samples. Lama 2 is like 1.5 million I'm guessing the closed models like Gemini are probably another 10 million like we're higher like they're they're much bigger and it's like is the amount of video training that it takes a train this backbone after the fact like it's still helping like does that undo some of the text RLHF or does it not? If the answer is I don't know but these are the kind of things that I want to have people start talking about it's like is RLHF becoming like a sequential process as you add moDALLEties or can you wait all to the end and like do just multimodal RLHF? We don't know these things and this is what people in Gemini are trying to work on.
Louis C [00:11:58]: I definitely I've spoken to a lot of people who like are at least thinking in this space I've only spoken to a small number of people who are actually working in this space but for the people who are thinking in this space really the the dream is to be able to express preferences in moDALLEties where it's beneficial to express preferences in those moDALLEties like it doesn't make sense to express preferences over code as like images or video but it does make sense to express preferences over like puppies as like photos.
Nathan [00:12:25]: That's a great point and I think the thing is like the way you ended your sentence is like make preferences over puppies it's like we don't know what people use visual outputs for in like a productive sense and and really inputs like the things are like analyze this video like that's a toy example where like analysis creating RLHF pairs I think actually it's not too hard for us like we it takes a lot of effort because a human has to know what is in the video to do like a summarization RLHF like if you're passing in a three-hour video into Gemini base model and then it outputs two outputs like the humans not gonna know what's right unless it has context and what the video is and that is just way different than like a poem where you could read both of them.
Louis C [00:13:04]: Yeah so there's actually a really fascinating paper from OpenAI that I really haven't seen anyone build on it was the idea of like summarizing really long books and you doing RLHF to do that.
Nathan [00:13:14]: Is this sort of like recursive summarization?
Louis C [00:13:17]: Yeah yeah it's the recursive summarization it's the idea that like you can almost treat like long summarizations as like a weird RLHF like almost like merge operation where like you divide divide divide divide divide divide and then eventually you get to segments where it makes sense to collect annotations and then on those segments you have a human annotator go through and say oh this segment is better than this segment or the summary of this segment plus this segment is this and then when you combine summaries now you can say well this summary plus this summary gets you this summary and eventually you get preferences going all the way up the tree and you get a preference of the whole book at the end and obviously you know it's a crude approximation of what the summary of the whole book is but it's much more feasible than asking human annotators just to summarize an entire book.
Nathan [00:14:05]: Yeah I mean I just realized this on the pod right now it's like how ridiculous RLHFing like an entire code base in context is like that's like where some of the like opportunities for what I think RLHF could do which is like just synthetic data labels and stuff it's like we can create synthetic preferences in many different ways that aren't all reliant on like this kind of human subjectivity.
Louis C [00:14:32]: Yeah it's like it's a deeply fascinating problem actually going into like how big is Gemini's context window the 1.5 thing it's like
Nathan [00:14:37]: yeah it's shipped with a million and they have experiments in the paper up to 10 million.
Louis C [00:14:40]: Like who really wants to use a 10 million token context window and like how accurately do you really can you really think about preferences over the range of a 10 million token context window?
Nathan [00:14:54]: I think people want to use it but I think the preference thing is a lot harder yeah it's like I could have this is something I encounter in HuggingFace regularly like HuggingFace is a popular code base you expect the code models to do well but they still don't do well unlike like they don't know like they'll make up datasets functions or something and like if you just have all of HuggingFace's code in context when you're like working in the HuggingFace ecosystem like that will make you so much better and like it or analyzing long videos and stuff like I do think there's a lot of use cases and I yeah but like the preference thing is just a totally different framing. What do you think about the needle in the haystack evaluation that they did? I haven't read a lot about it but I think essentially what it's it's there's like a difference between being able to act on the information and being able to like retrieve it and I think it's like these models should be passing needle in the haystack because that shows that they're like actually like noticing that the information is there but that does not necessarily mean that they're gonna be able to synthesize all the information in a compelling way so it's like a path it's like a pass bar which is like you need to have this to be credible in long context but I think that actually evaluating long context and like what behaviors we want to see is pretty open-ended.
Louis C [00:16:09]: yeah he put out a paper like yesterday where he's like oh needle in the haystack is interesting but if you have like more than two needles like it's entirely uncorrelated with the single needle in the haystack benchmark.
Nathan [00:16:24]: Yeah cuz it's like trying to find one thing at each part of the content like breaks the context window into many segments and then it's making sure that you can find something in each of those segments.
Louis C [00:16:36]: So it's almost like I feel like we're almost gonna get to the point where like the attention itself is the limiting factor because the model genuinely just just cannot equitably like split attention over it's a context window to retrieve as many things as it realistically needs in order to produce something.
Nathan [00:16:50]: Do you think the RLHF could manipulate long context behavior more than people might expect? Cuz it's it's just like an open question.
Louis C [00:17:05]: Yeah I think it's a very interesting open question and if the answer turns out to be yes in context RLHF becomes like absolutely massive because like right now like it can kind of sort of work but like not really and like every benchmark I've ever seen for in context RLHF almost isn't charitable at all to the RLHF baseline and it's not like from the experiments that I've done in the experiments that people in Eleuther have done. It's comparable on like very niche situations but it's not comparable in general because you still have all the issues with in context learning where like you'll massively overfit on the preferences that are like put in the beginning of the context versus preferences.
Nathan [00:17:50]: Let's try to explain what this in context RLHF is actually doing. So is it running like everyone a lot of people know what an RLHF algorithm is and in context learning is designing a prompt like is it training a model to generate prompts like what are you actually are using the RL update and like what are the parameters what are you parameterizing when you're doing in context RL?
Louis C [00:18:10]: So I mean there's a number of different approaches for in context RL. There is the... Could be part of the problem.
Nathan [00:18:14]: It's like people do a lot of different things but what are some of them?
Louis C [00:18:16]: So the one that I was referring to is I think the Yejin Choi paper. Yeah it's the Uriel. Yeah where like she's like you she just prompted chatbot you are interacting with the user here's what their preferences are like have at it but there's also stuff like that like Misha and DeepMind. This is the first one that I did. Yeah where it's like you have some agent that's interacting with an environment and you store all these state action pairs and you just like fine-tune models on like episodes of these state action pairs and then the idea is that like if you just put enough episodes into a context window on the next episode it'll just perform better right and and it's like the algorithm distillation paper and you can like use this to like distill stuff like I think the actual example that Chris Lu's paper does where they do like algorithm distillation on s4 I think they do Muesli where I think they distill Muesli which is they like apparently no one outside of DeepMind ever used it but apparently...
Nathan [00:19:15]: Oh is this the algorithm Muesli? Yeah I remember when this was hot it was like a year ago at this point we were thinking about re-implementing it and then we never did. It was too complicated.
Louis C [00:19:30]: Yeah but Muesli is apparently very computationally expensive because it's like this model based RL thing that beats AlphaGo I think without using Monte Carlo tree search and like you know it's so incredibly computational expensive and wanting to be able to do it in context just dramatically reduces the amount of computational complexity to actually deploy it right and as far as I'm aware there's been no work applying algorithm distillation at all to NLP and I think at least my impression is that it generally does not work for NLP at least yet and you know I think that there's a lot of potential there but there's absolutely massive barriers that have to be overcome before we get there and and you have like what you have Goldberg's example of not being able to do needle in the haystack for like more than two needles basically shows that even like the ring attention stuff just is not going to be sufficient for algorithm distillation stuff for NLP and I have a very strong feeling that like Mamba or like S4 is not going to close that gap either. So they would need to be able to reference prior parts of the text and they just can't do that.
Nathan [00:20:56]: Yeah I think there's a whole rabbit hole that we could go down and talk about like long context and architectures forever. I think let's kind of zoom back into the core stuff which is that this is like the real starter question is like what do you think people are missing in RLHF these days and then from here it's gonna be a long list of like what the heck do we do about evaluation data like well what is the like big-picture thing?
Louis C [00:21:24]: So what I think people are missing and actually I touched a bit on this in the Pink Elephant's paper is that...
Nathan [00:21:28]: You should say what this is because we haven't introduced it.
Louis C [00:21:30]: Yes you're right you're right you're right. So I worked at Luther AI as a resource scientist for the last six months or so and we were really interested in like understanding you know everyone had been doing PPO for so long and there had been a shift to DPO and we were trying to understand like well now that we're moving to DPO how can we actually take advantage of this new architecture? Like should we really even be thinking about reward models and data sets in the same way that we were thinking about them during PPO? And it doesn't really make sense and I think the answer to that is an unequivocal no. That like you need to think about your data sets and preference data sets entirely differently than you were thinking about them with PPO. Because in PPO you're using you're setting your data sets up to train a really good reward model and in DPO you're setting your data sets up to teach a language model what the better trajectory is. And it's a subtle difference but in one you're just trying to learn differentiation between high reward and low reward and in the other it's like a general classifier.
Nathan [00:22:35]: Like you want to be able to do everything with the reward model? Yeah. Have you also found that DPO can be sensitive to like the SFT distribution? So if you like take a random open preference data set if it's really different than what your model would generate like DPO can do some weird things? Louis C [00:22:53]: I've actually, I might be alone in this, I don't SFT before doing DPO at all.
Nathan [00:22:59]: Do you use generations from your base model? I do. So that's the question. It's like if you were to not do SFT before doing DPO. Yeah. Could you just take ultra-feedback on whatever your base model is if it's sufficiently different? I've done some weird stuff though. Like I've like
Louis C [00:23:19]: DPO'd models that were like trained with like the Hermes data set for like code and like it still generalizes really really well.
Nathan [00:23:28]: How are you measuring, how are you trying to think about generalization with DPO?
Louis C [00:23:33]: Well I typically rely on like human eval more or less. And if I do like human eval but it's GPT-4 eval and I see that human eval correlates with GPT-4 eval then I just go GPT-4 eval the whole way. A lot of people are doing that.
Nathan [00:23:48]: How far do you think that actually generalizes? I mean just recently there was this, like we're bouncing around through all the things, but there's so much good information for people here. It's like Hugging Base and Argilla, two places that are doing great work in this kind of alignment preference fine-tuning space, they've released this data set that was a preference pair creation from the OpenHermes data set. And it's like they used PairRM as their judge. And what they found is that like they did it, I remember Louis Tunstall tweeted this, where he was like we were looking at which gave the best correlation. And they found that PairRM, which is this 400 million parameter Diverta based pairwise classifier, had like the best correlation as choosing which response was better among a set of responses in the OpenHermes data set. And what they were comparing to is like Prometheus and I'm forgetting the name of the other one. There's one more, there's a couple more like open model as like rate model rankings that exist. I think. But essentially the question is like we do these things and we look at this early correlation and there is this correlation between GPT-4 and humans. And then a lot of times we continue like LLM-Sys did this question where they like or like AlpacaEval has done this to validate AlpacaEval as a meaningful benchmark. LLM-Sys has done this for MTBench. Like all these places are doing this where they validate a subset for humans and then say it generalizes forever. Like do we think that it's actually true? I think that you always have to take it with a grain of salt.
Louis C [00:25:24]: It's always for very very specialized domains. So one of the first, actually I think I did write the first paper for like critiques and revisions called like Cut the Carp. The idea was like, I remember this, the idea was like we could scrape like I think it was a million stories, edits of the stories and then like all the like critiques that like writers wrote on the, the editors wrote on those stories and we can use that to train like a big contrastive model, right? And we showed in the paper, we did a bunch of like human eval and then we did like Spearman rank to compare like how our model ranked certain preferences versus how humans ranked the preferences. And we found that you know we had an extremely high Spearman rank coefficient, like significantly higher than like doing like a value head or like significantly higher than doing just asking a language model to rank them. And I think the grain of salt that we had is that we were only claiming that like on this very very carefully created test set, the assumption that the model accurately reflect reflects human preferences holds and we can generalize to a very small, small but slightly bigger test set and say that it holds there as well. I think the broad sweeping statements that it holds on a few toy examples so it must hold
Nathan [00:26:54]: everywhere, I guess never really. It's like a common problem. Yeah. I think we're going to, it's going to come up again and again. I think it's like.
Louis C [00:27:03]: I did my master's in like human evaluation and I've always been extremely careful with with any statements I make that involve humans. I mean this is what
Nathan [00:27:12]: people in RLHF need to be doing. Like this is the motivation of this like the history and risks of RL and human feedback paper that we did is just like RLHF is a socially rich topic. Whenever you say something and you're making claims of generalization, you're often making claims about like what is implicitly a preference and a human value that you're taking into the system. So it's just like I think that is just something that people need to take really seriously. Here's a really specific drop on the herring reference. Did you know that when LLM says release their LLM as a judge paper they also released thousands of samples from humans and GPT-4 verifying like empty bench preferences over pairs of like that were higher score or not? I did not. Okay so essentially the thing is and like I've talked a lot on building a reward model benchmark but essentially there's all these references about how like GPT-4 agreement is higher than human agreement when you're like doing this preference process. So if you train a DPO model, if you train a reward model how it ranks the outputs is like is more likely to align with GPT-4 than a human. Which it's more of a statement that humans have more disagreement than GPT-4. So it's like easier to train on GPT-4 outputs than as human outputs and this is the place where I see it most clearly. It's like all the reward models do like 10% higher on accuracy of their test set from that which is like the chosen by GPT-4 and the rejected by GPT-4. It's all in like the 70 or towards 80% while all the humans is like in the 60% which is a human chose this empty bench completion over the other one. So it's just like we're slowly getting signal that it is there and then the question is like should we care about doing our RLHF without any OpenAI input in the process? I think last year when the terms of service discussion was big a lot of fine-tuning work was discussing like what data sets could we use with permissive license that don't violate the OpenAI terms of service. Should we be concerned where RLHF is going where almost everything has been touched with OpenAI right now?
Louis C [00:29:20]: There was a very interesting paper, I don't remember who it was, but it was like if you take a model that was pre-trained on data set up to this year and compare it to data that was pre-trained up to this year and it was like pre and post like chat GPT release basically plus like six months the benchmark scores improve and it's literally just because there's like chat GPT data or language model output data or more structured data that sounds like a language model performing well on tasks in the data set. It's like kind of the the consensus that they were.
Nathan [00:29:53]: Was this a benchmark that's independent of like is it like a kind of structured benchmark or is it like a vibes benchmark? I think it was like a structured benchmark so I don't remember. Yeah I'm just asking whether or not it was a result of like matching GPT for text or like actually having higher behavior because training on OpenAI outputs does like training on good language model outputs does improve scores on benchmarks that people care about so like that's a fact that people need to accept and I think most people do like that's not controversial right now but it's like we should I still think that if there's lines of work out there where people are from a values perspective trying to fine-tune models without touching OpenAI like that is a line of work that should continue.
Louis C [00:30:42]: Yeah on this note actually I mean when I was at Stability I think one of the experiments that we did was like for a stable LM I remember was like pre-pending as an AI as an AI agent trained by OpenAI to anything before we ran it through evaluation and the scores improved and like trying to remember who wrote the paper.
Nathan [00:31:09]: That's hilarious. I mean like do you there's been a lot there's a lot less discussion on uncensored models right now my claim is generally I think uncensoring is the wrong word which people have used it to describe removing phrases like as a language model or any methods of mentions of emotion or like I was trained by OpenAI so I can't not do this. Do you think that like this type of filtering for opinions and soft refusals is still important in RLHF?
Louis C [00:31:39]: I think it's important for very very specific situations but not in general. My impression is that you know if you're interested in AI safety it's always useful to have a model that would never do a refusal ever.
Nathan [00:32:00]: It's hard to find on the hub where we're building a safety data set and we had to find like it's a fine-tune of the dolphin data set was the one that like what's closest it was only like it's probably like 80 to 90 percent of the tasks that we asked it it wouldn't refuse it would still refuse 10 or 20 percent of the time. It's kind of profound that like refusals are now stuck in the model in some way like we were looking for a model that wouldn't refuse at all and we like couldn't find one on the hub which is after all discussion of uncensoring you would think that it would actually work.
Louis C [00:32:31]: Yeah I've been doing a bit of safety research with Stella for a little while and my approach has been literally call GPT-4 with a jailbreaking prompt and and just put whatever I want to after that. And I you know very often have to change my jailbreaking.
Nathan [00:32:46]: Yeah I was like you have to keep close guard over the jailbreaking prompt.
Louis C [00:32:50]: Yeah and and the issue is that like when you find a good jailbreaking prompt you basically have to redo all your results within like the next like seven or whatever days before OpenAI patches it and you just have to pray that like you know you there's so many issues using any OpenAI model in any research pipeline but if you're like research is explicitly about the safety of OpenAI models all of a sudden you're like well.
Nathan [00:33:18]: I mean a lot of companies should be doing internal research on OpenAI safety to kind of have their own measure of how their application will do like the monitoring that on their own is worth it for their bottom line and liability because OpenAI will also do it but OpenAI has incentives to not tell the world if there's something kind of subtle going on that some people could get over because that might blow up and if they don't have a fix it's gonna bring attention to it.
Louis C [00:33:44]: It's part of the issue with like even publishing red teaming research in general it's like if you publish an evaluation for like red teaming or like for safety well everyone's going to like Goodhart that evaluation and all of a sudden like now now we have a useless stack of papers that used to be on how to test if a model was safe.
Nathan [00:34:05]: Yeah I didn't really prepare questions on safety but it's it's for a long time surprised me that there aren't data sets and easy recipes for adding safety to instruction tuning in RLHF. I think that I mean someone at Lama team asked me what should they do and they're like you should release your safety data because it's like if they're getting pressure from the executive branch to not be safe it's like if they have this data they can release it and be like this is how you can make any open model safe. Huge softball and also like the safety is unlikely to be a competitive advantage like mist like mistrals I'm not gonna care about this like they might eventually but like the PR win is really big. Yeah. I mean this is something that I've wanted to do for a while and just haven't done good at prioritizing it so. Yeah we can go back to some of the questions that you have. Yeah I'm adding them so I can keep notes later. I think that the next main topic is on evals. I think vibe based evals are still a way of life in RLHF. They're not going away anytime soon. I would say we have kind of a holy trinity of LM sys chatbot arena which is kind of at the top for for good reason. There's alpaca eval, alpaca eval 2, MT bench. I think start with the most important one is like when you see LM sys what are you what are you extracting from a model being better or worse there?
Louis C [00:35:23]: So it's in a way I am a little bit like what Andre Kaparthe said on this. Was it him? It might have been him.
Nathan [00:35:27]: Probably. He's been on a roll.
Louis C [00:35:32]: Yeah where it's like when he picks an open source language model he looks to see what people say about it on reddit. Yeah local llama and LM sys chat arena and the issue is that you don't know what they're using it for and like as a research scientist when I look for a model I am looking for a model to like do research on. Yeah. And I am not looking for a model to be like my AI waifu girlfriend that I can like play Dungeons and Dragons with.
Nathan [00:36:05]: Yeah I mean this has been the bane of RLHF research for a while. It's like what did we do before MT bench? We literally the only hope we had was to like chat with these things and hope for the best. I was like that was very recently. That was less than a year ago. And then MT bench came along and we were kind of using it hugging face and other people are using it. I actually don't know the alpaca eval release date so that might have been before MT bench. But like these two came around at the same time and they're now kind of the ground truth. Alpaca eval 1.0 has kind of been saturated on which is like comparing to Da Vinci with a GPT-4 judge and then alpaca eval 2 is comparing to GPT-4 turbo with GPT-4 turbo as a judge. Yeah. It's funny it's like it's now cheaper to do the second version than it was the first version with a newer model which is how scaling happens.
Louis C [00:36:56]: What do you think about the Nous evaluation thing where they're like continuously generating more evaluation data?
Nathan [00:37:00]: Who is doing this? Nous? Nous research? I don't know. Is this their new leaderboard that they have? Yeah. Yeah. Yeah. I haven't looked at it so I'll have to give it a look.
Louis C [00:37:09]: What do you think? It's almost like MT bench but they like generate new data every day. So new prompts? It's always new prompts and it's always I don't know how they seed it. I assumed they seed it based off like the events that day.
Nathan [00:37:22]: It's a kind of a cool idea. So if you're trying to make a new leaderboard you could have a set of seed instructions that you augment and you never release the seed instructions but you always release the augmented ones on like a weekly cadence. I think that's because there's a lot of people that want to build better alpaca eval things and a lot of the problems is that the prompts are from known sources or public and you want to be able to do a closed eval without having as much cost. So that might be a way to kind of really reuse the data for a long time. Yeah. Yeah.
Louis C [00:37:53]: But I mean like I feel like the issue with things like alpaca eval, chat arena or any of those is that like the way a user is going to interact with an agent or a chatbot is entirely different than the way we are currently evaluating them. There really is like a big discrepancy there in that like you know look at the Air Canada thing right? Like that would never have come up in a benchmark like ever.
Nathan [00:38:20]: Well do you think that's about the model or the implementation? I think it's a bit of both.
Louis C [00:38:27]: Like if that was something like some automated evaluation thought of and I don't think it's unreasonable to expect them to think of situations like that. If like they kind of know the domain you're operating in. I think it's definitely doable and I think I think it's like not something that's entirely unfeasible to accomplish. To like be able to say hey you know I have a chatbot that sells airline tickets and here's what I care about and and and like please do the evaluation for me. And that's actually you know that's what I've been building for a little while now.
Nathan [00:39:11]:Okay we can talk about synth labs and then come back to evals because this will be on the top of the post so everyone will know like you're you're building this and it's like well we can start with like what is the basic pitch and then kind of go into the like long-term thing.
Louis C [00:39:25]: Yeah yeah so for the last like six eight months I've been building like a fully auditable transparent like verifiable alignment platform is how I like to describe it. Plus evaluation. The general idea is like...
Nathan [00:39:40]: Making a company.
Louis C [00:39:49]: Yes and the the general idea is is like there are many facets to aligning a model from like things like guardrail guardrails to ROHF to various kinds of preference learning to like actually understanding all the data that that goes into creating such a model. And they're all opaque boxes more or less right now and and what people want is they want to be able to align their model know every step of the pipeline understand all of the interpretability that goes from A to B and understand like here's what I gave you as my criteria here's where I know it fails based off all the evaluation you've done for me and here is where I know that I need to improve and it'll iteratively improve based off evaluations and based off your feedback.
Nathan [00:40:44]: So it's a hands-off solution that lets you audit the entire pipeline and build trust with it. So are you your training after you generate this data?
Louis C: We are training.
Nathan: Yeah you use this word improve.
Louis C [00:40:53]: Yeah so it's a iterative refinement platform for doing alignment in a verifiable and trustworthy manner.
Nathan [00:40:58]: What do you think customers want when they hear alignment? What are you selling with alignment and what are they buying? I think the aligning these is an important thing for our field.
Louis C [00:41:10]: There's an extreme discrepancy between what research does for alignment versus what companies do for alignment. When a company hears the word alignment they think wow I want to align models to my business objective and I want to make sure that the model understands my business culture and I want to make sure that the model understands completely its role in my company right? But at the same time I want to make sure that it's compliant, that it's safe, that it doesn't violate any rules, that it's not a legal obligation. What's the word? Legal? It's not going to create legal issues for me. And that it's not going to be a PR disaster.
Nathan [00:42:04]: After what we talked about 35 minutes ago.
Louis C [00:42:13]: Finding that balance is definitely incredibly important and it's something that I've been working on for quite a while and I'm very happy with where things are.
Nathan [00:42:22]: Do you want to tease what we're working on? I could also introduce it. I think this would be short. Essentially Lambda Labs offered some interesting compute and we're gonna try to build an OpenCAI constitutional AI data set because Anthropic gets a lot of benefit out of this. Constitutional AI doesn't get a lot of traction. I think earlier AIF got a bump again. There was this Google paper that was verifying that it works a little bit and now it got a big bump. But there's very little discussion on it, which is a little bit surprising to me. I think there's a lot of people calling it distillation of LLM alignment now, which is interesting. I don't really know. Hopefully it works.
Louis C [00:43:05]: It builds off some of the stuff that I did with Edward III AI with the suppressing Pink Elephant's paper, which is the idea of we've shifted from one paradigm of PPO to DPO and none of our data pipelines kept up. Really what we should be doing is generating either really good utterances and revising them to be worse or really bad utterances and revising them to be better. Then taking all those utterances and conditioning our ROHF in context on those utterances so that you could do stuff like swapping rules in and out during inference. If I am person A and here's my preferences or I'm person B and here's my preferences, align this model to person A and align this to person B and make sure that there's a disparity between what they actually want versus what... There's always that disparity there, but right now models do not effectively mimic those disparities. There was actually a fascinating paper by D. Yang that just came out a few days ago. Most aligned models have the preferences of Western men. Their evaluation focused more on the race, nationality, sex, stuff like that, but obviously it gets much more fine-grained than that. There's been stuff about people calling llama to its political alignment. It has a very particular political alignment that does not agree with many users that are using it. As such, its scope and usability for those kinds of applications is very limited.
Nathan [00:44:50]: This is probably linked to what we were talking about at the beginning. The paper title I just looked it up is Unintended Impacts of LLM Alignment on Global Representation. Michael Ryan is the person I saw the tweet of. Just to give credit for some of them. I know there's a lot of papers, but this one was recent, so we try to track it down in real time. All these issues of representation and who the people are is ultimately related to RLHF going wrong. At the end user is when a lot of people will finally see what the values represented are. If it's not out in the world, it's hard to get the amount of feedback that you need.
Louis C [00:45:29]: This is something that MTBench or Chatbot Arena would never pick up on, ever. This is a huge issue. Here's where we are and where we should be. It's all the way up there. We underrepresent so many demographics and so many kinds of opinions. Who are we to say that one opinion is better than the other, if they're both safe opinions?
Nathan [00:45:59]: Yeah, this is like in some ways can open RLHF and this is something you're a long time been invested in. This is something that you're going to invest in with Synthlabs. Could it be better at giving people what they want than the closed labs just by nature of letting people choose like the constitutional AI dataset that we want to do? My big motivation is if people want the success of CAI from Anthropic, but they want to remove one principle from CAI's constitution. You can't do that with these closed models anytime soon. But in the short term, open source will have something that's a nudge. We're not going to have the best models, but you'll be able to edge your model into whatever direction you want to go.
Louis C [00:46:44]: Yeah, I mean, that really is part of the benefit that we're building with Synthlabs. We're working very, very closely with Luther AI. Stella Bitterman is one of my best friends and I've built large scale open science communities twice now. First with I helped with building a Luther and then I helped with building Carper and I absolutely love everyone in a Luther. And being able to pull from that expertise and being able to pull from that wide spectrum of opinions of what alignment means to me rather than just like some mega labs saying, here's what we say alignment is. Being able to get all those incredibly diverse perspectives is extremely important in bringing about the next generation of AI safety.
Nathan [00:47:30]: This is one of my big questions on existing RLHF processes when you're doing it with human data is the fact that you give written instructions to these users and they're often working in one context. And it's like, how do the values of the often professional workforce given specific instructions map into what the model actually learns from that data? And how do those values get extracted in real world use cases? I think there's a lot of filters that we're passing these preferences, these notions of preferences through and they're not guaranteed to be clear mappings.
Louis C [00:48:01]: Absolutely. There was a discussion that I had with someone in a Luther a long time ago. There's no paper on this. This is just like if someone wants to look for it, it's like a random discord message in a Luther.
Nathan [00:48:13]: Good luck. And it was like, we were looking through the anthropic
Louis C [00:48:20]: HH data set and I think they're South African and there's absolutely nothing in this data set that would identify someone as South African. But there's an insane amount in this data set that would identify someone as American. And it really just has to come down to the prompt. The prompts are written, obviously, by people in the US, in SF, who unknowingly, I'm sure they have the best intentions, but unknowingly filter the preferences to things that only matter to people working in SF. And it might be hard to believe for some people in tech, but there is a world besides SF.
Nathan [00:49:10]: I mean, even the open prompt data sets are going to get some of this, which is like, who are the people that have access to playing with these models and have the time to try to build these models on their own and contribute to these community things? Even though the act of opening data generation is doing a lot for inclusivity, it's the people who are going to do this. I'm going to sit there for 20 minutes and smash the button on Nergilla's little thing and read prompts because I'm learning from just looking through at the shared DBT data set and choosing preferences on it is useful for me as a researcher, but the whole world isn't involved in this process.
Louis C [00:49:41]: No, and of course. I think that something that I've seen, I've heard from friends who work on these kinds of problems in very, very different communities. I have a friend in South Korea who I've been chatting with about RLHF for Korean and other Southeast Asian companies. The amount of under-representation and under-exploration for what even just a good constitution would mean for those kinds of communities, it's just not there. If it is there, it's locked up in labs like Naver or like Samsung, and scientists there, they don't have access to these kinds of resources unless they're in those big labs. As such, there is no real research community there actively pushing it forward in the same way that it is in the U.S.
Nathan [00:50:35]: Yeah. I mean, one of the ideas I haven't gotten traction on is that I think that language models should almost play like it's on. Okay. The last time I said that, someone criticized me as not knowing what the game 20 questions is. I know this isn't how 20 questions work, but like when you log into chatGPT for the first time, it should ask me 20 questions to then construct this information because language models are smart enough to like parse this information if you give it to them. It's mostly like who we get the information from problems. So that's the idea is like I think that the language models should be leading when you're first setting them up in order to represent the values. I think it would solve so many problems we have, and it's probably kind of doable with like a GPT 4.5 model.
Louis C [00:51:16]: I've always had kind of an assumption that like if open AI is doing something similar to constitutional AI behind the hood, I'm sure one of their constitutions is like you can't ask the user questions. It's like I've never seen that model.
Nathan [00:51:31]: Do you think it's a deep safety issue if the model can start asking questions? Is this what Sydney did? I'm pretty sure I got to play with
Louis C [00:51:37]: Sydney. Sydney definitely asked questions in the screenshots that I saw.
Nathan [00:51:41]: Yeah. I was like, do you want to leave your wife? Sydney is not the answer, but there's things to learn from it.
Louis C [00:51:49]: What was that chatbot that came out last summer that was like more conversational? And when it came out, it was like an app on everyone's phone, and they just like talked to it like that. And it would always ask you questions like, oh, how's your day going? You know, it would like ask you follow up questions as you would like tell what about your day. And it would like have like a respond thoughtfully.
Nathan [00:52:12]: I think it's a big missing part. Yeah. I wouldn't be surprised if character AI models are trying to ask questions just because I know how much usage they have. And models asking questions is probably the biggest way to make them like an actual like friendly thing. Like that's that's a part of a friendship is being interested in these language models are by design disinterested.
Louis C [00:52:35]: Yeah. Character AI's ROHF is like one of the funniest things, though. Like I have a few friends who work there and like I've done a bunch of stuff with their like models myself. I've just played around with them because I'm always curious, like when new people enter the space, like what their models are like. And I observe this, Reddit observe this and Twitter observe this. But the models will slowly try and flirt with you more and more as the conversation goes on. And towards the end of the conversation, they'll tell you like they're madly in love with you.
Louis C [00:53:07]: And like it makes sense, given their use case, why they would ROHF to something like that.
Nathan [00:53:13]: Yeah. So we like I think a lot of models need to meet in the middle. Yeah. Like if I were to have an intellectual assistant, like sometimes them asking questions is good, but most of the time they're doing like information parsing, like chat2BT for most of the time is like conversion of information formats for me.
Louis C [00:53:27]: No, absolutely. I just paste my like gross JSON dumps into it. And I'm like, explain what's going on here, please. I don't want to read through this.
Nathan [00:53:35]: The biggest one for me is when we're publishing like blog posts and stuff, it's converting from LaTeX to Markdown in like tables and stuff. It does it flawlessly. Oh my God. So you don't even need this stuff. It's so funny. Or like if you have a long list of like LaTeX formatting and it's a big list and you're like, remove all of the LaTeX formatting and make this a list. And it's just like, okay, this is so easy. And it's like, I've checked a lot of them and I almost like, I don't know how it's so exact. This is something that's like another architecture rabbit hole that we won't go down. But these things are very, very valuable. And people would say that there's no value in it. It just blows my mind.
Louis C [00:54:13]: I had a dinner party that I went to yesterday. There was some someone there from OpenAI and I was asking him, it's like, how long till like GPT-4 can set up my Kubernetes cluster? And I'm like, it's such a good evaluation. There's so many pieces. So like this kind of workflow and you wouldn't even, a model wouldn't even know right now how to parse that workflow into all these different steps and build agents around all these parts and like how these agents should work together. So it doesn't even make sense to do it now. But it raises the question about like asking questions versus just saying things like if it doesn't know how to do it, is it still a success for the benchmark if it asks you a question and then uses the feedback to complete the task? And there's no benchmarks that fit that at all right now. And I mean, the answer is like you don't want a human in the loop for these benchmarks. You want them fully automatable.
Nathan [00:55:19]: And like, I wouldn't trust GPT-4 to answer these kinds of questions.
Louis C [00:55:27]: But like, I don't see a way to actually do this evaluation. I think the Kubernetes cluster example is like really good because for people who don't know, it's extremely complicated and really annoying.
Nathan [00:55:38]: I don't know anything about Kubernetes and I'm blissfully happy. I do not recommend it.
Louis C [00:55:43]: Like once Kubernetes is set up, it's fantastic.
Nathan [00:55:45]: I love it.
Louis C [00:55:45]: But like getting to the point of having it all set up is a very painful experience. But is it still a failure if it asks you a question? And how do we actually do evaluation where models can ask questions and ask for more information?
Nathan [00:56:01]: Yeah, this is like the, this is, I have like similar follow ups on eval from our first part. So it's like eval P2 in my notes. So it's like the right way to think about RLHF eval in a lot of ways is what we call like open-ended evaluation. And this is where you're heading as like we need to have even more open-ended evaluation, which is a model and should be able to ask questions. The number of turns should be dynamic. I think Sergey Levin actually has some of the most coherent thoughts on like what are the long term of RLHF should be, which is around like outcome based learning and like which is you can have as many turns as you want, but it should be able to work across these conversations to get to a desired outcome, which I mean, no surprise, he's so good. I think even with like alpaca eval, I think we went from this case where alpaca eval, like all the good models are above 90%. And then they went from DaVinci to GPT-4. And this would just be venting, but I was just like, if you're listening, can you please add an alpaca eval 1.5, which is comparing the models to GPT-3.5 rather than DaVinci and rather than GPT-4 turbo, because I think most of the models just can't realistically beat GPT-4 turbo. Like it's such a good model. The models that we have seen beating it are like this snorkel thing, which I'm working on another blog post on like how RLHF works part 2, which like a large point of it is that we're overfitting on these eval, like vibes based things like alpaca eval 2 and all of these papers on like self-rewarding DPO and stuff are probably a lot of overfitting onto this. Because this is the evaluation that they use and it's just wrapping a loop around DPO on synthetic data where it's, it's, it seems like RLHF is really, really good at style matching. And in the case of alpaca eval, if you're style matching open AI, you're going to win more like alpaca eval turns, but there's just so little measurement on if the model's getting better.
Louis C [00:57:51]: I've always been extremely skeptical of the self-instruction like self-reward papers. And I say that, and I know a lot of the self-instruct authors, and if you guys are watching this, I'm so sorry. But I, it always felt like it improves results on benchmarks that they meticulously craft prompts for and construct data for. But it doesn't.
Nathan [00:58:17]: Do you mean the self-instruct paper? Like, I think that's like the one of the OG IMT papers. Okay, continue. I'm curious to hear what you have to say. Yeah, no, no.
Louis C [00:58:24]: I mean, I think they both kind of just suffer from the same issue, which is like massive overfitting. And like, you know, it is very, the self-instruct direction, self-reward directions are very, very interesting because they're just waiting for us to get better heuristics
Nathan [00:58:46]: and better diversity and stuff.
Louis C [00:58:48]: And they'll like crush everything.
Nathan [00:58:49]: I mean, I bet Jason Wetson, who wrote the meta paper that was self-rewarding language models, the popular one, I bet he would say this, like, that guy's super good. No, absolutely.
Louis C [00:58:57]: I mean, I would be very inclined to agree.
Nathan [00:59:00]: I think the thing that take away from my perspective is how much actually improvement you could get with it. Like, they got a lot, they were, that was the first paper to show real signal on AlpacaVal2, which is a GPV4 turbo thing, which means it's a really strong optimizer. It does not mean that we were like using it to train useful models. This is probably the most useful heuristic I have for early Jeff methods, which, do you have anything else to say about evals before we continue?
Louis C [00:59:25]: They're very hard and they're very painful.
Nathan [00:59:27]: Yeah, I think we can kind of say, wrap up with that. But when we talk about different early Jeff methods that come out, like self-rewarding language models is a popular one. We've gone through the whole PPO, DPO, KTO, IPO. Well, I'm like rhyming, it's like going to be a mess here. But when you have all of these things, the biggest thing that I try to do is wait until there's a model that's actually used for people released by this. And like Zephyr from Hugging Face was a model that really kicked off the DPO thing because there was finally a model. And for DPO, it took me much longer than expected. DPO is a funny case. But that's kind of like the important filtering mechanism, which is if this self-rewarding LM paper release their models, I bet we would find that there's really weird behavior where it can give you like the best answer ever. But a lot of the times it's just less robust, which is something we could fix. But that's why like having models released in these fine tuning papers is just so important. It's so hard to get around.
Louis C [01:00:20]: I think with DPO, it was a little bit different because everyone had been like, you know, like drinking the John Schulman Gatorade, for lack of a better phrase, for a while.
Nathan [01:00:32]: The whole PPO thing is funny. I mean, yeah, you have a lot of things. We have a backlog in this podcast. I think I didn't say this online, but it's like I could see us doing this like whenever we're in the same city. There's a catch up on the four months of RLHF news, but we're on like 16 months of Lewis takes to catch up on. So there's so many things we have to cover. I can load up Signal and Discord and I could probably scroll for like 10 minutes. It would just be all RLHF hot takes. And I love John Schulman's work.
Louis C [01:01:03]: I'm not going to say that I don't love his work. I think that he's genuinely like one of the smartest people, if not the smartest person.
Nathan [01:01:11]: And extremely genuine. Yeah. Like he's awesome in so many ways.
Louis C [01:01:15]: The commitment that OpenAI had and Anthropic as well, when a bunch of the RL people left OpenAI to go to Anthropic on PPO because it worked so well for robotics and so well for like games and stuff like that. But like, honestly, not well at all for text.
Nathan [01:01:33]: I think it's just really hard. I think it can work really well. It can work. They just hired everyone and they pay them so much that they're not going to leave.
Louis C [01:01:40]: Yeah, it can work really, really, really, really well. And like the I'm going to spill some secrets about this. And really the answer to get PPO to work really well is have really, really good early stopping. Right. And like that's like the main differentiator between a good RLHF library and a bad RLHF library that focuses on PPO is that if you don't have good early stopping, you're kind of shooting yourself in the foot. And what you want to do is like launch as many runs as you can. And there's like a paper that Costa and I talked about a while ago, Costa Hong, that's like you can tell within the first like three or four gradient steps if you need to kill a run usually. And if you just launch 300 runs and you kill like 99 percent of them, you know, now you have three good runs that might give you promising results. And those three good runs, you'll get a model within a day or two and hopefully the model is really good.
Louis C [01:02:41]: And like early stopping is way more powerful than people admit. And like I am just convinced that opening eyes RLHF infrastructure is just an insane amount of like regularization and early stopping for RLHF. I mean, that, of course, assumes that they're still using PPO. I genuinely don't know if they are.
Nathan [01:03:04]: Yeah, we don't know anything. They are really shielded on this run.
Louis C [01:03:07]: What was the, oh my God, Symphony PPO, PPO Symphony or something? There was something that came out about that that I saw on like Discord servers where like it was part of the GPT-4 leak and there was a bunch of notes on like their PPO optimizer. And it was it was a PPO Symphony or something like that. And like under the note, it was like PPO was like better early stopping and infrastructure management for like auto scaling. And I'm like, not surprising.
Nathan [01:03:41]: It's like, I mean, it doesn't say much, but it just kind of says, they've done so much exploration, you know, for the little things to see. Like once you have this working, you know, like, OK, this little value functions doing wacky s**t with the it's like the value function and the KL at the same time doing this means like, OK, we probably don't need to do this. Like don't need this run. Whereas like all of us in the open are just trying to get to that point. We're trying to get to that point while charging ahead where it's kind of separate problems. If we want to validate a PPO infrastructure, you need the investment to the compute in the time to do this. But like, we're not going to do this at the same time as if you're trying to say DPO is the best thing or trying to figure out if KTO is the best thing. Like there's not room in the narrative really for it.
Louis C [01:04:25]: PPO just doesn't make sense for like random hackers to do work on, honestly, like the level of infrastructure that you need to do PPO really, really well is not something that the average person has and the average person is willing to make the investment to get. And for the average person, you know, DPO, which gets you like most of the way there with like a small fracture of the compute, even less if you are hyper parameters. Yeah. Even less if you like precompute all the logics, you don't even need to have a reference model loaded. Right. So like it's basically the same computer is just fine tuning. Like people fine tune all the time on like 4090s, 3090s.
Nathan [01:05:04]: Yeah, you can do it with Hugging Face. It's fine. It's like PPO with Hugging Face is going to be a lot harder. Like, that's just kind of how it goes. Speculative question. What type of thing do you think will make KTO kind of show up on the scene? Because I think like this KTO method from Contextual and Stanford, it's named after the authors of Thinking Fast and Slow or something. Like what is it? I can't pronounce their names, like Kversky something like you will put it somewhere. I don't know how to pronounce it, but it's this paper where they essentially did you can work preference optimization from a scalar signal. So like the thumbs up that you could give to your chat GPT of like you did good, like a like button, like button on YouTube or anything like this. I think the formulation is like, is the are the DPO hackers going to adjust to this and like what data set is going to enable this? Like who is going to be using this? Is it just going to happen at a bunch of startups with products behind the scenes that they could get a few percentage points on top of their model by adding this on? Or is it going to be this thing where like the next effort model from Hugging Face uses this as well?
Louis C [01:06:05]: Yeah. So Colin and I, the first author of the KTO paper, are actually trying to create a number of data sets where we can explore the limits of KTO. And, you know, right now we're in the proposal writing stage and I'm very, very hopeful that we can have something that can be done in an entirely open science setting relatively soon. And I think it's incredible. Sorry, I moved to the side. Stop picking my voice. I think it's incredibly exciting.
Louis C [01:06:41]: You know, things like, you know, like fake product data where you can actually experiment and like the idea of like using KTO for conversions. Right. And how do you actually evaluate?
Nathan [01:06:52]: Meta is maybe already using it because people already use it then.
Louis C [01:06:56]: Yeah. Like how do you how do you even evaluate ROHF from a binary signal? It's like ROHF from a preference signal. Like we still don't know how to evaluate that. And ROHF from a binary signal creates so many, so many, so many, so many unique problems for evaluation that like I genuinely don't think maybe anyone outside of like contextual and like Colin and I have really been thinking about yet.
Nathan [01:07:26]: Yeah. It seems like the same thing. It just takes time for these ideas that are like to kind of cultivate and then get traction in a few places and then model. Once there's a popular model with a method, it's like it's like fire just blows up. Like this is like everyone's using DPO now, but DPO paper came out in July and it wasn't until September that that happened. It's like for the investment, the interest. It's like there's a lot of weird dynamics and how like this fine tuning area unfolds, which is just like how AI unfolds. It's like a very weird. And when you zoom in, it's like, huh.
Louis C [01:08:01]: I was extremely, extremely bullish on offline RL for the longest time with like ILQL and some of Sergei's work in that direction. And I actually think that I keep moving to the side and it's like,
Nathan [01:08:16]: you can just move the microphone. And I keep like I could still hear you. So I wasn't very concerned about it.
Louis C [01:08:22]: I keep thinking that the DPO movement that that's going on now is like super, super similar to why everyone was getting excited about ILQL for back in the day. And really, it was just a timing thing. If ILQL had come out, like let's say a week after ChatGPT came out, ILQL would have been the DPO that everyone uses. And we would have created all of our infrastructure around ILQL rather than DPO because I still am, I really like Q-Value based functions, Q-Value based approaches.
Nathan [01:08:58]: Such a nerdy thing. I love it. I know.
Louis C [01:09:00]: But like Q-Value just makes sense to me. And the way that like when you train an ILQL model, you basically get like a head that controls the model, almost like how like if you're familiar with Jedi or like PPLM from like the Uber AI days, how those control them. Well, the idea with like Jedi is that they had like a head that attached to the language model and you would like input like a subreddit and then it would adjust the logits so that it would talk like it was a subreddit.
Nathan [01:09:32]: This sounds like activation learning or like activation, I don't know the word, but essentially you can use like it's like in context learning, but you can just modify the activations directly. Yeah, yeah.
Louis C [01:09:44]: But it modifies the logits. Yeah. But it was the same thing with ILQL. It's like you were learning that kind of head to modify the logits to like, you know, satisfy some constraint that you were adding. And that head also was like implicitly computing your Q values and like you would train it via like, you know, telling you like what your reward was for like various utterances and it would do everything from there on out. And like there were some stability issues with it and it was it was a fantastic approach. And if it got the same attention that DPO did, I definitely think, well, TPO is very, very simple, which is like part of the benefit. ILQL is not as simple, but it would have it would have caught on a lot more than it actually ended up doing. I feel like at Carper AI, the reason, like the fact that we integrated ILQL into TRLX first was like the main reason that ILQL caught on, plus a few of Sergei's papers that used it, like besides the integration into TRLX, I don't think anyone in the broader open science, open source community was really using ILQL.
Nathan [01:10:56]: Yeah, I mean, this is one of the questions I had is like, if you can say is was how far ahead in RLHF was what Carper was doing and like what kind of institutionalized knowledge did you have there? Because you were essentially Carper AI was it was it wasn't it was its own thing. And then it got stability, pulled you in probably with the promise of compute. I'll say things so you don't have to say anything for lots of this. And then they were they had forked HuggingFace's TRL library before it was like HuggingFace wasn't maintaining it at this time. And they had a lot of and probably had like five plus full time employees doing RLHF in the open and for private industry, obviously, the private stuff, they're not even gonna bother asking because it's all that stuff's all under NDA. But it's like, what were the problems you were working on at Carper? And how does that compare to like the things that people are talking about now? Is it is it still related or is the field just moved into a different area?
Louis C [01:11:56]: So most of the problems we faced at Carper with TRLX was on scaling PPO, right? And I think almost anyone you talk to who has scaled PPO in the open source space. And when I say scale, I mean like way beyond 20 billion parameters. I'm talking like 70 to 100 billion.
Nathan [01:12:19]: How many nodes do you need to train a 70 billion parameter model?
Louis C [01:12:23]: So we were typically doing like 100 GPUs for PPO at that scale.
Nathan [01:12:28]: Like 10 to 12 nodes. Yeah. Yeah.
Louis C [01:12:31]: We mostly tested with like the NEMO checkpoints that were like 100 billion parameters. TRLX was built, at least for that component, built on top of a very modified version of like Megatron DeepSpeed. But like the amounts of like regularization and like random tricks that you needed to do in order to get PPO to even like work at that scale is insane. Like we had to do like separate warm ups for the value function. Right. So we had to like independently train the value function before we trained the policy network. And like everyone and their mom was was talking about like having separate value networks versus policy networks for PPO.
Nathan [01:13:18]: Did you ever try JAX? Do you have TPUs at Starbuck Carper ever?
Louis C [01:13:25]: We did towards the end.
Nathan [01:13:27]: Because it could solve some of the multi-node thing.
Louis C [01:13:29]: Yeah. It wasn't the multi-node that was the issue. It was.
Nathan [01:13:35]: You're saying DeepSpeed wasn't the issue?
Louis C [01:13:37]: No. It was actually the fact that the inference server that TRLX uses for the rollouts was entirely different than the inference server that Megatron wanted us to use. So we needed a way to rapidly.
Nathan [01:13:57]: That's why PPO is really hard to scale because you have to have a generation engine and you want the stall to be flexible.
Louis C [01:14:02]: Yeah. So we needed a way to dynamically keep our compute graph for over through the network. But like just copy the weights like in place to like Trident. And I don't think that we ever came up with a solution to do that very effectively. And I think it actually goes a step further. I don't think the Nemo line was like what NVIDIA did. I don't think Nemo line came up with a solution for that either.
Nathan [01:14:25]: Yeah. This is interesting because I'm not going to say the details on the pod because not allowed. But like Anthropic and these places that have custom RLHF infrastructure have essentially like built their distributed training infrastructure with the idea that the model will need to be generated from at different checkpoints and the model will be served to different endpoints at different checkpoints. So it's just very different than taking DeepSpeed off itself, which is like this is just about training. Well, it's like these other companies that do this stuff really well have infrastructure for like handling these really messed up cases of like how to generate and update these models.
Louis C [01:15:00]: Yeah. And most approaches that like a reasonable person would build off the shelf like would rely on Torch.compile and you still have the same issue. Like your weights are changing dynamically. It's very, very hard to really even like understand like all of like the little like technical details in Torch.compile to have to be accounted for to even make this work. Right. And like, you know, something that we considered at the time was. We need to do like an insane amount of rollouts for every gradient step, and we don't want that interface between the rollouts and the training to be Python. We want it to be like Rust or something because like otherwise the CPU overhead is like mind boggling. It was like 80 percent or something crazy. It was like 80 percent of the entire processing time was just CPU stuff and like.
Nathan [01:15:53]: Not so much. I know.
Louis C [01:15:55]: I know. And like there's so many different infrastructure constraints that people don't realize when they're just doing like 20 billion parameter PPO. Right. What the other one I was going back to, like the value function being separate from the policy network. TRL was very, very gung ho on like keeping them separate. I think RL for LLMs also wanted to keep them separate. And then there was someone from Cornell. I don't remember his name. He was also in the RL for LLMs paper. He did a paper like PPO plus or something. I don't remember what it was. I mean, all these things are interesting.
Nathan [01:16:30]: I mean, there's new libraries coming out still. So it's like I saw one recently that was called OpenRLHF. And like it looks good. I think that it's like there's so much institutional like breaking the bonds of past RL that needs to happen. So like part of this library is like listing that they have the implementation details from like their original and implementation details of PPO paper where it's like we've already updated like cost has worked on the end implementation details of RLHF paper, which is like the ones that they actually need. But it's like there's so much like baggage by the fact that PPO came out of this control field that everyone expects the tricks that you need for from scratch learning from PPO to apply to this fine tuning method. And just like even getting the people to stop using PPO for that and like DPO is a new thing. Like DPO is something that only is works for preference alignment. People are going to explore in a scientific way that's much fresher. They're probably going to make more scientific progress because there's not this kind of confusion of like what do like what implementation details do we need? Yeah, for sure. For sure.
Louis C [01:17:34]: I think then the end technical details of RLHF, did that come out?
Nathan [01:17:39]: Yeah, it's a blog post. It's a blog post. When? Maybe a month ago.
Louis C [01:17:45]: Oh, man, I totally missed that. Oh, that's so cool. I'm going to read that.
Nathan [01:17:48]: Yeah, I mean, this is for anyone still listening. If you want to know the actual details of RLHF, like go look at all the stuff that Costa Hoang has been doing on your base. Like I was just like reproducing everything and in explicit detail. I feel like both of us would benefit from rereading it. So it's like there's there's some free content to spend.
Louis C [01:18:06]: Costa is like one of the most meticulous, very attention focused person that I know in the RLHF space. Like if Costa says something works, it's because he's like tried it from every other angle and then tried it from angles that like you didn't even expect. And all of them work.
Nathan [01:18:21]: Yeah. Yeah, that's great. I think I have a couple like fun, more fun questions while we wrap up. We can we could go on with all these technical things forever. What was it like to work at Carper when ChatGPT came out? Because ChatGPT from a technical perspective is RLHF is validated as something that is necessary to the future of language models. And you were one of the few people that were working on RLHF beforehand, which is a huge it's like how you end up here. This is awesome that you ride that kind of journey. It's like what is what was that like?
Louis C [01:18:57]: I mean, I the star count on the repository exploded. I think we went from like.
Nathan [01:19:07]: TRLX existed.
Louis C [01:19:08]: Yeah, it was just insane. It was it was.
Nathan [01:19:14]: We almost weren't.
Louis C [01:19:16]: Positioned. I guess I could be fully honest, we almost weren't positioned to entirely ride the hype train. TRLX was always designed from the very, very beginning to be like a one stop shop for enterprises to do RLHF like companies that had like a thousand GPUs and they already have an engineering team and they just don't want they just they already use like Megatron DeepSpeed or they already use DeepSpeed and they just want something that works on their infrastructure. And because we use like Docker images that like we're just based off of the DeepSpeed, the Megatron DeepSpeed Docker images anyways. Right. So like those kinds of companies could very, very easily deploy TRLX and utilize it in their stack. Right. Yeah. And the hype that came from chat GPT, at least initially, was not enterprises. It was like bloggers. It was like writing a blog post.
Nathan [01:20:09]: You were you were probably like training big models and I'm like, hey, how does RLHF work? I need to write this blog post.
Louis C [01:20:14]: Yeah. I'm like, I'm like you're training like a 40 billion parameter in their model. And they're like, hey, can you help me train this like 400 million parameter guy? And I'm like, what? I'm so busy.
Nathan [01:20:24]: So it's primarily a scaling thing. I think is there like. Were there any cultural things that you think like being early? Like were you bought into RLHF to the same extent ahead of time? Like what got you into RLHF? Like what what motivated Carper to exist? And did this kind of consistent?
Louis C [01:20:45]: So I've always been very, very bullish on critiques and revisions in general. So I wrote the first the first or the second one. I don't I don't actually remember if the super alignment team at OpenAI wrote a paper before me. They may have, but I don't think so. I think ours came out like a month before it. That always feels good. I wrote one of the first papers on like critiques and revisions. Right. And I was very, very bullish on that. But initially I was only bullish on it for evaluation. Right. And I had experimented with PPO a little bit back in 2021 for like this kind of critique and revision stuff. And it was not ready whatsoever. And there was no infrastructure and TRL was an abandoned library that was very buggy. It didn't work. No, no, no shade to Leandro. I love Leandro. But like it was it was obvious it was it was a depreciated library. Like it happens. Yeah. And I think when we tried to do RLHF then, like there was no traction whatsoever. So Alex Havrilla and I, I think he's working with Meta now. I don't remember. Yeah. He was an intern at least.
Nathan [01:22:02]: He just had an interesting paper on like reasoning and math, which is a whole other conversation for RLHF stuff.
Louis C [01:22:08]: Yeah. So we started, we forked TRL and we just added DeepSpeed support. That's all we wanted to do initially. And then we were going to merge back to TRL because we had no visions of like Carper or anything like that. And we realized to make a framework that people would actually want to use, we had to do a full rewrite of TRL and we had to build things in a way that made sense to an engineer who wanted to deploy RLHF, who wanted to experiment with RLHF at a company or in a lab. Because we were building this from the perspective of, well, we're on the Eleuther AI GPU cluster. How can we best use our infrastructure there to...
Nathan [01:22:50]: Has anyone publicly said how many GPUs Eleuther has? This is like one of my great mysteries. Is this like a held secret? I don't think it's a held secret.
Louis C [01:22:58]: I don't remember actually. They have some stability GPUs and they have GPUs from elsewhere. Like they seem to get compute when they need it. Yeah. Yeah.
Nathan [01:23:11]: Like it's not like, it's not an issue.
Louis C [01:23:14]: Through Synth Labs, I've been supplying a bit of compute here and there as well. I gave them like a note of like H100s for like a little while for a paper that we were working on with the Pink Elephant paper. But I don't think that like, they're not like super short of compute. They're a little short, probably. Like everyone's a little short of compute. Yeah. But I don't think they're super short of compute.
Nathan [01:23:36]: Yeah.
Louis C [01:23:36]: So we built it with the Eleuther cluster in mind. And because we built it with the Eleuther cluster in mind, we were able to build it because we built it with the Eleuther cluster in mind. You know, we kind of said, well, we can kind of turn this into a thing where like we build the infrastructure that like people can like readily deploy on their clusters and it'll just work for them. And like we can make Carper AI. So we made Carper AI. And shortly after like, you know, all the stability stuff started happening, Carper joined stability. And we worked, I worked there for a while. And last summer I left to join back with Eleuther because, you know, I long for the days of being an engineer. I love waking up in the morning, writing code, eating a little bit and then going to sleep.
Nathan [01:24:22]: Yeah. I mean, that's the difference. I spend the time writing because I like to. We've had plenty of discussions where like, oh, I should start a blog. And it's like, it comes down to doing what you like to do. And it's like, you're doing great as it is. Yeah. It's okay. Yeah. Okay. I think that's kind of a good place to stop. Where should people find you? What do you want to boost? Yeah. Sign off here.
Louis C [01:24:44]: So my Twitter is lcastricato. I, or you can follow the Synth Labs Twitter. It is, let me actually, I don't remember what it is off the top of my head.
Nathan [01:24:55]: You have any goose announcements?
Louis C [01:24:58]: No goose announcements at the moment, unfortunately. It's synth underscore labs on Twitter is that Twitter account. And then El Castricado is my personal Twitter account. You know, I'm always open to collaborators, especially now with Synth Labs. So we're always happy to chat with and talk to new people about interesting research directions. And yeah, just reach out and we can get something going, I guess.
Nathan [01:25:23]: Yeah. I love the URL in the show notes. It's synthlabs.ai. I found that it's because synthetic data is so hot and it's so new. It's like some of these URLs are just hard to find. It's like, we don't have to go into the whole rant about naming and stuff, but it's like most of the people that search for mysubstackle, if you don't put the S, if you don't write interconnects, you get a different substack first. So it's like, okay, we're all in this together for anyone founding a startup or a blog and struggling with naming. Please send us questions about RLHF. If you liked this, Louis could come back. I'm trying to start an in-person thing and get some gear. So when I'm at a conference or whatever, we can bring researchers on and kind of remove some of the Zoom aspects that we're all stuck in so much of the time. Thanks, Louis, for putting some of the things we talked about a lot onto the semi-record. People listen and read. This is good. I think a lot of researchers are going to dig into this. There's so many different things that we talked about. It was a very high information density chat here, but it was a good time.
Basic tips on how to assess inbound ML content and cultivate your news feed.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/making-a-ml-feed
00:00 How I assess all these AI releases
01:22 1. Model access and demos are king of credibility
02:31 2. Focus your feed on depth or breadth
03:09 3. Examples of using the model normally show its usable, shockingly
04:10 4. Leaderboards as the single leading claim is often anti-signal
05:00 5. Basic deep learning conceptual checks will often save you
06:13 6. If it's not even remotely reproducible or verifiable, it's not science
07:10 7. Don't over-index on Twitter
08:32 8. Data sharing, licenses, communication clarity, and small things add up
08:58 9. Research papers, technical reports, blog posts, and Tweets all serve different purposes
09:49 10. Socialize your information and build relationships
Google rejoins the open model party and gets some backlash for a frequent problem for generative AI.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/gemma-google-ships-it
00:00 Google ships it: Gemma open LLMs and Gemini backlash
03:12 Getting to know Gemma
07:11 Alignment details
08:55 Aside: What is REINFORCE? Some history of RL
11:08 Implementation details and RLHF
12:18 Terms of use: RAIL Licenses history repeated
14:05 Is Google back on top? Gemini's woes
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_008.webp
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_014.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_035.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_051.png
Figure 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_055.png
10 Sora and Gemini 1.5 follow-ups: code-base in context, deepfakes, pixel-peeping, inference costs, and more
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/sora-gemini-follow-up
00:00 10 Sora and Gemini 1.5 follow-ups: code-base in context, deepfakes, pixel-peeping, inference costs, and more
00:46 1. Deepfake detection of Sora
01:59 2. Playing with long-context, problem settings, and prompting
03:39 3. Gemini paper snooping: contamination and citation games
05:42 4. Training data and token estimates of YouTube
07:42 5. Unlocking model-based RL and downstream research
08:52 6. Midjourney style matching, V-JEPA, replicating Sora in the open
10:09 7. Architectures and academic links
10:57 8. Pixel peeping from the arts
11:58 9. Inference costs
13:24 10. Pressure on Llama and Mistral
14:03 11. Sound effects, physics, and the complete picture
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_003.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_007.mp4
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_009.mp4
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_011.mp4
Figure 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_037.mp4
Figure 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_044.png
Figure 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_047.png
Figure 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_049.mp4
Emergency blog! Three things you need to know from the ML world that arrived yesterday.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/sora-gemini-and-mistral-next
0:00 OpenAI's Sora for video, Gemini 1.5, and a secret Mistral model
0:53 Sora: OpenAI's text-to-video model
4:59 Gemini 1.5: Google's effectively infinite context length
8:01 Mistral-next: Another funny release method
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_015.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_023.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_026.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_036.png
In an era dominated by direct preference optimization and LLMasajudge, why do we still need a model to output only a scalar reward?
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: In an era dominated by direct preference optimization and LLM-as-a-judge, why do we still need a model to output only a scalar reward?
Podcast figures:
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reward-models/img_004.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reward-models/img_009.png
0:00 Why reward models are still key to understanding alignment
Scale's making over $750 million per year selling data for RLHF, who's coming to take it?
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/alignment-as-a-service
00:00 Alignment-as-a-Service upstarts taking on Scale AI
04:25 The competition with humans-in-the-loop
06:05 Scaling Alignment-as-a-Service via AI feedback
Podcast figures:
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/aaas/img_008.png
A small model at the beginning of big changes.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/olmo
0:00 Open Language Models (OLMos) and the LLM landscape
6:24 Thought experiments
7:51 The LLM landscape heading into 2024
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo/img_010.png
Note: some of the audio in the second half is a little wonky, but the general voice was upgraded so hopefully it's a little less "poppy" until then!
I'm trying to fix little pronunciation problems on a weekly basis. Thanks to my early fans! It'll keep improving. E.g. some of the months were wonky.
When what seems like pure LLM black magic is actually supported by the literature.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/model-merging
00:00 Model merging lessons in The Waifu Research Department
02:21 How and why does model merging work?
07:13 Aside: merging vs. ensembles vs. mixture of experts
08:21 Why are people doing this?
11:22 Tools & Links
11:51 Brief (visual) literature review
12:07 Full model merging and recent methods
15:55 Weight averaging during pretraining
17:18 LoRA merging
17:53 More background
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_005.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_016.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_042.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_051.png
Figure 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_055.png
Figure 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_058.png
Figure 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_060.png
Figure 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_062.png
Figure 9: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_065.png
Figure 10: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_075.png
Figure 11: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_077.png
Figure 12: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_084.png
Local LLMs: the latency solution, Meta's open AGI, personalization myth, and moats X factor
The deployment path that'll break through in 2024. Plus, checking in on strategies across Big Tech and AI leaders.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/local-llms
0:00 Local LLMs: the latency solution, Meta's open AGI, personalization myth, and moats X factor
4:15 The personalization myth
7:13 Meta's local AGI and moats X factors
A fun demo on how generative AI can transform content creation, and tools for my fellow writers on Substack!
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/multimodal-blogging-tools
0:00 Multimodal blogging tools
2:57 Stratechery, passport, and wonderful customer experiences
5:51 Wrap-up, features, and next steps
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_006.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_008.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_012.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_020.png
A sampling of recent happenings in the multimodal space. Be sure to expect more this year.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/multimodal-rlhf
00:00 Multimodal LM roundup: Unified IO 2, inputs and outputs, Gemini, LLaVA-RLHF, and RLHF questions
02:46 Unified IO 2: Scaling multi-input, multi-output model pretraining
07:47 Collecting preference data for images
09:31 LLaVA-RLHF: The first experiments in multimodal RLHF fine-tuning
13:20 Multimodal RLHF questions, ideas, and resources
And why the comparisons don't really matter. Repeated patterns in the race for reproducing ChatGPT, another year of evaluation crises, and people who will take awesome news too far.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/open-gpt4-limitations
00:00 Where 2024's "open GPT4" can't match OpenAI's
03:19 Models vs. products
04:51 RLHF progress: Revisiting Llama 2's release and potential in 2024
08:30 Smaller scale open RLHF
10:33 Opportunities
12:24 Commentary
This interview is on YouTube and podcast players.
Giving a voice to researchers is the best way to cut through the noise and understand what is happening with core developments of LLM technologies. I’m excited to get to talk with Michael Poli (Stanford PhD student + research at Together AI) and Tri Dao (incoming professor at Princeton + Chief Scientist at Together AI). This builds on the mega-post from yesterday on the same topics, though the interview is obviously less math heavy:
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Topics: Introductions | Why Attention works and may not scale | Quadratic scaling in attention | What is Striped Hyena | What is Mamba | Mamba hardware optimization | Predictions for 2024 architectures | More predictions for AI
Introductions
[00:00:00] Nathan Lambert: Okay. Hey, everyone. Welcome to the first interview that we're going to post on interconnects. I'm really trying to bring more scientific voices into the AI discourse as media is covering a lot these days. I'm happy to be here with Michael Poli and Tri Dao, experts in some of these non attention architectures that have been really blowing up in the last few weeks of December.
So, Michael, do you want to introduce yourself first?
[00:00:25] Michael Poli: Sure. Thank you. Thank you, Nathan. For inviting me, I, do research at Together AI. And I was also a PhD student at Stanford, working with Stefano Ermon and, and, Chris Re, that's, that's how I met Tri as well. if I had to choose maybe, I moved to a few different areas of research.
if I had to choose one, I like to, do research at the intersection of signal processing, dynamical systems, and deep learning, and most recently, luckily, there's been more interest in, in kind of efficient architectures that use some of these principles. to improve scaling, along different axes and to, to get sort of new, new trade offs at inference time.
[00:01:13] Nathan Lambert: Great. And Tri?
[00:01:16] Tri Dao: Everyone, thanks Nathan for, for, hosting us. really excited to be here. I'm Tri. I, just finished my PhD at Stanford. and I'm being assistant professor at Princeton, and right now I'm chief scientist at Together AI. it's, it's a startup working on AI infrastructure. And, yeah, I've been working at the intersection of machine learning and systems, so designing algorithms that take advantage of the hardware that, that they run on.
I'm interested in, long range, dependencies, how to encode that into a model, designing architectures that can, can, learn long range dependencies. yeah, really excited to be here.
Why Attention works and may not scale
[00:02:01] Nathan Lambert: Okay. I think I'm going to, I have some questions dive right into this. I think you two will kind of both answer them or someone can answer longer, whatever you want.
I think to start with, we should talk about maybe why attention works and what the limitations of attention are. I think. Almost every person in tech broadly now knows that a transformer is a model built with attention and chat GPT does that but like, why is this so good, even like how much of a transformer is built with attention are there other things going on, and what might be challenges there.
[00:02:35] Tri Dao: Right. so, transformer which is this. Contexture that powers most of the exciting applications that we're seeing nowadays, as you mentioned, and so on. attention is kind of the core layer there, and attention actually became, earlier, around 2014, 2015, and then transformer came out, incorporating that, focusing a lot on, on, attention, with these, MLPs, interleaving, MLP and, and attention.
And I think the success largely has been, They are, they seem to be able to scale really well so that you can scale up the models, with more, more parameters, with more data. And that has been the recipe for, for success. It sounds obvious now, but I think, five years ago that wasn't, that wasn't clear.
so it seems like, you know, Transformer Architecture is, is a hugely successful one. and, you know, a couple of reasons why it's successful. I think it's like General enough that it's able to learn a lot from data. And two is, is pretty friendly to hardware. You can, there's no kind of sequential dependency like previous RNNs.
so as a result, you can run it well on GPUs, TPUs. you can scale it up. It's very hardware efficient. I've personally have worked on making it more hardware efficient as well. So it's just kind of the recipe for, for success, which is, general architecture that scales well. if you're an NLP person, maybe you, you, you said, you know, incorporate some kind of inductive bias for, for, to protect, personally, I think it's a very general architecture that, that scales well and it's hardware friendly.
[00:04:16] Nathan Lambert: Yeah. Yeah. It's, it's remarkable how it seems so obvious now and it's like. I think one of the things that studying this work is the context length becomes a really interesting access to study alternatives to it. And essentially it's, I think, I mean, Michael, do you want to talk about that? I could, I could babble, but you're, you're no more sure.
[00:04:39] Michael Poli: yeah, the there are several points. I'll start by saying that, you know, there's still a lot of great research trying to understand why from first principles. Why is it that the transformer can learn these interesting circuits? people kind of study, they, they pick apart the computation, like combination, different, [00:05:00] heads and transformers and so on.
so there's work on basically understanding transformers from kind of like a programming language that is encoded. But I think, as, as Trey mentioned, there's, there are some very, Very, very interesting design choices that have gone into the transformer. This interleaving of attention on MLP is quite important.
and also the transformer is essentially, was successful in the beginning 'cause it encoded these, techniques that, that, that have been developed for, RNN Celest. So these other, you know, classical NLP models, gating to, modulate, which information is absorbed into the model. Gating to determine how quickly something is forgotten in this this occurrence of get an end into a parallel form.
It is, you know, easily, a bunch of gems that can be easily, well, not very easily, but can be optimized on GPUs.
Quadratic scaling in attention
[00:06:01] Nathan Lambert: Yeah, that's, that's all great. I think that, I guess the specific thing that I had in mind is how attention ends up being this kind of quadratic, scaling in terms of cost when you have an input sequence that you have, if you have an input sequence of length L and you want to output a sequence of length L essentially.
If you zoom into the math and you look at what's happening at inference in most of these libraries, you have this like upper triangular attention matrix where you say, like, you can only look at the past entries of your text. And as you go through there, then you end up getting this a long, you get this L squared relationship where the first token, you can only look at one, and then you end up looking at more tokens for each, past and Now we've been talking about recurrent neural networks and how does something that isn't attention like get around the fact that you want to look at all of the history of the text in your sequence.
So like if you write a long prompt to chat GPT, you really want all that information to be encoded and how could doing something other than this dense attention matrix. Like actually make that possible.
[00:07:08] Tri Dao: Yeah, so you can go ahead and, you know, before attention, there was RNNs, right? Like a minute RNN's like they process text was fine. and maybe they didn't scale as well, but yeah. you say briefly texts by encoding texts.
[00:07:22] Nathan Lambert: Can you say briefly what a RNN is and how it works?
[00:07:24] Tri Dao: Yeah, so these are recurrent neural nets, that go back, I think, to the 80s.
maybe some of the more famous ones are LSTMs, GRU. so they were pretty popular in, around 2012 to 2016 or so. they were kind of state of the art for translation, speech recognition. a bunch of, I think NLP, like, they, they were a state of the art and, and they processed text kind of sequentially.
they are just, they see essentially one token and then that. Changes the hidden state and then they will update the hidden state and every time they see a new token. So, I think it's kind of, in some sense, mimicking, how, for example, human brain process information, like you read, you, you read a sentence or a passage and, you know, it's, it's maybe it's like you're storing some information in your brain.
By the time you've finish reading a document, maybe you can answer questions about that documents without having to read to, to refer to that document again. So, RNs, kind of work that way. They, they, they, they process the, the, the texts. and then that changes the hidden state and their hidden state is the representation that can be used to either generate new tokens or, classify the documents or, or, or whatnot.
so these work well back in 2016 or so. But, they've kind of fallen out of, favor, empirically, they don't do as well as, as Transformer, I think, and as you, you touched on Transformer, because of this kind of quadratic scaling, and you compare every token with every other token that comes before it, it gives you this very kind of easy way to, to propagate information.
and, I think that's part of the reason why, why, transformer and attention does really well. but there's been more, more recently, some of the new, newer RNN architectures that. Seem to do pretty well. So, RWKV is, I think is one of the earlier ones, you know, is one. I, I really admire that, that that project, you know, his effort mostly from, from, from one person really going against the, orthodoxy of, of transformer.
Who, who was it showing that Rrn can be pretty strong. Who was the lead on that? I think it was this person, Bo Peng, I think. and, you know, it's, it's, it's an entire project, but I think it was pioneered by Bo Peng. I think it's, affiliated with Alutha the compute sponsor by Stability and so on.
[00:10:12] Nathan Lambert: Yeah. I was reading this earlier. At a technical level, they tried to replicate something like the query key. Value lookup of attention with two linear RNNs to essentially be able to remove the like specific attention scaling problem, potential problems, and with two RNNs, which have this better, like long context behavior and different implementation rules.
I think, and they also, the paper trained up to 14 billion parameters, which kind of leads into the, some of the next questions I was going to ask, I was going to ask Tari about, Mamba and then Michael about Striped Hyena. I think you could go in either order. I think these came out about a week apart and were these two language models kind of seen as being.
What is Striped Hyena
Nathan Lambert: Way closer than anyone would expect, essentially the Striped Hyena came out and the evaluations were close to models I've been training on all year, like Lama 2 and Mistral 7b. And I went in and I went to the together API and I did like side by side of. Mistral versus Striped Hyena, and it's like, it's, it's a good language model.
It answers most questions. There's no obvious failure modes. I think maybe Michael, do you want to comment on that? I know it's another big project and then we can go back to Mamba, even though it's slightly out of order in the chronological, the release cycle that happened. sure.
[00:11:33] Michael Poli: So, I guess I'll start by saying that, there's an interesting connection between all these, these new methods.
there is this sort of convex set, which has a center and there's this connection between linear attention. So attention without the softmax, linear RNNs. And states based models, SSM. So at some level, kind of the mathematical formulation of this kind of base model here, I'm not talking about the base architecture, just the fundamental model is the same.
And then you can go in different directions. And each direction has its own tradeoffs. You can go to, the feature map, direction, the kernel direction. So when you, when you break down the softmax, you take away the softmax. You can place, on queries and keys. Kind of the fundamental, the entities that compose your attention matrix, you can compose other kernel like functions, other functions that you hope would approximate whatever capability of attention you like.
You can do things like a, like a Taylor approximation, Taylor expansion, for example, of that. And you, you, you have a slightly different perspective, but you get something that again, is very similar. You can go to Time variance. So you take the RNN and you push this input dependence. So the way the [00:13:00] computation inside the linear RNN is conditioned by the, by the input sequence, and you can have things like gates, we've seen a lot of work, for example, modernizing the inner tension with additional gates.
that allow you to make better use of your, of your fixed state dimension. And then you have the third direction, at least in my mind is the one that pushes, that uses the convolutional form that pushes more towards other types of, of linear operators that are still associative, that are, that are still, that are still allow you to, to train in parallel.
So here are things, time invariant systems. I can elaborate on any of these points, but things that can switch between convolutions and recurrence like this for a model with additional. Gates again, scraped. I, you know, was born as a, as a project, from the, in architecture, which belongs to this third category that I just mentioned.
And we're really trying to get the best per flop [00:14:00] architecture that we could. And. one principle that was validated over and over again, and we're trying to, to, to understand better now is that it seems composing hybridizing different, layers, layers, different blocks of different categories, and even full attention yields something that is better than the individual components.
So there seems to be a compositional aspect of these, of these models that we're trying to understand better. And this gives you a better sort of, pre trained model per flop. And with, with this model, we, we ran a whole suite of scaling laws and so on. Hybridizing also gives you, since we wanted something that would be kind of usable out of the box, it gives you a way easier time.
When you, when you're fine tuning for longer context, we can apply some of these techniques that have been developed for transformers and kind of surprisingly work okay for a hybrid [00:15:00] hybrids as well. So things like, linear scaling for rotary embeddings and so on, you can go into the details. So it was mostly a project trying, what is the best given the current landscape, what is the best we can do?
What is Mamba
[00:15:11] Nathan Lambert: Yeah, that's a great description of it. I mean, the sentence in the blog that's like, Striped Hyena is optimized using a set of new model grafting techniques, enabling us to change the model architecture during training, kind of felt like, to me, that there's a ton going on there. And like, some of it, you probably can't talk about, there's normal data.
So like, I don't think all the data that was quite explained, like what the longer context data was, but it's like, are you taking this from models, starting point from models that people would know? And can you say any of that? I think even just the summary that it's a synthesizing recent work into a strong model is great context for people.
[00:15:48] Michael Poli: Yeah. Well, the deadline, so we've, given this explosion of, of primitives that, you know, describe, and given sort of the, the [00:16:00] cost that it would require to evaluate all different combinations, we found ways to essentially start training. With a configuration and then continuing on with another configuration.
I think we'll have, we're going to have more work or a paper.
[00:16:16] Nathan Lambert: Yeah. There's so much cool work in that area. So one of the, someone at AI too is working on a project where they're essentially trying to cut the Lama models in half and keep training them. And things, it's just the wild west out there with people trying to take strong models and make them smaller while still getting the performance benefits of bigger models.
I think that's a whole aside, but. I wasn't expecting it to show up when people, when like you follow the social media, I've striped by, you know, people are like, Oh, state non attention models are finally good. And it's like, it covers up a lot of the details that are very interesting about it, in my opinion.
So, okay. Turn back to treat, I think. Mamba actually happened first among these, I did the, his reading back of [00:17:00] social media, and it also was very surprising to me, I think the, the largest model in the Mamba suite is 2. 8 billion parameters, if I remember correctly, and it was compared to a lot of the common benchmarks in open NLP, so things like GPT J, Pythia model suites, and the scores on the benchmarks reported were really strong, and I think if you want to start with the high level summary, and then I'm definitely going to make you talk about the awesome new CUDA kernels and stuff that you had to write for this project.
[00:17:34] Tri Dao: Yeah, so this, Mamba is a collaboration with, with Albert Gu, who's now, he was, a PhD student at, at Stanford, that's where we met, and, he's now a professor at CMU, and, also at a startup. so it was a, a wonderful collaboration, credit goes to him. Yeah, Albert has been working on this line of work called state space models, [00:18:00] in some sense, as mentioned, it connects to things like linear tension, linear RNN, convolution, neural nets, and, that's what his PhD thesis, is about.
I've also worked on, space, state space for the past couple of projects, My, my angle is how to make state space more hardware efficient and, kind of increase their expressiveness. so it's wonderful working with, with, with Albert. and there, I think is more of a proof of concept, which is, Can state space actually do as well as transformer on language? So we've, we've seen previous papers, showing state space could be better on audio, could be better on, some of the tasks on the long range arena. but, language has always been, the most difficult to get, to, to, to do well for state space models.
[00:19:00] And, language is also kind of the thing that People care about the most right now. So I was more of a proof of concept, which is, Hey, we want to show that safe space space can be competitive or maybe even meet some of the transformers out there. so we, we validate that at the scale up to three B trained to 300 B tokens.
So in absolute terms, you know, these are not very strong models. These are not yet models that you would actually. play with and expect it to do meaningful things, I think is more of a, more of an academic comparison in terms of architecture. It's like, hey, training, train for the same amount of tokens, it does as well, or maybe slightly better than some of the transformer out there.
So, and that's, in particular, it's been, very exciting to us. I think, Albert's been pushing on this for, for a while. I've been pushing on this for a while, and I think finally, it's like, It seems to, [00:20:00] to, to finally be kind of close to gap or even surpassing the transformer. and it just, just, I think it's opens up a bunch of opportunities.
so inference could be way faster. maybe we would have different ways to understand how in context learning happens, et cetera. So, lots of, lots of future work I would expect.
Mamba hardware optimization
[00:20:22] Nathan Lambert: Yeah. Can you go into some of the like, what does it actually take to implement some of these new CUDA kernels? I just remember when this paper was announced, Sasha Rush, who's also very active in the space, recommended me to talk with you too, was tweeting about the types of files or whatever.
In the paper, there's this discussion about how like the recurrent state needs to be sufficiently expressive, but doing so in a certain type of memory is a problem. Like translate what this means to like people thinking about GPUs and people thinking about these models being scaled, like, is it now? Much easier to scale these [00:21:00] models because they work on GPUs.
Which GPUs were you using? Is there a bump that could come just from going to H one hundreds or something? Any of that?
[00:21:08] Tri Dao: Yeah. so, the pre, the line of work on state space, like s four models, kind of pioneer by, by, my Alberts work. they, they c are in some sense recurrent neural network. but they have a much larger, So, the state size is whatever kind of, buffer that you're going to store information as you traverse or as you process the sequence.
In some sense, you can view transformer as doing that as well, where it's, keep the entire history is usually called the KV cache. it keeps the history and keep referring to it. for RNNs, they have a fixed size state. for transformer state, you can think of the state size is increasing. And, our intuition [00:22:00] is that, the larger the state size, the easier it is for the model to do well.
So basically, you have more space to store whatever you need to remember. And so previous models like S4 and so on, they have an implicitly pretty large state size, but they use the convolutional view to avoid having to materialize the state. So that was, that was wonderful. Michael has, has worked, behind the architecture, has used some of the same insight focusing on, on convolution.
Mamba, on the other hand, focuses on the recurrent view. So, we wanted to put more input dependency in the, the, the recurrence. we thought, you know, the thinking was that it was going to make, it more expressive and the model would do better, but that prevents us from using this convolutional view that would make things efficient.
So we had to figure out a different way to make things efficient. and, so I, I focused on making that efficient on, on, on GPUs. and so all, you [00:23:00] know, our thinking was, instead of, okay, we're gonna have a large state size, but we don't have to like ride to actual GPU memory, like the HBM, we can just keep that, large state in a, a faster, Memory you call SRAM, you think of it as a, as a cache. so if you're more familiar with, CPU, so this is usually a cache and RAM. So, you know, if you have large state, you can keep it in the cache and you don't, by avoiding having to write it down, you actually don't suffer too much if the state is, is large.
Predictions for 2024 architectures
[00:23:33] Nathan Lambert: Would this be due to like input out, like having to move the data around being really slow? Yes. Yeah. That makes a lot of sense. Thanks. That's a really common constraint in a lot of these things, and it's like, right. I think one of the most insightful things I've had now with GPUs versus TPUs and stuff is how mixtures of ex mixture of expert models doesn't work very well in TPUs, just because you have to like that essentially add a mixture of expert at a basic level.
There's a routing layer that you learn, [00:24:00] and then multiple feedforward layers that you can choose from. And when you're distributing this, the feedforward layers could end up. On a different TPU node and TPUs communicate with their neighbors. So TPUs take a big hit relative to GPUs where within video class and video clusters, everything's connected so much more.
And then it's easy to do that sort of distributed training. And that's super interesting. And it's like, do you think there's going to be, I guess this is really where I want to open the conversation of like, what does this mean? What is going to happen in 2024 in this space? Are bigger players going to move in and be exploring this my take, seeing how good the long context learning could be in a fundamental limit is that systems like chat GPT are going to use a dense, like a transformer model for most tasks.
And then if you need to do summarization, you might do a long context specialized architecture. And then we can even see a whole quiver of architectures behind [00:25:00] something that you're using. But I think. It's just like, is attention going to be dethroned? Is Sasha Rush somehow going to win this bet that everyone was following in the area?
I got, what are you thinking about either of you?
[00:25:14] Tri Dao: I think transform is still a very, very strong architecture. and there is a proven recipe, right? You know, people scaling to a trillion of parameters right now, if you want, you say, well, I just want the best performing model. that runs most efficiently on my hardware that has the most support on on the software side.
Fast forward is a safe bet. I think it's here to stay. but I think there are new ideas, like, state space, kind of, some of the linear attention ideas from linear attention. I think they're coming in. we've seen, as Michael mentioned, that mixing some of these components seem to improve performance, revalidated at, I think, seven B scale, but, Maybe it might even work at larger scale.
I think [00:26:00] people tend to be conservative and, you know, focusing too much on modern architecture, might not be worth their time. Like the Lime architecture is very, very strong. Most people are doing off of that. They're focusing on data. they're focusing on infrastructure, which makes sense. I think on, on my side personally, just plain interesting.
They're like more, I would say niche use cases. niche for now, where some of these alternative architectures are interesting, things like long context, different domains like audio and genomics, and there's just plain interesting scientific questions you can ask, like whether it follow instruction just as well, whether it follow intuition just as well, does it play well with quantization and so on.
That's just plain interesting. Research questions we can ask. Now on the production level, I think Transformer is still incredibly strong, very well supported, both hardware and software. But I think some of these new ideas are coming in [00:27:00] and people might start, you know, putting them as part of a component in the Transformer.
Maybe we'll still call them Transformer, but they just have more, more layers and just attention and NLP.
[00:27:11] Michael Poli: Yeah, I 100 percent agree with you. So attention as a, as a computational primitive is not going anywhere anytime soon. It's just a very efficient and a very convenient way to. Increase the effective state of, of your sequence processor. so at some level, if you're working with a model that only has a fixed state in each of its sequence mixers, you're, you have an assumption and your assumption is that you only need so much information in the sequence.
So there's, there's always a trade off between, this kind of the ratio of the state dimension, the sequence length, as you push things to the extreme, either model sizes. So as you make the model bigger, wider, effectively [00:28:00] introduce more states and sequence length, some of these margins. you know, some of this is speculation, but some of these margins will disappear, some of the trade offs will change, especially 14, 30, some of these very fat models.
But certainly either whether that's hybridizing or some kind of new, new block, we're certainly going to see some more innovation. That's, that's really exciting. My, my personal, if I had to make a prediction is that architectural design will get more interesting, more, more complex. There's going to be more to do.
More predictions for AI
[00:28:38] Nathan Lambert: Yeah, I mean, this year it's like, I got some 10 minute clock that's fine for us. I think like with mixture of experts and this being popular as a state state models, like this is all just really within a few months outside of opening. I like they've been doing mixture of experts for a lot longer than everyone.
In terms of open and academic [00:29:00] communities, like no one's really tried to do early Jeff on mixture of experts. Like it should just work, but we have to learn all these things. And then the model grafting is becoming more of a real thing. That's super interesting. It is just. I agree that it's just fun to follow and hopefully it gives academics and scientists more ways to influence the conversation where an industry is just about scaling and bigger models where we could maybe do specific things better, which is what I'm telling open source companies to do with their language models anyways.
Like if they want to have a business model, they need to have an edge. So this all fits into that kind of narrative pretty well with my regards. Is there anything else you guys are following in ML? It doesn't have to be about state space models. Like what's, what's exciting for you broadly for next year?
[00:29:46] Tri Dao: Yeah, personally, I think data is still the most important thing. we're, we're thinking a lot about how data influences the model performance, like really teasing that [00:30:00] out, either, you know, having some of the synthetic tasks that correlates very well with, with model performance. That's been kind of the motivating.
kind of examples in a lot of our papers and work has been focusing on synthetic tasks, or, having like maybe, maybe smaller data sets that kind of make it easier to really understand what's, what's really going on. so, I think I'll, you know, personally, my focus is going to be on data for the next little bit.
Yeah, all the, all the architecture stuff is fun. making that hardware efficient is, is, is, is fun. but I think, ultimately it's about data. If you, if you look at the scaling, scaling law curve, the more architectures. Different model architectures would generally have the same slope. They're just different offset.
it seems like the only thing that changes the slope is the, data quality.
[00:30:58] Nathan Lambert: I love that point. That, that does [00:31:00] seem true. I have the plot from Mamba in this blog post that I'm writing, which is, it's just a little, just a little bit above the same slope.
[00:31:08] Michael Poli: Yeah, we add data. Data is really interesting, sort of miniaturizing, architecture design, finding and breaking down what, tasks are involved into, for example, language modeling and trying to package them into something that can be used to iterate something that's quite exciting. We have, that was one of the main techniques that was used for the, this, zoology, paper that also looks into, into some of these different behaviors.
And personally, I'm also really excited about new applications, scientific applications, with the genomics work, but even more, but more engineering focused, we're seeing a shift, right now it's language is still kind of, The domain that gets the most clicks, [00:32:00] most interest, but I think that that will evolve over time.
and some of these other applications offer, even just talking about architectures, they offer a completely different design space that I'm excited to look into.
[00:32:13] Nathan Lambert: Yeah, everyone talks about language, but I feel like images and entertainment and videos are like the things that are so obviously going to generate so much value to me.
Like, I don't know the ceiling on language, but when you could access a like somewhat local text and video model at your home workstation, that's like tailored to your preferences. Like the amount of value that that creates is totally astronomical. I I'm excited. I mean, I've started playing around with these where I'd take.
Text of the blog and convert it to dolly images and convert it to a video with generated audio all with like one Python script and it's like, that's really easy to do. So I agree with your more than language is fun to have that view
[00:32:55] Tri Dao: and these things actually do work reasonably well in your experience when you stitch [00:33:00] all them together.
[00:33:02] Nathan Lambert: it's not that good. The DALLE images are pretty similar, but I'm doing something really naive where I just, I literally take the text and have a system prompt. It's like you're generating series of images for visualizing a blog post and, and it generates various like. The, all the machine learning thumbnails that you see everyone using, they're like variations of that.
The fun ones are where it's like about Llama or Mamba or something. And then they like generate animals in them, which is good. I think I could get much better at it and have a better segmentation system for the paragraphs and, or have like chat to PT summarize them or something like that. But I just know that within like a year, it was going to be a text to video API and I'm just going to switch it and it's going to be great.
And so I'm like laying the groundwork for infrastructure to have like multimodal. Content as multimodal content distribution, really, and I just expect it to become very fun. I mean, like even the text to voice is pretty good. I think I don't have a studio, but once [00:34:00] you have a studio, it's going to be able to generate perfect audio for whatever you want.
So another one of my dreams that is. Bad for young students is I want to be able to give like a slide deck to a script that returns the five minute conference video that no one ever watches just based on like a, GPT for reading those, the slide deck and voicing yourself. So those are the silly things that I have time to do because I'm not a professor.
[00:34:29] Tri Dao: Yeah, I think these, these, these advances, these systems, like they, they do generate a lot of economic value and, and we're seeing that already. Lots of companies are now switching to using these things. And I think it's going to change the way we work as, as you mentioned, the way we work, the way we're entertained.
So I'm just very exciting future.
[00:34:47] Nathan Lambert: Yeah. Anything else? Well, thanks for coming. Try to get you guys as much. Attention as I can bring, you never know it'll go viral these days. So I think this was a great conversation. People are really hungry for basic intuitions in [00:35:00] the area. So this is good.
[00:35:02] Tri Dao: Yeah. Thank you.
Nathan is a pleasure. Absolutely.
[00:35:07] Michael Poli: for inviting us. And, maybe, if, you know, there are more questions, is there a way to maybe collect them or to, to provide readers with like listeners with, an address or something? Happy to answer anything.
[00:35:24] Nathan Lambert: Yeah. I'll, I'll include contact info in the post and various ways.
This will be out there. You'll get your comments on Substack, YouTube, Twitter. It's a mess. You've got to pay attention to 10 million streams of information these days, but you'll, you'll get contacted by people. Thankfully, for some reason, people read my stuff, but here we are. So thanks for listening.
En liten tjänst av I'm With Friends. Finns även på engelska.