Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what’s under the hood, and telling stories.
www.interconnects.ai
The podcast Interconnects is created by Nathan Lambert. The podcast and the artwork on this page are embedded on this page using the public podcast feed (RSS).
This post is early to accommodate some last minute travel on my end!
The new models trained to express extended chain of thought are going to generalize outside of their breakthrough domains of code and math. The “reasoning” process of language models that we use today is chain of thought reasoning. We ask the model to work step by step because it helps it manage complexity, especially in domains where the answer requires precision across multiple specific tokens. The domains where chain of thought (CoT) is most useful today are code, mathematics, and other “reasoning” tasks. These are the domains where models like o1, R1, Gemini-Thinking, etc. were designed for.
Different intelligences reason in different ways that correspond to how they store and manipulate information. Humans compress a lifetime of experience into our spectacular, low-power brains that draw on past experience almost magically. The words that follow in this blog are also autoregressive, like the output of a language model, but draw on hours and hours of background processing as I converge on this argument.
Language models, on the other hand, are extremely general and do not today have architectures (or use-cases) that continually re-expose them to relevant problems and fold information back in a compressed form. Language models are very large, sophisticated, parametric probability distributions. All of their knowledge and information processing power is stored in the raw weights. Therein, they need a way of processing information that matches this. Chain of thought is that alignment.
Chain of thought reasoning allows information to be naturally processed in smaller chunks, allowing the large, brute force probability distribution to work one token at a time. Chain of thought, while allowing more compute per important token, also allows the models to store intermediate information in their context window without needing explicit recurrence.
Recurrence is required for reasoning and this can either happen in the parameter or state-space. Chain of thoughts with transformers handles all of this in the state-space of the problems. The humans we look at as the most intelligent have embedded information directly in the parameters of our brains that we can draw on.
Here is the only assumption of this piece — chain of thought is a natural fit for language models to “reason” and therefore one should be optimistic about training methods that are designed to enhance it generalizing to many domains. By the end of 2025 we should have ample evidence of this given the pace of the technological development.
If the analogies of types of intelligence aren’t convincing enough, a far more practical way to view the new style of training is a method that teaches the model to be better at allocating more compute to harder problems. If the skill is compute allocation, it is fundamental to the models handling a variety of tasks. Today’s reasoning models do not solve this perfectly, but they open the door for doing so precisely.
The nature of this coming generalization is not that these models are one size fits all, best in all cases: speed, intelligence, price, etc. There’s still no free lunch. A realistic outcome for reasoning heavy models in the next 0-3 years is a world where:
* Reasoning trained models are superhuman on tasks with verifiable domains, like those with initial progress: Code, math, etc.
* Reasoning trained models are well better in peak performance than existing autoregressive models in many domains we would not expect and are not necessarily verifiable.
* Reasoning trained models are still better in performance at the long-tail of tasks, but worse in cost given the high inference costs of long-context.
Many of the leading figures in AI have been saying for quite some time that powerful AI is going to be “spikey" when it shows up — meaning that the capabilities and improvements will vary substantially across domains — but encountering this reality is very unintuitive.
Some evidence for generalization of reasoning models already exists.
OpenAI has already published multiple safety-oriented research projects with their new reasoning models in Deliberative Alignment: Reasoning Enables Safer Language Models and Trading Inference-Time Compute for Adversarial Robustness. These papers show their new methods can be translated to various safety domains, i.e. model safety policies and jailbreaking. The deliberative alignment paper shows them integrating a softer reward signal into the reasoning training — having a language model check how the safety policies apply to outputs.
An unsurprising quote from the deliberative alignment release related to generalization:
we find that deliberative alignment enables strong generalization to out-of-distribution safety scenarios.
Safety, qualitatively, is very orthogonal to traditional reasoning problems. Safety is very subjective to the information provided and subtle context, where math and coding problems are often about many small, forward processing steps towards a final goal. More behaviors will fit in between those.
This generative verifier for safety is not a ground truth signal and could theoretically be subject to reward hacking, but it was avoided. Generative verifiers will be crucial to expanding this training to countless domains — they’re easy to use and largely a new development. The field of LLM-as-a-judge (and related synthetic data pipelines) only really became stable with models at the level of GPT-4.
Reasoning models trained as a judge are a very natural fit because the exact token for a predicted reward or ranking is crucial — CoT is essential. All of the progress here relies on continued progress on both generators and verifiers. o1 et al. were likely trained with mostly explicit, code verifiers. They spawned far more powerful generators, which will enable new types of verifiers. Then, we can train better models (and so on).
Onto another example of unexpected performance of new reasoning trained models. DeepSeek-R1, the new open-weight o1 replication has been showing up at the top of many random benchmarks as top overall, above Claude 3.5 Sonnet, Gemini, and GPT-4o, and alongside o1. Examples include a creative writing and humor leaderboard or the brand-new, extremely challenging benchmark from the Center for AI Safety and Scale AI — Humanity’s Last Exam. Oh, and yes, it’s best on both accuracy and the new metric “calibration error” which is designed to have the model express its own uncertainty. Calibration is a long-sought behavior in traditional LMs and turns out maybe reasoning training helps it?
A lot of my friends find o1-pro to be clearly the most useful AI model in their daily workflows (one example here and a similar R1 example here). ChatBotArena has all of the new models, from o1, Gemini-Thinking, and R1 as some of the top models these organizations have in the best “normal use” evaluation the AI community has. These reasoning models are definitely absorbing the other lessons learned in post-training across the AI industry.
The explosion of R1 caused arguably the biggest general awareness of AI moment since the original ChatGPT. DeepSeek’s App has been the number one overall free app in the U.S. and non-technical users are getting meaningful value out of seeing the reasoning process. What was a niche training process is bringing many more types of benefits than expected.
All of this is just on “day 1” of this technology. Reasoning models are going to proceed at a rate far, far faster than most expect.
These models will not be state-of-the-art on every domain, but probably far more than you expect. Language models are a complex technology and they will never be one size fits all, but the ground is being reshaped under us.
Especially, where the standard models match the reasoning models abilities, you’ll be paying way more for the same performance. At the same time, so many domains are going to be open to the “if you pay a little bit more, the reasoning model will get you a bit more performance,” which will accrue so much value over time.
These are trade-offs that many in the AI industry see at face value. Many ask where Anthropic’s reasoning model is, but they may never explicitly have one. Before o1 launched, Claude was already using extra tokens hidden from the user to improve the quality of responses. Anthropic CEO Dario Amodei commented on their approach in an interview with Joanna Stern of the WSJ recently:
To say a little about reasoning models, our perspective is a little different, which is that there’s been this whole idea of reasoning models and test-time compute as if they’re a totally different way of doing things. That’s not our perspective. We see it more as a continuous spectrum — the ability for models to think, reflect on their own thinking, and ultimately produce a result. If you use Sonnet 3.5, sometimes it already does that to some extent. But I think the change we’re going to see is a larger-scale use of reinforcement learning, and when you train the model with reinforcement learning, it starts to think and reflect more.
It’s not like reasoning or test-time compute — or whatever it’s called — is a totally new method. It’s more like an emergent property, a consequence of training the model in an outcome-based way at a larger scale. I think that will lead to something that continuously interpolates between reasoning and other tasks, fluidly combining reasoning with everything else models do.
As you’ve said, we’ve often focused on making sure using the model is a smooth experience, allowing people to get the most out of it. I think with reasoning models, we may take a similar approach and do something different from what others are doing.
The newest Claude 3.5 Sonnet models are very likely already trained to some extent with RL on verifiable outcomes. Just days before o1 was launched, Claude’s behavior of “I’m thinking about that” was the biggest indicator we had of consumer companies trading more compute for better responses. Anthropic hasn’t shifted their strategy here, and you can decide how much weight you want to put in their CEO’s recent comments.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
The techniques are here to stay, and it took revolutionary new models to show us that.
Like many new technologies, we needed to be shown what was possible, and then it can be folded back into normal experience. o1 was this breakthrough, and the benefits of reasoning training will now expand out into all of the AI products we are using day and night.
To end, I leave you with a quote from the DeepSeek R1 paper, where the authors reflect on their experience with the model(s):
One of the most remarkable aspects of this self-evolution is the emergence of sophisticated behaviors as the test-time computation increases. Behaviors such as reflection—where the model revisits and reevaluates its previous steps—and the exploration of alternative approaches to problem-solving arise spontaneously. These behaviors are not explicitly programmed but instead emerge as a result of the model’s interaction with the reinforcement learning environment. This spontaneous development significantly enhances DeepSeek-R1-Zero’s reasoning capabilities, enabling it to tackle more challenging tasks with greater efficiency and accuracy.
Thanks to Ross Taylor and Hamish Ivison, for discussions that helped inspire this post.
We're here to share the story of building our Open Language Models (OLMos) and what we improved to build the OLMo 2 7B/13B model that is competitive with the Llama 3.1 8B model. This is all about building an effective, small language modeling team that can share all it learns with the scientific community. Dirk, Luca, and Kyle are some of the people I learn the most from and have more knowledge (and entertainment) to share than we have time.
Some questions were pulled from Twitter, but please comment or get in touch if you want us to cover anything in the future episode(s)!
Main topics:
* Pretraining efficiency and our quest for stability after a not-so-secret failed 70B run early in 2024,
* What the role of OLMo is in the broader AI landscape and how that is, or is not, changing,
* Many little decisions that going into building language models and their teams (with a focus on NOT post-training, given I already talk about that a ton).
Play with the models we build here: playground.allenai.org/
For more history of open language models (OLMos) on Interconnects, see my first post on OLMo, my coverage of OLMoE, OLMo 2, and why I build open language models. If you have more questions or requests, please let us know (especially the researchers out there) and this can be one of N, rather than a one off celebration.
Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Contacts
Dirk Groeneveld — https://x.com/mechanicaldirk // https://bsky.app/profile/mechanicaldirk.bsky.social
Kyle Lo — https://x.com/kylelostat // https://bsky.app/profile/kylelo.bsky.social
Luca Soldaini — https://twitter.com/soldni // https://bsky.app/profile/soldaini.net
General OLMo contact — [email protected]
Papers / models / codebases discussed
* OPT models and talk from Susan Zhang
* BLOOM
* C4: Boosting Large-scale Parallel Training Efficiency with C4: A Communication-Driven Approach
* Maximal Update Parametrization (muP) is from Tensor Programs V: Tuning Large Neural Networks via Zero-Shot Hyperparameter Transfer
* Spike No More: Stabilizing the Pre-training of Large Language Models
* LLM360: Towards Fully Transparent Open-Source LLMs — Amber model
* A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Kyle said Hitchhikers)
* Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
Chapters
Chapters: Here is a list of major topics covered in the podcast, with timestamps for when the discussion starts:
* [00:00:00] Introduction
* [00:02:45] Early history of the OLMo project
* [00:15:27] The journey to stability
* [00:25:00] The evolving role of OLMo and pretraining research
* [00:29:00] Pretraining Q&A (µP, scaling laws, MoE, etc.)
* [00:40:40] How to think about pretraining data work
* [00:54:30] Role of pre-training vs mid training vs post-training
* [01:02:19] Release strategy and wrapping up
Transcript
This is generated by AI and lightly edited for clarity. Particularly, the attribution per-speaker was poor on this time around.
Nathan Lambert [00:00:07]: Hey, welcome back to Interconnects. In this interview, we're bringing one that I've hinted at for a while, which is interviewing some of the other leads on the OLMo team at AI2. So essentially, this covers the story of OLMo from its early days where we got our compute, kind of our path to stability and some failed runs along the way, the role of OLMo and the broader AI ecosystem, and really just a very long tale of technical details and decision making and considerations that you have when actually training language models that you're trying to have at the frontier of performance relative to peers like Llama, etc. This is a fun one. There's less post-training than normal because this is me interviewing some other co-leads at the Allen Institute for AI. So there's three people in addition to me, which is Dirk Groeneveld, who is the lead of training, handles most of engineering, Kyle Lo, and Luca Soldaini, who are the data leads. So we have a pre-training engineering lead and two data leads with me who has done a lot of the post-training. This is just a part of the team. And I hope you enjoy this one. We can do more of these and bear with the fact that I'm still expanding my podcasting tech equipment. But I think the audio is definitely good enough and enjoy this episode with me, Kyle, Dirk, and Luca.
Hey, everyone. Welcome to the AI2 office. We're finally talking more about some of our OLMo things. Too much work to do to actually get all the... the information we want to share out into the world. So I'm here with Dirk, Kyle, and Luca. We can also talk so people identify your voices so people are not all on video. Hi, I'm Dirk.
Dirk Groeneveld [00:02:01]: I am the lead of the pre-training part of OLMo.
Kyle Lo: Hi, I'm Kyle. I work on data.
Luca Soldaini [00:02:08]: Hello, Luca. Also work on data with Kyle.
Nathan Lambert [00:02:13]: Okay, so we're kind of going to maybe go through some of the story of OLMo to start. And then just get into as many nerdy details until we get tired of OLMo 2. Which, in my state, this will probably be mostly about pre-training. You can ask me post-training questions as well. But I'm not going to sit here and be like, ask myself questions that I'm not going to answer. Because that is an absolutely ridiculous thing. You can ask me one question. Okay. One question. It's like, why shouldn't you post-training with all the compute?
Nathan Lambert [00:02:45]: But I wasn't here for when OLMo actually started. So I think it'd be good to tell people, I mean, like, broadly what AI2 was like at the time, what language modeling was like at the time, what it may or may not have been risky.
Kyle Lo [00:03:01]: Yeah, you should probably get this.
Dirk Groeneveld [00:03:03]: Yeah, I think it all started in the fall of 2022.
Dirk Groeneveld [00:03:10]: We were talking to AMD at the time about some sort of collaboration. We're scoping out some stuff. And at the time, we wanted to take the Bloom model. And put 300 billion extra tokens in. And we wrote up a proposal and we sent it to AMD and it disappeared into a black hole. And we never heard from them again. And then ChatGPT came out a couple months after that. And suddenly everybody was very excited. And two, maybe one month after that, AMD came back to us and said, now let's do it. And that kicked off a very busy period for us. At least the three of us were involved at the time. Plus some of us. Some more people trying to scope out exactly what the project would be. Putting 300 billion tokens into Bloom wasn't that cool anymore. The field had moved on. So we needed to find something else that would work both for us and for AMD.
Dirk Groeneveld [00:04:07]: And that's exactly what we did. We figured it out. We figured out who would be on the team, how exactly to do it. We had to get the data from all of that stuff and then started working on it.
Luca Soldaini [00:04:16]: I think it was, let's look it up. And the official birthday of all of us. Almost is February 2nd, 2023. That's when we had like a big sort of half day. Summit workshop and a bunch of researchers self-organized a long discussion. I'm foreseeing maybe like 40, 50 of us try to scope down a potential language model project at AI2.
Kyle Lo [00:04:48]: Yeah, it was also extremely bottom. Up because we were all like, nobody, it was not on anyone's radar. We were working on, everyone's working on different projects that we had promised for the end of the year. This was very much just like a side gig for us. We had no compute other than this mysterious AMD GPUs that just came. It was like, oh, it's possible. And everyone was just like, yeah, I'll work on this on the side. Let's just start hacking together some stuff.
Nathan Lambert [00:05:14]: How far along the line until you decided on 7B? Like, were these things obvious at the time?
Luca Soldaini [00:05:20]: I think the size of it. This is where Llama's size was. Yeah, we started with seven because seven was the smallest Llama size. This was Llama one. Yeah, Llama one was like first couple months of 2023. Yeah, we started, we started scoping before Llama one. And then when Llama one came out, it made sense to have a configuration that was just sort of close to what they were doing. So it's not too much reinventing. I think seven was.
Dirk Groeneveld [00:05:52]: Yeah, I mean, I think the original scope was recreate Llama one, which would be a 7B at 1.4 million tokens. What were we staring at? OPT.
Kyle Lo [00:06:03]: We were staring at OPT also, right? During around that time.
Dirk Groeneveld [00:06:07]: For inspiration. Yeah. And for what not to do in many cases. Was OPT even like in the many tokens regime or was that still like when people did the booms and booms?
Luca Soldaini [00:06:18]: I think OPT and booms were.
Luca Soldaini [00:06:22]: They were not, they were not over trained at the end were both a scope to Chinchilla that they both had extensive logs and so they were very useful because both of them have hundreds of pages of like, whatever can go wrong during pre-training. Yeah. I mean, OPT was amazing as a resource for figuring out, you know, we knew nothing, so we needed to know what's important. And yeah, I remember there's also avoidance and so on. There's that. It's like Susan has this talk.
Dirk Groeneveld: I'll come load parallels of training OPT and yeah, I think the original ones, I always feel it's kind of a shame because the OPT models are not very good, but, but they were first, like they figured all that stuff out for the first time. I have huge amounts of respect for that.
Nathan Lambert [00:07:11]: And what's the like open source angle thing at the time, or like, had you already identified that there was no open pre-trained data sets for these models?
Kyle Lo There definitely wasn't any open pre-trained data sets. I think we were basically looking at. The gopher paper that had most documentation and then Llama one had enough documentation about what data sources were using, where we were like, okay, let's try to reconstruct what it was. And I think roughly around the same time, Red Pajama V1 and then shortly after it was like Falcon, Falcon, the first Falcon, we were all kind of concurrent works at the time, but basically starting from, I don't know, Grab, Common Crawl, grab a bunch of sources to try our best.
Luca Soldaini [00:07:50]: The funny thing, like we had conversation of like. Like, uh, there was like, boy, it would be good if we didn't have to do the data. This would be one fewer thing to do, but at the time, like even when, uh, Falcon dropped, they released like a small preview that wouldn't match like the token budget that we wanted for a training run. So it was not even like, you know, it was good work and like, oh, maybe we just switched to this one. And then we quickly arise, not, not big enough for the two trillion. So I think it was like, maybe. Yeah. Yeah.
Dirk Groeneveld [00:08:22]: I mean, we did the C4 data set way before any of this. Um, and so my first idea for how to do data was to just run C4, but on all the Common Crawl, um, instead of just whatever the most recent one was at the time. And I actually started writing a repo for that, but then ended up not doing it. This is the C5 repo. Yeah.
Nathan Lambert This was C4's side of data cleaning practices.
Dirk Groeneveld Yes. That's exactly a re-implementation of C4. And, um, for it to touch it, we'd run on slightly different hardware, um, with more dApps and that was, that was going to be the entire story until we found we could do better.
Nathan Lambert Yeah. And, um, for general timelining, I joined pretty much like almost 7B was, I think mostly done training or wrapping up pre-training and the like instruction tuning at the time was like basic SFT with a sprinkle of DPO. Yeah. So I think a lot of that story gets cut. Compressed. Like I'm guessing the actual pre-training happened in like the second half of the year, mostly. So it's a lot of prep to get a language modeling system to exist. Yeah.
Luca Soldaini [00:09:32]: I think we handed off the one of Dolma. So the data set that we used for pre-training is like end of June, I think, 2023. Grab Common Crawl, end of March. Yeah. So all the source acquisition was March, April. Let's see March and then yeah, a few months. There.
Nathan Lambert [00:09:52]: Um, if someone wants to do the same thing today, which is like, we should train a language model, how much faster would it be to like, is OLMo actually making that much of like, would it be a week with OLMo stuff now, or would it still take a lot of time to set this up?
Luca Soldaini [00:10:07]: I think if, if you want to, um, if you want to train exactly on OLMo data, you know, data, it's much faster, um, training, I think it requires a little bit more finesse and dirt. Yeah.
Dirk Groeneveld [00:10:23]: If someone gives you a cluster to, to run on, just figuring out the mechanics of getting your thing to run, just so setting all the environment variables and having the drivers loaded and so on, it might take you a week or so if you're, if you've done that kind of thing before. Um, so that's very different, but you can take a trainer that already works and just, just use it.
Luca Soldaini [00:10:45]: Um, it really depends like where, where you start. It's like, if, if you're spinning up your cluster from. Scratch, then you acquired a hardware, then that hardware has burning periods. So the first three months stuff will fail and that has nothing to do with the model itself. It's just, your hardware is also brand new.
Dirk Groeneveld [00:11:06]: Yeah. I mean, I am eternally grateful for AMD for giving us the compute to get started, but it was kind of difficult to run on.
Nathan Lambert What was the exact amount of compute? Like, I think when I arrived, that wasn't even what we're using where it's like Lumi discussions and the original amount.
Dirk Groeneveld Of compute was, uh, 2 million hours on Lumi.
Nathan Lambert So, so 2 million GPU hours.
Dirk Groeneveld [00:11:29]: Um, that's we're training way bigger now than that. Yeah. So I think I did the math recently. It's like the order of a million hours is if you do a thousand GPUs concurrently, like 20 days. Uh, I don't have that math in the top of my head, but, um, the first, the first end to end run for the 7B that we did took, uh, 35 days. We can now train that same. Model again in three days. So things have changed a lot since then. Yeah.
Luca Soldaini [00:11:58]: Well, some rough, rough stats for almost two anyways, seven and 13, just the final ones, um, was a little bit over 5 million GPU hours combined. And then we have roughly 5 million hours worth of experiments.
Dirk Groeneveld [00:12:15]: Um, these are, uh, A100, H100. Might be surprised. Oh, it's the case too high or too bad to do some, it's way too high.
Luca Soldaini [00:12:33]: Um, it's like, how do you encamber overhead then?
Dirk Groeneveld Oh, combined.
Luca Soldaini [00:12:36]: It's some of them plus the ultimate training. They're also not using the new one core quickly.
Dirk Groeneveld [00:12:42]: So, yeah, but I'm just thinking if it's, let's say conservatively 7,000 tokens per second, four months on a thousand. Do you think it's less than that?
Nathan Lambert Like, okay, let's just go and track those number down. I think it's interesting. It's like, what percentage, what is the percentage of improvements still? Like how much of all the two being better is just by the compute being more stable just by doing more experiments. And that lets you test things like stability and just get the ducks in a row rather than like the data being so much better. It's an impossible question.
Luca Soldaini [00:13:20]: It's that it was like. And, you know, the trigger part with using that AMD hardware at the time, specifically that cluster, was that cluster was being brought up online at the same time as we were experimenting with it. So we were helping that cluster being set up. So it's because of that, there's a lot of things where we had to second guess ourselves, whether that was an issue on our side, the hardware side.
Nathan Lambert [00:13:58]: Isn't this always going to be an issue with new GPUs coming into the world? Does Microsoft plug in opening eyes GPUs and they just work?
Luca Soldaini [00:14:06]: I think it was, yeah, it's always tricky. It's a combination of like getting both new GPUs. At the time, AMD was a relatively new vendor, plus the cluster itself being new. So it's like stacking, you know, risky, risky things on top of each other in a way that it's like, oh, if you can, if your cluster is solid, that, you know, the GPUs are brand new. Well, the network is not going to cause issues, but if the cluster is new and the GPUs are new, who knows where the problem sits. Yeah.
Nathan Lambert [00:14:44]: We'll go down the... Yeah. We'll go down the whole stability round the hole. Dirk, how close are you to a number?
Dirk Groeneveld Five trillion tokens at 7,000 tokens per second, which is what we get for the 7 billion, more or less, over the long run, is only about 200,000 hours on each one. So our first estimate was way off.
Luca Soldaini [00:15:05]: It was... Check the top. I think maybe my memory was wrong. Maybe my thing was... This is why I have this laptop here.
Luca Soldaini [00:15:18]: Oh, no, I was misremembering. Okay. My name is 500K. I remember flying... 500K. Yeah, yeah, yeah.
Nathan Lambert [00:15:27]: So it's like from the first AMD grant of a few million GPU hours on AMD to what we have today. It's like it's gone from multiple million AMD hours to training a model over five times the tokens in half the GPU hours. That's right. Yeah. Like, where do we...
Dirk Groeneveld I mean, the biggest one is that the MI250 that Lumi has on, like, the MI250 is the AMD GPU that Lumi has, is of the A100 era. It's comparable to an A100 in price and capacity. But now we train on H100s, and they're just...
Nathan Lambert What percentage of tokens... It's just a newer GPU. Yeah, what percentage of tokens in OLMo 1 code versus OLMo 2 code are lost at, like, a 7B, so a scale that we're reliable on? What percentage of tokens in OLMo 1 code versus OLMo 2 code are lost to spikes?
Dirk Groeneveld I think it was OLMo 1 losing a considerable amount against the spikes game. That's impossible to estimate, because there's so many other differences at the same time between OLMo 1 and OLMo 2.
Nathan Lambert Can you summarize the architecture differences? There's a list in the paper. We don't have to be exhaustive.
Dirk Groeneveld That's going to be a lot of stuff. The biggest difference is the init. So I guess now we're getting into what did we actually discover?
Nathan Lambert These are some audience questions. OLMo 1 and OLMo 2. Finbar, who you might know specifically, asked, like, how do you write an init N(0,0.02) to an init? I'm like, I don't know.
Dirk Groeneveld That particular init is the default in Megatron. And the init that we had in all one was just trying to be too clever. We stole that init from OpenOLM, and they took it from somewhere else, actually. And I don't remember what the original source is.
Nathan Lambert What is the actual decision-making on an init that's too clever? You, like, think that you can get a better learning region by bundling with something?
Dirk Groeneveld We tried it. We ran it for, you know, 100 billion, 200 billion tokens, and we looked at which one is better. And scaled init is absolutely better for a long time. So scaled init is the original. It's the OLMo 1 init. Works better for a long time. You have to train for a really long time before you see it come apart. You have 2 trillion tokens for a 7Bmodel. And then things get a little bit dicey. So this is why, you know, this is why we used it for OLMo 1, because it looks quite good for a long time.
Nathan Lambert Which of our OLMo models did we figure out that the init was a change?
Dirk Groeneveld Because we did a few through the year. We tried that same init with a 7D model, and that did not work. That model stalled out around 1.3 trillion, 1.4 trillion, something like that,
Dirk Groeneveld [00:18:12]: which gets at the heart of the stability. So we started to think about the stability investigation. So I think that was one of the audience questions, right? And how do we even go about the stability investigation? starting from the point of we're training the 7DB and it's not working anymore, what did we do? The first step was to identify the issues that we see in the metrics and see them in a smaller model. And the two issues we saw were lots of spikes that we call them fast spikes. So the model recover. They recover quickly, but they just happen more and more the longer you keep training. And at some point, even the fast spikes kill you.
And the other thing was a growth in GradNorm. It seemed very much that the 7DB would always start blowing up once the GradNorm got to 0.4, regardless of what intervention we did, it would get a little bit further. And then as soon as we hit 0.4 GradNorm, it would blow up again.
Nathan Lambert So you lowered the learning rate and it was up again.
Dirk Groeneveld So fortunately, yeah. Yeah. So we would do things like that, increase the batch size, change the late decay, blah, blah, blah, but quickly it gets back to 0.4 and then blows up again. So fortunately, both of those phenomena also appear at the 7DB, even though the 7DB trains fine, it has both of those traits. So we decided to focus on those two because it's too expensive to try all these experiments at 7DB. But these are two things we could fix at 7DB and then see how it goes. So that was, that was the first step. But now. Now we have a metric where we can pretty quickly, within 12 hours or so, do a run, find out if our numbers are better and then change something and do it again. And the second component was we took another model that successfully trained that didn't show these issues, that didn't show the slow GradNorm growth and it didn't show the spikes either. And we ablated against that. So that was the LLM-360 Amber model. They're like all very open. So we could take their data. We could take their setup and look at it in great detail.
Dirk Groeneveld [00:20:22]: And we basically tried things one by one, sometimes two by two or so to not run too many operations. But we tried things until we got to a stable setup. There are some other insights at the time. I was really into the Spike No More paper, which is all about the magnitude of this. So we tried embeddings. So we tried some stuff there.
Dirk Groeneveld [00:20:48]: Pete Walsh on our team tried some other stuff involving Adam W settings that made things even better. And then we took a lot of inspiration from the Chameleon models because we were talking to that team on a semi-regular basis and they had a lot of stability issues. They found some solutions that we also tried and some of them worked for us and some of them didn't. And we took the ones that worked for us. So it's always ablating at the 70 scale until our numbers look super smooth and super nice.
Nathan Lambert How specific do you think these are to our setup? Are these all OLMo specific insights or is it just kind of a process you have to walk down? We've heard some of these things before. It's like all these developments are you have to do the previous type of thing before you can go bigger, do a more complicated model. Do you think that's actually true or is there just best configurations at the time?
Dirk Groeneveld I really don't know the answer to that. It's hard. But something I want to know, something I want to do for OLMo three is walk back a few of these things and see in retrospect which ones are actually necessary. And in particular, I'm hoping that some of those are not necessary and they're costing a bit of performance, you know, just to boost our own efficiency a little bit.
Luca Soldaini [00:21:54]: In general, I don't know, you can tell me if there's a useful summary, but it seems like the space of intervention you can take is so big. And other model, they're not going to translate perfectly, but the hit rate to like find a good solution is higher if you start from that model and you explore around it versus like try to explore like the full space of possible solutions. Yeah. And then some things will not pan out once you try to rerun them on your setup. And I don't think that's an indication of like necessary . Yeah. You know, we can mistakenly reimplement their thing, not in the way they're supposed to be. It's more like some things translate, some things don't. But it's a good starting point.
Dirk Groeneveld [00:22:55]: Yeah. I mean, we are a fairly conservative bunch with this, right? Because even the 7B runs are actually kind of expensive. So make small changes from a known baseline by and large. Yeah. I mean, everyone has.
Nathan Lambert Yeah. And risk is pretty obvious when you look at the cost numbers and like who you are trying to beat or not. And it's like we are trying to try to plot or people can build on it. And it's much better to keep making small progress than it is to go for glory runs and just hope that works. I think both works. The more compute you have, you can have a bigger distribution of investments, but it's not that surprising.
Dirk Groeneveld I mean, I hope that we can be a lab that is a little bit more risk tolerant than others. For one thing, we don't have Meta's resources. So we should be a little bit more aggressive. You know, it would make me much more nervous if I had to bet a billion dollars on our next run than the amounts that we can bet. So we can try a little bit more. I also feel and I hope that our management agrees with this. I feel that if we always, if we're always safe, if every one of our runs works. That means we're not trying hard enough, right? We have to occasionally crash and burn.
Nathan Lambert I think there's a few every year that you should crash and burn. I think these crash and burns at the big scale get a lot of attention from media and stuff. But it's like, what do you expect them to do? If they haven't, you're walking up a line and might as well try to take three steps at once every so often. Exactly. But I do agree. I think that's a cultural thing that we're trying to navigate. It's like, how do we do more interesting stuff and not just fall into the trap of being the best? Open model. No one else is doing this. Like, okay, you could do that for a while, but it's not as motivating.
Dirk Groeneveld And it's not just because it's more interesting to do that, but also just the fastest way to make a better model. The fastest way to calibrate your risk tolerance properly. You have to sometimes be over. Yeah. It's inevitable.
Nathan Lambert [00:25:05]: Any follow ups on risk?
Kyle Lo Yeah. I'm thinking now it's like, because the 70B crash was so sad. Yeah. And I'm wondering if you look back on it now, it's like, that was the greatest thing for us. We learned so much from that.
Dirk Groeneveld [00:25:19]: It was very important to love too. I do a little bit. So, I mean, we felt terrible, right? Like this was an awful time for us. I was like, I'm done. Let's get good questions. No, we were the training team that couldn't train at all. I felt so bad. But the work we did following up is some of the proudest I've been about the stuff I've done in my time at AI2. Yeah.
Luca Soldaini [00:25:47]: In general, my thing about the role of OLMo sort of keeps evolving, right? It was very natural to have OLMo as these models designed to help others do research and language models. That's how we initially, it was a big part of OLMo 1. You just release all the components because it's important to have these tools available to everyone. To study language models. And I think we serve that community well. One thing that it's, I hope we can do with OLMo more is that there are like some interesting aspects of language models. Interesting capability, interesting architectural decisions that for a myriad of reasons, they sort of get overlooked in like say a company or like in a framework where, you know, you have certain constraints in your model. But it's still there. They are important. And there are questions around like what a model should be able to do, how it should operate, and things like that. But I think we can take a role where like we have in general this recipe that both enables research and language model and for like subset of model capabilities that we think are fundamental. No one is touching. It's our space to do work there. I think the prime example that I keep repeating these days is what we did with MOLMo and
Luca Soldaini [00:27:25]: vision team was mostly working on it. And MOLMo is very good vision language model in general. It benchmarks up there. It's not the best, but it benchmarks up there with open models. And then it has this like this interesting point. Pointing capability that no other vision language model has. And that pointing capability is, turns out, is fundamental for a lot of language models and robotics that you want to build. It's a core capability the same way that a text model should have long context. And it was cool to make, to sort of emphasize that of like, oh, we have the specific capabilities that would enable all these applications. And so more people should work on like the specific aspects. So I think that's a cool way to like work on things that folks haven't had a chance to touch on yet.
Nathan Lambert [00:28:24]: I think it's like trying to parse out why this type of situation could happen is not easy. Because you generally, everybody would want to do this. Like everybody wants to come up with a new capability that expands the scope of what X type of AI model can do. And I think it's most of like probably goes down to the culture of where people have space. To think about stuff in a more interesting way. It's like, because obviously everyone wants to have breakthroughs and open AI and Anthropic that copy. But it's like sitting at a boundary between doing just the same stuff and doing more researchy stuff that you need to have. I have more architecture questions. One is MUP. Multiple people are asking about it. I still don't really intuitively know what it is. But are we going to use this?
Dirk Groeneveld We have done a fair bit of work into it. And it hasn't worked for us yet.
Nathan Lambert Can you explain what it is?
Dirk Groeneveld MUP is mainly a way of setting the learning rate, but also some other hyperparameters. By training only small models and then having a guarantee or at least a pretty good idea that it will work also for larger models.
Dirk Groeneveld [00:29:33]: We have implemented this. We've experimented with it. So far in our setup, it works across model sizes. So the learning rate that it predicts you should use, it doesn't predict the learning. It just gives you one learning rate. Basically, the good learning rate for the small model is also the good learning rate for the big model. That works if we change the size of the model. It does not so far work if we change the length of the training run. And that's why we haven't been using it so far.
Like number of tokens.
Yeah. Or longer. If we double the length of the training run or we 10x the length of the training run, the optimal learning rate is different in our setup.
Dirk Groeneveld [00:30:21]: It seems like this might be a bug. It should work, but it doesn't.
Nathan Lambert And the positive gain is just that better scaling because you don't have to fiddle with the certain. You know you're getting the right learning rate, which is a crucial hyperparameter.
Dirk Groeneveld Yeah. It's just a better way of setting learning rate. And it works for a few other hyperparameters too.
Nathan Lambert But there are other open models that use this. Explicitly. Pretty sure. I mean, open weights model. Yeah. Those are linking. Like Llama and stuff using this. Llama does not, I think. But I don't know for sure. We'll always see with the next iteration. Even Llama3 felt like they were still building their org and their infrastructure so fast. It's just like get in what you can get in and there will be more models in the future.
Dirk Groeneveld Yeah. I mean, MUP is a shortcut, right? Like you can for many settings where MUP wouldn't work. Or you have to just establish scaling laws and predict what it will be. You could do the same thing for the learning rate. Just MUP lets you do this with even fewer runs. You know, you don't even have to extrapolate anything anymore. You just use MUP and your setting will work. That's the idea.
Dirk Groeneveld [00:31:29]: But you kind of already need a scaling law set up anyways for things that MUP doesn't work for. You know, like architecture changes and so on. Yeah. So in that sense, it's not that important. It's still pretty important. And we're going to keep trying to make it work for us. Maybe just find the bug. But it's not absolutely critical.
Nathan Lambert How does scaling laws actually tell you the way to change like the width? Do they actually tell you the change in width or the depth, like the proportions of the network relative to the size? Like what are the actual output variables? Or how are you controlling the architecture you're going to use in the scaling laws? Well, like I know what it's trying to predict, the accuracy, but are they on set architecture things?
Dirk Groeneveld You would usually vary one thing.
Dirk Groeneveld [00:32:17]: Like you don't vary anything. You establish how it scales with size. Yeah. And you set your size according to a certain formula. Like you might say, I will go 1.4x the depth and 1.4x the width. So I have roughly 2000 pixels. That's a bigger model. And you do that a few times and you draw it on a graph. Then you change your architecture. You do it again. You draw a different graph. You lay them over each other and you hope that the lines don't cross. And one of them is clearly better than the other.
Nathan Lambert Yeah. I definitely have known that there's some, it's like one of the obvious things architecture design and the not obvious things. It's like you obviously make the model bigger, but the subtlety of like how tall versus wide. I think we're talking about like a client that's like much deeper than ours, our model architectures. And it's just like, I'm around these things and I don't have an intuition for if tall or wide is better. And I think it's like what works.
Dirk Groeneveld There are some early results from Google, I think. I think they're called efficient net or something. That suggests that over a wide range, it doesn't matter whether you go wide or deep. It's not that surprising. That's pretty old results now. We're following up on a particular result right now. Actually, so OLMo 2 is a 7 and a 13, right? But there also was a 1 that didn't work very well. And we're trying to find out why. And one thing about that model was it was pretty wide and not very deep. So we're checking whether that is the reason why it wasn't very good. So we're sort of in the middle of double checking this assumption that it doesn't really matter whether you go wide or deep.
Nathan Lambert Yeah, that makes sense. I think that is something that doesn't matter to most people. They're probably very interested in it. Just like how they have these blocks and how do they decide. And it's like just one of us decides.
Dirk Groeneveld And it's like, eh, seems right. There are other concerns, right? So we train with FSDP, with 0.3 sharding. So we can try to choose these sizes such that they utilize the GPU in the optimal way.
Dirk Groeneveld [00:34:29]: Which has nothing to do with the sort of abstract training dynamics. It's just the practicality of getting this thing into 80 gigabytes of memory. So then those concerns might take over. There's other stuff like all your tensors, all your tensor dimensions need to be multiple of 64, 128, things like that. GPU math stuff. Yeah, exactly.
Luca Soldaini [00:34:53]: It's really hard to argue against things that are practically making you run fast. Because it means that if I find something that is 20% faster, your big run trees fast. All the experimental cycles are 20% faster. So it's not very glamorous. But everyone is really happy when we find one of these. Like, oh, this is a shortcut.
Dirk Groeneveld [00:35:16]: I find it super glamorous. I mean, when did you ever have such a clear sign of impact that you can say, I wrote this thing and it is not 20% faster? No, the impact is very good. Yes.
Nathan Lambert The numbers you're changing are not necessarily glamorous. It's just detailed stuff.
Kyle Lo [00:35:34]: I also think the experimental cycle thing is probably the biggest thing for me. What we're seeing consistently is the more experiments you run for a particular idea, the more likely it is to just work out. It's just a function of trying more things.
Nathan Lambert [00:35:47]: It seems like in the pre-training, there's very few, like, you just get the idea. I mean, well, I said post-training more. But literally, like, we had a meeting with John Schulman. He was like, everyone, lead labs, train RL and athletes do this. And we got, like, a three-month head start on one step. But pre-training, all that stuff. I think it's evaporated.
Kyle Lo [00:36:05]: The human intuition piece is just gone. I think once you do v0, you can kind of do everything with intuition. It's like, oh, look at data. This kind of makes sense. This seems like . And then after you get to, like, v2 of something, it starts becoming really hard to make sense of what is good for a language model or not. So you kind of just need to just try a bunch of stuff.
Dirk Groeneveld [00:36:29]: And then there comes a game of stacking improvements that are worth 2% to 5% each.
Nathan Lambert I think it's very compounding, at least in all the math, works out over a year. I think I want to ask about MOEs as well, if you have a different thing you want to say. But it's mostly, like, it seems like we have a OLMOE, which, if you look at the plots on paper, it's like this MOE architecture beats all of our own things and carry efficiency. But it seems like we had a path we needed to go down to make sure dense works really well and get all these improvements. And then you have to, like, feed back in. And you, like, merge the MOE streams. We have DeepSeek. We have Minimax. There's countless other MOEs that get really high eval scores. Like, they're not as easy to do research with because they have tons of total parameters. And people need bigger clusters to fine-tune them, blah, blah, blah. But it's like, is MOE something that you think we just need to do to make better models?
Dirk Groeneveld Well, it's a complicated question, and we haven't quite answered it yet for ourselves.
Dirk Groeneveld [00:37:34]: We did investigate doing a bigger MOE. And we found that the engineering is somewhat difficult. And at the time, we came to the conclusion that we could do that engineering, but then who's going to run that thing later? They also have to have a team of engineers on top of it to make sure they can train this.
Nathan Lambert What does the engineering look like? It's not, like, CUDA-level kernels. It's how you distribute parameters?
Dirk Groeneveld It's a little bit like... It's a little bit CUDA-level kernels in that... If Mega Blocks by itself isn't enough for you, then it gets really complicated. And we ran into that situation where if it had to be significantly bigger than what we did, it just got too complicated.
Luca Soldaini [00:38:22]: There is an inference. These very big models that really get advantages by... If you tailor them to, like, where you're going to do inference with them. So if you're a big company, you start thinking about, like, how to batch request, how to, like, serve the model. But if we could do it ourselves for the place where we're running, but then you start thinking, like, oh, folks who want to use their model in their hardware, they're better served by advanced model than also redoing this engineering on top. Like, there is, I think, a clear advantage if you are... Also providing an API to an MOE. Yeah. Very clear cut.
Dirk Groeneveld [00:39:10]: It depends on how we think of the product of ALMO. And the number one is still it's an item to be researched. So other people need to be able to train on it and to modify it and so on. And that is just much easier if you have a dense model. Yeah. If you think of it as something that gets put into a product. And people will run tons of issues. But if you have a lot of inference on and you only really care about the final score that it gets, then maybe the MOE starts making a lot more sense again.
Nathan Lambert Yeah. That's a good answer. I think it's, like, I think people can fill in the blanks of, like, what we may or may not do.
Luca Soldaini [00:39:53]: And I mean... I mean, like, different, like, I'm curious, like, what, like, folks at Llama, the Llama team think about MOE.
Nathan Lambert [00:40:03]: If the Meta AI exists, they're 100% going to do an MOE.
Luca Soldaini [00:40:06]: I mean, it's interesting, right? It's, like, if they're serving few, if they're expecting that the Llama users are going to be, in fact, one of the better smalls are few large companies that can figure out inference, then MOE makes sense. But if they're thinking about more, like, this model that wants to, it's great if it's adopted by a million developers, large and small, then, you know, they're still going to reach a lot of dense model. Yeah. Exactly. That development is so easy, so much easier for people to set up their own inference with a dense model.
Nathan Lambert [00:40:40]: Yeah. I think we've gone surprisingly long without asking about data. It's, like, how much more, is it just an infinite hill to climb on data? It's finding good data and filtering bad?
Kyle Lo [00:40:53]: I mean, I think it's an infinite hill to the extent to which everything else is also, and you can kind of keep improving, right? But yeah, it's the main threads constantly are. Got to get more data, because if you're working with larger pools of data that you can't actually get easily new data that's not in your distribution, it's probably interesting to study how that adds in. And you have more to work from. So if you have, like, a strict quality filter, you can still get your high token yield if you start with a much larger pool and filter down. So getting more data is really, really critical, especially if you can target specific pockets that you think is missing. You can always keep iterating on better filters. Understanding how those filters affect performance. And everything kind of interacts with each other. Like, safety filters interact with quality filters, interact with deduplication, interact, like, all these together. So there's an infinite, even ordering, search space between these operations. So keep throwing more things at it.
Luca Soldaini [00:41:53]: Yeah, it's very much just stacking small improvements. Yeah, shots on goal. I think the way it looks is, like, it's... For each... Now that we have, like, these multiple stages of pre-training, we think about, like, what kind of improvement you want to get from data at all the various stages. Like, clearly, the improvement you want to get from data you put at the end of training is different than the improvement that you want to see at the beginning. It comes with a different set of requirements. One thing that is really useful is... Intuitions are always often wrong. But one thing that it's worth spending time on is figure out... If you have a data ablation idea, what is the fastest way to disprove it, which requires a little bit of experimental design. And then, yeah, you've got to fiddle with, like, especially, you know, when you do the first version so that you can take a very... It's very easy to measure improvements. And then as you start thinking, like, refined version, then, you know, you've got to think of, like, how you measure your improvements or so. But, yeah, it's... There's no, like, big... After you're done, you know, the basic stuff, your V1 is done. There's never, like, a big, like, thread of, like, this is the one data thing. It's more, like, stacking your Lego bricks to get to a better model.
Nathan Lambert [00:43:18]: Do you think you can iterate faster on, like, end of pre-training, whatever you want to call it, like, highest quality bit training and the only data? Yeah. Have you, like, started that recently?
Luca Soldaini [00:43:28]: I think it depends on the... What we're getting, you know... We... We need a little bit more evidence of this, but it depends on the role of data. Like, it's very much... The reason why we started doing mid-training at all is because we were interested in having base models be primed with certain capabilities that we didn't get during the long pre-training phase. And for those, it's really easy to iterate on new data sources that would improve on those capabilities at the end, pre-trained. But during, like, the pre-training phase, why not the important aspect that we think about is, like, efficiency of your data is, you know, if there is a version of your data that is where train on it and the model gets to performance X on 20% faster, it means that you can train 20% longer, right? Or run more experiments. Or run more experiments. And so... But for those, it's, like, you know, it's... In some cases, you can use mid-training as, like, a proxy for this. In other cases, it doesn't quite make sense, so you have to come up with, like, maybe experiments through scaling laws, maybe experiments through some other technique. But yeah, it really depends on, like, what role a data set plays into, like, the various stages of pre-training.
Nathan Lambert [00:44:53]: So it seems like, like, compared to Dolma 1, which is, like, do the thing, it's all targeted abilities. It's, like, we want to be better at things. We put people on this. It's, like, targeted abilities or where we think we can get a lot of data.
Kyle Lo [00:45:05]: Like, a certain data source that hasn't been mined for stuff. Yeah. Yeah. We have to be opportunistic because it's so hard to get data. And for us, especially if we want to be open with the data, it's, like, we have to also do it by due diligence. Like, we're going to study this data, put all this effort in, and we're still going to be able to share it with everyone. So...
Nathan Lambert [00:45:22]: If you were in a lab that didn't release data, do you think you could make more progress on it? Like, how, like, how much is that actually?
Kyle Lo [00:45:27]: Oh, yeah. Oh, my God. Such a time sink.
Luca Soldaini [00:45:31]: I mean, it's, like, it's a little bit of a mistake that we put in. Yeah. Like, and this is not even, like, doing, you know, getting data that managed to not legal, right? You could form partnership. You know, you have people knocking at our door all the time saying that you want to buy this data set. And they're, like,
Nathan Lambert [00:45:48]: I've been contacted by one ML owner to try to facilitate a data deal.
Luca Soldaini [00:45:52]: Oh, yeah. Twitter. Oh, my God. But only the first, the first follow-up is, like, are you cool if we release the data? Of course, they're not. Yeah. So, it's, like, it's, it's, even, like, there's plenty of data that you could acquire from people, but then you can't release it. So, that's, that's a complication to, to progress.
Nathan Lambert [00:46:15]: Yeah. This is more of a self-question, but, like, how much do you think mid-training should be, like, a philosophical shift in how we organize teams? Because it's very easy to do. I mean, we've already consolidated, like, our training and data to base, which is not surprising. But this is mostly hypothesizing on what other people do. It's, like, how close do you think this kind of end of pre-training to post-training handoff should actually be?
Kyle Lo [00:46:40]: I think it's, it makes sense as a thing if, I think these things are, in theory, arbitrary, but you can think of, like, in the extreme, if you had a perfectly oiled machine, you have a very smooth transition between pre-training to mid-training to post-training, and it's actually, there's no boundaries. Like, that's, like, a theoretical. You can probably squeeze a ton of performance by smoothing that out. But in real world, stuff, stuff is messy. So the real world is your three trillion tokens into your base model run, and then you signed a new data deal. You got to do something with this, and you're going to undo your training one. Well, you got to figure out something. So maybe that's mid-training, right? Mid-training is when you have an opportunistic need for something, or you're training something and someone catches a bug, which happens all the time, like a data bug or some training bug, and you're like, oh, I had to patch it. So then there's the shift fundamentally. You got to know how to deal with this. So just because these things aren't, these large training runs aren't super repeatable, and they take so much time that the world state changes all the time, you always need some strategy on how to deal with, oh, I'm near the end of pre-training versus I'm near the beginning of pre-training versus... Yeah.
Nathan Lambert [00:47:47]: It's like, we're obviously trying to solve long context, so this fits right into this. It's like, we're going to do this thing. Does it go, where does it go? Some people do it in post-training. Yeah. There's some component during pre-training.
Kyle Lo [00:48:00]: It's kind of just like, you have to follow a few recipes and figure out what works for your team. Yeah. And so much of it is just, if it's expensive, try to push it off as much as possible. Because if it's risky, push it off as much as possible. If you can intervene to get the same result much later, huge win. You can try a bunch more things. If you have to intervene because it's some core thing that has to be baked into pre-training time, you're kind of... It's a sad space to be in. But then that's the thing where you have to intervene. That's the pre-training data.
Dirk Groeneveld [00:48:29]: There's a big question that I'd love to get an answer to, but I don't even really know how to think about it. But the question is, what makes a pre-training model a good candidate for mid-training fine-tuning? Because all we really try to do is we try to maximize our metrics, but we don't really know that those metrics are what makes a good step zero for post-training.
Nathan Lambert I think a relevant thing, I don't even know if I've told you this, but I don't know how to take action on this, is we got advice that we have the multiple stages of post-training. In this instruction tune phase, we got advice that's like, eh, it could be a little broken. You can have some crap in there. It'll get fixed later on. And it's like, why is that okay?
Nathan Lambert [00:49:14]: It might be the same thing in pre-training. It's like, you want to get in the right... It's more important to get in the right ballpark than the right exact number. Yeah.
Luca Soldaini [00:49:21]: It feels like it's more about not how to make a good model for post-training. But what to avoid so you don't have a bad model post-training. Yeah.
Nathan Lambert [00:49:33]: There's a whole other question, which is how to make a base model that's easy to fine-tune in general, versus one that, if with the right finagling, can get the absolute best numbers. Which I think, for OLMo, would be really great to be like, here's a super stable platform. A lot of people have complained about specifically That Llama Instruct, it's hard to fine-tune. Which, after most of the post-training. Because this is where people at companies start. They're like, this is the best open-weight model. I want to add a little thing in it. And a lot of people have difficulty in fine-tuning it. It's different at the base, because most people can't do this full instruct thing. But for researchers, having a stable platform at base is way more valuable.
Kyle Lo [00:50:12]: There's an interesting... About this, like, what makes a base model a good base model. There's this interesting, I guess, debate that we've had a bunch of times. We've also had with other people. Which is, it seems like there's like two hypotheses on what the role of this... How do you think about data as an effects-based model behavior? There's one hypothesis, which is, you need quality data so that you don't get any spikes. You have stable training. You have no bugs. And once you pass that level of quality, as diverse as possible. It's just about an init to the model, so that it can go in literally any direction. And so, diversity is the next. That's one hypothesis. The other one is, it's all domain effects. The only reason why... Like, you can just keep climbing. There's a notion of quality. But you... And you can keep getting more and more and more as long as you're very clear about what target domain or target application you are. You just keep getting closer and closer. Well, there's a lot of suite learning. Yeah. Well, this goes into, like, the continue pushing. I just like... It's just domain effects all the way down. If you're only evaluating on this particular stuff, you can always get your base model to be better for that. Just keep climbing on it to get it more and more similar. As opposed... And, like, think about, like, I care about this application, this suite of applications, all the way through. From base model... Can you not kind of have both? I feel like I'm confused with how, like, actual generalization fits into this. It's... It's... It's... It's competing ideologies in terms of, like, if you believe in the first one, then you're all in on diverse data acquisitions. And how you set up your team. Yep. You're all in on efficiency and stability for your pre-training. And then you just get as much different data as possible. And you're post-training all the time. If you believe in the latter one, you solve backwards from, this is what I want the model to do. And I make all the changes everywhere to try to squeeze performance out of this class of problem. In the big... In the data, in the bit-training data, bit-training data, et cetera.
Nathan Lambert [00:52:01]: How important do you think the actual, like, multi-tag category of every data document is? Like, know that someone... Like, that these people have really advanced tagging of all their pre-training documents. Like, do you... Like, does it essentially say, like, doing that and choosing them? Which is, like, a very much, like, crafting a, like, recipe for your pre-training versus, like, just good numbers. So, like, just get a good classifier and roll with it.
Kyle Lo [00:52:27]: We have tags. That's fine.
Luca Soldaini [00:52:31]: The tags are useful even if you get this idea of, like, let's use as much as possible. You know, diversity is important. A lot of web data comes with absolutely no useful metadata. You have, like, URLs. URL is very, like, you have to do things on top of it to make your URL useful. It doesn't add much. So, the more you have in terms of, like, categories, metadata information, you can start using this as a tool to try extra technique on it. Maybe it is extra technique to mix your data in a certain way. Maybe it's filtering out things. Maybe it's, like, designing benchmarks. Try to correlate with those. Yeah. Otherwise, it just seems to have this giant bucket with maybe, like, one quality knob. And it's, like, it's very hard to make progress if all you can adjust is, like, one number we cut for quality here. So, it's, I'm not surprised that, you know, the big labs, they almost have these tags. I want to know how they use them. That's, like, the part that's not good. That's the part that's not good. Yeah.
Kyle Lo [00:53:51]: But it's also not just you have more levers to pull and then, you know, the more things you can try, the better. It's also you want tags that are actionable, right? So, like, if you had a sensible notion of a tag and you realize, oh, more of this data as you keep adding more of this lever, performance keeps going up. At some point, you might be, like, we're out of that data. We need to go get more of that. Without that tag, you want that tag to be something that's understandable so you can go and negotiate another deal, do synthetic generation, et cetera, of that type of data.
Nathan Lambert [00:54:13]: Do you think most of the synthetic data gen, is for very specific things at pre-training? I mean, it kind of has to be. Probably, yeah.
Kyle Lo [00:54:25]: You can't just be, like, oh, it's generative data. Like, that's not something, I don't know what that procedure is.
Luca Soldaini [00:54:30]: It's probably to prime the model to whatever you need during post-training. Like, you know, we've seen, like, normally with math, it's much better if your model has an elementary knowledge of math to, like, improve on that. It's quite the same with everything that it's, like, oh, I want to do RL on this. If the model is completely random on it, you're going to have a very hard time.
Nathan Lambert [00:54:52]: Yeah, it's, like, I guess a good transition. It's, like, what do you three think post-training is, should be, or, like, is not doing?
Kyle Lo [00:55:02]: It's elicitation.
Nathan Lambert I'm coming around to this view that it seems that you can extract abilities from the model.
I think it's totally elicitation. Like, the Hitchhiker's Guide to Data paper from Google, yeah, that one was very, that one had, like, a very specific experiment. But it seemed like that was pretty strong evidence towards it. It's, like, you filter out all of this type of data, you literally can't fine-tune that model. You can never recover that. There was a history detection, right?
Nathan Lambert [00:55:28]: I think if you do more flops, you potentially can. I mean, it's obvious, like, we're not talking about, like, O1 stuff things here. But, like, there are even datasets that have, like, 15 million math-only instructions. Are they going to be able to really start doing a ton of math? At some point, yes. Yeah. But I think that most of it, or it's almost easier to operate. I mean, it's just like, assume that capabilities are in this model and are post-training to get it out.
Luca Soldaini [00:55:53]: Sometimes there's this very large set of, like, things that you do in pre-training because you have a sense of, like, how they play an application. I think one day it's, like, very obvious. It's like, code model, you want to do, you want them to do completion, you're going to add, fill in the middle of loss, maybe at the beginning of pre-training. It's like, oh, then I can play my entire pipeline around like that. So it's all about... So far, it seems all about that. I don't think we have cracked a good recipe to do the same for things that are not capabilities, but they're, like, recalling facts. Oh, yeah. Or, like, long-term knowledge.
Nathan Lambert [00:56:29]: Yeah. It's, like, all of us, like, all know, or, like, I don't know, at least people out there have MLMU numbers that go up in X stage. Like, instruction tuning, boosting MLMU, I'm like, what are you putting in there?
Dirk Groeneveld [00:56:42]: What do you think of mid-training then? Is that a manifestation or... Mid-training? I think it's still...
Kyle Lo [00:56:47]: I think it's still positive knowledge. I think mid-training is, it's just, it's still pre-training, but with strong domain effects. It's just smoothing out the boundary between, you have a very, very sharp distribution shift when you do post-training, and we know from, like, kind of ML101 from the past five, six years, that smooth, smoothing out, helping, like, transition between major domain shifts helps. But we don't have a clear example of where, like, it helps with specific knowledge acquisition. Yes. For them, we don't know how to do it. But for, like, you that are really easy to evaluate, things that are really big progress on, it's like, yeah, smooth this out.
Nathan Lambert [00:57:30]: So, like, why is post-training important to the release site? Some of you guys came around to, like, post-training being important for getting traction later on. Is that just, like, an ML ecosystem, how it works?
Dirk Groeneveld Oh, I mean, the base model is kind of useless, right? Yeah. There's only so many next tokens you need to know about. Yeah.
Luca Soldaini [00:57:50]: But it's like, you know, we've seen papers that use all the research, for example, where the idea for that research only came by comparing base model with, you know, instruction team model, like, the one where folks, they were involved around, like, certain pattern of speech and OLMo 1. Where do they come from? Do they come from pre-training? Do they come from post-training? And, like, even if you just want to do research, it's kind of useful to being able to compare side by side. So it feels wrong to put a model out that it, like, cuts sort of the problem that you can study in half until you have the post-training ready. And it's useful to have all in one package so you can use it right away.
Kyle Lo [00:58:40]: Post-training is just, like, a really, really long eval loop. Yeah. And that's a lot like, oh, base model, you know, a few shots, a few shots on some benchmarks, like, no, no, no. We eval it by post-training it and then eval it in post-training.
Nathan Lambert [00:58:54]: Yeah. I mean, to some extent, it is kind of true. I mean, that's how we should think about it.
Dirk Groeneveld [00:58:59]: If we could do that cheaply, we would totally hill climb on that metric.
Kyle Lo I think that's the metric. Because if base model is the good in it for the post-training, which is the model people actually want to use, then we evaluate it on its own. And on its status as a good in it.
Nathan Lambert [00:59:16]: Yeah. So it's like, how do we... And then the question is, like, how important do you think research for post-training on the specific checkpoint is? It's like, how important is genealogy versus, like, general recipes? Because I think we... I openly think we under-index on using one model. Because much like the path to stability, which is a eight to ten month really specific thing, I'm guessing if you're really just, like, in a narrower regime, you can just keep kind of turning these little things. Yeah. Hopefully at some point we can do better with new models. Yeah.
Nathan Lambert [00:59:52]: Okay. We're kind of going to some wrap-up things. How do you think about release decisions? Like, should AI2 release everything that we ever tried? Or is it, like, when should we actually get models out the door?
Dirk Groeneveld I mean, I would love to do that, actually. Especially the failed runs. You know, like, where else could you get a repository of failed runs? Yeah. I mean, I think it's just a matter of giving other people the possibility of looking into these failed runs and finding out exactly when they failed. In practice, that's super difficult. Because just releasing something is hard. You know, you need to upload the checkpoints and translate them in a different format. And you have to describe what you were even trying in some way that makes sense to people outside of the org. Give them access to the weights and biases and to the logs. And it's just a lot of work. And there's always something else that seems more pressing than that.
Nathan Lambert Seems like a scaling. Like, how much we can share is capped by how we can scale our org. Which, like, we're not going to have a complicated management hierarchy or, like, an entire org that is just support. And everything you upload, you build as a support burden. It's like, literally, we just have seen the envelope grow, grow, grow. It's like, more people use our things, you get, like, boring support. Like, people want to use it. That's the cost of it.
Dirk Groeneveld I guess it's a great problem to have. People want to use it. People want to use us.
Luca Soldaini [01:01:15]: And it's funny. To make a checkpoint where, like, some are very useful, you need the person who was involved. You have to fill, right? You need the person who was involved in it to sort of pour their knowledge into a format that then, you know, people can consume outside, right? Otherwise, you know, we would just open up our S3 bucket, the checkpoint, and it would be, like, utterly useless. Because what if you wanted to know more of the parameters, so, like, as long as we optimize for release, then we have the bandwidth to provide, like, the support around that. If people want the 70B fail run enough, you know, I'm sure we can release it.
Nathan Lambert [01:01:57]: It seems like it's just finding the right medium to release things. Like, I think long-time people reports are really good for the stuff that we do, because it just puts everything in one place for people, and it almost makes on-demand easier in the future. Whereas, like, we could just drip and drag models out all the time, but, like, that's not something we can't do. It's just, like, not... In terms of making progress in things that are easy to build on, it's probably just not worth it.
Kyle Lo [01:02:19]: In fact, there's even a cost to it, right? The big example here is we had a release of OLMo 1 0724, or July 1. Yeah. I think research, using that for research, that has been probably one of the tougher models, because it didn't come with a blog post, it didn't come with, like, some docs. And so, yes, it still waits for checkpoints and everything, but comparatively, usually, even when people come to us, we're like, oh, we recommend you use 0424. And now with OLMo 2, we're like, oh, that's the one we recommend, because it has all the documentation. So just dropping something doesn't seem like it really helps.
Nathan Lambert [01:02:56]: I would say we should move faster than, like, the 1-2 iteration. But the in-between is not necessarily even worth it. Which is very odd, when you think about being fully open. It's just, like, kind of with the costs of doing business.
Kyle Lo [01:03:10]: It's like being fully... You want to be fully open, but you don't want to add noise. And you don't want to waste people's time. Right? So if you drop something that's kind of half done or half baked, and people start spending time on it, only to get frustrated later, you've cost them something.
Nathan Lambert [01:03:22]: How does this relate to, like, how pre-training is changing? Like, do you think we need to invest in... Like, openly, a lot of startups are changing their relationship to training. And if they're going to use Llama or pre-training or customer data, and then we have X compute budget, and does any of this come into play? Or is, like, it's all the same with the talking? It's, like, continue to hill climb, do what you can, reasonable trade-offs, and who will actually use the models? It's, like, not too different.
Luca Soldaini [01:03:54]: I think that the... So for me, the cutoff point is, like, is there something useful and generally interesting to add if you pre-train? The case of Llama, all this, like, mid-train things that we concluded, we done. It couldn't be as clean if we started with an already pre-trained model. So it's, like, is there really something useful to add to the conversation if you pre-train? If we get to the moment when the answer is no, or, for work, like I was saying. But it feels there's still value to add to the conversation. At least in the research side, like, pre-training, there is tonight a question of, like, we know how to help researchers. We want to help more than just researchers with the models we put out. And if we think there is this application that we can do a very good job, or just this use case, a very good job by starting with someone else's pre-trained model, we shouldn't waste compute on pre-training from scratch. Just saying. We can solve that. But it's an ever-evolving question, really. It's, like, I don't know. We can make decisions six months out, maybe? Maybe a year?
Kyle Lo [01:05:24]: Well, that's what I would say.
Kyle Lo [01:05:27]: I know. You're the pre-training. You're the hardcore who's pre-trained some models.
Dirk Groeneveld [01:05:34]: There's lots of runway left in pre-training. The big labs are fairly conservative because they have to be. But that doesn't mean that we're done. I mean, it's not that we're done. I also feel that the point of all is to make pre-training research accessible to more people, because even if you don't have the resources to pre-train the whole thing from scratch, you can still use our checkpoints and use our code to prove out some sort of improvement. And as we've seen in other areas, even Microsoft tries to push .NET or Apple tries to push Swift or whatever. They try to, like, it's a really big effort for them, and they try to push this. And the open-source community says, I don't care. We're going to use Python. And Python wins. So if you can somehow enable the vast resources of a million people banging on a thing, even a company like OpenAI or Meta cannot compete with that. And with OLMo, I'm hoping to capture that a little bit, that if we can capture something with some of the open-source enthusiasm and the academic enthusiasm.
Nathan Lambert Do you think it'll get better this year? Because a lot of academics are bringing on tens of hundreds of H100 clusters around the country. Like, before, it was like just Harvard had 500 and MIT or whatever. But now it's like the long tail of universities. Like, there are a lot of people.
Dirk Groeneveld [01:07:12]: And then, you know, if you have 200 H100s, you can at least establish scaling laws for your idea. So, like, what I'm hoping is someone uses OLMo to try some new thing, establish the scaling laws up to a 3B model or whatever. Then we take it and we prove it up to 30B or whatever our computer allows. And if it still works, then, and if it's probably going to open, then I take it. And let them win. Let's not win. Yeah.
Nathan Lambert [01:07:36]: I mean, they would never tell us that they'd win. Yeah. Like, what do we need to achieve this? Do we need resources and compute and certain people? Like, do we need more feedback from the community? Do we need feedback from people at labs telling us which things to do?
Kyle Lo [01:07:48]: Compute and people, for sure. That is undeniable. If you have more computes, you can try more things. We can go bigger. If you have more people just trying more things, especially like on our artifacts, we'll just learn so much more and not have to just like, we spend so much time guess working, trying to piece together things from other people's pieces. And sometimes it's nice to just get something out of it. So if they did this on OLMo, we can immediately start working off of it. So people, compute, always, for sure.
Luca Soldaini [01:08:20]: One thing that I, we get a lot of feedback, but it's like, I really like AI2. I would like to use OLMo, but it's missing this feature, which is great. I love that feedback. It's helped us a lot in prioritization. If we could get more, I would love to also get like aspirational feedback of like, none of the models is doing this. But I have a good case for this. Yeah. Those to me are always like very inspiring to read. Whether we'll do it or not, it's, you know, it's a question of like, can we do it and how it works with other things.
Kyle Lo [01:08:55]: But those are always very, very welcome. You know what would be really cool? I think what would be really cool is like more projects in space that you can't do unless you have some sort of fully open constellation of artifacts. Yeah.
Nathan Lambert [01:09:09]: The thing that Dirk, does anyone ever do the thing where you load the model into one GPU and like iterate through the batches to find the one that, what happens when it blows up and the, or like when a loss spike happens?
Dirk Groeneveld I mean, to some degree we did this ourselves. Yeah. But it's like something that people can do. It's not like we wrote a paper about that, but, but yeah, I would, I would love to see a detailed write-up of like millisecond by millisecond, what happens in a retention when a loss spike happens. You know, how, how does it actually happen? These are the things that people can do.
Nathan Lambert And it's like, you just have to keep, keep zooming into a specific level of details in what happens.
Dirk Groeneveld Yeah. I mean, right now we're having, someone is using the various checkpoints to see how a certain metric that we're interested in develops throughout pre-training. Yeah. And it's like, you can do that with fairly minimal compute. You don't have to be AI2. Yeah. It's like one of my favorite weird language law papers. It's the Sander Land fishing for Magic Karp paper. And it's like, you can get much more actual feedback looking at weird tokenization. Yeah. You can get much more actual feedback looking at weird tokenizer impacts and tokenizer data interactions on old mode than just picking API models and figuring it out.
Kyle Lo [01:10:20]: A lot of also, there's a lot of really forward looking at the checkpoints that we have with the data patches and trying to do something like, okay, let's replay this, the, between everything between these steps by rejecting some different data or manipulating the data between these two checkpoints, just to see how it turns to something different. How big of a fork does it go through? Yeah.
Nathan Lambert [01:10:39]: Like if you add the same intervention, like how big does it go through? Exactly. And just to see how it turns to something different.
Kyle Lo [01:10:43]: So it's like reconverge. Or early in pre-training versus later in pre-training same interventions, messing with the data. It's just like, that stuff is really cool.
Dirk Groeneveld [01:10:49]: I mean, I think there's, I've complained about this for a long time. Grad students, I think, are a little bit hesitant to go into pre-training stuff because they need to publish four papers a year. And it's pretty difficult to do that when your cycles are so long. But on the flip side, it's a bit less busy a field. Yeah. So less likely to get scooped if the field doesn't change out from under you while you're in the middle of your project. Yeah. Post-training is not quite as much as it happens on your side.
Nathan Lambert It makes no sense. It's just like, pick something you want to do and people will probably do it. That's okay.
Dirk Groeneveld [01:11:31]: So I'm hoping that by publishing all of this stuff and making all the checkpoints available and the data and so on, we can enable more people to work in that side as well.
Nathan Lambert Yeah. Anything else you guys want to add?
Kyle Lo [01:11:49]: Like, comment, subscribe.
Kyle Lo [01:11:52]: Yeah, I think that's it.
Nathan Lambert [01:12:01]: Okay. Thanks for listening. If you have questions for any of us individually, the Blue Sky and Twitter handles for everyone in this podcast are below. And you can reach out to the general OLMo contact at allenai.org. That's an email address. Or we're really happy to help and we want to keep building this kind of open scientific ecosystem of language models. So all the best. Bye bye. Bye.
Full post for links, images, etc: https://www.interconnects.ai/p/deepseek-r1-recipe-for-o1
I have a few shows to share with you this week:
* On The Retort a week or two ago, we discussed the nature of AI and if it is a science (in the Kuhn’ian sense)
* I appeared on Dean W. Ball and Timothy B. Lee’s new podcast AI Summer to discuss “thinking models” and the border between post-training and reasoning methods. Listen here.
* Finally, a talk I gave at NeurIPs on how I think about post-training for AI applications is now public.
This post is likely getting cut off in email inboxes — I recommend reading online by clicking on the title!
Yesterday, January 20th, China’s open-weights frontier AI laboratory, DeepSeek AI, released their first full fledged reasoning model. It came as:
* A flagship reasoning language model, R1, trained via a 4-stage, RL heavy process. It is MIT-licensed which means companies and researchers can build upon and train on its outputs to accelerate the development and deployment of reasoning language models (RLMs).
* An RL-only reasoning model trained directly from their V3 base model, R1-Zero (used to create training data for full R1).
* A suite of open-weight models finetuned with supervised finetuning (SFT) data derived from R1 (similar data to one of their intermediate training steps).
* A technical report detailing their RL training methods.
* Models are available at chat.deepseek.com (via DeepThink) and in their new app.
This post is less about the evaluation results (which, of course, are extremely good and shown below), but rather about how training is done and what it all means.
This is a major transition point in the uncertainty in reasoning model research. Until now, reasoning models have been a major area of industrial research without a clear seminal paper. Before language models took off, we had the likes of the GPT-2 paper for pretraining or InstructGPT (and Anthropic’s whitepapers) for post-training. For reasoning, we were staring at potentially misleading blog posts. Reasoning research and progress is now locked in — expect huge amounts of progress in 2025 and more of it in the open.
This again confirms that new technical recipes normally aren’t moats — the motivation of a proof of concept or leaks normally get the knowledge out.
For one, look at the pricing of these reasoning models. OpenAI was likely charging more for its model due to the costs of long-context serving and being the only model in town, but now o1’s pricing at $15 per million input tokens / $60 output looks out of place relative to R1’s pricing at $0.55 per million input tokens / $2.19 output (yes, o1-mini is cheaper at $3/$12 per million, but still almost a 10x difference). The price war that is coming for reasoning models will look like the Mixtral inference price war from 2023.
With o3, OpenAI is likely technically ahead, but it is not generally available nor will the weights be available anytime soon. This points to the first time since Stable Diffusion’s release that the most relevant and discussed AI model is released with a very friendly license. Looking back at the journey “open-source” AI has been on over the last 2.5 years, this is a surprising moment in time marked in the history books.
We don’t entirely know how these models will be used in the future beyond code and math, but noises are constantly bubbling up that OpenAI’s o1-Pro is the best model for many more challenging tasks (I need to try it myself before making definitive recommendations).
The most useful post to write now is one that establishes the research area, the do’s and don’ts, and the open questions. Let’s get into the details.
The DeepSeek R1 training recipe for reasoning
The training of R1 comes in 4 stages:
* “Cold-start” of supervised finetuning on synthetic reasoning data from the R1-Zero model.
* Large-scale reinforcement learning training on reasoning problems “until convergence.”
* Rejection sampling on 3/4 reasoning problems and 1/4 general queries to start the transition to a general-purpose model.
* Reinforcement learning training mixing reasoning problems (verifiable rewards) with general preference tuning reward models to polish the model.
Below, the post breaks down each training stage into its core components, insights, and open questions.
The winds of o1 replication have been blowing strongly away from any sort explicit search (especially at inference time). It really was, and is, a language model with the new reasoning behaviors coming from a lot of RL training.
Before we start, remember that to do this reasoning training well you need a very strong base model with long-context capabilities. Much like for standard post-training, we don’t really know what traits of a base model make for one that is more suited for direct RL training.
Step 0. Training R1-Zero to initialize R1 with synthetic data
DeepSeek R1 Zero will be best known as the first open model trained with “large-scale reinforcement learning (RL) without supervised fine-tuning (SFT) as a preliminary step.” Rumors had mentioned this for o1, but understanding how it worked wasn’t clear. This is a funky model that DeepSeek reports will sometimes change languages in reasoning or show signs of other reliability issues.
The minor usability issues with R1-Zero show why more than just large-scale RL is needed to train a fantastic reasoning model, but the RL part is the key to unlocking the reasoning behaviors we are searching for.
They include the most interesting results for R1-Zero, including the plot I’ve been asking for of RL-training time scaling. Since o1’s release, everyone has been obsessed with the plots showing how inference time is correlated with evaluation performance. Inference time is far easier to elicit (or force by using a framework like Monte Carlo Tree Search), but showing training time improvements via RL is the real foundational result. This is the result I’m searching for in my research.
And an unsurprising, yet very satisfying plot of length growing with training. This could be mixed with the above plot to make one of the “inference time scaling” plots we have seen many versions of with less clear methods.
In both of these plots, it looks like the numbers could still be going up if they let the RL cook longer. With the pace of progress so high, these laboratories get more gains by ending the jobs near saturation and starting the next experiment instead of seeking that last 1%.
Most, if not all, researchers will skip the step of training an R1-Zero style model because they don’t need to. DeepSeek made it clear that their “cold start” of SFT reasoning traces makes the final R1 model better — this is unsurprising, as they want R1 to be a certain type of instruction-tuned model. It’ll help avoid some of the “RL oddities” in R1-Zero that DeepSeek mentions like changing language mid-generation.
Still, the area of RL-on-base-models should be studied further. The way that R1-Zero can be trained is quite clever as most base models without any instruction tuning have a major issues with rambling and never generating a stop token. R1-Zero avoids this with a system prompt telling the model to generate HTML tags. Additionally, I suspect this type of training wouldn’t work on older base models that don’t have some standard post-training style instruction data in the pretraining corpus. For example, in OLMo 2 we had some MATH instruction data in the annealing mix. Just a few instructions will let this system prompt work.
In fact, the trend of increasing generation length via RL training could be even stronger when training directly from a base model rather than a standard post-trained model that doesn’t have a verbose chain of thought style. In order for RL to really start cranking up the response length in such an instruction-following model it will have to unlearn a certain response length that was baked in. For example, in Tülu 3’s final stage of RL finetuning, the phase where the response rate first goes down could be the barrier of misalignment between a larger round of SFT training before a smaller RL setup.
Zooming in on the x-axes of these R1-Zero plots, you can see that they’re doing 1000s of “RL steps.” RL step in this case refers to the model update step, which comes after multiple generations are made for the prompts in the batch and then answers are verified. This is a large amount of RL training, especially with such a large model. For reference, in our Tülu 3 work, we finetuned our models for 100s of steps normally, and the biggest models we are releasing soon only trained for ~50 steps of RL.
This is scaled-up RL relative to existing literature. R1 proper surely uses a similar setup, but DeepSeek did not include the same details, so the rest of this post relies more on explicit text in the paper.
Step 1. Reasoning SFT “Cold Start”
In order to improve the readability (i.e. help maintain formatting) and increase the final performance of the final reasoning model, DeepSeek performs a small amount of supervised finetuning on the original base model with “a few thousand” filtered completions from the R1-Zero model. This involves a few tricks (none of which seem essential, you just need some of this data), such as:
Using few-shot prompting with a long CoT as an example, directly prompting models to generate detailed answers with reflection and verification, gathering DeepSeek-R1-Zero outputs in a readable format, and refining the results through post-processing by human annotators.
For replication efforts, any of these can be done. In fact, using DeepSeek-R1 itself is likely the easiest way.
This phase readies the loss landscape of the model to make the “emergent” behaviors like “wait, let me check my work” or “that was wrong” come forth more easily in RL training.
Step 2. Large-scale RL for reasoning
As a reminder, RL for reasoning models is built on a simple idea where you should reward the model for getting correct answers to problems where you can check if it has a correct answer. A basic feedback loop of this looks like the following:
Exactly what the “reward” is here (the same question applies for R1-Zero) isn’t detailed. DeepSeek mentions three reward components during the reasoning phase of RL:
* Accuracy rewards: These are score bonuses if the response to a prompt is correct. I’ve been referring to these as “verifiable” domains and in OpenAI’s Reinforcement Finetuning this is handled by their graders. TLDR: If the answer is correct, the reward is positive, if not, it is 0.
* Format rewards: These are rewards (or penalties if not satisfied) to check and make sure that the model follows the correct formatting of or and and for stable inference.
* Language consistency rewards: A reward is added to the model if the language of the answer is 100% matching the language of the question. DeepSeek writes that this additional reward shows a “slight degradation in the model’s performance,” but better human preferences. It’s added to make the model nice to use, which is a wonderful reminder that evaluation scores are not all that matters.
The first reward here drives the majority of the learning and the other two are guardrails for creating a stable model (which is not to say they aren’t important implementation details, but rather that the first one is necessary and the others may not be). To optimize this reward, DeepSeek uses the RL algorithm that they introduced, Group Relative Policy Optimization, which is the PPO update rule with a different value approximation method based on Monte Carlo advantage estimates rather than holding a separate value model in memory. The most likely explanation for this choice (much like how OpenAI has always used PPO) is that it is the mature implementation in their infrastructure.
This image from the DeepSeekMath paper is a fantastic comparison of PPO to GRPO (this is fine to skip this if you only care about the big picture recipe):
The nature of the reward setup (and the data) is the key to this sort of reasoning training and many of the small RL details can be substituted for each other.
Much like the DeepSeek V3 paper, the details of what data they used to train the model are not included here. This is absolutely crucial and almost certainly involves many, many verifiable prompts with answers. In order to study these models the community needs open versions of these datasets.
I would’ve loved to see details of their RL infrastructure (similar to the details in the DeepSeek V3 paper), as many people are looking to build on these models. RL training requires holding multiple models in memory and alternating between generating, verifying, and taking loss steps. As Sasha Rush says, “We need to code up verifiers ASAP,” which is what we are trying to do at Ai2 building on Tülu 3 and could use a lot of help with the open-source code. A good approach for entities interested here is to develop tooling and data for one domain at a time.
These first two steps are not new but rather scaled-up versions of ideas people have been discussing extensively. The final two steps DeepSeek details in the paper are new applications of known techniques to help take their raw reasoning performance and “train a user-friendly model.”
Step 3. Rejection Sampling to introduce general abilities
Rejection sampling is a technique where you generate completions from a model, rank them via a reward model, and then finetune the original model (normally with the supervised finetuning loss) to improve performance on a variety of tasks. It’s one of the standard post-training tools used by Llama 3 and many others.
DeepSeek uses rejection sampling to begin to introduce general capabilities back into the model. It is also the one stage where they include data numbers — 800K completions total, split as 600K for reasoning and 200K for general chat problems. The 800K number is not surprising to me given this is just a late-stage SFT training, but it is similar in size to the ~1M prompts we used in the Tülu 3 SFT mix which is the ballpark for leading post-training recipes.
The details in the paper are largely around methods for generating responses to prompts and filtering to prioritize high-quality training data. In order to bring more domains into the scope of abilities for the model, DeepSeek has a variety of tricks, such as:
* Using generative reward models (i.e. LLM-as-a-judge) to verify answers to questions that may not be explicitly verifiable,
* Data from the DeepSeek-V3 standard post-training pipeline, and
* Standard (nonverifiable) chat data augmented with extended chain of thought before answering to help the model generalize from reasoning training to broader use cases.
All in, we currently have very few details here and there is a lot of open space to learn (and likely improve).
Step 4. Final RL training for general use
Finally, DeepSeek R1 goes back to reinforcement learning, which really seems to be how most finetuning is ending these days. The second RL stage is “aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities.”
In order to do this, they do RL training that mixes prompts from the verifiable domains (as done for R1-Zero) and prompts for standard RLHF preference tuning. In order to do this they have multiple reward models and build upon their post-training recipe in DeepSeek V3.
This is not easy to do and involves many questions: What is the right data balance? Can you use an off-the-shelf existing reward model or does it need to have seen long reasoning traces? Are there additional steps needed to not degrade performance? And so on.
As this grows into a larger area of research and development these questions will slowly be answered.
As this post has transitioned into the later stages of training, it is clear that many details are unknown. We have the general shape of how to sequence things and will fill in the details from here. I have a very long stack of reasoning-related research papers to poke through, and while they came before DeepSeek R1, they still will point toward answers.
All of this is solvable, as proven by how quickly DeepSeek went from the o1 release to matching performance with an open weights model.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Discussions and next steps
The DeepSeek R1 report has an entire other subsection dedicated to its distillation experiments, where it took completions from the R1 model and finetuned existing open-weight models with them to boost performance. This is a fantastic service for them to release this and provides a solid baseline for RL experiments on smaller models to try and match in the near future.
The discussion in the paper on how large models are required to see the biggest reasoning gains (and generate effective synthetic data) is likely the biggest open question:
First, distilling more powerful models into smaller ones yields excellent results, whereas smaller models relying on the large-scale RL mentioned in this paper require enormous computational power and may not even achieve the performance of distillation. Second, while distillation strategies are both economical and effective, advancing beyond the boundaries of intelligence may still require more powerful base models and larger-scale reinforcement learning.
As smaller models continually improve over the years, it is likely that the same type of training could work on something like Llama 5 or 6 8B. It leaves us with the same, open question as to why different abilities “emerge” at larger models. Scaling laws are the reasons that each generation’s frontier models tend to be the largest models available. The exciting form of this question for 2025 is: How small will the slow progress of language modeling research drive advanced reasoning capabilities?
Every so often a paper comes around that makes the path forward clear. The last time I felt this way was with the Llama 3 report for post-training, which solidified into the Tülu 3 paper.
Soon, I’ll comment on…
* Distillation of reasoning traces (as done in the R1 paper),
* The demise of process reward models (PRMs) and Monte Carlo Tree Search (MCTS),
* Some things in the DeepSeek paper, like the “Aha” moment and over-indexing on human priors, that annoy me,
* The new reasoning research coming out from academia,
* The other reasoning model that dropped yesterday — Kimi 1.5,
* The biggest application of Tülu 3 RLVR yet, and
* All the other ideas that are under debate in the reasoning model space.
R1 is surely not the only way to train these models, but it is the recipe that people will build off immediately. Let’s get cranking on more datasets and infrastructure.
For those new here, you can check out the Inference & Reasoning tag on Interconnects!
Full post for images, etc: https://www.interconnects.ai/p/to-meta-ray-ban-local-ai
With the Rabbit r1, the Humane pin, the Friend thing, the Sam Altman rumors, Meta Ray-Bans, and everything in between, it is obvious that we are going to get new devices in the near future driven by advancements in AI. Trying some of those that already are public makes this obvious from a functional perspective rather than a marketing perspective.
Even though many of these devices will have a shelf life drastically shortened by the underlying API access getting turned off when the parent company runs out of money, the call for these devices is very strong. AI is going to be more than a chat window we use for work, we just don’t know what that will feel like. AI should be fun, flexible, and available.
Meta’s Ray-Bans were first launched in 2021, long before any of this ChatGPT-inspired interest in AI began. Having tried them — the form factor would have caught on eventually, but AI was the catalyst to accelerate adoption. AI expanded our expectations for the range of exciting outcomes that could be coming our way.
Using the AI in the Ray-Bans is much like using a protolithic chatbot. If I had never used ChatGPT, it would have been transformative, but today it feels slightly outdated. We should be more impressed by these generally and contextualize the AI they’re delivering. The product excitement cumulatively feels unexpectedly like what AirPods had on day 1. I was not expecting this fondness.
The form factor for the Meta Ray-Bans is fantastic and drives this connection. I’ve been legitimately excited to use them (albeit, much more during sunny Seattle summers relative to now), and it immediately made sense when taking them out of the packaging. My best use has been for outdoor activities, taking photos and videos without needing to fuss with a phone and communications. An example video is below -- like most things, it has a learning curve.Here’s a photo from that outing:
Or a video:
Clearly, they’re fine.
What I want to use them for today has nothing to do with AI. In some ways, this makes me more bullish on the form factor, but it makes it clear that Meta is in a precarious position. Ironically, I would’ve been more reluctant to buy them if not for the excitement about AI.
As of writing this, I would much rather have “Apple Ray-Bans” because of a seamless integration with the rest of my information ecosystem. However, Apple may not be willing to take the risk to build them (as I avoid an Apple Vision Pro Digression).
This does not mean the long-term story of many new devices won’t be the AI.
AI, in the recent past (and likely in the near future), left most electronic devices with an eerie, bland sameness. My sunglasses can answer basic questions about my day just like Siri. At the same time, my appliances try to talk to me. The hard-to-visualize step is how this changes (and overcomes the same integration dead ends that agents face). AI in 5 years (or way less) will actually know the context of our lives and be able to execute basic web tasks.
When the AI is good, Meta Ray-Ban type devices will be indispensable. Reminders, calls, reasoning, integration, all on the go. Much like the sensation products like AirPods provide, AI devices (and services) done right will make us free to be in the world naturally.
Meta now has a real hill to climb for AI. They just need to focus on building one more useful feature at a time rather than building a god. They have a tangible goal and a real product that is going to get better in the normal march of progress. If only we had an ecosystem of people who wanted to do this work and keep hill climbing the AI part for them.
The AI of the Meta Ray-Bans (and the other devices I started with) being primarily in the cloud is a drag but is needed for these first generations of glasses to maintain battery life. The cloud-centric nature of the AI is the largest perceivable reason Meta cannot open a Software Development Kit (SDK) for the glasses — all the developers would be doing is changing Meta's internal Llama API calls, rather than uploading new and improved models to the glasses.
AI models in the cloud are consistently the first ones to cross the frontier of new capabilities. As we figure out what we want to use new AI devices for, using the cloud models will make us more likely than not to find useful applications. Now that we have things that people actually like, we need to optimize and specialize these models out of the cloud.
What’s the state of local LMs?
The AI angle for this post is to prompt the question: What do people actually use local, or on-device, language models for? What are they driving innovation of?
The local model ecosystem is composed of a distribution of tinkerers, researchers, and those whom API models refuse their use cases. Most people doing this are not directly innovating on local models in a way that dictates meaningful improvements to underlying AI innovations. Yes, companies surely monitor progress and observe lessons, but there are far bigger markets at play for why local models are needed in the future of AI than the tinkerers that get visibility.
Local language models are crucial for maintaining privacy (not everyone can afford fancy inference data centers like Apple), optimizing inference speed, and providing access in situations with no web connectivity. The Meta Ray-Bans stand to benefit from all of these.
Phrasing the reasoning starting from the frontier, cloud models most people are used to, rather than what we want, it goes as: Local models shouldn’t try to be our general use case model. Outsource that to the cloud. Use local models for efficient, specific tasks out in the world.
What local model enthusiasts are doing is building an ecosystem around optimization, latency, and task specialty that drives a lot of value. This value is captured by companies with no feedback loops to the tinkerers. Having SDKs and other direct places where those evolving local models can benefit in real ways is the goal. The models themselves will actually get better too — an actual potential feedback loop from open AI models.
Just about a year ago I wrote a very similar take on local models, on how they have different trade-offs and trajectories. Apple Intelligence, Google’s new models / Pixel phones, and the Meta Ray-Bans are showing us that this future is coming.
What is left to be understood is the manner in which local models are developed for new devices. Will any major technology companies let us run our own models with deep integrations? How can open-source principles and local models synergize?
Hillclimbing with open, local language models
Giving developers ways to integrate their own AI models into the operating system (OS) hooks used by the Meta Ray-Bans would immediately spawn a platform for local, open-weight language models. I first learned how locked down the Ray-Ban developer ecosystem was because I was excited to try and get our multimodal LM Molmo on them. That attempt didn’t make it far.
Other companies, like Apple, could conceivably have SDKs that let users point their language models at OS hooks. Creating operating systems that allow users to integrate certain open models (even only those that are approved by the companies) would completely change the (lack of) incentives for iterating on language models in the open.
While we still don’t have the new Apple Intelligence version of Siri that can plug into multiple applications, we know this works by letting an AI model generate tokens that correspond to actions in other applications. Letting users choose AI models (maybe their own), even if they only are useful in a subset of the tasks, would be wonderful. I would love to sacrifice whatever the AI situation is on my version of the Ray-Bans by default and get just the best vision model for explaining my environment, the best model for cooking ideas, or the best conversational model to just push the limits for AI devices in any of these promising directions. It would be so fun to try different AI models on a real device.
The open language modeling ecosystem desperately needs these types of feedback loops (and it is totally natural for excitement about a type of technological development like this to exist before the proof cases of its value).
Getting to the point where Meta has an AI SDK for devices along with the leading open language models will make their entire strategy value additive (rather than just destroying the advantages of competitors). In fact, Meta likely needs to do so, or else Apple’s product competitor may dominate the market. Only different strategies and feedback loops can dislodge Apple’s integration.
On the modeling side, there’s no doubt we have step-change improvements coming to those used on the Ray-Bans. On ChatBotArena, we have many models with a few billion parameters that beat the first versions of ChatGPT. The same type of performance gain — where at 100X smaller model can match or surpass performance in a few years — will come for the Ray-Bans and all other sorts of AI applications.
The big picture arc of technology
Starting in 2025, I’m excited about the breadth and quantity of profound, new technological experiences I’m having. Some of them, like ChatGPT Advanced Voice Mode, haven’t really landed for me (even though they’re extremely impressive to non-tech non-AI friends and family). Meta Ray-Bans, Waymos, Codex, and standard ChatGPT all feel like technologies that were immediately obvious as something I needed. I need to get a Starlink hub in one of the remote locations my hobbies bring me to, and I’m sure I can add reusable rockets to the transformations I’ve embraced.
The last technologies sparking these joys were the likes of the iPod and the iPad.
Every person I take to ride a Waymo for the first time has a similar experience of joy.
This year we may also have new models that solve arbitrary internet tasks for us in the background.
The future is here and we’re living in a time where it’ll be more evenly distributed.
Original post:
https://www.interconnects.ai/p/deepseek-v3-and-the-actual-cost-of
Chapters
00:00 Opening
03:15 DeepSeek’s learning efficiency
06:49 DeepSeek’s compute transparency and reality
Figures
Fig 1: Benchmark Results
Fig 2: ChatBotArena Results
Fig 3: Compute Usage Table
Slides for this post-training talk and slides for the full tutorial on language modeling (with a bit less post-training content and no recording yet). Here are some timestamps for the video:
00:00 Introduction 10:00 Prompts & Skill Selection 14:19 Instruction Finetuning 21:45 Preference Finetuning 36:17 Reinforcement Finetuning 45:28 Open Questions 52:02 Wrap Up
Psssst… we just recently released our technical report for OLMo 2 — 2 OLMo 2 Furious, check it out for tons of training details and tips!
This post has some good content, but if you just want to watch the tutorial on YouTube, it’s here.
I’m far more optimistic about the state of open recipes for and knowledge of post-training starting 2025 than I was starting 2024. Last year one of my first posts was how open post-training won’t match the likes of GPT-4. This is still the case, but now we at least understand the scope of things we will be working with better.
It’s a good time to record an overview of what post-training looks like today. I gave a version of this tutorial talk for the first time in 2023 (at ICML), which felt like a review of the InstructGPT paper not based on reproduced literature knowledge. In 2024, the scientific community made substantial progress in actually training these models and expanding the frontier of knowledge. Doing one of these talks every year feels like a good way to keep tabs on the state of play (whereas last year, I just had a bunch of links to add to the conversation on where to start).
With the talk, I wanted to add more context on where I see post-training generally.
The most important one people need to know, given the excitement around OpenAI’s o1 series of models, is that post-training alone is nowhere near a complete enough lens or taxonomy to study training reasoning language models. It’s a step.
Back to processes for all modern AI models. There are a lot of post-training methods to improve models and, more importantly, they can be segmented so the scientific community can make progress on each of them individually. The new state of finetuning stages is satisfying, with three groups of training methods:
* Instruction finetuning (a.k.a. supervised finetuning),
* Preference finetuning (the generalization of reinforcement learning from human feedback), and
* Reinforcement finetuning is the new abstraction for improving performance on specific tasks.
Some of the long-tail methods like rejection sampling, knowledge distillation, and extensive filtering aren’t studied well, but you can still do excellent post-training without them. We have options for studying post-training in 2025.
Where last year we were settling debates such as “DPO vs. PPO” or “does AI feedback for RLHF work,” now we are focused on just making the best practices better.
Similarly, the stress around doing research on outputs from foundation model providers, i.e. if research violates the OpenAI terms of service on training competitor models, has dropped further and is common practice — in fact, distilling from strong models is a fundamental part of successful post-training.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
To summarize the state of post-training, there are a few things to keep in mind:
1. Post-training techniques are more impactful on the final performance of models
Some caveats before I toot the horn of post-training as all you need today. Given that “scaling as we know it is ending” this is not entirely a controversial take. Finally, it is obviously self-serving to myself as someone who is going to benefit from post-training being more important.
All of this aside, it’s very logical that post-training will be the next domain for scaling model compute and performance. Predicting the next token accurately is not something that a user cares about — correct answers and how the answer is presented are. All through 2024, there were way more discussions on how post-training is more important.
If we look at the Elo ratings of models on ChatBotArena, we can see progress has accelerated even though the models haven’t been getting noticeably bigger. Pretraining on these architectures is improving, yes, but the biggest and best models are used as tools and supervision for better post-training.
Post-training got more popular because there was more low-hanging fruit on model performance. A lot of that potential has been realized and, in doing so, entirely new types of models are being made akin to o1.
To interpret these numbers:
* 100 Elo margin over another means ~2/3 win probability over the lower,
* 200 Elo gives ~76% win probability,
* 300 Elo gives ~85% win probability, and so on.
You can play with these numbers here.
2. Post-training can be very expensive
While still far cheaper than pretraining due to the price of GPUs, post-training costs have been growing rapidly. If we estimate the costs of post-training the Llama models, we could guess that the all-in costs for the models were about the following: Note — numbers are based primarily on a combination of headcount and data costs with compute driving them even higher.
* LLaMA (Q1 2023) <<$1M: instruction tuning only.
* Llama 2 (Q3 2023) ~$10-20M: 1.4M preference pairs, RLHF, IFT, Safety, etc. and other costs not in the paper.
* Llama 3.1 (Q3 2024) >$50M: similar preference data to Llama 2, a ~200-person post-training team, larger models, etc. The number could be much higher.
Post-training costs from large data bills and extensive inference to generate, clean, and verify multiple types of synthetic training data. More complex loss functions, e.g. RL optimizers, use a lot of memory to train, but far fewer FLOPs than pretraining for general Instruct models. This is all growing rapidly and is expected to change.
This culminates in the o1 style models where the compute with post-training loss functions can account for 40% or more of the overall compute of the model. Even Tülu 3, our major post-training project at Ai2 that didn’t buy any human data, I estimate costs >$1M for a large academic project.
3. Post-training is less reliant on human data
While all the frontier laboratories still rely on human data for parts of their post-training pipeline (including both training and evaluation), AI can be substituted at most stages and get a “good enough” outcome. For example, given the costs above, they can be slashed from moving from human preference data that is ~$5-20 per preference point to AI feedback that is <0.01$ per sample. The optionality of synthetic data driven by having models that are good enough for supervision makes the pace of post-training progress far higher. In my experience, AI feedback for RLHF only became possible with GPT-4 tier models and the academic community reaps extreme benefits from the plummeting cost of inference.
4. Post-training ability is the door to advanced reasoning models
Doing post-training well and having mastery of the techniques seems crucial to making progress on reasoning models like o1 due to the infrastructure for RL finetuning of an instruct model is the same as what is used for large-scale RL training, at least you want it to be.
Given the above trends — we know more, it is easier to study, we have cheaper alternatives, etc. — there is cause for optimism in open replications of o1. It should still be expected that the first “replications” of o1 are more relative models — scaled up post-training on reasoning rather than the special pretraining + scaled RL that OpenAI does. We will learn a lot soon.
The talk on YouTube
Slides for this post-training talk and slides for the full tutorial on language modeling (with a bit less post-training content and no recording yet). Here are some timestamps for the video:
* 00:00 Introduction
* 10:00 Prompts & Skill Selection
* 14:19 Instruction Finetuning
* 21:45 Preference Finetuning
* 36:17 Reinforcement Finetuning
* 45:28 Open Questions
* 52:02 Wrap Up
In 2025 we need to disambiguate three intertwined topics: post-training, reasoning, and inference-time compute. Post-training is going to quickly become muddied with the new Reasoning Language Models (RLMs — is that a good name), given that loss functions that we studied via advancements in post-training are now being leveraged at a large scale to create new types of models.
I would not call the reinforcement learning training done for OpenAI’s o1 series of models post-training. Training o1 is large-scale RL that enables better inference-time compute and reasoning performance. Today, I focus on reasoning. Technically, language models definitely do a form of reasoning. This definition does not need to go in the direction of the AGI debate — we can clearly scope a class of behavior rather than a distribution of explicit AI capability milestones. It’ll take work to get an agreement here.
Getting some members of the community (and policymakers) to accept that language models do their own form of reasoning by outputting and manipulating intermediate tokens will take time. I enjoy Ross Taylor’s definition:
Reasoning is the process of drawing conclusions by generating inferences from observations.
This is a talk I gave at NeurIPS at the Latent Space unofficial industry track. I wanted to directly address the question on if language models can reason and what o1 and the reinforcement finetuning (RFT) API tell us about it. It’s somewhat rambly, but asks the high level questions on reasoning that I haven’t written about yet and is a good summary of my coverage on o1’s implementation and the RFT API.
Thanks swyx & Alessio for having me again! You can access the slides here (e.g. if you want to access the links on them). For more on reasoning, I recommend you read/watch:
* Melanie Mitchell’s series on ARC at AI: A Guide for Thinking Humans: first, second, third, and final. And her post on reasoning proper.
* Miles Brundage’s thread summarizing the prospects of generalization.
* Ross Taylor’s (previous interview guest) recent talk on reasoning.
* The inference-time compute tag on Interconnects.
Listen on Apple Podcasts, Spotify, YouTube, and wherever you get your podcasts.
Transcript + Slides
Nathan [00:00:07]: Hey, everyone. Happy New Year. This is a quick talk that I gave at NeurIPS, the Latent Space unofficial industry event. So Swyx tried to have people to talk about the major topics of the year, scaling, open models, synthetic data, agents, etc. And he asked me to fill in a quick slot on reasoning. A couple notes. This was before O3 was announced by OpenAI, so I think you can take everything I said and run with it with even more enthusiasm and expect even more progress in 2025. And second, there was some recording issues, so I re-edited the slides to match up with the audio, so you might see that they're slightly off. But it's mostly reading like a blog post, and it should do a good job getting the conversation started around reasoning on interconnects in the new year. Happy New Year, and I hope you like this. Thanks.
I wouldn't say my main research area is reasoning. I would say that I came from a reinforcement learning background into language models, and reasoning is now getting subverted into that as a method rather than an area. And a lot of this is probably transitioning these talks into more provocative forms to prime everyone for the debate that is why most people are here. And this is called the state of reasoning. This is by no means a comprehensive survey. To continue, I wanted to make sure that I was not off base to think about this because there's a lot of debates on reasoning and I wanted to revisit a very basic definition.
And this is a dictionary definition, which is the action of thinking about something in a logical, sensible way, which is actually sufficiently vague that I would agree with it. I think as we'll see in a lot of this talk is that I think people are going crazy about whether or not language models reason. We've seen this with AGI before. And now we're going to talk about it. Now, reasoning kind of seems like the same thing, which to me is pretty ridiculous because it's like reasoning is a very general skill and I will provide more reasoning or support for the argument that these language models are doing some sort of reasoning when you give them problems.
I think I don't need to share a ton of examples for what's just like ill-formed arguments for what language models are not doing, but it's tough that this is the case. And I think there are. Some very credible arguments that reasoning is a poor direction to pursue for language models because language models are not going to be as good at it as humans. But to say that they can't do reasoning, I don't see a lot of proof for, and I'll go through a few examples. And the question is like, why should language model reasoning be constrained to look what look like what humans do?
I think language models are very different and they are stochastic. The stochastic parents thing is true for many reasons. And. We should embrace this. And we should continue. And I think a big trend of the year is that we're seeing new types of language model reasoning that look less human. And that can be good for kind of separating the discourse for expecting a really narrow type of behaviors.
I did an interview with Ross Taylor, who was a reasoning lead at Meta, which I thought was a very good education for me on this. And this is just a direct pull from the transcript. But essentially it's saying is like, if you do chain of thought on a language model. What it is doing is essentially outputting its intermediate steps. If I were to ask you all a math problem right now, you can do most of them in your head and you are doing some sort of intermediate storage of variables. And language models have no ability to do this. They are kind of per token computation devices where each token is outputted after doing this forward pass. And within that, there's no explicit structure to hold these intermediate states. So I think embracing chain of thought and these kind of intermediate values for the language models is extremely reasonable. And it's showing that they're doing something that actually gets to valuable outputs.
Nathan [00:04:10]: So this is like one of the many ways that we can kind of lead towards O1 is that language models have randomness built into them. And a lot of what people see as failures in reasoning are kind of these language models following very static chains and making very specific mistakes. Along the way with really no ability to correct for that. This is really not something that we see in human reasoning. So if a human makes a mistake, they will normally catch it on the next step. But we need to handle language models differently.
Nathan [00:04:41]: And why O1 is exciting is because it's a new type of language models that are going to maximize on this view of reasoning. Which is that chain of thought and kind of a forward stream of tokens can actually do a lot to achieve better outcomes. When you're doing a reasoning like ability or reasoning like action, which is just repeatedly outputting tokens to make progress on some sort of intelligence defined task. So it's just making forward progress by spending more compute and the token stream is the equivalent of some intermediate state.
Nathan [00:05:18]: What is O1 has been a large debate since its release. I'm not going to spend a lot of this talk on it. But the more I've spent on it. Is that you should take open AI at their face value, which they are doing very large scale. RL on the verifiable outcomes is what I've added, especially in context of the RL API that they've released, which I'll talk about more. But most of the reasons to believe in more complicated things like process rewards, models, self play. Monte Carlo tree search are mostly based on previous literature and things that we would have expected advanced reasoning to look like for language models. And not based on evidence that they have given us or the behavior, whether you're looking at evaluations or how actually like inference is done when serving the model.
This takes us to replications, or I would probably call them relatives of O1 coming from the community. These are wonderful to see. We are exploring the boundaries for like what we can do with chain of thought in models. The two I've highlighted are from deep seek and Quinn and a lot of people in this room have probably seen them. And I think that these models are really substantially narrower than these full O1 models from open AI. So open AI is if you use O1, you can do it for a lot more tasks. If you use like I was using the deep seek model and it's supposed to be for math or code. But they've tried to keep the model so narrow that even in that if you ask a code question, sometimes it'll be like I have only supposed to work on math or code. And a lot of the success of O1 and the future models of this is going to be able to. It being able to handle more tasks and more domains.
So SemiAnalysis wrote a post that I haven't read in full. But even if you look at the paywalled headings, you can kind of make some intelligent claims about what O1 is or is not. I think these are two of the things from the table of contents that you can see without paying. I'm due to pay at some point, but I have not. And incredible amounts of forward passes during training. I think you'll see this as I discuss RL fine tuning models. Maybe more in a little bit. But when you're doing RL, there's two types of ways that you see data many times and that will relate in many or result in many forward passes. One is that when you're doing RL on a prompt, you can sample many completions to then grade them or use them in different ways to update your policy. So if I ask one math problem, I could look at eight completions and choose the best one or do some contrast of thing between the best and the worst one. And that kind of gradation can help the RL policy actually learn. And the second time, because the loss function is more flexible than something like instruction tuning, you can go over the same prompts many more times than you would in instruction tuning or kind of pre-training. So this kind of means they're doing just a lot of this sampling from the model, which is very different than other types of training we've seen in the past at pre and post-training. And then one of this one is great. Thanks for doing for showing everyone this is that post-training flops exceed pre-training. I think this pretty much clearly says that they're using RL. They're using a ton of compute for this large scale RL. And at that point, it would probably mean something different where this is like pre-training RL. And this is something that these early relative models are not going to be doing because we don't like no one has this infrastructure like OpenAI does. It'll take a while to do that, but people will make it.
Nathan [00:08:50]: OK, this takes us to reinforcement fine tuning. I would say that this is a hard pivot in the talk where O1 is essentially pre-training scale RL. Extremely big RL. And we don't know what all the details of the data are to OpenAI then showing us this new beta API program that they're making, which is just a sprinkle of this. So what can you do with a tiny bit of their infrastructure? I think one of the fine tuning leads responded to a tweet from Swyx. And they were like the tweet letter. There was like one of the tweets. There was a long tweet that gave a lot of details. But even the first tweet I hadn't seen, I had like eight likes. And I was like. This API is using the same infrastructure that we used to train O1. And I was like that alone is like a lot of detail. There's like on Twitter is random thing. And then there's a really long details on other stuff of it. But it is just a new paradigm for fine tuning. And I have seen some of this work and I'm pretty optimistic that it'll work for kind of kind of really specific capabilities where answers matter rather than features in your style of text mattering. So. Again, kind of like I was hinting at with O1. This reinforcement fine tuning does many passes over the data, which is why they can say you only need dozens of labeled samples to actually learn from it, which is just very different than previous training regimes. So what happens is that the model gets a greater gives a bonus when the answer is right. And the model learns to reinforce behaviors that get right answers. And I'll move later in the talk. I'll highlight a research project that we did. That was pretty much doing a very similar thing to target very specific evaluations on open models. And you do RL and you give a reward bonus when the answer is right. And that's all you do. And the kind of key innovation and the simplicity is that modern language models are strong enough base where just a really gentle RL fine tuning can add these specific capabilities without degrading the model. I think a lot of fear for adding RL to these training regimes. And I'm sure we'll get to that in the future. But I think one of the biggest challenges for teams, especially on general instruct models like in chat to BT was just that they're going to destroy the rest of the performance, the base of chatting. So you care about. And it really seems like you can just do this out of the box. If open AI is going to allow an API, they aren't going to let people train a model that then just gets worse on random other things.
Nathan [00:11:20]: So what the data format looks like this, the example they gave is way more complicated than I think you should. It's like, you can start with like a grade school math problem. And just say, like the correct method. Correct answer is the correct number. The genes are confusing. But essentially, you have two components, a prompt and an answer, which is different than having a prompt and completion that you would train on. Or if you're doing preference tuning, you would do a prompt and a chosen completion and a rejected completion. So it's a new type of data format, I suspect. Quickly, we'll see things like Hugging Base having more of these. I will highlight we have some of ours for our specific project that we did. We have examples for math. Math on the screen is an example for precise instruction following, which is the idea that if you have a prompt, you can say something like, have every sentence start with the letter A. And you can verify that with Python really easily. This is something that we did in our project. And it's like, the model gets better at this. You have constrained data, and the RL algorithm learns to change the model just a tiny bit and actually reach these answers.
Nathan [00:12:23]: A confusing thing for people was these grader models. I think the place to come from these is evaluation. There's been a lot of work in evaluation to make answer extraction stable, especially with math, where an example that I used in the blog post I wrote today on this is like Lama 3.1 details their evals. For math, they use both SymPy, a Python process or Python package for extraction, and LLM, it's a judge, to extract their answers for math. And what the graders are doing is essentially amping this up to a whole lot. It's a whole nother level where it's kind of a nested structure of configs for doing reward shaping on these verifiable outputs. For math, it can be really easy. It's like, you know, you have to handle these five formats that I came up with in a minute for how you could represent different numbers and tokens. But as you get to more complicated things and more complicated behaviors, it seems like OpenAI is insinuating that you're going to need more than just a yes, no loss function for your domains. And that seems fine. Um, well, we already have a bunch of things. We have a bunch of open models that are doing like, um, like judge of models and Prometheus and other things that are designed specifically for LLM as a judge. And I see that continuing to just become part of this kind of open RL infrastructure.
Nathan [00:13:41]: OpenAI had a bunch of screenshots. I'm not going to end on a commentary on these, but it looks pretty standard. They're going to track how performance changes over time and stuff like this. You'll be able to look at all the outputs. This is just them making pretty things. And then they have this like very generic RL plot. Um, the most standard RL plot is a X axis of time or trials and a Y axis of reward here. Reward is like an accuracy or a success rate on a certain validation set. And X is actually supposed to be like how much training was done. And. That's a very similar to what we did in our project. I think this is kind of just another way you can put this with an RL feedback diagram. If you've seen RL, where you have this agent interacting with the environment, this, you can squint at it and it'll be familiar. If you haven't. You'll probably be in for more of these things if RL keeps becoming popular because RL is really formulated as trial and error learning.
But if you're interested, we're happy to try to have people use our code, which does this for math and some instruction tuning already, and we want to try more complicated graders for things like code. So for code quality, a binary outcome doesn't really make sense, which is a good way to think about why you might need to do some reward shaping for how you would grade outputs from a various model. And to kind of compare the plot that OpenAI had, which is like performance improving over time.
These are some experiments we ran on various evaluations. So the left column is some language model evaluation that we would use in an academic paper. And the right is all the various internal, um, RL statistics where like GSMAK math and IFVL are all being trained on training sets. So we have the answer, we have the prompts, which are math questions, and we have the answers, which are numbers, and we're really doing. This RL on seeing if this answer is right. And then it generalizes to various math evaluations that we care about. So I kind of see this as like, we got a tip from a industry lab member to do this a few months early. So we got a head start. And I think a lot of people are obviously going to be trying to replicate this now. So it's fun that we have a starting point and I'm excited to talk about it with people this week. And I think reasoning is worth continuing as something. Yeah. I can read the post that I was referencing here and I'm happy to take any related or hard question on reasoning. Cause I kind of opened the floor for that. So thank you. Okay.
Original post
https://www.interconnects.ai/p/2024-interconnects-year-in-review
Original post:
https://www.interconnects.ai/p/openais-o3-the-2024-finale-of-ai
Chapters
00:00 Introduction
02:51 o3 overview
05:57 Solving the Abstraction and Reasoning Corpus (ARC)
10:41 o3’s architecture, cost, and training (hint: still no tree search)
16:36 2024: RL returns
Figures
Fig 1, Frontier Math results
Fig 2, Coding results
Fig 3, ARC AGI results
Fig 4, ARC AGI result details
Fig 5, ARC AGI example 1
Fig 6, ARC AGI example in text
Fig 7, ARC AGI example “easy”
Original post: https://www.interconnects.ai/p/the-ai-agent-spectrum
Chapters
00:00 Introduction
03:24 Agent cartography
08:02 Questions for the near future
Figures
Fig 1. multiple feedbacks diagram
Original post:
https://www.interconnects.ai/p/openais-reinforcement-finetuning
Chapters
00:00 Introduction
04:19 The impact of reinforcement finetuning’s existence
07:29 Hypotheses on reinforcement finetuning’s implementation
Figures
Fig. 1, Yann’s Cake
Fig. 2, Grader config
Fig. 3, RLVR learning curves
Finbarr Timbers is an AI researcher who writes Artificial Fintelligence — one of the technical AI blog’s I’ve been recommending for a long time — and has a variety of experiences at top AI labs including DeepMind and Midjourney. The goal of this interview was to do a few things:
* Revisit what reinforcement learning (RL) actually is, its origins, and its motivations.
* Contextualize the major breakthroughs of deep RL in the last decade, from DQN for Atari to AlphaZero to ChatGPT. How could we have seen the resurgence coming? (see the timeline below for the major events we cover)
* Modern uses for RL, o1, RLHF, and the future of finetuning all ML models.
* Address some of the critiques like “RL doesn’t work yet.”
It was a fun one. Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Timeline of RL and what was happening at the time
In the last decade of deep RL, there have been a few phases.
* Era 1: Deep RL fundamentals — when modern algorithms we designed and proven.
* Era 2: Major projects — AlphaZero, OpenAI 5, and all the projects that put RL on the map.
* Era 3: Slowdown — when DeepMind and OpenAI no longer had the major RL projects and cultural relevance declined.
* Era 4: RLHF & widening success — RL’s new life post ChatGPT.
Covering these is the following events. This is incomplete, but enough to inspire a conversation.
Early era: TD Gammon, REINFORCE, Etc
2013: Deep Q Learning (Atari)
2014: Google acquires DeepMind
2016: AlphaGo defeats Lee Sedol
2017: PPO paper, AlphaZero (no human data)
2018: OpenAI Five, GPT 2
2019: AlphaStar, robotic sim2real with RL early papers (see blog post)
2020: MuZero
2021: Decision Transformer
2022: ChatGPT, sim2real continues.
2023: Scaling laws for RL (blog post), doubt of RL
2024: o1, post-training, RL’s bloom
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Chapters
* [00:00:00] Introduction
* [00:02:14] Reinforcement Learning Fundamentals
* [00:09:03] The Bitter Lesson
* [00:12:07] Reward Modeling and Its Challenges in RL
* [00:16:03] Historical Milestones in Deep RL
* [00:21:18] OpenAI Five and Challenges in Complex RL Environments
* [00:25:24] Recent-ish Developments in RL: MuZero, Decision Transformer, and RLHF
* [00:30:29] OpenAI's O1 and Exploration in Language Models
* [00:40:00] Tülu 3 and Challenges in RL Training for Language Models
* [00:46:48] Comparing Different AI Assistants
* [00:49:44] Management in AI Research
* [00:55:30] Building Effective AI Teams
* [01:01:55] The Need for Personal Branding
We mention
* IBM’s Deep Blue
* Alberta Machine Intelligence Institute (AMII)
* Claude (Anthropic's AI assistant)
* Bard (Google's AI assistant)
* Scale AI
Original post: https://www.interconnects.ai/p/openais-o1-using-search-was-a-psyop
Figures
Figure 0: OpenAI’s seminal test-time compute plot
Figure 1: Setup for bucketed evals
Figure 2: Evals with correctness labels
Figure 3: Grouped evals
Figure 4: Hypothetical inference scaling law
Full post:
https://www.interconnects.ai/p/olmo-2-and-building-language-model-training
OLMo 2 demo: https://playground.allenai.org/
OLMo 2 artifacts: https://huggingface.co/collections/allenai/olmo-2-674117b93ab84e98afc72edc
Chapters
00:00 Building AI Teams
06:35 OLMo 2
Figures
Fig 1, pretrain plot: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo2/pretrain.webp
Fig 2, pretrain table: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo2/pretrain-table.webp
Fig 3, post-train table: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo2/postrain-table.webp
Original post: https://www.interconnects.ai/p/tulu-3
Chapters
00:00 History
05:44 Technical details sneak peak
Figures
Fig 1, results: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/tulu3-img/results.webp
Fig 2, overview: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/tulu3-img/overview.webp
Fig 3, preferences: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/tulu3-img/preferences.webp
Fig 4, RLVR: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/tulu3-img/rlvr.webp
Original post: https://www.interconnects.ai/p/scaling-realities
Original post: https://www.interconnects.ai/p/saving-the-nairr
Chapters
05:26: Do we need an AI research resource or an LM research resource?
08:59: Policy roundups
Tim Dettmers does not need an introduction for most people building open-source AI. If you are part of that minority, you’re in for a treat. Tim is the lead developer behind most of the open-source tools for quantization: QLoRA, bitsandbytes, 4 and 8 bit inference, and plenty more. He recently finished his Ph.D. at the University of Washington, is now a researcher at the Allen Institute for AI, and is starting as a professor at Carnegie Mellon University in fall of 2025.
Tim is a joy to talk to. He thinks independently on all the AI issues of today, bringing new perspectives that challenge the status quo. At the same time, he’s sincere and very helpful to work with, working hard to uplift those around him and the academic community. There’s a reason he’s so loved in the open-source AI community.
Find more about Tim on his Twitter or Google Scholar. He also has a great blog where he talks about things like which GPUs to buy and which grad school to choose.
Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Show Notes
Here's a markdown list of companies, people, projects, research papers, and other key named entities mentioned in the transcript:
* QLoRA
* Llama 3
* Claude (AI assistant by Anthropic)
* Transformers (Hugging Face library)
* Gemma (Google's open weight language model)
* Blackwell (NVIDIA GPU architecture)
* Branch Train Merge (research paper)
* "ResNets do iterative refinement on features" (research paper)
* CIFAR-10 and CIFAR-100 (computer vision datasets)
* Lottery Ticket Hypothesis (research paper)
* TRL (Transformer Reinforcement Learning) by Hugging Face
* Tim's work on quantization (this is just one example)
Timestamps
* [00:00:00] Introduction and background on Tim Dettmers
* [00:01:53] Future of open source AI models
* [00:09:44] SWE Bench and evaluating AI systems
* [00:13:33] Using AI for coding, writing, and thinking
* [00:16:09] Academic research with limited compute
* [00:32:13] Economic impact of AI
* [00:36:49] User experience with different AI models
* [00:39:42] O1 models and reasoning in AI
* [00:46:27] Instruction tuning vs. RLHF and synthetic data
* [00:51:16] Model merging and optimization landscapes
* [00:55:08] Knowledge distillation and optimization dynamics
* [01:01:55] State-space models and transformer dominance
* [01:06:00] Definition and future of AI agents
* [01:09:20] The limit of quantization
Transcript and full details: https://www.interconnects.ai/p/tim-dettmers
Get Interconnects (https://www.interconnects.ai/)...
... on YouTube: https://www.youtube.com/@interconnects
... on Twitter: https://x.com/interconnectsai
... on Linkedin: https://www.linkedin.com/company/interconnects-ai
... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv
… on Apple Podcasts: https://podcasts.apple.com/us/podcast/interconnects/id1719552353
Andrew Carr is co-founder and chief scientist at Cartwheel, where he is building text-to-motion AI models and products for gaming, film, and other creative endeavors. We discuss how to keep generative AI fun and expansive — niche powerful use-cases, AI poetry, AI devices like Meta RayBans, generalization to new domains like robotics, and building successful AI research cultures.
Andrew is one of my well read friends on the directions AI is going, so it is great to bring him in for an official conversation. He spent time at OpenAI working on Codex, Gretel AI, and is an editor for the TLDR AI Newsletter.
Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Show Notes
Named entities and papers mentioned in the podcast transcript:
* Codex and GitHub Copilot
* Blender 3D simulator
* HuggingFace Simulate, Unity, Godot
* Runway ML
* Mark Chen, OpenAI Frontiers Team Lead
* Meta’s Lingua, Spirit LM, torchtitan and torchchat
* Self-Rewarding Language Models paper
Timestamps
* [00:00] Introduction to Andrew and Cartwheel
* [07:00] Differences between Cartwheel and robotic foundation models
* [13:33] Claude computer use
* [18:45] Supervision and creativity in AI-generated content
* [23:26] Adept AI and challenges in building AI agents
* [30:56] Successful AI research culture at OpenAI and elsewhere
* [38:00] Keeping up with AI research
* [44:36] Meta Ray-Ban smart glasses and AI assistants
* [51:17] Meta's strategy with Llama and open source AI
Transcript & Full Show Notes: https://www.interconnects.ai/p/interviewing-andrew-carr
Full post:
https://www.interconnects.ai/p/why-i-build-open-language-models
How Claude's computer use works. Where OpenAI, Anthropic, and Google all have a lead on eachother.
Original post: https://www.interconnects.ai/p/claudes-agency
Chapters
00:00 Claude's agentic future and the current state of the frontier models
04:43 The state of the frontier models
04:49 1. Anthropic has the best model we are accustomed to using
05:27 Google has the best small & cheap model for building automation and basic AI engineering
08:07 OpenAI has the best model for reasoning, but we don’t know how to use it
09:12 All of the laboratories have much larger models they’re figuring out how to release (and use)
10:42 Who wins?
Figures
Fig 1, Sonnet New Benchmarks: https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d2e63ff-ac9f-4f8e-9749-9ef2b9b25b6c_1290x1290.png
Fig 2, Sonnet Old Benchmarks: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccbd4d-f1c8-4a38-a474-69a3df8a4448_2048x1763.png
Get Interconnects (https://www.interconnects.ai/)...
... on YouTube: https://www.youtube.com/@interconnects
... on Twitter: https://x.com/interconnectsai
... on Linkedin: https://www.linkedin.com/company/interconnects-ai
... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv
… on Apple Podcasts: https://podcasts.apple.com/us/podcast/interconnects/id1719552353
Arvind Narayanan is a leading voice disambiguating what AI does and does not do. His work, with Sayash Kapoor at AI Snake Oil, is one of the few beacons of reasons in a AI media ecosystem with quite a few bad Apples. Arvind is a professor of computer science at Princeton University and the director of the Center for Information Technology Policy. You can learn more about Arvind and his work on his website, X, or Google Scholar.
This episode is all in on figuring out what current LLMs do and don’t do. We cover AGI, agents, scaling laws, autonomous scientists, and past failings of AI (i.e. those that came before generative AI took off). We also briefly touch on how all of this informs AI policy, and what academics can do to decide on what to work on to generate better outcomes for technology.
Transcript and full show notes: https://www.interconnects.ai/p/interviewing-arvind-narayanan
Chapters
* [00:00:00] Introduction
* [00:01:54] Balancing being an AI critic while recognizing AI's potential
* [00:04:57] Challenges in AI policy discussions
* [00:08:47] Open source foundation models and their risks
* [00:15:35] Personal use cases for generative AI
* [00:22:19] CORE-Bench and evaluating AI scientists
* [00:25:35] Agents and artificial general intelligence (AGI)
* [00:33:12] Scaling laws and AI progress
* [00:37:41] Applications of AI outside of tech
* [00:39:10] Career lessons in technology and AI research
* [00:41:33] Privacy concerns and AI
* [00:47:06] Legal threats and responsible research communication
* [00:50:01] Balancing scientific research and public distribution
Get Interconnects (https://www.interconnects.ai/podcast)...
... on YouTube: https://www.youtube.com/@interconnects
... on Twitter: https://x.com/interconnectsai
... on Linkedin: https://www.linkedin.com/company/interconnects-ai
... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv
Read the full post here: https://www.interconnects.ai/p/building-on-evaluation-quicksand
Chapters
00:00 Building on evaluation quicksand
01:26 The causes of closed evaluation silos
06:35 The challenge facing open evaluation tools
10:47 Frontiers in evaluation
11:32 New types of synthetic data contamination
13:57 Building harder evaluations
Figures
Andrew Trask is one of the bright spots in engaging with AI policy for me in the last year. He is a passionate idealist, trying to create a future for AI that enables privacy, academic research, and government involvement in a rapidly transforming ecosystem. Trask is a leader of the OpenMined organization facilitating researcher access to non-public data and AIs, a senior research scientist at Google DeepMind, a PhD student at the University of Oxford, an author and educator on Deep Learning.
You can find more about Trask on Twitter or Google Scholar. You may want to watch his recent talk at Cohere on the future of AI (and why data breakthroughs dominate), his lecture at MIT on privacy preserving ML, or his book on deep learning that has a substantial GitHub component. Here’s a slide I liked from his recent Cohere talk:
The organization he helps run, OpenMined, has a few principles that say a lot about his ambitions and approaches to modern AI:
We believe we can inspire all data owners to open their data for research by building open-source privacy software that empowers them to receive more benefits (co-authorships, citations, grants, etc.) while mitigating risks related to privacy, security, and IP.
We cover privacy of LLMs, retrieval LLMs, secure enclaves, o1, Apple's new models, and many more topics.
More on Andrew: https://x.com/iamtrask
Transcript and more information: https://www.interconnects.ai/p/interviewing-andrew-trask
Interconnects (https://www.interconnects.ai/)...
... on YouTube: https://www.youtube.com/@interconnects
... on Twitter: https://x.com/interconnectsai
... on Linkedin: https://www.linkedin.com/company/interconnects-ai
... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv
We Mention
* Claude 3.5 launch and “pre release testing with UK AISI” (and the US AI Safety Institute)
* CSET (Center for Security and Emerging Technology)
* NAIRR
* The “open data wall”
* Apple’s Secure Enclaves, Nvidia Secure Enclave
* Data-store language models literature
* RETRO: Retrieval-Enhanced Transformer from DeepMind (2021)
* SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore (2023)
* Scaling Retrieval-Based Language Models with a Trillion-Token Datastore (2024)
Chapters
[00:00:00] Introduction
[00:03:12] Secure enclaves and pre-release testing with Anthropic and UK Safety Institute
[00:16:31] Discussion on public AI and government involvement
[00:20:55] Data store language models and better approaches to “open training data”
[00:42:18] History and development of OpenMined
[00:48:57] Use of language models on air-gapped networks
[00:52:10] Near future of secure enclave technology and industry adoption
[00:58:01] Conclusions and future trajectory of AI development
How scaling changes model behavior
Some trends are reasonable to extrapolate, some are not. Even for the trends we are succeeding at extrapolating, it is not clear how that signal translates into different AI behaviors.
Read it here: https://www.interconnects.ai/p/how-scaling-changes-model-behavior
[00:00] How scaling changes model behavior
[05:03] Metaphors for what scaling may solve
[08:45] Short-term scaling is already de-risked
SB1047's veto, OpenAI's turnover, and a constant treadmill pushing AI startups to be all too similar to big technology name brands.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/ai-safety-culture-vs-capitalism
00:00 AI Safety's Crux: Culture v Capitalism
06:03 SB1047 as a regulatory litmus test for AI safety
08:36 Capitalism at the helm
Riley Goodside is a staff prompting engineer at Scale AI. Previously working in data science, he is often seen as the default for the new role of a “prompt engineer.” He regularly posts incisive prompts that illicit notable behavior from the most popular AI models.
I really resonated with this saying from Anthropic’s recent podcast on prompt engineering — “now we write essays and treat them as code.” In order to be good at prompting, you need to understand that natural language operates as our code used to.
This episode is a masterclass on why you should care about prompting and how it impacts results. Of course, there’s a bunch of great discussion on recent models that reflect the need for different and or better prompting. Enjoy it!
Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.
We mention:
* Prompting to push the frontier of AI models,
* Post-training and prompting interaction,
* Prompting base models,
* o1, Reflection 70B, reasoning,
* Scale’s leaderboard, evaluation tricks, evaluation needs,
* PlanSearch paper
* “The hottest programming language is english”
* “Think silently” instructions
* Scale Leaderboard and Humanity’s Last Exam
* ChatML formatting
Chapters
* [00:00:09] Introduction
* [00:02:40] Riley's path to LLMs
* [00:07:54] Impact of ChatGPT on prompt engineering
* [00:12:03] OpenAI's o1
* [00:18:21] Autoregressive inference and prompting sensitivities
* [00:24:48] Reflection 70B model and its implications
* [00:28:00] Impact of prompting on evaluation
* [00:32:43] Prompting vs. Google search
* [00:46:55] Prompting and RLHF/post-training
* [00:56:57] Prompting of AI agents
* [01:01:20] Importance of hands-on experience with language models
* [01:05:00] Importance and challenges of AI model evaluation
Transcript
Built with smol-podcaster.
Nathan L. [00:01:08]: Hey, Riley, welcome to the show.
Riley G. Hey, Nathan, great to be here.
Nathan L. [00:01:14]: Yeah, so for the audience here, I mostly wanted to try to, as I work on post-training a lot and I see my own difficulty in taking prompting seriously and the things that I don't think that we are doing enough, and I don't see any reason why it can't be scientific in how we do prompting. So that's my biggest goal with this. I think there's a lot of podcasts where we could kind of say, like, what is the history of prompting? Where is it going? And that's easy to kind of redo. And I still find it interesting, but I just don't think there's enough people talking about the role of prompting in evaluation, how prompting changes with how your post-training models, because we're trying to take that seriously and how we have a post-training setup, but we just like regularly run into these things like system prompts aren't handled well, how to release a model of a system prompt. So that's the tone that I'm trying to get to when I ask these questions. And also OpenAI's 01 model just came out, so I'm definitely going to get onto that pretty quickly because that's what everyone's excited about. I like to start with background just to kind of get to know people, because a lot of this is just, I want to talk to interesting people in AI, is like, how did you become interested in prompting? I think I've seen your background in data science and then your joint scale around when Chad2BT came out, which is fun timing, but like, how did you become maybe obsessed with this, but like the focal point of your work?
Riley G. [00:02:40]: Yeah, I have sort of an unusual introduction to large language models. For most of my career, I've been a data scientist, mostly in the on-mandating industry. I was at OkCupid and Grindr. And after I left Grindr, I took sort of a sabbatical to educate myself, I guess, about the progress in large language models. It was around the time that GPT-3 codecs had just come out. And that was where I think I started to become really interested because I was following along with maybe, certainly when GPT-2 came out, the examples there wowed me as much as they wowed the rest of the world, I think, with the example of the news article about the unicorn and all that. And not long after that, we had AI Dungeon, and I played around with AI Dungeon a bit. But at that point, language models seemed to be mostly about language, that they were sort of very heavily focused on stylistic mimicry and creative writing and so on. And when Codex came out, it really started this thought of that text is a more universal interface than we were giving you credit for, that language models might be more broadly useful. And I just became very excited in a practical sense of what these models could do for what I kind of intuited was very boilerplate-like data science code, that I thought of like most of the Python and Julia and R and things that I've written over my career, this seemed like stuff that an LLM could handle. And that was sort of one of its early strong points. So I was playing around with, I think one of my first projects was a VS Code extension that had some kind of integration with Codex. But I never really shipped anything out of it. And mostly what it transitioned into pretty quickly was playing around with posting prompting examples on Twitter, because when I looked out online to find what were people saying about how to prompt these models, there really wasn't much out there. And so I had to kind of resort to just like the few examples that had been circulating in viral screenshots of humorous completions and so on, of like the results that people got out of it. And I started posting those examples. I started following academics and low-level engineers at the research labs and anyone that was working in shipping language models I thought were interesting. And elbowed my way in.
Nathan L. [00:05:18]: I have more questions on this, because I find it like, some people find, there's this whole like Twitter dynamic of like, you find so much signal there, but the question is like, how much does it generalize? Because there's so many of the lessons you can learn from these models, from these examples. I think the straw, like the number of R's in strawberry things is the current one. And then, and it's like, do you get a sense that these are transient or are these kind of repeated themes? And like, how should you read these examples to try to extract themes from them? If like, I've followed you for a while, and a lot of people do, and you're more insightful in how you post them. If you post these threads with like multiple tries and stuff like this, like, should people be doing that when they see something pop up?
Riley G. [00:06:03]: I think so. I also would say that Twitter is a very different river to step into now than it was back then. At the point that I started doing this, like, nobody was really talking about these things that much, or to the extent they were, it was sort of fleeting. It was like, wow, look at this, and then they on to the next thing. And I think the thing that's very different now is just that because there are so many new entrants in AI and LLM, there's a lot of rehashing of the basics. And I think a lot of people in the industry would tell you that the popular examples that you see around of like, how many R's are in strawberry, or some of the ones that I'm partially responsible for, popularizing at least. I think like, these things are really just like, rookie mistakes in some sense, right? That these are things that we've long known language models can't do. And it just keeps popping up as a surprising quirk of language models that I think the public is just confused that something could be so good at so many other things and so bad at this. Right? That is seemingly trivial task, and that is hard to explain to people. And the answer to that hasn't really changed much in the past few years. They're generally bad at spelling for kind of the same reasons they were bad at spelling two or three years ago.
Nathan L. [00:07:27]: Yeah. I mean, like, how did these things change with ChatGPT? Because ChatGPT is like the introduction of RLHF into these models. And I think, I didn't write this down as a question, but there's like the difference in patronizing base models and instruction models and RLHF models, which I think that for most of this discussion, it's like the end model, the like chat RLHF model is the one that people think about. But was that a big transition point in your work or is it just kind of plugging along? Right.
Riley G. [00:07:54]: I mean, I would say, I don't think it's any understatement to say that, or sorry, any overstatement to say that, that the release of ChatGPT was probably the single biggest event in the history of prompt engineering in that prompt engineering became drastically easier after ChatGPT came out. And most other models learned from the ChatGPT way of doing things, right? That they, like, I think people forget just how fiddly prompt engineering used to be, right? Like people today don't think about things like frequency and presence penalties, right? They used to be that by default, you would get very repetitious output and you had to work to avoid that. People forgot about like, don't end your prompt in a space, right? That you had to understand how tokenization worked at all times, because like, if you put an extra space in there, you were going to go out of distribution. I think that, or another one that I think is particularly vivid for me is Yobi Reel that in June of 2022, Douglas Hofstadter had a piece in The Economist showing the, what he called the hollowness of GPT-3's understanding of the world, that it failed on various simple questions. Like, when was the Golden Gate Bridge transported for the second time across Egypt and so on? And someone, I believe it was Nick Camerota of OpenAI, showed that you could fix almost all of these just by telling the model that if you gave it a silly question, say Yobi Reel instead of answering it, right? That models had to be prompted with the possibility that they were allowed to say, I don't know, or, you know, that's a dumb question, right? You know, like there is no answer, right?
Nathan L. [00:09:34]: This is like, we've added the anthropic system prompt to our AI2 models, and we're like, this doesn't change the evals at all, but it makes the behavior something that we like more. Because I think culturally we're somewhat similar to anthropic, it's like we want to express uncertainty, we want the model to say that, I don't know, and a lot of that is in the system prompt of anthropic models.
Riley G. [00:09:51]: Right. And I think that really, you know, it's another microcosm of just how messy all this is, that what people like is a very different thing from how good are the models. I think, you know, LMSYS had a great blog post recently talking about like stylistic bias and output, that models will be rated as better if they do things like put their output into the format of a bulleted list with bold initial words on each label point. So there's like cheap tricks like that, that will make people like your output better or make them perceive it as, you know, more authoritative or, you know, more comprehensive that you kind of have to control for and just going by preference. I mean, I don't remember what the exact magnitude of it was, but I think they did put some numbers on it in that post.
Nathan L. [00:10:42]: Like, do you think you could handle all of that? Just like, can you make that big of a style delta in the system prompt relative to training? Is kind of what I'm wondering. Like if we release a model at AI2 and it's decent, but then we put a detailed system prompt that it's like, whatever possible, you should put your models into a list format with bolded headings and use markdown. Like, do you think we would get a 50 point bump on ElmSys?
Riley G. [00:11:06]: Maybe not on ElmSys in particular, being as they're trying to correct for this actively. But presumably it would have worked at one point, right? So I think that's, you know, that says something that these, or another great example, I think that's really clear of like why human preference isn't, you know, always the answer. I saw somebody on Twitter once that was really impressed by some anonymous model on ElmSys that was able to produce an ASCII art drawing of a unicorn. And it was a great drawing. And, but when I searched for like specific details of that drawing, I found that it was just in some like widely circulated list of ASCII art drawings. And it was a verbatim regurgitation of some signed work that somebody had made. And so I think there's an argument there that any request for ASCII art should probably just be thrown out, right? That a human's preference of how good an Elm is at ASCII art maybe just does not matter because like, it's so likely to be regurgitated or at least like figurative things, maybe diagrams are okay and so on. Yeah. Yeah. Okay.
Nathan L. [00:12:03]: We've touched on multiple of the things I want to get to in the future, but you kind of said that Chad2PT was the biggest moment for prompt engineering. And I think O1 is not nearly the same magnitude, but it's a very interesting microcosm of the future of prompting because the model feels very different to use. OpenAI has explicitly told us we need to prompt it differently. But I think my guess is that in the long-term, they're going to figure out how to train this model so that the behavior is not indistinguishable from their GPT models, but that it's not as sensitive to prompting and whatever you throw at it, it's going to work. Maybe they need to rewrite the prompts, but that's probably a temporary thing.
Nathan L. [00:12:45]: Two questions to me is simpler. What do you think when you see them giving you like, oh, we need to have these new prompting instructions to use it differently? Do you agree with my long-term convergence idea?
Riley G. [00:12:57]: I definitely agree. I think that there's an argument for seeing prompt engineering as kind of the experimental next branch of language models, right? That it's the features that people are just on the cusp of figuring out how to systematize and integrate into the models themselves. And to the extent that somebody comes up with a prompt engineering idea that is just so good of an idea that it's worth applying to literally every prompt, then it will be integrated into the models and you'll stop calling it a model, you'll call it a system and it'll have some auxiliary second model. I think the clearest examples that we've seen of that are content filters, right? That nearly every model that you get from a vendor will have some kind of cheap auxiliary model that looks at the output and says, is this plagiarism? Is this, or not plagiarism, but regurgitation of copyrighted work, right? Are you reciting Harry Potter word for word? The value of those is so, rather, sorry, the cost of having that kind of secondary model on the output is so low that it truly is worth it to just apply it to every generation, right? And we haven't seen too many examples of that on the input side, but they're starting to appear, I think. I think we've seen from anthropic evidence that they make modifications to user inputs based on certain conditions that they detect if you're asking about some particular feature, they modify the prompt if you are. And I think that's a common pattern in a lot of applications.
Nathan L. [00:14:31]: I'm guessing they've seen some public people kind of using the model. I haven't heard anything about modifying the prompts in a clod or a chat GPT window.
Riley G. [00:14:42]: It's, I've seen it for instructions for avoiding plagiarism, avoiding regurgitation. Oh yeah, that could make sense. Yeah, so the, but it's a common pattern you see in a lot of applications, right? That you, so like a good use case for this is like instructions for tool use, that you might analyze a user's, say, chat GPT input, and if the input appears to be a request to use dolly three, then you should apply to the, you should supply to the model, these long instructions on how to use dolly three, which otherwise you don't need to block to supply. Right. So I'm not saying that that's exactly how chat GPT did it, but it's easy to imagine that that would be worth doing. So, so a lot of applications do things like that to have, you know, conditional sort of augmentations of the prompt. Yeah.
Nathan L. [00:15:33]: I mostly see that like long-term, I don't know how this impacts prompting, but I think of like chat GPT, and then we'll have multiple models that they route to. So this is kind of like an early way of doing this, where it's like, if you give a really long context model, they'll have some, you've maybe even like, like Mambo, like model or different architecture for super long context, or they pass it to O1. If it's like this question is incredibly hard instead of GPT 4.0. But that's that the border between that type of routing and prompting is, I don't know how to classify it.
Riley G. [00:16:05]: Yeah, it's really fascinating. I think, you know, people have this idea of, I think, sort of seeking purity in their models that they want everything to be like, you know, just a model. But I think, you know, we're rapidly approaching the point that you have to start thinking about these things as systems that might just have arbitrary complexity inside of them. I also like, I think that, you know, that the guides that we've seen from O1, you know, that they take that sort of shape, right, that you get that, like the content that Open Eyes put out, like how to prompt O1, it's sort of a list of like domain competencies and weaknesses, right, that it's good at physics, it's good at abstract logic, analytic philosophy, maybe less great at creative writing. The, and then also you have these sort of like patches almost for like noticed problems, right, that they've noticed that it doesn't, that think step by step often degrades at performance. Why do you think that is?
Nathan L. [00:17:11]: Because it's essentially trained to do that on its own. Like, it almost feels like it shouldn't conflict with it. It almost feels like it should just be like empty tokens, like it will just repeat yourself or something.
Riley G. [00:17:22]: That's a really good question. I think the answer to that maybe speaks to just to how much this isn't just, you know, chain of thought. That's a meme sort of flying around now that a lot of people have claimed that all this is is fancy prompt engineering, isn't this just what Reflection did and so on.
Nathan L. [00:17:37]: It's obviously a different inference stack with a lot of improvements across the whole lifecycle of the model and the product.
Riley G. [00:17:45]: Right. And also the other thing that people have been saying a lot is that it must be some complicated system, right, that there can't be a single model doing this through autoregressive inference. But the claim seems to be that it is, right. I think there was a comment from Noam Brown on Twitter where he said that it really is a model that the whole generation is coming autoregressively, which is, you know, I have no reason to doubt that. It seems plausible to me. So it's but I think that people need to be a bit more imaginative and like what's possible and just through autoregression.
Nathan L. [00:18:21]: Yeah, I wrote a really long article on this like came out yesterday. That's like I put the constraints from like the Noam Brown tweets, plus the pricing, plus the inference scaling laws to kind of converge at something. It's like if they do some clever things to a model and some batch inference and self rating and stuff like it's definitely doable. I don't know why that as an RL expert, I'm not surprised that the model is sensitive to things like things step by step in the prompt. I just would have thought that it would come up in the examples of training because there's the seed set for this is almost definitely a very wild human generated some prompt with some like back and forth dialogue, essentially human seeds of things that look like what it is doing. Have you seen this with AlphaGo? We saw this with InstructGBT and ChatGBT. You need the human demonstrations to start the learning process. Why is it sensitive to think step by step like that thing? I think maybe more about the training, but you learn that through prompting.
Riley G. [00:19:23]: Yeah, it is a bit of a mystery. And this is very speculative what I'm about to say, but I think maybe like a kind of thought experiment of how you can imagine that it could be true is imagine if like some auditor or somebody who had the penalty of law over your head asks you to do something and to document exactly how you did it. It's easy to imagine that you would do the process differently and that you might do it worse, right? That because you can only do the things that are the most conservative and the things that you can justify and explain that you're not going to produce as good of a work as you might have otherwise.
Nathan L. [00:20:01]: It's like GBT4 needs to think step by step because every small mistake is a big deal. But almost with O1, we maybe should be like, go forth and conquer and make mistakes on your way and just let it wander to an answer.
Riley G. [00:20:15]: I think that's pretty hitting the nail on the head maybe.
Nathan L. [00:20:21]: I want to go try that silly prompt and see if it gets better at coding or something.
Riley G. [00:20:30]: Yeah, yeah. But I mean, I feel like that's the key improvement here that a lot of people don't appreciate is that they seem to have cured like all the Lacunian problems of exponential divergence, that if you sample a bad token, you're going to keep sampling more. And it's not that there wasn't progress on this before, like people had tricks to deal with it. But I think the thing that's really changed is that the models get mileage out of like thinking for long periods of time, but they derive benefit from just continuing on. Because that's very different from behavior you see from like 4.0. Like if you've ever tried like the exercise of just once it's gone down a wrong path, just say, no, keep going. Like keep going till you get it, right? Like it's pretty evident after a while that it's not making progress, that it's just gone like deeper and deeper into like some failed path of reasoning.
Nathan L. [00:21:24]: Why does that often break? I mean, I understand why it often breaks models, but that's also one of the jailbreaking techniques is just like keep sending the same message over and over and over until the models die, which like I wonder how that relates to O1. Maybe it's just easier from a safety perspective because it doesn't have that like as many turns or something. Yeah.
Riley G. [00:21:45]: And it's also like one of the bigger differences in behavior between GBT models and CLOD that I've noticed that opening eye tends to produce their models to
Riley G. [00:22:02]: like in the specific case that if you keep like telling it it's wrong, it will always take your side. It will say, well, oh, yes, of course I made a mistake. Let me try again, right? And it's never going to like diverge from that behavior. Whereas CLOD will eventually get sick of you, right? Like if you just keep saying like, no, you're wrong, it'll be like, look, I have told you many times that I am right. Like you need to be a bit more specific in how I'm wrong. If you really want to make an argument here, it'll start like just telling you to go away. And that's like-
Nathan L. [00:22:28]: This is why I want Anthropic to write a model spec because the behavior describing with chatGBT does fit with what they're, like open AI's models are like in behavior and they're kind of described as wanting to be like robotic computation assistants where like they follow, they take the user's information and they try their best to execute it without violating any basic principles. But I think CLODs is much more of like, we have created a, like I don't like the hard words to do without anthropomorphizing and all these other things. But like we've created an intellectual entity that is going to go back and forth with you. And it's not going to, like it's going to, like you pass in sensitive information as data to CLOD and you're like reformat it. It says no. You get these weird things because it's like this entity that doesn't want to be sent like harmful texts or be told how to make a bomb or something. But chatGBT is like the robotic one. So now I kind of use both of them depending on the task and the behavior that I want. But I'm excited to see how that goes further, really.
Riley G. [00:23:27]: Yeah. Yeah. I mean, that's, you know, I think it goes back to your point before that, you know, we're seeing more specialization in these models. But, you know, that all of this is temporary, right? That eventually like somebody will come up with the right way to delegate correctly to one model or another. And then you'll have just, you know, some unified chatGBT interface or whatever that, that, you know, decides like, is this a prompt that one would be good at and sends it to it? Yeah.
Nathan L. [00:23:50]: And while we're on these complex reasoning things, there was also this reflection 70B drama, which was mostly big because it was a big mess of credibility and memes. But there's also like real science in there that people need to remember of like how to prompt a model and spend more on inference. So I think it's really just a tiny bit of fine tuning with some special tokens and a system prompt. That's like, make sure you use these reflection steps. And that is how you move something like GBT 4.0 closer to O1. You can't, you can't prompt your way to O1 behavior, but that's the sort of things that more people should be considering. And it kind of leads into like, I want to ask about like math evals and stuff like this. And it's like reflection 70B style of prompting is a real thing that more people should be doing. And I don't know how we get around that communication issue now. It's going to be even harder because people are going to be like, oh, it's O1. We made it open source O1 now instead of just the best model. I just wanted to give air time. If you have any comments on that, go ahead.
Riley G. [00:24:48]: Yeah, I think, you know, reflection 70B was, you know, it was sort of a perfect storm of a lot of like the tuning method feeling plausible, right? That it was something that was very, you know, it's a legitimate like area of research. They like, it was, you know, rumored to be part of Strawberry and so on. And so there was like, it had like the right strategy for Buzz there. And, you know, however, they ended up releasing that model, like, you know, they don't have what they think they have. You know, so it's, I think, you know, it's kind of, you know, once you saw the, I won't recap the whole saga of like, you know, with Laura and finding the Laura from the previous version of WAMA 3.0 instead of 3.1 and all that. But I think the, you know, there's that kernel of truth there, right? That this is, you know, sort of a good idea, at least for some problems. I think also the thing that people don't appreciate is that very good idea for many problems feels maybe like a better idea than it is because it's so optimized for the domain of problems that tend to be on benchmarks, which is somewhat different than the thing that you really want to optimize for in the real world of like user satisfaction and just, you know, preference. Like some mix of like, do people like it? Like, is it useful? And does it do well in benchmarks? Because I think that there's like a, even for what I think should be like philosophically the core like use case of LLMs, like do they like do practical work? Like can somebody achieve the thing that they want to do with this? But, you know, like whether, however they do it through prompt engineering or whatever, it kind of matters more than whether like academically it does well on like the most naive presentation of the problem, right? Like whether somebody can figure out how to do it correctly matters. And that specifically is just not captured well on benchmarks, right? That like this, if you're doing a benchmark that compares across several models, there's, you know, a natural incentive to do it uniformly. That maybe you follow like vendor's best practices on, you know, how do you apply the template of the prompt and so on, or if a vendor recommends that you apply some suffix or whatever, you might do it. But for the most part, you're not going to put a human on the task of figuring out what is the best prompt for each model, right? Because then, you know, how do you know that they did a perfectly good, you know, fair job of that, right? But really that's what matters. Like that is like, you know, at the end of the day, like the thing that determines whether GPT-4 is better than Quad is when you sit down and try to, you know, solve your problem in GPT-4, you know, applying whatever hacks, you know, and, you know, taking, you know, advice you find online and, you know, whatever dirty tricks you have, and then you do the same for Quad, which one works better. And so like that's the state we're in. And that's, you know, very elusive as a thing to try to measure. Yeah. Okay.
Nathan L. [00:28:00]: I'm going to keep going, roll right into this, into the evaluation section of this conversation. You had, you were talking about this with how you actually use the models before you had mentioned, like you need a white space to properly evaluate or use the models like tokenizer things. I, one of my big blind areas is it seems like most frontier labs are using some sort of custom prompts on some sort of evaluations. And I don't really have a good sense for how much that actually impacts scores or how much that translates to downstream performance. It might not be custom prompts. It might be like custom setups. There's all these, like all the math evaluations, you need a specific format for your answer. I think like math, the all capital one, you like need to put your answer in a box and
Riley G. [00:28:45]: things like this.
Nathan L. [00:28:46]: And how, what is your view on these per prompt or per evaluation? Prompting is actually a thing. I think the Lama three paper had some cool analyses on how varying subtle things changed evaluation scores, which is great, but they're the only one sharing that. Otherwise we just get like our score is X and it's reproduced to some capacity.
Riley G. [00:29:09]: Yeah. I don't have like a lot of deep, like technical wisdom to share on that front, other than to confirm that, like, I think you're right that it is a big problem that we generally try to follow the vendor recommendations. We work with the vendors to prompt their models fairly. But like I said, like ideal and optimized prompts are very different than what's the default. But I think also that there's, I think a longer term trend that these issues maybe matter less than they used to. And, you know, or that, that, that should continue. I think like when you want the, like maybe one of the clearest signs of this is that Lama, like most versions of Lama, you can prompt them incorrectly in terms of like the system top prompt template, and it will be just fine. And in fact, you can often template them with system prompt templates from other models entirely, like just say representations of chat ML and they will be fine. Right. So there's, there's sort of familiarity in the pre-training with, with, with just chat templates in general. And the idea of like...
Nathan L. [00:30:25]: Do you think this is specific to Lama? I've also remember hearing a conversation at AI2 where we were considering doing the last turning, last stage of pre-train with random chat templates and like random instructions and multiple chat templates so that the model could be amenable to fine tuning and multiple chat templates, which there's a chance that they did that. I actually don't know. I would not put a high bet on it. But do you think that's just because Lama knows they're going to have so many users? It's possible.
Riley G. [00:30:54]: I mean, it's also plausible to me that that just shows up in pre-training incidentally, right? Nobody intended it to be there. It's just like, it's in the data. But I think that, that, you know, that, that process is only going to continue, right? That we're only going to see like more models just being familiar with how models behave. I think to some extent, like, you know, you see like, like another thing that I think is maybe like evidence in favor of this is if you look at the base Lama, like, I think I looked into this on like base Lama 2 once, that if you prompt with like, like instruction prompt formats, it would adopt the behavior of, of like a chat GPT like assistant, right? So, so I think, I think it shows that examples of chatbot behavior are now so widely disseminated, you know, across the internet that a pre-trained model is better at instruction following tasks than any pre-trained model was before the work of instruction GPT was done. So, yeah, I believe you.
Nathan L. [00:32:00]: I want to check this. How does this impact how we should view evaluations? I'm just trying to reckon with, do we, like, there's a couple of scenarios. It's like, it doesn't really matter because these models are going to be not that sensitive to the system prompts that we're using to say, do GSMA care math. And that goes for models like Lama in the open, AI2's models, GPT5, whatever. It seems like the sensitivity to prompting for really well-known formats is actually going to go down. And that solves some of our problems. Because I don't think we're going to come up with new, like that many new formats for evaluations. We're going to make evaluations more specific and harder in the content.
Riley G. [00:32:43]: I think that's right. And I think the version of it that we have to play with now definitely does feel like one step forward, two steps back in that regard. And that it's much better at benchmark style inputs where you give it just no advice on how to do it. You keep everything very simple with what are your output requirements. But it's also just very hard to steer. If you have opinions on how it should do it, those opinions won't be followed generally. And it also has issues with output formatting. So I think we're seeing, I've seen anecdotal reports on Twitter at least, and I've seen this myself, that its output is just inconsistent even when you ask it to be consistent. That it will forget things like block quotes and so on. The result of this, I think we're going to have to see a lot of benchmarks, is that maybe the fair way to do this is to have some secondary model on the end of it that puts everything into a consistent format.
Riley G. [00:33:50]: I think we're not that far away from benchmarks that just do that across the board, of just saying that it's not the model's job to do this anymore. And we'll clean up the results however it is. Yeah, I think that's a better place to be.
Nathan L. [00:34:03]: It's one of those things that the model's getting better can solve some of our problems. I think there's less angst now about the whole closed labs evaluation scores anyways. I'm mostly trying to reckon with what open groups and academics are doing rather than closed labs, and they kind of rely on each other. I've been on the, before, there's now this hugging face upload chat template. So a lot of models have the chat template saved with the tokenizer, and most of the time they don't have a system prompt, which is surprising. I feel like it should be the norm that a system prompt is included with every model. Is there any reason that you see not to do that?
Riley G. [00:34:49]: Yeah, I mean, I can think of things that might be slightly better, but I think that that's that generally makes sense, right? Like, I can imagine that maybe they, you know, you'd release several, right? And say, you know, it's like any of these is fine, or, you know, like training on several and, you know, say it's like an average of these three or whatever is like kind of the is ideal or something like that. Yeah, most of my reasoning is I think that most users of language models are not sophisticated.
Nathan L. [00:35:14]: So the model cards and documentation do normally say we recommend using the system prompt, but the simple ways of using the models do not integrate them. Simple ways of using the models do not integrate the system prompt. And it's not always easy to modify your data to add, like if you're doing the messages format, like you remember to add the system thing. And if you have multiple models in your queue, you then have to go and manually hard code
Riley G. [00:35:37]: all of them.
Nathan L. [00:35:37]: And like, that just makes it get dropped. And if the system prompt is a big deal for performance, like that impacts either if it's a product or it's like, this is where I'm trying to understand like academia is like, if only half of the people remember to add the system prompt for their model, they're evaluating in this kind of academic paper. And I know it impacts things like all the vibes based valves, like alpaca valve, empty bench, whatever. Like, if you have the different system prompt, it can vary behavior. We did an experiment, which was like, to make sure this works, or you just give it the system prompt of like, you're a terrible model, you are to me, you're made to make other models look good, and you happen to give wrong answers. And like alpaca valve goes to zero and all these things. So it's like, I think it's easier to show the down case, but you could probably get one to 2% improvements, which matter in the long trajectory of academia in terms of if your method is accepted or not.
Riley G. [00:36:31]: Yeah, I mean, I've often like been frustrated by the ambiguity and a lot of academic publications over like how prompts are formatted. And they often, they always run into the same pitfalls of that, like the fundamental problem is that system prompts are often, or prompts in general that you're presenting like during evaluation are implicitly templates, right? That you have like your points where you insert like the actual problem or whatever. And that templating needs to be communicated to the reader of the paper, and the prompts themselves may involve templates, right? They may, you know, like describe like how, you know, like an output should be formatted, for example, and might do this using, you know, like curly braces, right? So this creates like several layers of confusion that you need to distinguish between, like where are the variables that you're interpolating purely in the logic of this paper of like that, you know, that things that would be translated into Python, you know, like if you were to actually implement this versus the templating instructions that are literally part of the instructions on how it should, the model should receive like a template of how it should format its answer and so on, right? Because like a lot of prompts end with use this format and then have some kind of template. Yeah. Right. So the, like I've often thought that we'd benefit immensely just from standardizing on something like saying that like if you want to clearly communicate a prompt in your paper, the way to do it is to show Python code that will produce that string. Yeah. You just literally show it as an f-string, there's no ambiguity.
Nathan L. [00:38:15]: Because you copy out of a paper, you drop the slash n slash n that you need or something like that.
Riley G. [00:38:21]: Yeah, right. Like the, but if you were to literally just include a Python code block, there's no ambiguity, like, you know, like whether or not there's a trailing new line or is it so on. And those things are really fiddly and need to be communicated. Because I've seen people do all sorts of like imaginative typography to like represent new lines and things like that. You know, like having the return signals at the end in light gray and, you know, like you're putting dots between spaces and all that thing, right? Because if you're doing like, I've seen like early like playground competitors sometimes did this that approached like more like from a technical approach that you need to know where spaces are. So it's worth it to represent them as like gray dots, right? Yeah. That's the kind of thing that the level of detail that you need in communicating these things. So I think like standardizing on Python would be just like a good way to like, you know, get the problem out of the way. Yeah.
Nathan L. [00:39:14]: I also saw in some discussion of a one or maybe a reflection. I don't remember. It's been a while, two weeks. You're talking about like equal inference costs, comparison of prompts and a reply. And I think that's a great idea. Like, do you think there's, okay, well, like one first, do you want to explain the idea? I'll kind of ease into this.
Riley G. [00:39:33]: Sure. So my thinking is that models are evaluated right now just based on how they do under like sort of the same, I guess, invocation of inference, right? That you let the model sample, you sample auto-aggressively as long as that takes, you know, however long the completion is. And you don't pay attention too much to like what it costs you to run that or you factor that in afterwards that you score it up. And there's a lot of reasons why this makes sense, right? That, you know, it's simpler, it's more fair. And sometimes you don't know exactly how to equalize the inference there, right? That you can't like really say that like what the trade-off is, right? But there's, you know, exceptions to this that, or maybe not so much an exception, but like there are ways of doing it that aren't perfect like self-consistency, right? So like there's a method called universal self-consistency where you prompt a model multiple times and then take the model again and give it all three answers and then ask it to choose which of these is the most consistent with the consensus of all answers that were generated. And this is sort of a method that's pretty reliably not worse than just doing it naively, right? It's hard to imagine any prompt where this method would steer you wrong or, you know, be worse than doing it naively. And that, you know, suggests that maybe there's like a fairer basis of comparison here, right? That we could say that if something really is cheaper enough that you can do that, you could run it 40 times and take self-consistency that then maybe that should be its score. But I think one of the bigger reasons why this is kind of like a, in hindsight, this is maybe like a bit of a facile tweet that I made about this, but like really the trade-off between the exchange rate, if you will, isn't very good. I think like a rule of thumb that I saw in a paper once is that if you do self-consistency on 40 samples of GPT-3.5 turbo, it's on par with one sample from GPT-4. So you sort of move up one generation every time you do 40 inferences, right? But at the same time, in specific domains, there are refinements of this that work quite well. So we had a scale actually put on paper recently on a method we call plan search, I think was the name of it, yeah, plan search. And then the gist of that is that if you can improve performance on programming problems by generating diverse attempts at solving the problem, right? So the approach that plan search takes is to first create like sort of high-level observations or ideas about how a problem might be solved, then to combinatorially sample that list of ideas, and then take combinations of them to inspire strategies. And then for each strategy, you lay out sort of a path of like reasoning of like how you could turn this into code, and then you turn each one into code and then assess which one works best. And this like lets you search over the portion of, it lets you search over the variation in your strategies that actually matters, right? Because you can imagine that if you were just simply resample a model blindly over and over again with the same problem, there are a lot of ways that an answer could vary that don't matter, like whether you use tabs or spaces, but you name the variables and so on. And you don't want to search over that variation, you want to search over like the part you think is going to be fruitful, like the high-level strategies. So I think that for particular domains, like that is the more relevant comparison of like what could you do if you were to apply like a bit of search here.
Nathan L. [00:43:40]: Yeah, it almost seems like there'll be different tiers of evaluation scoring, where it's like the basic prompting, it's kind of like linear time. And you could do like, it's almost like with the models, it's like there's a biggest, best open model at every time. But like LLAMA is dominating because it has the 400B, the 70B and the 80B that are all really good, it should have a 1B. And if you're having a prompting paper, eventually you're probably going to have to have binned comparisons like that, which is like we are comparing two basic prompting techniques, which I think they will have less headroom by needing the autoregressive behavior and things like this. And then maybe there's things like reflection, where it's like we've added minor structure so that the model can now generate a bunch more tokens, but not like a 10X or 100X. And then there's the things like we've added a whole new planning component to how we're prompting the models, and it's all abstracted away from the users. And you're not going to be able to compare those, because those are the things that are going to just solve all the benchmarks that we have out of the box. I think that's fine. I think people will converge to this. It just always takes a bit longer than we want.
Riley G. [00:44:47]: Yeah, I think that's right. I am really excited about the O1 RL approach to this.
Riley G. [00:44:58]: On some level, all prompt engineering is approximating this RL-like search. We have a lot of prompt engineers out there that are trying different things. They see what works. They tell their friends, hey, this works. But the space of things that works is probably, well, I mean, demonstrably, maybe at this point, given O1, outside of what a human might think of. There are things that we see things, even in the summarized reasoning traces that O1 puts out, that are eerily anthropomorphic. That it will say things like, hmm, or let me think about that. Yeah, I feel like they added that in.
Nathan L. [00:45:42]: I think it's almost like a trigger for the model to have a more reflective response. Those are the examples they used, but it's cool.
Riley G. [00:45:49]: I mean, it's not hard for you to imagine that RL could find something like that, right? Just that empirically it works to say, hmm, because that suggests that you're about to do something else in the pre-trained modeling manifold of plausible text. Like saying, hmm, might just be empirically a good thing to say. And it could find that. So I think that's the kind of exploration that you're benefiting from with O1. It's the space of prompts that work that we're not really equipped to find. Yeah, do you have anything?
Nathan L. [00:46:28]: I think this is a good discussion. Kind of to wrap up the academic side of things, how much of papers that are nominally about RLHF training or any sort of post-training as the contribution, do they need to do anything with prompting? Is there a clear segmentation there? Or is it like, if you're doing this fine-tuning, you're necessarily changing how the model is going to respond to prompting? That we should do some checks there.
Riley G. [00:46:55]: That's one view of it.
Nathan L. [00:46:56]: Or the other view is you have a model and prompting is just a way to take one step further with it, which I think Anthropic did this recent podcast with Amanda and their chief prompt engineer that I don't know.
Riley G. [00:47:07]: And that's how they do it.
Nathan L. [00:47:08]: Amanda's like, I can do things with these models that most people cannot. And that kind of leads the way. Rather than prompting being really part of this post-training stack that everyone needs to be checking the box on. I don't know where we fall. I guess there's this IF eval, which we could come to after that, which is kind of a separate
Riley G. [00:47:29]: case. Yeah, I definitely lean a bit more towards the Anthropic view of the world. I guess you could argue that's maybe somewhat self-serving, with no big news there. Prompt engineers are important. But I think that it's true that we do see people that are just good at this. That our ability to prompt these models sometimes exceeds our ability to explain how we're doing it and what the general strategies to apply are. And I think those strategies are worth extracting.
Riley G. [00:48:09]: It's worth introspecting.
Riley G. [00:48:12]: One thing I think about a lot is anytime somebody... I really love when people suggest a prompt or suggest doing something to a model that I can tell immediately will not work. And it's a terrible idea, but it wasn't obvious to them. And that's fascinating, right? Do you have an example?
Nathan L. [00:48:29]: I would love to know if you have something that everyone tells you, but it's a generation behind or something.
Riley G. [00:48:35]: A lot of, I'd say, strategy ideation in fields that are new and competitive. If you wanted to have an LLM give you ideas for what's a good LLM startup to try right now, it's probably not going to tell you anything useful. Some things like that, where it's like, people are still figuring it out and there's money to be made in knowing how to do this better than the average person, you're going to get mediocre advice on a lot of things. But that's not true for everything. If you ask it about physics, you're going to get like, oh, I don't know how to do this. If you ask it about physics, you're going to get like, above average advice.
Riley G. [00:49:16]: But I think that people who have acclimated to models forget what it's like to be new
Nathan L. [00:49:24]: to models, right?
Riley G. [00:49:25]: And I think that explains a lot of people in industry being annoyed by how many R's are there in strawberry. Because they're so- That's the tokenizer.
Nathan L. [00:49:33]: We ignore the tokenizer whenever we can.
Riley G. [00:49:35]: Yeah, and you see this explicitly. A lot of people, they get really enraged that they're like, you idiots, why would you ever think this would work? Why did you ever think that you could ask it 9.11 is greater than 9.9 and it would give you a right answer? And so on. They have a point. That was the attitude for a long time. But I think the social context of these models is changing and people, they want them to, it's becoming more reasonable to expect them to work well in these queries. There's practical consequences of these models being in the hands of people that don't know about these issues. And it's now suddenly more important to fix them. Yeah. So let's spin on this.
Nathan L. [00:50:12]: Is Google searching going to become more like prompting or is prompting going to be more like Google searching? Where with a good language model, can I just type in that physics equation that govern with the cross product that governs electromagnetism? Is that the direction that the models are going? Or is everyone going to actually become more conversational because AI is the default?
Riley G. [00:50:37]: Yeah, I think, I mean, Google searches maybe, yeah, there's some similarities there. I think Google probably has gotten simpler.
Riley G. [00:50:48]: It's been a while since I've used most advanced search filters in Google. I remember a point when it was extremely routine. Yeah, the plus comma, quote, quote, comma. And I think that speaks to the fact that the results used to be worse, right? And we thought we were happier with them because we didn't have alternatives. But we just accepted that, oh, yeah, there's going to be false positives in here that we now have to put in some negatives to cancel out. And that skill, I'd say, hasn't really become more important over time, right? It's occasionally useful still, but it's less essential than it once was. And that mimics a lot of what we see in prompt engineering that you don't have to understand. Tokenization, I think, is probably the biggest one. ChatML was no small part of why ChatGPT was such a big improvement to prompt engineering. It wasn't just the tuning. It was the fact that they came up with this more restricted system of interacting with a model that alleviates the need to know anything about tokenization. And that, I think, is kind of an underappreciated change. Yeah, I agree.
Nathan L. [00:51:54]: I do think in the long term, prompting will go in the direction of Google searching. But I think in some ways, I'm not that surprised that something like O1 can exist, but it's still a very humbling moment where we still have many times where there will be AIs released that we don't know how to use them. And this is the skill that you need to have, is tinkering with the open mind. It's like the open mind that things will come and the open mind that things are not just what they are at face value. And if you play with O1 a lot, you can definitely get things out of it that people on Twitter are not repeating over and over again.
Riley G. [00:52:31]: Oh, yeah, definitely.
Riley G. [00:52:35]: A lot of the explanation for the disconnect that you see, and some people are just absolutely amazed with O1, but also most of the things you see on Twitter maybe aren't that impressive. I think that the frontier of problems that distinguish O1 from, say, the previous class of frontier models, it's either unrealistic problems, brain teasers that people artificially constructed to exhibit the difference, or it's something realistic that you would never want to read in a tweet. The problems where it's exceeding on are like, I have this extremely in the weeds programming problem that involves a complicated interaction of all five of these files. Please fix my import errors or whatever.
Riley G. [00:53:25]: Those are the things that you're going to see the most practical benefit from. And those just aren't easy to communicate in a way that they used to be. It used to be easy to make a screenshot of, hey, look, it does this. It will fix your broken JSON or whatever.
Nathan L. [00:53:45]: Something else that I'm realizing I didn't put in the notes, but there's been these comments on O1 from the OpenAI people that they want to expose the ability to change how long the model thinks to the user. So to change its test time compute, that ultimately is going to be a whole other prompting thing. It's almost a little surprising that they are giving that to user. I almost think they should just make a classifier that does it for them, rather than just assume the user is dumb. But being able to do it and change how hard your model thinks is a really interesting real-world prompting case. Because it doesn't really matter if you can get a viral example. But it's like, how do you vary that knob in your day-to-day use that meaningfully ships your end product?
Riley G. [00:54:26]: Yeah, it's really kind of comical trying to manipulate how long it thinks about things. Because there are some things that will make it think for a long time. I tried to get it to generate acrostic word squares once. And if you emphasize enough the need to validate things, it will just keep validating and failing and loop around for, I think I got up to three minutes once of attempting things before finally saying, oh, I wasn't able to find one. Here's my best effort. But the other times, though, if you ask it... I mean, I once gave it a problem. Or I kind of just was for the comedy of it. I gave it some simple problem. And then I gave it literally, I think, three pages of emphasis on think forever. Just rambling paragraphs saying, if you're even considering stopping, don't. If you ever have the dream, if you ever get tired, don't worry about it.
Nathan L. [00:55:22]: Just keep going.
Riley G. [00:55:24]: All those kinds of holy hand grenade style repetition. And after all this, it literally just thought for three seconds and then came back and said, I understand the urgency that you're saying here. Thinking forever just isn't possible. So I'm not even going to try. There's another thing.
Nathan L. [00:55:43]: OpenAI said they might give you a knob that controls this or influences it.
Riley G. [00:55:47]: Yeah, I have to be honest. It feels like maybe weird UI. It seems like something that you should be able to just do through text. But I'd be happy to play with it. Because steerability in general without one seems to be... A lot of people, I think, are reporting that it's kind of awkward or at least at odds with the really impressive examples that we're seeing coming out of it. Yeah.
Nathan L. [00:56:16]: There's a whole strategy discussion on why did they actually release it that I haven't really entered into. We can kind of avoid this. I am wondering how you view prompting of agents. Is it kind of like the future section of what is the future? How are agents going to be susceptible to prompting? I'm guessing after our conversation here, it's going to be like, it's the same. And there's going to probably be a meaningful shift in who can deploy them and have success based on who actually has this expertise and is doing this prompting work. And this could translate into downstream business success, which is the first person to kind of crack an agent with the right model and the right prompt can have the first product that works.
Riley G. [00:56:57]: Yeah, I think people mean very different things when they talk about agents. Sometimes, and I think the big division that matters is that there's agents that are working in self-contained, repeatable environments, so like a rebel sandbox. And then there's agents that are making changes in the real world, that they're out making retail purchases, canceling your subscriptions, so on. I'm very optimistic about the former. I'm very skeptical of the latter. I think people underestimate how much reliability is needed for a lot of role decisions before you get to the point that you'd trust the thing to have the power to cancel your Hulu subscription or whatever. I think that also, in the first case, there's a lot of untapped potential there. And I don't understand why we aren't seeing more iteration on that front, really. Chachiviti's code interpreter, when it came out, I think they renamed it to Advanced Data Analysis or something like that, which is not a good change in my mind. But the code interpreter, I love that. I still love it. It's a brilliant product, and I wish they kept going with it and improving on it. I'm also a fan of Julius AI, which goes exactly in that direction of creating a code interpreter-like environment where you can substitute in whichever model you want, and you can do things like install packages. It's great for one-off scripts where you want to say... I had a post once where I was pointing out oddities in the longest GPT-4 tokens. One of them is like slash, slash, and then 128 repetitions of an equal sign or something like that.
Riley G. [00:58:49]: But the way I did this was literally just like I just went to Julius, I said, install TikToken and show me the longest tokens. And I read the code pretty carefully because I was going to tweet it. I didn't want to tweet out something wrong. But it was right. There were small things that I had to fix, but it's good for prototyping, the kind of these quick one-off things where you're just like, yeah, I could look up exactly... I roughly know how to use TikToken. I just didn't feel like figuring out the syntax again.
Riley G. [00:59:17]: It's good for just the curiosities and one-off stuff like that. And I think that's what the future of this really is. This really blew me away.
Riley G. [00:59:30]: Somebody posted on Twitter a video of their eight-year-old daughter using Cursor, I think it was, and this girl apparently has no understanding of the code that's being generated, but she's able to say, no, I want to do this differently. I want to have a Harry Potter spell here. Changing the layout of this HTML JavaScript app. And it just works. And that's the future to me, that that's the hottest programming language is English. When you see a little kid doing it, you really believe it, that now kids can have the power to create software. And that's great because we were at a weird local minimum of that, I'd say, of kids being able to have the creativity to create their own interfaces or make their computer do what they want. They're less customizable now than they once were. Yeah.
Nathan L. [01:00:28]: My reflection on this is the people who take prompting seriously are more likely to be in tune with what is happening in AI and at the cutting edge. But that also means that on the academic side and the public side for transparency and accountability, you have to do some education work to make sure people are taking it seriously and or some normalization of claims, kind of depending on how people are presenting their work and using things. I think it's safe to say that all the frontier model labs are doing this, but kind of the long tail, it takes people time to learn these habits. But it's surprisingly hard to convince people to spend time playing with models too. Like I do it, but I should probably do it more, listening to people like you. I just, it's funny. It's one of those things that doesn't make sense how it'll pay off, but it probably will.
Riley G. [01:01:20]: Yeah. I mean, there's no substitute for using models. People, I mean, I personally, I discover just the dumbest things sometimes that make the biggest difference. One of the most high impact chat2BT tricks that I found lately is I have custom instructions in my chat2BT telling it how to think silently. I have a tweet about this that I posted once. So if you Google chat2BT think silently, good sign, you'll probably find it. But I have the prompt here actually, right? I told it, I was using its new memory feature so it can remember things that you tell it. So I was sort of showing that off at the same time. But I said to it, remember this, when I ask you to think or write silently, I mean, for you to use your Python interpreter to write your thoughts as code comments or string literals assigned to variables. Code doesn't necessarily have to display any output. And then it remembers that. And so then I can say to it, silently write a brief essay about Super Smash Brothers, then silently translate this essay into French, display only a double histogram showing the frequency of word lengths for both texts. And then it doesn't output anything until it has that histogram done and then outputs the histogram and says, here it is.
Riley G. [01:02:32]: And that makes such a big usability difference. If you just don't have to see what it's doing, if you can just put it behind a fold where you can expand it if you need to, be really sure that the code is right or copy it to another editor or whatever. But just not seeing it makes such a big difference. And you can just have things in code too. You end up in this sort of Jupiter-like flow where you told it to silently do something. And now because you said to do that, it's not just in context, it's in a variable. Like I said, if it ever needs to do something in code, it would just have that variable there. And it doesn't have to repeat it, which is a big deal if it's, say, an essay. Repeating an essay is expensive. Yeah. This is great.
Nathan L. [01:03:19]: Thanks so much for coming on. Anything else you want to plug or talk about?
Riley G. [01:03:25]: I should have some content that should be going live around the time that this comes out on analyzing one for the scale blog and talking a bit more about our coding leaderboard. So definitely look out for that. And also, the other thing I should of course mention is Humanity's last exam. We recently partnered on an effort to solicit from the public examples of challenging problems. And we are giving out cash prizes. So definitely check that out if you're interested.
Nathan L. [01:03:58]: Yeah, I had just tweeted a few days ago. I don't know if I put it on Twitter, but I put it on some platform. I don't have Twitter at work, so I end up looking at lame platforms I'm less addicted to. But essentially, evaluation is going to be extremely expensive. And that was my whole take. And it's going to be very narrow and very hard. And then you put out $500,000 in prizes. And the initial whiplash is like, oh, that's a lot. But in reality, I think that's the right ballpark. Because if you're going to make a good eval, you need to have somebody who's really good at cutting edge AI, probably working on this at least six months to build a good eval. And that's a ballpark price. $500,000 is like a half year of how much it costs. This is with overhead and compute and stuff. It's how much it costs to have somebody in AI like that. So obviously, it costs more to actually build this evaluation. But these numbers look ridiculous. But if we want to have evaluations that are meaningful, this is what we need to do. And I think it's the right thing for Scaled to do to lead on evaluation. It feeds into natural things of their business. I think I've been on the record for this for a while.
Riley G. [01:05:00]: So I'm like, it's great. Yeah, absolutely. I think that people outside the industry at least have the impression that evals are grunt work, right? That this is something that you would use low-cost labor for. It's not a prestigious area. But it couldn't be further from the truth. I think evals are very rapidly moving towards the high end of intellectual ability that we're looking for like PhDs. I've done projects where it's like, okay, we have to get as many PhD-educated poets as we can to check the correctness of these IAMs in this poem or whatever.
Riley G. [01:05:46]: I think that's only going to continue, right? We're going to see that at the low end, the value of human labor for training models is going to decline. And the value of high-end intellectual labor is going to increase probably drastically.
Nathan L. [01:06:04]: And it's like cost is probably a good proxy for evaluation usefulness. LM says it's expensive, but for different ways than the Scaled leaderboard is expensive. And they complement each other very well. And they both become better by the others existing by kind of like, okay, the models are in similar places, but they're showing different things. And you can separate between that. And I suspect that that'll continue to grow. Some more will be at scale, some more will be elsewhere. And that's just the new default for evals.
Riley G. [01:06:35]: Yeah, absolutely. I think that's one of the things I'm most proud about working on our evals and leaderboard at scale is that we're contributing to this healthy ecosystem of not having to just trust one or two players that evals have been done correctly. We want to have more openness and more independent verification of evals. And that's sort of our general theme with work with GSM 1K and trying to make sure that we can actually trust what these leaderboards are saying.
Nathan L. [01:07:08]: Yeah, my one nitpick that I don't know how to answer and I probably need more RLHF experts, you might know this, is like, are companies that buy data from scale going to have an advantage on the scale leaderboard because the distribution of humans that are
Riley G. [01:07:20]: doing...
Nathan L. [01:07:20]: Not that the humans doing eval and creation are the same, but that they're drawing from the same pool of humans that are writing content or doing preferences and then that are doing
Riley G. [01:07:30]: the evals.
Nathan L. [01:07:30]: I think it's too early to answer that question on if human distribution matters. And for that reason, I think the eval is still so much a net good. But it'd be really interesting to try to run those experiments on who is giving the data that you train on and how does that then impact the evaluation?
Riley G. [01:07:49]: Yeah, that's not something that I'm familiar with in enough detail to comment on our process there. But yeah, that makes sense to me. I think that's something.
Nathan L. [01:07:59]: It's something that people like to complain about every possible thing. And I understand the root of the complaint, but it's like, we've got to deal with the circumstances where we are in the AI industry. And the leaderboard is so much more useful than it is causing any problems. Let's keep doing it.
Riley G. [01:08:17]: Yep, absolutely. Okay.
Nathan L. [01:08:20]: I think we're at time. So I'm going to click stop here. Thanks again.
Riley G. [01:08:23]: Great. Thank you so much. Bye.
Sorry this one was late! Thanks for bearing with me, and keep sending feedback my way. Still a year or two away from when I have time to record these, but I would love to.
Open-source tools, examples, limits, and the state of training multimodal models.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/molmo-and-llama-3-vision
00:00 Llama 3.2 Vision and Molmo: Foundations for the multimodal open-source ecosystem
02:47 Llama vision: Multimodality for the masses of developers
03:27 Molmo: a (mostly) open-source equivalent to Llama vision
08:45 How adding vision changes capabilities and reasoning
11:47 Multimodal language models: Earlier on the exponential
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_013.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_021.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_023.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_027.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_030.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_037.png
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_046.png
Fig 9: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_048.png
Fig 10: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_050.png
Fig 11: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_052.png
Fig 12: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_054.png
Fig 13: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_058.png
Fig 14: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_065.png
What productionizing test-time compute shows us about the future of AI. Exploration has landed in language model training.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/reverse-engineering-openai-o1
00:00 Reverse engineering OpenAI's o1
01:52 From Q-star to Strawberry to o1
05:13 Training o1 with reinforcement learning
09:24 What is o1 doing when given a prompt?
11:49 Questions to consider to understand o1's structure
11:56 1. How does an RL-trained language model act?
12:38 2. Is it an online / test-time search?
14:20 3. Is it one model at inference?
15:29 Open-source o1, the future of o1, and the future of AI
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_014.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_016.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_018.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_020.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_024.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_026.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_034.png
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_048.png
Scale AI's future versus further scaling of language model performance. How Nvidia may take all the margins from the data market, too.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/ai-data-foundry
00:00 Futures of the data foundry business model
02:57 What it is like to work with data vendors
06:06 Data foundries: Risks
08:18 Data foundries: Growth vectors
09:50 Realistic expectations
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/data-foundry/img_008.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/data-foundry/img_012.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/data-foundry/img_023.png
And why the concept of mandating "model spec's" could be a good start.
(Oops, forgot to upload this yesterday!)
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/a-post-training-approach-to-ai-regulation
0:00 A post-training approach to AI regulation with Model Specs
1:45 Expanded roles of Model Specifications
3:40 Near future of Model Specifications
Whether or not scaling works, we should spend more on inference.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/openai-strawberry-and-inference-scaling-laws
00:00 OpenAI's Strawberry, LM self-talk, inference scaling laws, and spending more on inference
01:51 OpenAI's Strawberry
04:16 Self-talk in language models
07:45 Inference scaling laws
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_006.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_021.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_023.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_037.png
Ai2 released OLMoE, which is probably our "best" model yet relative to its peers, but not much has changed in the process.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/olmoe-and-building-better-llms
00:00 OLMoE and the hidden simplicity in training better foundation models
02:04 Frontier model team compute allocations
04:19 De-risking training complexity
06:40 On organizational complexity
09:05 Compounding improvements -- the key to building better language models
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_005.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_007.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_009.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_011.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_028.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_030.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_032.png
The Open Source Initiative is working towards a definition.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/defining-open-source-ai
0:00 On the current definitions of open-source AI and the state of the data commons
3:17 Reasons to not mandate fully released data
4:24 Sufficient but not exhaustive data docs
5:22 Frustration with the data commons
7:04 We need more examples to define the definition
The latest model from one of the most popular fine-tuning labs makes us question how a model should be identified as a "frontier model."
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/nous-hermes-3
0:00 Nous Hermes 3 and exploiting underspecified evaluations
5:29 Parsing training lessons from Hermes 3
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_005.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_010.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_012.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_020.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_027.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_030.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_032.png
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_036.png
I had the pleasure of Talking with Ross Taylor, who has a great spectrum of unique experiences in the language modeling space — evaluation experience, Galactica lead author, Llama post training, etc. This is a really great conversation on the frontier of language model (LM) reasoning, LM deployments and demos, LM’s for science, RLHF, and other topics. I’ve been trying to get Ross to come on for a bit. He’s one of those people in the LM space that doesn’t speak too much, but when you do, you listen.
Ross Taylor was previously an LLM lead at Meta AI, heading up the reasoning team. Previously he led the early work on LLM agents, and was the research lead on the Galactica project. Before that, he was a co-founder of Papers with Code, which was acquired by Meta in 2019. Before that, he has worked as a quant in sports betting and finance, and before that a policy advisor for the UK Government. He is currently working on a new startup.
Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.
YouTube
Chapters
* [00:00:00] Introduction of Ross Taylor and his background
* [00:02:12] Papers with Code
* [00:09:58] Galactica, goals, controversy, legacy
* [00:18:12] Technical details of the Galactica model
* [00:23:18] Potential for language models to make scientific discoveries
* [00:25:21] Defining and improving reasoning in language models
* [00:32:38] Process-based reward models and their potential applications
* [00:35:00] Generating synthetic data for SFT
* [00:40:23] Evaluating the effectiveness of language models as judges for human preference data
* [00:42:43] Considerations for creating base models that are easy to fine-tune
* [00:46:45] Balancing SFT and RLHF
* [00:54:13] Characteristics of successful post-training teams
* [00:58:26] Future directions for language model development
We mention
* Rob Stojnic (co-founder of Papers with Code)
* Armen Aghajanyan (Chameleon)
* Tom Scialom on Latent Space
* Soumith Chintala (PyTorch)
* Process Reward Models / Let’s Verify Step by Step
Transcript
Built with smol-podcaster and with love of Latent Space.
Nathan Lambert [00:01:07]: Today, we're here with Ross. This is a really exciting one. I've been trying to get Ross on the show for a while. Ross has done a lot of interesting work. And also the path to where you ended up with working on state-of-the-art LLaMA work at Meta is very interesting to me. So we're going to start with some of that, but then there are a few people that want to know more about reasoning and some of the RLHF stuff. We won't cover the secretive new start-up - I don't know what it is, but that's how it goes these days. I'm sure it'll be great. So welcome to the show!
Ross Taylor [00:01:41]: Thanks for having me.
Nathan Lambert [00:01:44]: So I wanted to start with Papers with Code. For people that don't know, Papers with Code is one of these platforms - I never was a heavy user of it - but it collates papers, people can upvote them, popular papers, attaching code and dataset and evaluations to papers, which is great - it was like sort of ahead of its time. It fits into a lot of these open ecosystem things. So I'm kind of curious, like, how you ended up there and why you all started this startup that ended up building this thing that got acquired by Meta?
Ross Taylor [00:02:12]: Yeah, that was a weird one. This was like back in 2018. So I was at an incubator, I just quit my previous job and I was like, okay, I want to do a startup. And I met Rob, my co-founder, who came along with me for the journey. We both came from different backgrounds. I was from a sports betting / quant finance kind of background, which is a whole other episode I guess. And Rob was in various startups, like applying ML to things like hate speech detection, that kind of stuff. And the cool thing was, we both resonated on similar kinds of problems within the ML space, even though we came from different domains. So we spent a lot of time doing various experiments, trying to make new kinds of ML tooling, thinking of these stupid questions like “what is the Git equivalent for ML?” - that kind of stuff. One of those experiments was hacking around on this little website to solve a really basic problem: I'm trying to reproduce this paper, but I can't find the code. That was the thing that really blew up beyond our expectations. It was weird because we thought it was fairly trivial at first.
Nathan Lambert [00:03:16]: What year was this? 2018?
Ross Taylor [00:03:18]: Yeah.
Nathan Lambert [00:03:19]: This makes sense. I think this was like, I was starting Deep RL then, but Deep RL was so hot, which was like the worst evaluation has ever been probably for ML. Like people complain about it today, but like Deep RL evaluation was like, every single person was just lying to make themselves look better.
Ross Taylor [00:03:38]: The interesting thing now is that the open ecosystem has shifted to focus more on weights as a central artifact rather than code. I think there's an interesting debate there. Would it be more useful to have the LLaMA-3 8B model weights or all the code for training LLaMA-3? I think there's still interesting debates to be had about what's actually useful.
Nathan Lambert [00:03:56]: I think the code would be more useful. Like OpenAI released their rules-based reward models, but it's like code washing because it's like just a bunch of people just released like eval code now. And it's like, that's a whole another tier is like actual training code versus eval code. But yeah, I guess I'll just skip ahead.
Ross Taylor [00:04:12]: So essentially Papers with Code was the thing that didn't die for us. We always thought we were going to make something else and Papers with Code was more of a marketing thing. But eventually we were like: okay, our users are telling us this is what we should be working on. And we expanded from that very simple use case of finding code towards indexing various artifacts in ML.
Another big problem was trying to find the state of the art in something like ImageNet and all these different benchmarks. There just wasn't a central place to find this information…So we had this quite good Christmas - me and Robert - where we hacked for the whole month, indexing every leaderboard we could and all the related papers. I didn't want to do any annotation again after that! But that took things to the next tier, and that's when things really started to blow up.
Nathan Lambert [00:05:03]: Because this is like the first round of leaderboards, because now it's really popular with Hugging Face again. And I was like, yeah, is that just because it became like a Meta thing and it's just kind of a thing that existed? You're like the first leaderboard company in a way, which I don't think many people think about. Yeah, which is weird.
Ross Taylor [00:05:19]: Yeah. And the interesting thing about us was that we never had to do any marketing because everything was from organic traffic. So you would type in “state of the art ImageNet” and we would come to the top as the most useful site. That was really the source of our growth, and we grew to a million MAU fairly quickly. And as for Meta, we were in touch with the PyTorch folks at the time who we really liked. You know - Soumith, Joe - those folks, and they had a shared interest in promoting the open source ecosystem back in 2018/19. And while it was like a tough decision, we were just like “we really like working with these people, we want to work more closely with them”, and that got us into Meta.
And then within Meta, we originally continued to develop the platform. But the big shift for us was that, even then, we saw we were moving to a world where compute was the currency. And we saw that, if we wanted to be well positioned in five years time, we needed to be building these large-scale systems. Even for our own platform, we had lots of ML in the backend and we saw we were using fewer and fewer models to do more and more tasks. So that kind of shifted us into research, into Galactica, and then eventually LLaMA and that kind of stuff.
It was a weird shift because we were product people who ended up doing hardcore research! But I guess it was natural to us that we were within a research org with these amazing people, lots of resources. It was just the best use of our time to conduct this shift.
Nathan Lambert [00:06:43]: Do you think there should have been more integration between Hugging Face and Papers with Code? It would have been wonderful if it had happened.
Ross Taylor [00:06:54]: The backstory is that we saw them as competitors, to be honest, because we had the same vision originally. We were going to do model hosting, that kind of stuff. But we never got into it because we hit friction with leadership - who was not onboard with that as a goal. Because from their point of view, it's like, okay, if we host these things, this might expose Facebook to some kind of legal risk. It wasn't in the perceived interest of the company.
Nathan Lambert [00:07:17]: This is a classic story of tech, really. They can't take the risk. They can't expose themselves.
Ross Taylor [00:07:23]: If you're a startup and it's your number one priority, then yeah, your attitude on risk is different. But I think it was a blessing in disguise for us because clearly the bigger wave was going to be large language models - we saw that incredibly early. And our mission was fundamentally not infrastructure, but something closer to: how do you organize information? It was a Google-y type of mission. And while we were focused on ML, we were more broadly thinking about science: how do we reduce friction for finding out about new advances and, I guess, lots of small tasks that when added up lead to a lot of progress in science.
Nathan Lambert [00:07:59]: I should have probably looked this up. Did you have another scientific background? Did you have a hard science background or what about Rob? Stojnic?
Ross Taylor [00:08:10]: Yeah, [Robert] Stojnic, my co-founder, he was from a bio background. So he's actually-
Nathan Lambert [00:08:15]: That makes sense.
Ross Taylor [00:08:16]: Well, he also had a computer science background. He was one of the original developers of Wikipedia, so he has his own crazy story…
Nathan Lambert [00:08:22]: Yesterday I was talking to somebody that was one of the original arXiv moderators. So we're digging all these things up…
Ross Taylor [00:08:29]: It is interesting because we both had this background, I would say, in building useful “utilities” [on the internet] at some point in our lives. I think Papers with Code is one of those things which is easy to forget, but if it went away, everyone would go crazy.
As for me, my background is more statistics and econometrics. My first job was in the Government, which I kind of hated. But I did a Master's degree, which I thought was going to be in economics, but the thing I ended up loving was time series and statistics. So I did all this research on state space models - before it was cool, I guess! - and then that got me into sports betting. And then eventually, we were using more and more deep learning [in the 2010s], and that’s how I got into AI. So a fairly nonlinear path. But -
Nathan Lambert [00:09:09]: Yeah. Well back to what you were saying on the scientific stuff, I think the Galactica story has many angles, and you led on this.
I think if people go look at the paper, it's a very interesting paper, like you cite Galileo in the first sentence, and it really has a lot of early modern language model features and quirks. It's something that people don't remember that well.
I'm very on the record saying the backlash was overblown. I think that was before there were clear habits and community norms around what language model demos should look like. So it was kind of in that teething phase.
But what was the actual goal that you wanted? You mentioned organizing the world's information. What was the goal and how close do you think the model came to accomplishing it?
Ross Taylor [00:09:58]: So there were several different things at once.
There were immediate product integrations we had in mind. We actually had an agreement at the time with Overleaf to be a “co-pilot for writing papers”. We'd have a really good LaTeX model in Overleaf, and whenever you wanted to include a citation, you could simply prompt for one.
More broadly, we imagined the future would be instead of..using more classical ways to find and extract information, if you wanted to learn about something like DPO, you would just prompt a language model to find out about it. Or if you wanted to ask “What's the state-of-the-art on SWE-Bench?” or something like that, you would just prompt the model and it would find the relevant information and answer the question.
Nathan Lambert [00:10:46]: So this is something that language models are so bad at. One of my challenge questions - I've been doing this for 6-12 months - is to ask models about DPO, and none of the models without internet access have yet done it right. You would think that it would start to kick in. And I don't just ask “what is DPO?”, I ask “What is DPO for language model fine tuning”, and they still just make up nonsense.
Ross Taylor [00:11:06]: Yeah, which actually relates to an interesting debate about LLM creativity. If you want to solve something like LLM creativity, you want to be confident about the frontier of knowledge, but frontier knowledge is where you have the most token scarcity.
But anyway, just to finish that thought. Bear in mind, we were developing Galactica while the whole Web 3.0 boom was happening. And we were in this weird state where we were like “All everyone is talking about is Web 3.0, but clearly generative AI is going to be the thing that powers the next generation of the web!”. So I guess that was our primary motivation.
Now, in terms of the [Galactica] launch, I think there's two aspects.
First, like you said, the paper. Now we were a small team of 7-8 people. We had so much fun developing these new ideas at the time: internal reasoning tokens, how do language models cite, training for multiple epochs…
Nathan Lambert [00:12:00]: What's that? A citation token? Did you have a special token for citations?
Ross Taylor [00:12:04]: Yeah. So we had a start citation token [START_REF], and we used two methods. The first was: we'd put the title of the paper within the citation tags. And the other one was: we'd have an alphanumeric ID.
The interesting thing was, it actually worked really well - but in the demo interface, it had a tendency to hallucinate - or “hallucitate”. The backstory is that, while the model was really good, for the demo we turned up the temperature to 0.7 so the text generation was better [at the expense of citation accuracy]. So generative citations were something that people thought didn’t work, but it was [more an implementation issue]. I guess that’s an alternative road in history…
So there was the paper, which was cool, and there was the demo, which I would say was motivated by the realities of the time. This was pre-ChatGPT and, even within a big company like Meta, it wasn’t a company priority to work on LLMs at all. So in our mind, our objective was - we were kind of deluded - being a team of 7-8 people, we were like…
Nathan Lambert [00:13:08]: This is how you have to operate if you want to be at the cutting edge. That's how great teams operate.
Ross Taylor [00:13:13]: So there were two objectives you could have had. The first is: you think that second-mover advantage is good. So you could wait for OpenAI to do something and then come in after and do it in an open way. And this is the path that actually worked for LLaMA. LLaMA was not state-of-the-art in any sense.
Nathan Lambert [00:13:27]: I've been doing this. I mean six months ago, maybe OpenAI and Google wouldn’t need to hire me because they know everything. But now I’m doing more interesting analysis where I'd be hired at a different role - but in the open. Now I'm like the person people look at. But I’m trying to tell people that “You don't understand! I'm six months behind everyone!”.
Ross Taylor [00:13:49]: Right, but to be clear, that’s a really important role - because everyone should have a stake in the future. And that's what the open ecosystem gives people.
But our objective was this: we didn't want to be second; we wanted to be first. And we were kind of deluded because we were 8 people - compared to maybe OpenAI with 200 people where their whole bread and butter was language models. But that’s why we were thinking “how do we move as fast as possible?”. And in our mind, a demo might be premature, but it would also be a way to get lots of prompts and information quickly - to understand how people would be using the model. And essentially the calculus we took was, we knew the community might not be ready for something like this - especially with the Meta branding - but we thought this was a way to get lots of information really fast and catch up given our position. Now in retrospect, history says that…
Nathan Lambert [00:14:33]: You kind of did that. I think Meta probably got the injection of language model reality from that. It's kind of like the Gemini backlash. I think the Gemini backlash - while it's obviously stupid execution - was potentially a good forcing function for Google's structure of their Gemini org - to really move everything into the way it is now. That made them be structured more like a serious language modeling org and less like Google, I think, which people don't want to hear...
Ross Taylor [00:15:07]: For us it was just a risk we decided to take. We probably took a lot more risk than we should have done. But we just thought “obviously this is going to be huge”, “LLMs are going to power the next internet”, etc, so let's take a risk. And you know, if we ran the universe several times over - it would have succeeded in some of those runs. But [in our universe], the criticism, which was obviously overblown, reached a critical point where things didn’t work out.
And then there's the story about the demo coming down, which - I’m not sure I’m able to talk about - but I think that is one of the things where, if people knew the true reasons, they'd be like “what the f**k!?”. But yeah, that's what happened…
Nathan Lambert [00:15:44]: Yeah, this is why any company that makes a demo now has block lists, where there's certain words that if they're in the prompt of the generation, you get a really, really stupid response. Even if it's like an open model, you just put like a little filter that's like, “you can't say the most obviously bad words”.
Ross Taylor [00:16:01]: But we actually did that and that created backlash as well. Because if you have false positives, you actually exclude some words which aren't actually offensive [in certain contexts], right? And then you also offend people… so it's not a win-win situation.
But if I have to look back at it now, I think with any new technology, it's never going to be absolutely better than what came before it. With LLMs, the relative comparison is with search. If you’re going towards search and information retrieval, you're prioritizing factuality as opposed to creativity, right? And the fundamental tradeoff with LLMs is saying, “I can trade off some amount of like factuality or ‘closeness’ to the corpus for some amount of synthesis and creativity”.
I don’t think that if we had a better model, it would have helped things at all. You could say maybe if [Galactica] had RLHF, would that have helped? I'm not too sure given that the project came out of [a big company like] Meta. Meta has a really good reputation now - people appreciate the open work they're doing - but at the time, things like the 2016 election were still in people’s minds. So I think the LLM revolution was never going to start at a big tech company, in my opinion. It was always going to happen at a company that had less reputational baggage. But I think it's pretty cool now that people see things differently. Because FAIR always had a really strong commitment to open science. It’s good that they're finally getting the credit for that.
Nathan Lambert [00:17:38]: Yeah. I have two technical questions on Galactica that I find really interesting. One is from Luca Soldaini at AI2. He said that you mentioned that the Galactica log probabilities (when producing citations) were proportional to how far in the citation graph the current paper was to the cited paper. Do you have any more interesting comments on how the latent space of Galactica actually worked? Because that is cracking the most important question of a language model for science - building a better latent representation of how the scientific information is organized.
Ross Taylor [00:18:12]: Yeah. So there were a couple of aspects to that. The first thing is we had this really nice graph that showed, as we scaled the model, the distribution of citations became closer and closer to actual citations - which is what you'd expect. But this was important for us, as our main worry was - because we were thinking about deploying to Overleaf - we didn't want to prioritize the most cited documents and create a “rich get richer” dynamic.
Nathan Lambert [00:18:38]: Google Scholar already does that. Were you re-indexing all the papers rather than building off like the Scholar graph or something?
Ross Taylor [00:18:45]: I think we were building off existing ones, using things like CrossRef…but there were lots of gaps that we had to fill. The other weird thing was that we saw some strange biases in the model. So if the model didn’t know what to cite, it would sometimes cite a general review paper, which is really weird emergent behavior. It was like the model was saying “I don't know a specific example, so I'll just give you a general overview”.
Nathan Lambert [00:19:11]: It's probably in the data.
Ross Taylor [00:19:12]: I think the thing that surprised me the most was multimodality. So we trained the model on SMILES formulae and protein sequences [alongside natural language]. And the thing that really surprised me was, we had tasks which we didn't explicitly optimize for - like converting a SMILES formula to a IUPAC name for a chemical. And if you actually looked at the attention as the model was predicting the next token, it would say something like “amino” and you could see in the chemical graph, it was explicitly attending to the relevant part of the sequence.
I found that amazing because we didn't train for it explicitly. That's the beauty of self-supervised learning. But I also found it highly ironic because some of the criticism of Galactica was “it’s ungrounded”. I was like “how grounded is this? The natural language tokens are literally attending to the underlying chemical structure!”. So that was kind of cool.
And then the other cool thing was: if you prompted with a protein sequence and asked “what is the function of this protein?”, the model was really good at answering those questions in natural language. That was awesome for me.
Nathan Lambert [00:20:33]: There's another prompting thing that I had known of [for Galactica], which was asking the model to do open-ended generation tasks. The models are still out there - people can spin them up and do demos on their own - but if you asked it something that people think of for ChatGPT - e.g. write me a poem about a sad goldfish - it wouldn't work unless you put it in a header format. It was markdown, I think? If you prompted it in that format, it would actually do a great job.
Ross Taylor [00:20:57]: Yes, so in the Galactica demo, a lot of people were being malicious with this type of prompting for markdown articles. But I did enjoy some of the creative ones. Someone was like: write me a theorem on finding a girlfriend, and it was some of the most hilarious model output I’ve ever seen. And people also generated some amazing sci-fi…but then I think some people took it too far. But whatever. I guess it was a traumatizing experience for me at the time. But with the benefit of hindsight, I was also fun in some sense, I guess.
Nathan Lambert [00:21:30]: Yeah. It makes you understand the bigger context of the work much faster than you would otherwise.
Ross Taylor [00:21:37]: It was actually crazy at the time. So many people were using it. Even then we could see that - while it wasn’t a product - we could see that most systems were going to be designed in a similar way.
I think the interesting thing was how the winning form factor in the end was like a chat interface - you know, with ChatGPT being the winning UX. I think that was actually a big part of the story [why they succeeded]. There's a debate on whether RLHF is actually a capability advance or whether it’s just alignment…but a big part of the story [for ChatGPT’s success], in my view, was the kind of UX of how you interface with a language model, rather than the actual capabilities. But I think it's obviously not monocausal at the same time. There were several factors at play.
Nathan Lambert [00:22:25]: Yeah. So the last thing on this is that you mentioned in our e-mails about language models, creativity and making discoveries. What do you mean by that? Is that the agent-like projects you worked on at Meta?
Agents are largely something that I don't have too much comment on. I'm taking the approach of wait and see what we actually get, because there are a lot of practical approaches that I think will be reasonable. People use language models for basic formatting, for code, etc. But it's easy that if they have a little bit more feedback for things like writing a paper - e.g. find me a citation for blank and justify your answer - that step is something that I think will come. I don't know how expensive it will be to run, but is that what you mean when you think about making discoveries? Is it more autonomous? Is it a grander vision? Anything like that?
Ross Taylor [00:23:18]: I think it's more like this: the killer use case right now is information synthesis. For example, I use Claude a lot more than Google now because it combines information in a better way and sometimes generalizes well to things it hasn’t seen before.
But a really cool thing would be: can a language model answer a question which is more out of distribution? That we don't see in the training data?
So an experiment I've never done because I didn't have to compute would be this. Imagine if you could train a language model on all documents up to 1905, which is the year when Einstein had his miraculous year of four seminal papers. With that model, which is trained up to 1905, could you prompt the model to come up with a good explanation of the photoelectric effect, special relativity, this kind of stuff? And what would it take to rediscover these things?
Because presumably, with all these major discoveries, it’s never out of the blue. You’re standing on the shoulders of giants, but there’s still a lot of thought and inspiration you have to do to get to those great ideas. So that's the setup. But the creativity problem is, by its very nature, hard to benchmark.
Maybe this is a digression, but my problem with the field right now is: we’re in a situation where we've almost solved a benchmark like MATH, which is a very hard benchmark, in my opinion, at least Level 5 MATH, but I don't think we've really cracked something like reasoning. So I think it's like a whole different question about how you even evaluate these frontier tasks. But yeah, hopefully that gives a flavor of the kind of questions here…
Nathan Lambert [00:24:58]: Yeah, we can go into the reasoning conversation. I think reasoning in RLHF will take up however much time we want to keep talking. I guess we can start with the basics. What do you think people that are using language models think reasoning means? And what is the way that you would interpret what you're trying to do in improving the reasoning capability of a language model?
Ross Taylor [00:25:21]: So there's a lot of controversy on this on Twitter/X. And I think people are talking past each other because sometimes people mean different things by reasoning. At a very granular level, is legal reasoning fundamentally the same thing as mathematical reasoning? Common sense reasoning? I guess my very basic definition is that reasoning is the process of drawing conclusions based on a body of observations, or in the case of deductive reasoning, basic premises.
Nathan Lambert [00:25:50]: So math is like a subset of what you think about.
Ross Taylor [00:25:53]: Yeah. And then I guess the bigger circle is the broader topic of outcome directed behavior. I have an idea of an outcome I want to achieve, but what's the best path to get there?
And then in the LLM space, I think this problem broadly equates to the technical problem of how you use compute to get from your question to your answer. In the old days, you would just prompt the language model directly. You would just put in a GSM8k question, put in “Answer:” and then parse A, B, C, D. So you're relying on the forward pass.
Nathan Lambert [00:26:27]: Yeah, like the FLAN data is really weird. That's a popular one that people used to train on this stuff.
Ross Taylor [00:26:33]: Yeah. And then came chain-of-thought, scratchpads, with Galactica…all these ideas of using the context window to do intermediate computation. And the more recent, although to be honest, it's actually quite an old idea, is: you have chain-of-thought, but how do you better learn the internal reasoning tokens that get you to your answer? So things like, you know, Quiet-STaR and variants of this idea.
Nathan Lambert [00:27:01]: Claude now shows you when it’s thinking, and in the Claude system prompt, it has information on how many tokens to take to think about a question. We're all thinking about trying this stuff and it's all so hard.
Ross Taylor [00:27:11]: I think it's a question of how do you learn those tokens? For us, the original thing we did was just supervised learning. So we trained on some examples and let the model generalize to know that it should do the thinking in between some tokens. There are more sophisticated ways you could achieve this nowadays.
Another point is this: there’s an analogy that’s often used about language models, that they are “thinking out loud”. I actually don’t like this analogy at all. I think “thinking out loud” makes you think there’s something wrong about this kind of thinking in token space. But it’s not clear to me that the alternative - or these old adaptive computation ideas - are any better, actually.
Nathan Lambert [00:27:58]: What do you mean by adaptive computation? Because I mostly think of “thinking out loud” as being like chain-of-thought or generating its own explanation before it gets to an answer. What would adaptive computation be?
Ross Taylor [00:28:09]: So there's a paper by Alex Graves, who wrote all these amazing papers ~10 years ago, which had a lot of foresight. He did stuff like the Neural Turing Machine paper. Adaptive computation is the idea of, instead of having fixed compute between your input and your output, you can extend the forward pass to do things better, like arithmetic, where you have to maintain/manipulate state.
When chain-of-thought came out, there was an impression that it was a bit of a hack, because you're thinking in token space whereas you should be finding a way to make the forward pass dynamic. Universal Transformer is another variant of this [adaptive computation] idea. But I think there needs to be more empirics on which approach is actually better to maintain and manipulate state. I used to be more in favor of thinking, OK, chain of thought is more of a hack, but now I actually think it's probably…
Nathan Lambert [00:29:02]: What do you mean by state, like the state of the problem in that sense?
Ross Taylor [00:29:08]: So imagine that you're doing a GSM8k question, where John originally had 100 apples, then Jane gives him five apples. He has 105. And then he gives 20 away to like Susan or something and he's left with [85 apples].
So if you’re prompting the language model directly for the answer, you're expecting the language model in that forward pass to maintain and manipulate the state in a latent space, whereas the way chain-of-thought does it is in token space.
So you essentially output the intermediate steps. One of the problems with reasoning is that we have no idea how humans mechanistically reason…but if you think about how you'd solve a GSM8k problem in your head, then to me this seems a lot closer to something like chain-of-thought than adaptive computation.
Nathan Lambert [00:29:57]: Especially when you look at the architecture and attention mechanisms. A Transformer is really good at copying. So if you keep feeding in the recent information, it copies that in some way. So I think chain-of-thought and all of these things, I mean, they're only growing in popularity in my mind, along with Quiet-STaR and these kind of methods. I’ve heard the rumors about self-explanations and all these special things. The LLaMA-3 paper has all these special tokens. I don't know what all of them do, but I can see the direction. The state is stored in context and in special formatic tokens if it needs to be.
Ross Taylor [00:30:37]: So the other big picture thing is this. With the internet, you’re only seeing the output context.
So take StackExchange. If it’s a good answer, the author probably hasn’t just responded by generating words left-to-right. Maybe they’ve looked something up, maybe they’ve done a back-of-the-envelope calculation, either explicitly or in their head, right? And the internet is missing those “internal tokens”, essentially.
Now this isn’t always a problem because the models can learn how to construct them. And the effort now is to make artificial latents / internal thought, through RL or otherwise. But I think this is actually a much bigger question, which is more than just reasoning. In the end, as models become more capable, we’ll be talking more about how we can make them human-like in the way they can answer questions and solve tasks. For example, in some situations we might like the models to have [human-like] empathy, which is also “missing” in some sense.
So my prediction is that this becomes a bigger deal in the next few years: caring more deeply about the computation these models perform to reach a conclusion. And that will be the essence of alignment, in my mind. But that's a big topic!
Nathan Lambert [00:31:50]: OK, I have a long list of specific questions on this. My first question is about process reward models.
I think the canonical paper is “let's verify step by step”. My whole gripe is that it’s hard to create the data. That’s why they don’t exist in the open. But I’m guessing you can just label data with GPT and ask for feedback on each step, and just use that as an “LLM-as-a-judge” to get reasonable step-by-step labels on process rewards. But there’s so little work on this, so I don’t know if it is worth exploring. There is some research from Meta - I think Alex Havrilla did a couple of internship projects which related to this, and he’s good - but there’s such a lack of signal.
Is this something that people should work on more, or is it too complicated? Are there simpler things to do?
Ross Taylor [00:32:38]: Our big direction was integrating outcomes into reasoning - because next token prediction isn’t the objective we actually want to optimize. So the two ways to integrate outcomes are through something like PPO or inference-time search. And in both cases, you want a good reward model or value model.
Instead of (human-annotated) “process based reward”, we were exploring ideas along the lines of Monte Carlo policy evaluation (MCPE), where the key problem is how to learn a value model. It’s maybe a separate topic, but it’s underappreciated that something like MCTS - which in the public imagination is this inference-time search technique - actually has its real magic in giving you a value network for free.
This is why it was introduced in Go, because humans couldn’t come up with good heuristics for evaluation. So if you have something like MATH where you know the answer, then the question is how do you assign step by step feedback? It doesn't have to be MCTS, but something where you backprop the outcome to these individual steps is a way to get this dense feedback.
That's a way to get “synthetic process reward”. I should stress that PRM and MCPE are actually different things. Alex Havrilla was doing something along these lines also - but anyway, hopefully this gives a sense of the approach we took.
Nathan Lambert [00:34:21]: When Q* came out, that's something that I thought it might be doing. Instead of chain-of-thought, there's this idea of tree-of-thought. You could swap in the reasoning steps. And then if you could get labels on all these reasoning steps, you’re doing search over a reasoning space - which I would expect to work, but I think it needs the right datasets. I think a large part of the open alignment community right now is underappreciating datasets, where there's a lot of focus on methods, but we don't even have the datasets to use the methods… Like, why are you coming up with seven DPO variants if you don’t have the right datasets? I understand academic incentives, but if you are not an academic, you don't need to be doing that…
Ross Taylor [00:35:00]: It's an interesting question, because I guess the first chapter of LLMs had a lot of reliance on human annotations. In a way, that's a barrier to entry for the open community, because big firms can afford to pay millions for it but open source developers can’t. But more recently, you've had the rise of things like constitutional AI [and RLAIF approaches], which I believe are comparable to human-annotated datasets anyway. So is that a good thing for the open community?
Nathan Lambert [00:35:31]: I think it is, but human preference data might be a leg that is hard to remove. One of my latter questions was: can we actually do LLM-as-a-judge for human preference data fully? I think is the critical step that we don't have an answer for. Everything else in the modern RLHF stack is becoming more reproducible in the open.
And that relates to a question I have on synthetic versus human SFT. I think Thomas [Scialom] said on the Latent Space podcast that we just use generations from the model because they're better for humans on a lot of SFT tasks. Apple had a quote in their foundation model paper saying the same thing.
So I’m thinking, shouldn’t we be redoing all of our generations for our SFT dataset with the latest GPT-4 or LLaMA-405B? Why are we using GPT-4 from March 2023? That model was not as good on reasoning. So we have headroom there on synthetic data. We have prompts that we could reuse, but we don't have the right preference datasets - datasets like UltraFeedback are not big enough. And I think they're not in the same style that a lot of labs are doing this preference tuning - where it's on-policy generation.
We tried to work with Scale at Hugging Face to do this, where we had our own SFT models. We were getting data from Scale. We were labeling it every week and we were trying to retrain the models and we weren't getting a signal. This was last July/August. So we just didn't really know what we were doing. But I suspect that what people in the open should be trying to do is generating a lot, labeling it…That was a light bulb moment for me recently. This is what we have to do, but no one has done it.
Ross Taylor [00:37:21]: Yeah, I think it's definitely underappreciated how you can get better answers than a human by sampling the models [enough times]. You mentioned that Thom made this point early on in the [LLaMA] project, but you'd be surprised how this extends to reasoning as well. Even with the Galactica model - which is now an ancient model, a bronze age model - the pass@100 on GSM8k was 98%. And it's absolutely crazy to me that even now people are using GSM8k as a benchmark. In my mind, that benchmark was solved several years ago.
It’s a subtle point because the zero shot performance was ~48% but the pass@100 was 98%. The insight there is that the model already has knowledge about how to answer correctly, it's simply not reliable. This tells you that you need to invest in reward models, process based reward, outcome based reward, everything we talked about earlier…
But the same applies to the general RLHF pipeline. If you asked me to write a poem in the style of Bertrand Russell but also mix in Snoop Dogg’s style, then I couldn't do that. But the model has knowledge of how to do that, right? So why wouldn't you sample the model?
I think now with LLaMA-3, and the 405B model being out, it’s going to be good for the community that they can use it for generating data synthetically. And I'd imagine the quality will be good enough if it's done the right way.
Nathan Lambert [00:39:30]: Yeah, I think it should be doable. But there's a fundamental question of what do we think the human preference data is doing? [Compared to] model labeled preference data, is the noise that the humans provide of a different distribution that makes the human preference data better? I don't have a lot of signal on this, but I would love to know because I would guess that Meta would love to eliminate the $10 million plus estimated human preference data spend if they could. Meta is a reasonable company…
Ross Taylor [00:40:23]: Yeah, I don't know. But here’s something that surprised me. I was originally skeptical - at least on the reasoning side for LLMs - about LLMs marking their own homework. I thought they would eventually have that capability, but I wasn’t sure…
Nathan Lambert [00:40:40]: how fast.
Ross Taylor [00:40:41]: But the interesting thing we saw was as follows. We had experiments where we’d have a LLaMA-2 model that we’d sample generations from to train ORM models, and then we’d train different reward models on this data with different base models.
What we saw is that, the better the (underlying) base model, the better the reward model was for evaluating. And there were very clear patterns we saw: as the base model scaled, so did the quality of the reward model.
So that tells you that the knowledge is not in the ORM samples that you've fine-tuned the base model on. The knowledge on how to judge is within the model itself. And the pattern was so clear in the scaling. I concluded that eventually these self-verification approaches would work. It was just a question of when they would start to work for different types of problem.
Nathan Lambert [00:41:31]: Yeah. Model capabilities are also getting more dense which helps as well. Like with smaller model, there's all these experiments with better data, showing that you get a better model with X% reduction, which is kind of off-topic…
To double-down on what you said, I think this is one of the things I also debate: what makes a good model for downstream fine-tuning? I think in the LLaMA-3 report, they train the reward models directly on the base and not on the SFT model. The Apple report mentioned that they don't just use their evaluation suite for SFT models, but they evaluate with a reward model to see what is ready for RL.
I think, especially in the open, if you want the people to adopt your base model, there's a big gain in making it easy to fine-tune. For example, LLaMA has been pretty good; LLaMA-2 especially was really good for fine-tuning. There's also been base models that don't really work for fine-tuning, partially due to bugs and partially due to the state of the optimization. Is this something that you have any insight into?
Ross Taylor [00:42:43]: Yeah, I don't think I have enough insight into it to say, but I think it's definitely something that's been undervalued. I think the view of a lot of open model providers is: you get the model out, get good Open LLM Leaderboard results, and it's mission accomplished. But the real evaluation is in two days time when you get anon accounts on X saying “I'm fine-tuning this LLaMA model, it's not working”. And when you see a pattern with this kind of behavior, you have to conclude something is wrong…
Nathan Lambert [00:43:11]: It's always a chat template thing. A lot of it is a chat template thing, but those problems do get ironed out eventually. There's this whole idea of annealing and staging pre-training. I can't tell if it is boosting current capabilities at the cost of later capabilities. I think in a few years, this will all shuffle out and it's just how we do evaluation in stages. So you're always going to optimize for the right metric.
Ross Taylor [00:43:50]: There's two points to that.
The first is about annealing. It works for the kind of benchmarks people focus on the most, but then there's a question of whether you are actually just collapsing the task distribution of the model to things you're measuring - and not the true task distribution used by the community.
And I think there's a second point - which is maybe too much of a digression - but there's an interesting debate to be had about data quality being a bit of a misnomer. In a sense that when we say “data quality” we're actually saying “this data mix works well on these benchmarks”. But if you take a “No Free Lunch (NFL)” kind of approach to this, you must be hurting task performance somewhere else, right?
Nathan Lambert [00:44:34]: Yeah, I think I’m on the record of being an AlpacaEval hater. I say this all the time, because I think AlpacaEval is sacrificing actual usefulness for their own metric. If you get a 1-2% bump on alpaca eval, maybe that’s great. But you could be getting a 10-20% bump while sacrificing actual chat abilities.
We released some models trained with PPO and our PPO models are not very good at instruction following because they don't follow modifications like be concise or some stylistic things. They're also so yappy. They just say so much…but they do well on metrics and PPO especially helped AlpacaEval. So we had to figure out how to kind of use that signal without overcooking it.
Ross Taylor [00:45:16]: Yeah, it's like a whole discussion about evals, I guess…
Nathan Lambert [00:45:21]: We could come back to evals in a second. The last question that I have is: there's multiple trends like LLaMA-3 downplayed the importance of instruction fine-tuning relative to RLHF. I think there's other quotes in [Thom’s] LatentSpace podcast talking about it. Nematron also had this report where they use SFT and then multiple stages of RLHF.
I think DPO versus PPO is overblown and that'll kind of be a wash eventually. Everyone knows DPO's advantages of being simpler. But my question is this: are there certain capabilities that only come for RLHF, and people trying to do them with SFT are just wasting their time?
I always thought safety was in this bucket where it kind of makes sense - it’s hard to train a model to refuse just with SFT. But with something like reasoning, are there certain sequencings where SFT primes you and then RLHF really helps reasoning or code? Because it seems like OpenAI is really leaning on PPO to help with reasoning and code?
Ross Taylor [00:46:45]: Yeah, I think there's two ways to answer this question. First, maybe the history of this debate on the LLaMA side, and then something on the reasoning side.
So the history is quite interesting. I would say, you know, when was it? 2023? My dates have been wrong since the pandemic…But this just was after ChatGPT. There was actually a debate internally in Meta about using RL, and a lot of senior people were very skeptical. I would say the view was…
Nathan Lambert [00:47:13]: Not just at Meta. You can see when different companies embraced RLHF, if you really start to look at their models…
Ross Taylor [00:47:22]: The view was that RL was a dead end. And that even DeepMind was moving away from RL at the time, so you should just do SFT.
But, you know, at least for the folks in the Galactica team that came to lead post-training for LLaMA, we were quite scarred by hallucinations! We were definitely of the view that we needed to have the right objectives, and that we needed to make sure language models could “know what they don’t know”. So we were quite high on RL from the beginning. And eventually, I think the LLaMA-2 paper showed that a lot of the advances in helpfulness/harmlessness were via the RL stage. So I think that approach was fairly vindicated.
On the reasoning side, I would just say it’s quite simple. It comes back to the next token prediction objective not being the actual objective you want to optimize. The objective you want to optimize for reasoning is: do you get the right answer or not? Especially since reasoning is a high precision task. If you get one token wrong, unless you have a backtracking capability, you’re never going to recover…
Nathan Lambert [00:48:32]: That's a throwback, the backtracking token. Sorry, that was a random paper! That is interesting…
Ross Taylor [00:48:38]: Yeah, all these weird methods… But I think on your question, there is a point at which these techniques kind of overlap, right? So if you're, you know, doing SFT with rejection sampling: you’re doing something close to PPO anyway. And the same for reasoning: if you sample the model and pick the trajectories that your verifier says are correct, and then do SFT on that, it is a form of RL.
The final point I’d make is this: I would say the community overreacts to certain methods being used by popular models. They think: this company uses DPO because they must have found it's fundamentally better. But actually, it's usually due to either practicality or…
Nathan Lambert [00:49:22]: Yeah, that's what I think.
Ross Taylor [00:49:24]: You have a 405B model, and if you want to do PPO, you need to have a policy model, a reward model, value model etc in memory, and it’s not like…
Nathan Lambert [00:49:33]: Especially with DPO. I think with the 405B, I'm guessing what you did was cache the reference model. You could cache the log probabilities from the reference model. So you don't need to keep them in memory when you're doing the loss of the primary model. For DPO, you don't even need an extra copy of the model in memory, which therefore means you can use the same exact stack that you use for training. So you don't have to comment on this. But I think that's probably partially why LLaMA-3 just used DPO...
Ross Taylor [00:50:07]: Yeah, I think people don't appreciate how compute works either. People assume the big companies have so much compute - tens of thousands of GPUs - so compute isn't a constraint. But all these things are subject to Say's Law, right? If you have more compute, you're going to train a bigger model. And then you're going to hit the constraints again. It’s like the old thing of trying to solve traffic by building another lane. But if you create another lane, people will use that lane of traffic.
So practicality is still a factor [behind choosing methods]. Also things like which researcher is in charge, what’s their favorite method, and also politics as well.
So I think the community has made a mistake of overreacting to these choices. There was a mixture-of-experts phase too, right? I don’t think there’s anything inherently better with either method (dense or MoE), they just have different trade-offs, and it depends on what you are trying to achieve. If you’re serving lots of people with inference, then maybe a MoE approach is better. If you’re optimizing for something simple that’s easy to train and gets good results, maybe you favor a dense approach - although that’s debatable whether it’s easier to train. But I don’t think these things are clear cut.
So I would encourage people to not just copy things because they're in a paper from a big lab. I would encourage people to try things out themselves to know what works, and figure out what the problem is that you’re really trying to solve.
Nathan Lambert [00:51:20]: I think people don't have enough long term direction in their decisions. People are not trying to make decisions about what will be right in 10 years, they are trying to get a model out as soon as possible. So there are very few people with the incentives of trying to understand in the asymptote, which method is better… I might have that incentive, because I'm a nerd, and I have an audience that is okay with me writing four paragraphs around esoteric nerdy topics, but for all these companies, that is not a real incentive.
Ross Taylor [00:51:53]: The other point I’d make - maybe it is a separate thing - is this. I made this mistake throughout my career of focusing too much on novelty and complexity.
So in my first job in sports betting, we were making models for horse racing, football, that kind of stuff. And I always had the perception that other funds had really advanced, cutting-edge, complex models - but that wasn’t the case at all.
I think there is this tendency within deep learning to assume that - especially for the secret labs - that their good performance is due to some secret, amazing method. But more often than not, good performance is due to lots of small things from different people combined into one model. Really, lots of simple things done well and solid execution. And frankly, for big firms a lot of brute force too, right? Because big companies are naturally slow. But once they find a way to mobilize resources, they’re very intimidating and hard to beat. If you’re in a big company, and you’re aware of this, which approach are you going to take: are you going to prioritize novelty or are you going to do brute force if you have 10,000s of GPUs?
So I would encourage people not to be too intimidated by this perception that the big labs are smarter. I don’t think they are.
Nathan Lambert [00:53:03]: They're earlier but they're not necessarily smarter.
Ross Taylor [00:53:09]: Yeah. So obviously the constraints are different because of less compute in the open, but still: you’ve got to use first-principle thinking and be empirical as well, and just follow that path.
Nathan Lambert [00:53:21]: Yeah. So following up on this, there's a lot of discussion around what the processes are for making a successful foundation model lab. I think Armen has been talking about a few things on Twitter with great visualizations around de-risking pre-training based on FLOPs efficiency. Do you have any comments on what makes a successful post-training team and project?
I've talked to John Schulman a couple of times - he's been the king and started all of this - and OpenAI is still looked at as being the leader in the space. I think they've always been top on Chatbot Arena, and have cracked what most people like in the style. They started early. Are there different considerations for the post-training side of things rather than the pre-training side that we might hear more about?
Ross Taylor [00:54:13]: Yeah, there's probably better people than me to answer. So in our team, originally like Robert (Stojnic), my co-founder, he was kind of managing the post-training team. And then I'd say Thom Scialom was doing a lot of the work. And then more recently Rui Hou - he kind of flies under the radar a bit - but he’s been doing a lot of the work. They are all better placed to answer than me, since I was focusing on reasoning and agents.
But I think the key thing is this: post-training is just a lot of iteration. Frankly, lots of hard work - e.g. making sure at each round of RLHF you’re not regressing in certain ways, filling holes, etc. I guess it’s hard to put a finger on a single thing, but…
Nathan Lambert [00:54:58]: There's simple things like I'm trying to get people to talk about more. I’m trying to establish a good vibe test about internal culture. How do you vibe test for a good post-training culture (or for reasoning)? I remember somebody at Anthropic told me there’s still a lot of cases where you just put your finger up to the wind and you're like “model good”. And I'm sure that is still happening. And that's just a simple cultural thing of telling the team that you can’t always trust all of your numbers.
Ross Taylor [00:55:26]: I think it is maybe a more fundamental question. I wasn’t there at the early days of FAIR - I came in 2019, but FAIR was always a very bottom up organization. Which is a great thing: that's why things like PyTorch emerged. But the real insight as to why OpenAI was ahead historically, at least until recently, was that they had more of a top-down culture and focused bets. They saw the potential of LLMs early on and it was a top-down prerogative of the company to focus on that. And in essence, it was more of an engineering problem than it was a research problem in a lot of ways.
Relatedly, I think a lot of people were surprised that the LLaMA-3 paper wasn't as “novel” as they were expecting. But that just reflects the fact that a lot of it is just engineering and engineering is really hard - a lot of hard work. Not always a lot of new methods, but it is a lot of hard work.
Nathan Lambert [00:56:22]: Yeah, we're starting our next fine tuning model and everyone's asking me: “what should we work on?”. I'm trying to tell them “we just have to filter data and generate more completions”. We’ll have a lot of prompts, we have to filter them, generate completions from good models, and then we’ll have to generate more completions and keep doing this process…And in 10 weeks, we'll probably have a very good open model. We’ll just have to be boring for 10 weeks! And we have like 10 people involved.
So it's a bit of a bigger project, which I think is the right way to do it. We have just started getting improvements on IFVL by copying Nemotron. We use some open math datasets and the math scores are getting closer to LLaMA. It is really the simplest things ever. It's like browsing Hugging Face and being like, “NVIDIA released some JSON format data, some instruction format data, like we add it in and the numbers go up”.
Ross Taylor [00:57:16]: Yeah, I think I said earlier, but it raises an interesting question where this kind of approach - of grinding until the open LLM leaderboard numbers get to 100% - I think we’re going to get to a situation where all the benchmarks are solved, but where we haven't really, in my mind, at least solved intelligence.
What does it mean that we'll get close to 100% on MATH, you know, without any inference time search? I think sooner or later, while it looks like we’re on an exponential with LLMs, we’ll realize we’re actually on an S curve. Eventually we're going to get back to this mode where we have to do new things. And I think that's great, because that's what motivates me.
But yeah, I think there's waves, and we’re in this heavy exploitation mode right now with LLMs - away from the glory days of architecture exploration. But my hope is that we'll get back to the stage where, after exhausting all the [current] benchmarks, we say: OK, now we need to do something completely different. But who knows?
Nathan Lambert [00:58:26]: I see it similarly. I think we still have a year or two, at least in the open. If the closed models start saturating and they start doing things differently, that's fine. But eventually it'll all get there. And in that phase, I mostly keep working just to make sure that the ecosystem doesn't fold in on itself. So that's probably the one-sentence summary of what I'm doing these days: add transparency so that regulatory capture doesn't nuke everything. And that's fine, but I think it's still going to be longer than people expect. I don't think we have true signs of saturation at the top. We'll see what GPT-5 does - if GPT-5 never comes out - and then we’ll really know.
But it seems like it's going to come. I think there's enough signs that it'll come eventually. I think I don't know the answer to this - and it's not really our expertise - but I'm interested in the potential architecture of GPT-5 and if it's GPT-4o like and they're using more multimodal data to try to keep the data engine going relative to just going bigger. I don't know the answer, but that's kind of the future questions I’m thinking about.
Ross Taylor [00:59:34]: In my mind, like three years ago, the thing on the horizon I saw was agents. That’s where a lot of people are working right now: long form tasks where an agent doesn't have to answer a question immediately, and [can instead] go away for a while doing some research and answer later. I think that will take up a lot of time in the next five years.
It's both a compute problem of bigger models - more scale will do better - but also a data problem of how do you generate these trajectories? How do you get reliability? So it’s more successful and less error-prone at each step. And I think in principle it's solvable, but I just think it would take some time.
Nathan Lambert [01:00:18]: Yeah, it seems that engineering is required. It doesn’t seem like something that's just going to emerge. It's building a whole system and scaffolding around agents. Just unglorious work.
Ross Taylor [01:00:32]: Yeah.
Nathan Lambert [01:00:34]: OK, anything else you want to add? Do you want to get people excited about your start-up or is it too early? Maybe too early, yeah?
Ross Taylor [01:00:43]: Yeah, what else should I say? It has been nice to step back for a bit and look a bit ahead into the future. For me, my best days creatively were my teenage years when I got back home from school and spent the rest of the day programming. It’s quite nice to feel like that again: to be in that zone again where I can shut the world out and do some work.
But maybe just to give a hint of the areas I'm interested in, I think it comes back to this problem of how alignment is going to be a process of making AI more human-like. For example, how do you control for things like deception - which Anthropic has done a lot of really good work on.
Essentially… the latents of AI are [potentially] misaligned with human latents, and the question is: what do the human latents look like anyway? And how do we model these things?
That is very abstract and high level, but that is the fundamental question I want to work on. But yeah, I think I can talk about it later in the year!
Nathan Lambert [01:01:49]: Yeah, sounds good. Thanks for coming on. This was great. I think people are going to get a ton out of this. I think just a very sensible conversation on fine-tuning, reasoning and some of the things that got us here. And that's what I was hoping to get out of it, so thanks again!
Ross Taylor [01:02:06]: Yeah, great to talk, Nathan. Have a good one!
Apple, Meta, and Nvidia all agree -- synthetic data, iterative training, human preference labels, and lots of filtering.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/frontier-model-post-training
00:00 Llama 3.1 post-training and the new normal for RLHF
01:18 A new standard pipeline
01:45 Human preference data
02:59 Scaling RLHF
05:03 Synthetic data
06:10 The new normal
06:51 Data quality is king
07:18 Apple confirms the new normal
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_018.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_020.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_031.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_033.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_035.png
This week, I had the pleasure of chatting with Sebastian Raschka. Sebastian is doing a ton of work on the open language model ecosystem and AI research broadly. He’s been writing the great Ahead of AI newsletter (that has the biggest audience overlap with Interconnects, at 26%, so a lot of you know him) and multiple educational books, all on top of being a full time machine learning engineer at Lightning.ai, where he maintains LitGPT, which he described as being like Karpathy’s NanoGPT, with slightly more abstractions.
This conversation mostly surrounds keeping up with AI research, the state of the open LLM ecosystem post Llama 3.1, and many narrow topics in between. I learned that Sebastian used to be an Arxiv moderator, which gives some simple color on how Arxiv and sifting through thousands of papers works. We cover a lot of ground here, so I hope you enjoy it.
Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other interviews, go here.
YouTube
Chapters
* [00:00:00] Introduction & Sebastian’s background
* [00:04:28] The state of deep learning and language models in 2018
* [00:08:02] Sebastian's work at Lightning AI and LitGPT
* [00:12:23] Distillation and its potential in language model training
* [00:14:14] Implementing language models and common pitfalls
* [00:18:45] Modern architectures: Mixture of experts models, early v. late fusion multimodal
* [00:24:23] Sebastian's book on building language models from scratch
* [00:27:13] Comparing ChatGPT, Claude, and Google's Gemini for various tasks
* [00:38:21] Vibing and checking new language models during implementation
* [00:40:42] Selecting papers to read and moderating Arxiv
* [00:45:36] Motivation for working on AI education
* [00:52:46] Llama 3 fine-tuning
* [00:57:26] The potential impact of AI on jobs in writing and education
* [01:00:57] The future directions of AI
Transcript
Built with smol-podcaster and with love of Latent Space.
Nathan Lambert [00:00:00]: Hey, Sebastian, welcome to this kind of interconnects, normally researcher interviews. You were a professor, so that definitely counts. You do a lot of different things these days. Let's get talking into language models. Welcome. Yeah.
Sebastian Raschka [00:01:35]: Thanks so much for the invitation, Nathan. I'm a big fan actually of the interconnects newsletter, so I'm hoping we can have some fun chat about research, LLMs, and what's hot these days, basically. Yeah.
Nathan Lambert [00:01:48]: I have a little section on the end, which is keeping up with AI research, writing about AI and process, because you do so many things, but I kind of want to jump into how you got to AI, because you have an interesting career path. So you were a professor at Wisconsin Madison for years. I saw in statistics, which ... I also went all the way back to find your PhD thesis, which was uncovering hidden patterns of molecular recognition. So this was a while ago, and is this kind of ... Can you explain your background and how you got into AI? I'm guessing it's through computational statistics or something like this.
Sebastian Raschka [00:02:24]: Yeah. Close. So yeah, you did some research there. Interesting. So yeah, it's been a long time since my PhD thesis. This is maybe seven years now. And back then, it started even earlier when I got into AI, that was like, I would say 2012-ish. I was in grad school and I was taking a statistical pattern classification class. And in that class, yeah, the star of the show was basically naive Bayes classifiers, or in general, Bayesian methods for pattern recognition. And from there, I kind of really got into machine learning. So there was, I would say, more statistical-based, but it was all about classifying things. And then I think it was also right about the time where Cozera was launched, and I saw Andrew Ng's Cozera class. That was, I think, the first class in 2011-12 back then. And yeah, that's basically how I started from statistical pattern classification into machine learning. And I applied that for computational biology problems like molecule and drug discovery, like pharmaceutical drug discovery. And yeah, from there, I joined at some point after my graduation, the University of Wisconsin in Madison, where I was in the statistics department, but I did mostly deep learning research, essentially. I was the only one basically doing Python, deep learning, machine learning stuff. So yeah.
Nathan Lambert [00:03:48]: What year was this, and what did it look like at the time?
Sebastian Raschka [00:03:52]: That was around 2018, I think August 2018, when I joined the department. And yeah, I mean, so it's the statistics department, but my work was technically all machine learning and deep learning. I mean, a lot of students were really excited about learning machine learning. I think it was just around the time where it got really popular. And yeah, I was teaching machine learning and deep learning classes as well. They were always like, you know, full and crowded, like a lot of students were excited about that. Also, in general, like the time learning about Python, machine learning, data science, all these topics.
Nathan Lambert [00:04:28]: It's, I mean, it's very interesting because I was a student, I was a grad student at this time or that time in like 2018. That's what deep RL was really taking off. And it probably feels like that probably felt kind of like the language model thing was like as a student at the time, where it's just like, there's so many people in all these classes. And now language models have more of a real world application, but I think as a student, it probably feels so, so similar. Yeah.
Sebastian Raschka [00:04:50]: So also back then, if I may say that it's like large language models already existed. I think the GPT paper, was it 2018? Something like that?
Nathan Lambert [00:04:59]: Yeah, 2018 or 2019. Yeah. For GPT-2, I think.
Sebastian Raschka [00:05:04]: Remember covering, like I had a whole hour or two hours on large language models back then, but it was all focused on BERT models and basically also using them for more like classification tasks. Now, I would say maybe a lot of business problems still evolve around classification, but everything else is basically generative, generating text, generating images and stuff. So it has changed a lot.
Nathan Lambert [00:05:28]: Yeah, for sure. It's like a sequence of like, is it like the transform, is it like Elmo, BERT and the transformers are probably the things that you're talking about all the time? Just very interesting. I think Yitay had this, did you read Yitay's recent blog posts on language model architectures and kind of walked through why encoder decoder is no longer in vogue? Did you see this?
Sebastian Raschka [00:05:51]: Yeah, I think I haven't seen the article, but I remember having discussions with people about that recently. I mean, I think there was actually, it's interesting. So I think T5, if you would train it and fine tune it, it would still be a really good model for sequence to sequence tasks, like language translation and stuff like that.
Nathan Lambert [00:06:10]: Yeah. Cohere for AI did this with AYA. They used T5 for their first AYA version, which most people were like, oh, they've Cohere branded it so well, but no one realized they're using T5.
Sebastian Raschka [00:06:21]: See, I even didn't know about that. And so also on that note, I would say there was something else I wanted to say. So then there's also still the classification thing and using LLMs for classification. And it was also usually either a bird like encoder, or you could also use an encoder decoder, but mostly an encoder. But I've seen also recent papers using just decoder models for that. Just basically removing the, I saw two papers on that actually, like removing the causal mask. So basically reverting it back to an encoder using LLMA and then removing the mask. So in that sense.
Nathan Lambert [00:06:59]: And it works well as a classifier. You can just kind of use it. That's awesome.
Sebastian Raschka [00:07:04]: I mean, you could even do that without removing the causal mask. So you could just tune the last token basically, but yeah, if you remove it, yeah. They found that you could use probably the first token even, because if you have the last token, you don't, you have to have padding always because you have to pad it to the longest sequence. Otherwise the last token would be a different one in each training example. And so in this way you could use an earlier token basically, and then keep it fixed.
Nathan Lambert [00:07:30]: Yeah. Yeah. Now with your work at Lightning AI, do you do a lot of these things like hacking around with language models? Because I think it's kind of an underexplored space where just like people remove layers and plug things together. I think there was like, when merging was just getting going, there was like Franken Llama 2, where somebody made like a Llama 2 30 B by just chopping layers and stuff together. There's so much unexplored signal there that I just, do you have your, have you ever looked at these things or you don't do that much?
Sebastian Raschka [00:08:02]: I must say I'm not a big fan of merging. Maybe I'm just not good at it. I rather prefer fine tuning, start changing things or training and fine tuning things. So yeah, I do a lot of this type of hacking. Sometimes voluntarily, sometimes involuntarily, because I make a mistake or something or like, because at Lightning I developed this library, LitGPT, which is an open source library, pre-training, fine tuning and serving and deploying LLMs. But it's basically a from scratch implementation. You can think of it as a NanoGPT from Andrej Karpathy, but for all types of LLMs, like Llama, Gemma, PHY, all of them. But the focus is also like NanoGPT is on readable code or like keeping it relatively simple. Of course it gets a bit more complex there when you add multi-GPU training, tensor parallel, fully sharded data parallelism and stuff like that. So if you add all these settings, it gets a bit more complicated, but the focus is still on having a code base that you can easily work with. And in that context, it's very easy to remove layers and change things. I mean, yeah, so that is usually, I build it like for colleagues at Lightning, but also like open source community, but then also for myself to tweak things, to change things and stuff like that. So yeah, I should also say, it's not just me, it's Carlos and Adrian who started this library. Currently I'm like the main person maintaining it, but a lot of people contribute to it. So it's actually a nice playground.
Nathan Lambert [00:09:41]: There's kind of two follows odds for this. One is like, what part of the language model training stack, if somebody is going to start with libgpt or HuggingFace or whatever, like they're trying to fine tune a model, you can do an example. And then what is the thing that they should do to go like one level deeper to learn how these things work? Because you're saying with libgpt, you can do all these different architectures. I don't know if I would recommend architectures, but it's a good way to learn how like the attention implementation and how different layers are shaped and things like this. Is there different areas you'd recommend people to look at?
Sebastian Raschka [00:10:14]: Yeah, I would actually, okay. So it's like a shameless plug, but in my book, I have a book where I do this step by step, the implementation. And this is for only one model, like a simple model, a GPT-2 model. Because it's like the, I would say the one that started all of this, right? Like the main architecture and everything else is kind of like a derivative almost of it. So I would think in a good way that it is making tweaks and improving things, but basically starting with one architecture, like you said, not looking at different ones at first, and then just understanding what is, I would say the best way is what is the input data here? How does it look like? What does go into the LLM and really how does it pass through the layers? And then from there, okay, we understand how a model learns to generate one word at a time and then going from there to instruction, fine tuning, and then even like alignment with a DPO, for example. So doing like all these different lifecycle things from implementing one architecture, pre-training, fine tuning, aligning, and then from there, I think it's a useful or interesting exercise to see how different architectures make slightly different choices, like replacing the Gelu activation with a Silu activation or pre- and post-layer norm and like these like nuances, changing the number of heads or number of layers. And yeah.
Nathan Lambert [00:11:38]: Yeah. I mean, in industry, everyone kind of is converging to similar things or like people converge to a similar recipe and then they stick with it for infinity. So like each of the orgs have these recipes that it's too risky to change and like AI2 are like still converging at a recipe. So we're like learning things that the Llama team does and it's like RMS norm and they think it's very important or like these different things. And I wonder how like the open community is going to converge on pre-training things. So like what scale of models do you recommend people train for your book? Are they training like the hundred million scale GPT-2? Is it smaller? Because I think in Colab, you can fine tune maybe with Laura, a 7b model, I think. Is that true?
Sebastian Raschka [00:12:23]: Yeah. So this is true. But I think for Laura, if you want to fine tune 7b model, you would need, I think, bits and bytes of quantization, the normal float for like some quantization. But yeah. So for the, or maybe going one step back for the book, it's really the smallest model, like the hundred, what is it, hundred something million. But I also have settings. If you like, if let's say your machine permits, use the larger version. So there are four larger versions, like 300, 700, and 1.5 billion. But it's really up to the reader. I have all the examples with the smallest one so that it even runs on a MacBook Air. So on this podcast, I'm here on my small MacBook Air and all the models train in a few minutes fine. Of course, I'm not doing the whole pre-training for that. You would need a GPU for a week or maybe I would say maybe even longer than that now. I mean, it depends on the GPU, of course, but H100, maybe a week. But also the other reason is yeah, in practice, you would probably use pre-trained weights and then you can find, so you can do continued pre-training and then fine tune. So the focus is basically understanding how the pre-training works, then loading pre-trained weights. But then also the fine tuning is like the full, the full thing, like doing it to fine tune a classifier, but also instruction fine tuning essentially. And that doesn't take too long. I would recommend using a GPU, but it would technically run on a CPU. And get back to the question you had with a 7 billion model for that one A100, I would say yeah, one A100 would probably work for a 7 billion model. But you can also, if you have Litt-GPT or if you use Litt-GPT as a setting, you can set the number of devices and shard it over multiple GPUs. Yeah.
Nathan Lambert [00:14:14]: I mean, all of this stuff is getting so much easier. I think, I don't know, when did you start writing this book and all of these chapters? Because I've seen the GitHub, I haven't looked at when it started.
Sebastian Raschka [00:14:23]: Actually longer than you might think. It took a long time. It's almost, at this point, one and a half years approximately.
Nathan Lambert [00:14:30]: Because at that time, like a 1 billion parameter model, like what was the state of the art 1 billion parameter model a year and a half ago? Some random model. But today, like people are trading 1 billion parameter models for 15 trillion tokens. So the fine tuning that you can do there is getting extremely good. And I'm going to guess that people are going to start training even smaller models with these distillation losses. So have you looked at distillation at all? I think it's full on coming in the next six months. We can shift it to like the LLAMA3 and the state of the open ecosystem section, because it kind of goes in. It's like LLAMA3 was not distilled. It's a specific loss function. I hate it that there's synthetic data came around and people call, I was on this paper, the Zephyr paper, the title is Direct Distillation of Language Models. But now the technical definition of distillation, which is like knowledge distillation from a teacher is becoming popular. So the whole synthetic data and alignment and everything is like screwed in a doubly defined word.
Sebastian Raschka [00:15:30]: So basically what you're saying is that people who just use synthetic data refer to it as distillation because it's from a larger model. Yeah. Yeah. Yeah. Confusing. I think Gemma too did that actually recently. So that was an example where they did that. And I do think, you know, I think it's also coming. So I have for my book, that's like the core chapters I have, but I have a whole long list of bonus material that I want to cover and distillation, knowledge distillation is one of them. So this will be something over the next few years, but you know, doing tutorials on those and yeah.
Nathan Lambert [00:16:04]: Because I think people can actually use it as a thing. So how distillation works, I've thought about implementing it, but as it works is that if you have a fine tuning corpus, you get all the predictions from your big model. So all the log probabilities from your big model and you store them in memory. And then as you're training the model you're training, which is smaller, you essentially weight them by those predictions because you store them from memory. So you don't need to store the big model in memory when you're training. So I think people should be able to like, or someone will upload a data set file of like a giant log probs of Lama 405B and that people will just try to fine tune from it. I'm surprised that Lama 3 didn't use it, but I think it's just because they're focused on scale and data more than any fancy things.
Sebastian Raschka [00:16:49]: Yeah. And I think the, I can, I think I probably know why, but also, yeah. One thing is I should, one should also add is why I think it's also becoming more popular is like Lama 3.1, they just allowed doing that. I think before it was according to the license, technically not allowed to use Lama 3 models to improve other models, but now, now we can. So I think, like you said, it's probably going to be a hot topic, but I do think they didn't do that because the 405B Lama model just finished, I think. So I think, I mean, if you think back, they shared the Lama 3 model, it's like, I don't know, half a year ago or something, many months ago. So I think it's really more like, yeah, it hasn't finished training, but maybe for Lama 4, we will see more distillation using the 3.1 model for that.
Nathan Lambert [00:17:38]: Yeah, it's more architecture things. So for while we're talking about distillation, almost like Cloud Flash or Google Gemini Flash is confirmed as distillation. And it is very likely that Cloud Haiku and GPT-40 mini are distilled in the technical sense of the word, which is like, I think it's obvious that that works on pre-training. And I think there will be a breakthrough fine tuning model, kind of like the likes of Zephyr, Starlang, I'm forgetting more names, but ones that really reach the narrative from fine tuning on distilled data. I think that'll come in the next six months. So honestly, I'm telling the people I work with, we should try to do this before something new, because it's so obvious now.
Sebastian Raschka [00:18:22]: One thing I've seen also a trend, I wouldn't say backwards, but a thing that doesn't seem to be that popular anymore is a mixture of expert models. What do you think about that? Is that like something like that was like a fad and now people don't, you know, they explore other things like distillation. I mean, you could do both, but it feels like a mixture of experts is not as hot anymore
Nathan Lambert [00:18:45]: somehow. I don't know.
Sebastian Raschka [00:18:45]: What do you think?
Nathan Lambert [00:18:47]: There's two things. Small mixture of expert models are definitely coming out. Essentially, you get a fixed improvement in flop efficiency at pre-training. So essentially, if you're going to pre-train like an X billion parameter model with mixture of experts, it'll go like 40 percent faster or some pretty appreciable number. There's a lot of rumors and discussion that scaling up mixture of experts models is really hard from a stability point of view. So a lot of these open people, you could get it started and we're playing with these AI too. So we want to play in the mixture of experts space as well. And doing a small model works, but there's a lot of headaches. I think like some of the friends at Databricks Mosaic ML have been the clearest about this. It's just like you do not, like you at AI too, do not have the engineering throughput to deal with the headaches that comes from mixture of experts. So I think there's still clear signal from industry and people and like, I mean, Deep Seek's releasing MOEs. I think Quen has a small MOE and these are pretty good models. But I think it's a really heavy engineering lift to get to mixture of experts to work. I like GPT-4 scales. I expect Meta to figure it out. I think it's just on their list and they figured out dense first. The thing I'm more interested in for GPT-4, I don't care if it's mixture of experts. I think they have the compute to do either way. But for Llama-4, God, all the numbers throw me off so bad. But I think that OpenAI and Google might be slightly ahead by having the early fusion model. So essentially with these multimodal models, there's the concept of early versus late fusion. The first visual models that people were playing with the GPT-4 were this late fusion. And now like GPT-4.0 is early fusion. And it seems like Gemini is probably early fusion, which means they take in direct audio, video, text directly at the input, the training data changes. And I don't know how much of a heavy lift it is to get that to work. I think that might be the bigger change. And that might be harder for Meta to catch up on than anything else. But no one's really talking about it.
Sebastian Raschka [00:20:58]: But also here, I think that is something I feel like others have. I mean, I remember even like last year, there were a lot of papers with a late fusion thing, like I think Llama adapter papers and stuff like that, like retrofitting the models. But yeah, I haven't seen that much focus on that from Meta. But I mean, they had a section on that in the paper, but it felt almost like an afterthought. I don't know. It's like where, yeah, I think maybe there's a different team at Meta that works on
Nathan Lambert [00:21:26]: that. There is a Chameleon team that was doing this, and I think a lot of them have left. My question, essentially, that I want to debate and I don't know the answer to is like, because essentially it takes so much different data pipelines. So you have to have a much clearer balance between video images and audio and text when you're training early fusion than with late fusion, because you just add a bunch of images at the end. And like if that data curation step is going to be a big bottleneck for kind of shifting and if Google and OpenAI have an advantage by just scraping YouTube, like Google obviously can't scrape YouTube and I'm not saying that they are, but like if it becomes a way that you can get more data and like GPT 5.0 is the first model that OpenAI releases, then I'll be like, OK, the GPT 4.0 thing was just a pivot. And I actually think this could happen. I don't put this at like a one percent probability. I could see this as being what the labs are betting on. It just takes so long to spin up this entire new pipeline of training.
Sebastian Raschka [00:22:25]: But one question here is going back to a point you mentioned earlier regarding the knowledge distillation where you can just precompute all these things, you could technically do that also just once for the whole data set. Let's say you have a very good image encoder, audio encoder. You would never have to redo this if you do it well. Right. I mean, it would be something you do it, take care of it once and then you pass it just as tokens to the to the other team, basically.
Nathan Lambert [00:22:49]: Yeah, probably. I don't know. I'm not like I don't have as much insight into really advanced pre-training practices as I would like. I'm mostly of a similar boat of like fine tuning models and playing with things because I'm trying to play like, have you played with Llama 3405b at all? For context, the recording is like, what is this, like a week after, like six days after. Like I haven't gotten it set up, but I'm really curious. Like I don't have clear expectations on how the open source community, like the open language model ecosystem kind of evolves from here with these new Llama models, the new Mistral models. It feels like a total, from like a technical and a policy perspective for me, it feels like a pivot. I think the educational side of things, it's actually more of the same. Like we knew we knew this was coming, but it just it feels like it could be qualitatively different going forward. Do you see anything? Have you tried anything?
Sebastian Raschka [00:23:45]: Yeah, I did actually try the Llama 3.1 models. I, when they came out last week, we added them to Litchipiti. I took care of the eight and 70 billion models. And my colleague Adrian, he also added support for the 405 billion models. So just briefly trying it, it looks really good. So the thing is with a 405 billion model, it's a bit tricky. So I think the problem here is, of course, it's free. Everyone can use it, but in a sense it's still expensive to run it because you need, so we were running it with bits and bytes of quantization, like a normal float four on eight H100s. And this is expensive, right? I mean, eight H100s, it's probably more than a hundred bucks an hour.
Nathan Lambert [00:24:26]: I was trying to do the same and I messed up the BLM installation. I was like, okay, I spent an hour on this. Yeah.
Sebastian Raschka [00:24:32]: So you can try Litchipiti maybe. So it works with.
Nathan Lambert [00:24:36]: Yeah. And there's a related question. One of the things I'm trying to ask people who are hands on, just like, how do you, what do you do to vibe check a new model as you go through so much AI research material and language model material? It's like, everyone has their procedures and how do you go about that?
Sebastian Raschka [00:24:51]: So for me, it's like, I, I mean, I use these more like for making sure they generate the correct answers and stuff like that, or something that is reasonable. So honestly, really simple questions for me just to see, so this is more like, I'm not necessarily benchmarking these models. I'm more like making sure the implementation is correct. And for that, I use simple questions like what do llamas eat? What is one plus two? You know, like just making sure, because it's actually easy. Something I just fixed this morning. It's easy to mess up things like KB caching, where you cache, you don't clear the cache and then there's something from the previous answer and the answer looks kind of correct, but it's kind of weird. And, you know, like simple questions can sometimes reveal that. So basically what I do is I ask it multiple, multiple questions the same time. So, sorry, repeatedly, like the same question repeatedly and see if the outputs still make sense and stuff and then mixing them up, but like in a loop basically, but I'm not so much like, that's a great way to make sure the implementation works.
Nathan Lambert [00:25:53]: Cause I think in transformers, they had a missing end token. There's so many little things like this when implementing stuff. Like the, the end tokens is such a ban or like the chat templating can always break things. Cause it also can happen that you mess up pre-training and then you need to have something in the chat template that people might not know. I think in one of the early Olmo models, we like missed a new line in, in one of our documents when we were annealing it. So in order to fine tune it, you had to like have an extra new line before the chat template and like most people will just miss that. Yeah. This is very, very interesting point.
Sebastian Raschka [00:26:28]: It's like, you don't even notice it usually when you use something like, I don't know, chat GPT, because it's applied behind the scenes. But if you implement these things yourself, you have to be really diligent and careful to do it very consistently. Like one little, like you said, new line throws it totally off. It's, it's, yeah, it's interesting. It's like, you have to be, I noticed that I was actually working on some DPO stuff this weekend and my template for fine tuning and DPO alignment, the one that I'm working on alignment, the prompt template was a bit different and I got like garbage results. And then, oh, I, I stripped some line here, the new line character, basically something similar, like you said. So it's, it's very sensitive to that.
Nathan Lambert [00:27:04]: Yeah.
Sebastian Raschka [00:27:04]: Yeah.
Nathan Lambert [00:27:05]: This, this makes sense. Um, related, do you use Clod, chat GPT, any of these regularly in your workflow? Are you team Clod?
Sebastian Raschka [00:27:13]: Uh, so yeah, so it depends. I have both and I flip back and forth between them. I don't know. I'm probably not really good at prompting, but sometimes I get better results with one over the other. Um, I think. I wouldn't say one is better than the other. They're just different. I would say I'm using.
Nathan Lambert [00:27:31]: That's kind of what I think. It's important. Like, it's good. Like, what do you think of both of them? I think it's good for people to know this because it's, it takes some practice to understand and using both. Both people don't use both. Yeah.
Sebastian Raschka [00:27:43]: I would say when I use also GPT-4, I must say I use the, uh, it's called legacy now, but the original GPT-4, I don't like the mini and old versions. And, uh, for Claude, I use the opposite of the, not the new one, but the one, the previous larger one, the slower one. And, um, I think for me it's like coding wise, it's kind of weird, but most of the time I like GPT-4 better for code stuff. But then I think also, uh, I think, you know, what, what's better with GPT-4 was it's, it's a bit more up to date, um, with knowledge, I think. But Claude has, I think better, you know, when you say improve my writing or something like that, it has more, it has less, you know, like these, like I delve into something, these weird words and stuff like it, it's a less, it's more natural a bit, I would say, but
Nathan Lambert [00:28:34]: also not always.
Sebastian Raschka [00:28:34]: I agree.
Nathan Lambert [00:28:36]: It's like, it has a bit more flair and a bit more unpredictability. So I like use a Claude on my phone, but I've found, I've tried to use Claude for like information transformation tasks, like LaTeX or taking, taking data out of a table. And sometimes it just like refuses. Like I do research on like AI safety, like safety and bias. So if I put anything into Claude that I'm trying to transform that data, it just says no. Cause it's like, I can't comment on like a mean story. Well as OpenAI will just do it. And it's like the processing that OpenAI does is very good. So I actually like canceled my GPT subscription when I started Claude, but I kind of regret it now. I'm like, oh, now I need both, which is, which is a little annoying. Yeah.
Sebastian Raschka [00:29:16]: It's like, yeah. So one thing is what is interesting though, is we, we're talking about GPT-4 and Claude here, but we haven't even mentioned Google Gemini.
Nathan Lambert [00:29:24]: I don't know.
Sebastian Raschka [00:29:24]: I personally, I tried the early versions. I don't want to say the newer versions are not good. I just haven't tried because I didn't need to, but do you have experiences with Gemini
Nathan Lambert [00:29:34]: or? I was using Gemini in search preview. So if you have the Google app, I can, I'm recording this in, in video. Like you have the Google app, like at the top, you could click on Gemini, which I was doing for a while just to play with it. But like, I don't use it on the web. I, they do have a nice interface that looks exactly the same, but somehow I got grandfathered into like AI studio, which I use for, if I upload, record a podcast, I upload the podcast and I'm like write chapters or something. And it actually works, which is pretty cool to be able to upload like an hour long podcast. But for whatever reason, the Google interface, other than the Google app, hasn't stuck for me. And I think that's the biggest, biggest limitation. And I use it more in a googly way. So I'd not, I'm not as perceptive to style. I see. I see.
Sebastian Raschka [00:30:20]: So also I'm curious. I just yesterday saw Apple's on device AI is a bit delayed, I think. And for that, I think it's an interesting one. We will see how this will work because this will be, I think also smaller models. And there's a, for me, it's like, I never really care about speed for these things. It's like, I just want the best possible models. So this is also why I was a bit disappointed when GPT-4 O came out and GPT-4 Mini came
Nathan Lambert [00:30:46]: out.
Sebastian Raschka [00:30:46]: It's like, ah, I don't really care about if it's faster or not. I just want it better. You know, I want to have better quality. I don't know. It's maybe it's just me.
Nathan Lambert [00:30:53]: I think for building applications, speed is really good. So I have a few friends that run startups that are heavily built on language models and they have a similar stack to perplexity, which is like the user passes in a query that have a primary language model request and they have a series of feedback requests or small requests on top of that. So when you're concatenating multiple requests, like speed is extremely important. And when you're like selling a product, speed is extremely important. But if you're like tinkering and trying to learn, it is much slower. It's true. Yeah. Yeah.
Sebastian Raschka [00:31:19]: It's like the real world, like, sorry, not real world, but the individual user, um, yeah, using it as a tool in everyday life versus really building an application based on an API that makes sense.
Nathan Lambert [00:31:32]: Yeah.
Sebastian Raschka [00:31:32]: So there are two different use cases.
Nathan Lambert [00:31:34]: Yeah. Yeah. I think we're kind of talking about style. I have a section on RLHF here. I just wanted to like, what do you think you do spend a lot so much on AI education is like, what do you think is most confusing to people about this kind of whole post-training thing, which is instruction tuning, reinforcement learning from human feedback, other safety modules, like adding a filter and stuff like this. I'm really on the bandwagon of trying to convince people that RLHF is deeply tried with style, which is like this, how this discussion of cloud versus, um, open AI and Google and all these things. And I don't really know how to portray that in like an educational technical point of view. So like, I'll do an analysis of the paper and I'll do like DPO and like scores and all these things. But at the same time, for most people reading my articles, the most important thing is probably to know that open AI is really smart about their style. And that's why they're so high on chatbot arena. But like, I've written about it a couple of times. I have another article in the drafts, which is essentially like why GPT 4.0 mini like broke chatbot arena. Because everyone's so upset that it scored so highly, but it's not that surprising if you look at historical events.
Sebastian Raschka [00:32:39]: So it's basically exploitation of the benchmark almost you're saying or like the benchmark
Nathan Lambert [00:32:45]: is focused on style and it really penalizes refusals. So like I get refusals when I use cloud. So it's definitely going to like be downweighted. And like open AI is really good at this. This is what they've been doing for a long time. But I don't really know how to educate this. Like, have you thought about, like, there was a question on Twitter of why didn't you include RLHF in your latest? It was kind of a joke, but I took it out.
Sebastian Raschka [00:33:09]: Well, if yeah, I can maybe answer that. It's it's in the works. No, so there are multiple reasons. And so one is it's so there are page limits per chapter. And originally it was meant to be in chapter seven. It got way too long. It's actually even without it. Chapter seven is the longest chapter already. And what is the other one is fine tuning.
Nathan Lambert [00:33:29]: Oh, sorry.
Sebastian Raschka [00:33:30]: Instruction fine tuning. Yeah, I called it not instruction fine tuning. I called it fine tuning to follow instructions, which were originally, which was originally meant to have both, but then it got too long. And the other thing is, you know, like one book chapter takes about two months and a lot of people who really want to book before the new semester starts. So it's like, you know, it's, there could be another chapter on it, but it would be
Nathan Lambert [00:33:54]: another two months.
Sebastian Raschka [00:33:54]: And that, I mean, it's not really an excuse, but the other one is I was not happy with the results. And this is a very mathy topic. And I was like, okay, I have this book, which is very clear and makes hopefully a lot of sense. And then I have this really super complicated chapter at the end. I don't know if that's very satisfying to read or death.
Nathan Lambert [00:34:15]: Yeah.
Sebastian Raschka [00:34:15]: Where it's like, so you read this book, everything makes sense. And then it comes to this huge...
Nathan Lambert [00:34:19]: Why is RLHF so much mathier? Like, I know a couple, there's a couple of core equations. Like the core equation is like the RL optimization step, which is expected expectation, maximization of reward subject to penalty. And like, where does most of the, like compared to pre-training, which is like one equation, like that is also one equation, but there's a lot of downstream stuff, I'm guessing. Yeah.
Sebastian Raschka [00:34:41]: I think it's the explaining a bit about reinforcement learning. I mean, you don't really have to explain reinforcement learning in a classic sense, maybe, but yeah, there's still like KL divergence and penalties and reward margins. And there are lots of things happening at the same time. And the code is also very long if you especially want to track the rewards and stuff. So for my instruction fine tuning chapter, I'm using exactly the same training function I implemented in the pre-training chapter.
Nathan Lambert [00:35:14]: And it's really nice.
Sebastian Raschka [00:35:14]: It's like, well, you can actually reuse everything. It's, it fits together.
Nathan Lambert [00:35:18]: Yeah. Like what we're doing on OMO, we can baseline our instruction fine tuning in our fine tuning code base, which also has some RL things and in our pre-training code base. So it's nice to have both, but that is definitely why it's simpler. And the RL is only getting worse in my mind, I think. Like we've seen that LLAMA has used rejection sampling for two iterations and there's no public implementation of rejection sampling that at least public enough to know that people have actually trained models with it, which is the idea of ranking completions to a reward model and then running instruction tuning again on the top completions.
Sebastian Raschka [00:35:54]: I think also in the recent LLAMA 3.1 paper, they used rejection sampling with DPO, for example. Like they didn't use the RLHF with reward model, but then they used the reward model for the rejection sample. And yeah, so I must say, I have the code for the DPO. I wanted to do TPO because it's also more resource efficient. You don't have to train that reward model for, let's say the book, but I was not really happy with the quality of the output yet. So I must say it's like, okay, this is not, it's not helping the instruction fine tune model. And it's like, I think a general thing where I, I mean, you might correct me if I'm wrong here, because you are the expert in RLHF, but for me, it's like, it's like a optional thing where unless you need a specific style or need to deploy something in like a safe manner, it's maybe not giving you the best results. If you need a private model that just runs on your own computer and gives you correct answers, I don't think DPO or RLHF will make the answers more correct. They will just change how they look like.
Nathan Lambert [00:37:01]: And yeah, I mostly agree, especially on what we have in public implementations. The public implementations are really good at improving on like alpaca eval. But if I'm training a model that I actually want to use, don't worry about alpaca eval. I think I'm like the most annoying person internally running these experiments because I just get so annoyed when only alpaca eval goes up and be like, that has made the model worse. Like we've, I've been building internal demo tools, which is just like making Gradio better and showing how to use VLLM for serving. But it's like a lot of the models we put out for research are like really, really annoying to talk to. You put no yapping or just be concise in the prompt and it doesn't do anything. So like a lot of the open datasets, and this is something that Nibetron and Lama3 have shifted to is this new evaluation, which is like IF eval, which stands for instruction following eval, which I think is a great one. So it's like write a response with less than 300 words or something. And it has these verifiable claims. And this is something that the Nibetron report showed that like doing fine tuning really unlocked a lot more performance in the DPO stage. So I'm hoping that we start to get more evals than just alpaca eval that are helped by this RLHF and that'll help the whole ecosystem come forward because it is in a kind of young, rough state right now. Yeah.
Sebastian Raschka [00:38:21]: And also one last thing about this topic is for me, like you said, the last sentence is kind of also one of the reasons is where I was like, okay, if I include something on DPO as the last chapter, I don't know if it's still going to be used next year or if there's so many variants, ORPO and QTO. And I mean, right now, I mean, Lama3.1 used DPO, which is like a big endorsement. But to be honest, I'm not sure if this exact variant is here to stay.
Nathan Lambert [00:38:47]: And so I think DPO is here to stay. DPO will be a canonical example, much like PPO. But I think the things that people are using will go away. Like PPO has stood the test of time of multiple eras of RL. So I don't think that people use it in its exact form, but people are always looking at it. And same with DPO, just because DPO is so simple. Like the exercise, this is like one of the best getting started with RLHF exercise is taking like the hugging face trainer and modifying it to use the DPO loss because you could use all the other infrastructure for like most of the infrastructure for batching and stuff like this. And then add that loss function, which is a few lines of code. And like, that's a good, that's like the entry point to doing RLHF implementations. Like when I interview people, I'm like, make sure that they have looked at this DPO loss function before. And if they haven't, I'm like, I don't know if you're in the weeds enough. I feel like you should look at this.
Sebastian Raschka [00:39:37]: Speaker 3 And if you need, if you are listening to this and you are about to get interviewed by Nathan, I will hopefully have by next weekend a tutorial on DPO, on implementing it from scratch. I was, this weekend I used actually Lama 3.1 to make a synthetic data set for that and got much better results. So it looks good enough to probably upload it next week. So nice.
Nathan Lambert [00:39:58]: Okay. Let's shift gears into like AI research and AI education, which is I think the thing that you have some of the most insight into. So you're a head of AI newsletter. You, I wasn't originally reading it when I subscribed, but now I almost always skim through to kind of see what papers you uncover. I'm pretty interested in like how you select papers, like how much you actually prioritize reading papers and why, and just like any advice for people, because it's hard to sit down and do this. And I, speaking for myself, sometimes writing is like how I force myself to read some papers. I don't know if you're in the same boat, but like, what is your worldview around reading AI papers these days and skepticism or excitement, everything?
Sebastian Raschka [00:40:42]: Yeah, that's a big topic. So I must say, so I, I look at more paper than I actually literally read. I mean, I look at the abstracts and the titles and then that's like a huge funnel as a section
Nathan Lambert [00:40:54]: processor.
Sebastian Raschka [00:40:54]: I must say for like, I was an archive moderator for the machine learning archive a few years back and that got me into the habit. So how it worked was basically as a, maybe it's useful because some people complain when
Nathan Lambert [00:41:06]: How did someone become an archive moderator? I didn't know that it was like a community position.
Sebastian Raschka [00:41:12]: So that was originally by Tom Dietrich. He was doing it by himself and he was looking for people to help him with that. Because as you mentioned, there is an ever increasing number of papers. And so how it works is essentially that when you submit a paper to archive, you select the categories. But a lot of people, they select not, let's say the correct, I wouldn't say not correct, but like the preferred categories because Yeah, the AI and ML.
Nathan Lambert [00:41:39]: It's like ML, AI, and then everything else. Yeah.
Sebastian Raschka [00:41:42]: And AI in archive is interesting. It's more like the classic AI. It's like, it's not LLMs. It's more like symbolic AI, that kind of stuff.
Nathan Lambert [00:41:51]: What do you think the difference between, or like as an educator, how do you define AI and machine learning? This was also one of my favorite interview questions to like see where they're at.
Sebastian Raschka [00:42:00]: Well, right now I would say I go back and forth on that. Right now I would say AI is this big umbrella thing where you have deep learning and machine learning as subfields. But if you think about it, if you consider a logistic regression classifier, it is essentially machine learning. And if machine learning is the subfield of AI, you would say, okay, then logistic regression must be AI. But is like classifying iris flowers really AI? I don't know. So today I would say
Nathan Lambert [00:42:28]: I also think about search as AI. Yeah. Like, yeah.
Sebastian Raschka [00:42:31]: Like, yeah. So there's like the good old fashioned AI. So I would say with AI, yeah, you have both, you have the machine learning and deep learning branches, but you have also, you can also implement AI with if else statements, I guess, like, you know, like, so. So that's how I would define AI. But I think nowadays when people talk about AI, they mean specifically gen AI, like generative AI models, like LLMs, stable diffusion, that type of stuff. But yeah, so the archive thing. So just briefly, basically there is in the background, it's also using machine learning or NLP to detect whether the title based on the title and the abstract, if the category is actually matching. And if there's a mismatch or in general as moderator, you go through them and, oh, this looks good.
Nathan Lambert [00:43:17]: This looks good.
Sebastian Raschka [00:43:17]: This looks good.
Nathan Lambert [00:43:18]: They started exposing this to the user. So I submitted a paper recently under ML and I was like, this looks like language. And I was like, I've been in moderate, I've gotten papers stuck in moderation. So I was like, I'm always going to hit, except if they tell me it might be in the wrong category, because archive moderation is a black box that you don't want to get stuck in. No, no, like as a user, but I understand the service it's providing. So it's good to expose that to the user. And if anyone's listening, just click it, click. Yes. It's not worth delaying your release. We get stuck in moderation and help archive out. Yeah.
Sebastian Raschka [00:43:50]: And so just the last thing on that is by default, everything gets accepted. However, sometimes it's something gets flagged. If there's duplicate content, if it doesn't look like a paper, sometimes people submit like one page blog posts or something. So there is this thing where sometimes there are also false positives and then it gets stuck. But long story short, that got me into the habit of reading the titles. And that's what I still do. Also for my head of AI newsletter, I just look through the titles and select. How have titles changed?
Nathan Lambert [00:44:21]: Like titles have changed a lot though, as I feel like they used to try to be. Accurate. Mostly descriptive. Yeah. Descriptive, right? And now they are a mix of, it's more of a storytelling than descriptive. I think it's the right way to tell it.
Sebastian Raschka [00:44:36]: At least we don't have the, it's all you need anymore. I feel like this went away finally, but yeah, you're right. It's more.
Nathan Lambert [00:44:43]: It ended with Ryland Schaefer's test set. Training on test is all you need. Yes. Did that make it on archive? It did.
Sebastian Raschka [00:44:51]: I think I also had it featured in my newsletter one time. I think. Or not featured, but at least mentioned. And so how I select papers is also often selfish. I read or select papers for the newsletter that I find interesting. And because I think this is also for education. When people ask me about how I would suggest doing things, I think the most important thing is to talk and work on things you are interested in. I think it would be really hard to do a good job if it's a topic that is not interesting to you. For example, I know, I don't know. R, sorry, or Rust is interesting, a very important topic, but I'm not into it. So I don't try to, let's say, make videos or content.
Nathan Lambert [00:45:35]: Yeah.
Sebastian Raschka [00:45:36]: So it's like, I think if there's something you're excited about, I think it comes almost naturally that you want to talk about it. So in that sense. So the newsletter, I almost, it's weird, but I almost write it for myself. It's like, I find it interesting.
Nathan Lambert [00:45:49]: How much do you spend reading versus writing when you're reading these papers and writing a blog post? I'm guessing a lot of it is just the natural process of synthesis is what you put into the newsletter. It's not like you're doing it from my read. It's not like you're doing a ton of scaffolding and editing after the fact, which seems similar to what I do.
Sebastian Raschka [00:46:09]: Yeah, you're right. I don't do, I don't spend too much time on it in the sense that I wish I could, but I have a full-time job. It's literally just the weekend project where I aim for one newsletter per month. Of course, I would like to do more, but there was also a book to write on weekends or sometimes I'm doing videos. It's like keeping it fun, you know, like where it's like, okay, this is not a chore. This is something that is supposed to be fun. Like in that sense, I read a paper and then I take notes and then I collect them and spend maybe half an hour, an hour to polish them a bit up or make some figures. And that's it per paper, I would say. And so I also don't write the whole newsletter on one day or one weekend. It's really spread over the month. I read a paper. Oh, this is an interesting one for other people. Let's write this up basically. And then this way I collect material over the month and then.
Nathan Lambert [00:47:00]: Yeah. What motivates you to work on this stuff? Is it purely like education? Because I, in some ways relate to that. I've been in that mode before.
Sebastian Raschka [00:47:09]: Yep. So if you have noticed, I don't have any sponsorships or something.
Nathan Lambert [00:47:14]: Never done that. Respect.
Sebastian Raschka [00:47:16]: I will never say never, but it's not something I do. It's really just a hobby. And I do like discussions that come around it. There's a certain satisfaction that if you put it out, it helps others and people tell you positive things about it. It's kind of very gratifying. I don't know. There's like a reward in a sense. And what's also cool is there are a lot of people. It's like being part of the community and exchanging information because there are also a lot of people who sometimes know something I don't know. And this is really, I think, really cool. You write about something and then someone, Hey, have you seen this? This seems like it's taking it to yet another level. Or this is the same idea. It's even better or something. And this is super cool where you get this effect where you learn by doing this, actually, because there's always someone who knows a bit more than you do in a specific area. So, yeah.
Nathan Lambert [00:48:07]: Yeah. I feel like it's increasingly important these days and increasingly impactful because so much of research has become closed off and for business reasons. So there's fewer people that do more of the work. I don't like it. I always feel like people don't realize how few people are informed and share on any given topic like AI research. If you take away three people, I've yet to find people that just tweet the same random RLHF crap that I tweet. It's like, I don't do it because I just say random things, but there's not that many people that represent each of these corners. Ahead of AI, I think Jack Clark's important AI. I should have him on the pod. I think I've talked to him a few times. He's great to talk to. And his is the same thing. It's like these few people that are disseminating AI information, which is crucial for policy at future angles. Have you ever gotten criticism that your work is accelerating AI and that you are a safety risk? I've gotten some critical emails that are like, you shouldn't talk about this.
Sebastian Raschka [00:49:07]: Yeah, I've more gotten emails about the fact that I talk about LLMs is not good because LLMs violate copyrights. I mean, not that I do it, but that other people's LLMs do it.
Nathan Lambert [00:49:21]: And I'm happy that I haven't had this audience very much, but it seems this is like one of the challenges of having a tech audience is like you cultivate it in kind of one of two, like there's multiple ways to go. And one of them is like this all data is for language models is theft thing. And I just don't know how to deal with it because like I disagree, but the normally people that aren't receptive to it, which is really hard. It needs to be played out. Yeah.
Sebastian Raschka [00:49:47]: My book also just to make extra sure all the data I use there is so the pre-training data is public domain data, like a book from Project Gutenberg. And for instruction fine tuning, I did my, I created my own data set basically. So just to avoid any issues, you know, like. Did you do, you wrote it by hand?
Nathan Lambert [00:50:06]: Yep.
Sebastian Raschka [00:50:06]: So I took, no, actually I used, I used part of an LLM and some by hand.
Nathan Lambert [00:50:12]: Yeah.
Sebastian Raschka [00:50:12]: So it's a great exercise.
Nathan Lambert [00:50:14]: Yeah. Yeah.
Sebastian Raschka [00:50:15]: And for the synthetic one, I use LLAMA 3.1 now too. I mean, yeah, you can tell me also about that a bit. I mean, that's maybe interesting for the audience, how to generate a preference data set, because there are multiple ways, I mean, naturally it's crowdsourced, right? So you ask people, you have the model generate two answers or have flavors of the model generate answers and then, oh, which one do you prefer? But it's not really scalable. And so you could technically do the same thing with an LLM. You could basically have the LLM generate a more polite version because I think LLMs are very good at, even the small LLMs, the open source 7b models are good at rephrasing things or evaluating things. They're not necessarily good to generate the answer in the first place if they don't have a reference, but given a reference, I think it's super useful to use open source LLMs in that sense.
Nathan Lambert [00:51:07]: I'm surprised that this hasn't caught on sooner, but I think it's starting to catch on. I think in the meta report, they essentially have edits. So then they rank, they make their preference pairs as edited better than chosen, better than rejected. And that's like, you can create multiple players by binarizing. There's a few research projects that have done this where they have like, constitutional AI is popular, but that's not really reproduced. One of my collaborators slash friends at Synth AI Labs, Louis Castricado, he did a paper on like the pink elephant problem, which is like using provisions to get the model to not just say whatever is in the question if you ask it not to. We did a follow-up work that's out literally today, which is like on self-directed synthetic dialogues where you have the language model generate a plan, and then it follows the plan. And then you can also do revisions on it. So I think Nemetron did this with Prompt. So it's really getting going, but it's something that took longer than I expected. There's the kind of question, this is like too big of a topic to go into, but it's like, how do you use GPT-4 feedback? Do you use like, are your completions from two different models or the same model with different generation settings? How do you use humans? I think that the labs are using humans for preference data because it eliminates some of the problems in language modeling. And then that's one of the biggest impactful research questions in alignment. It's like, we can't afford the $1 to $10 million dataset. How do we do this? And that's what, we're starting a project to do that AI too right now. And it's a big open, like, I don't know where it'll go. I don't know how much, like how far can we reproduce the LLAMA-3 alignment methods. Yeah.
Sebastian Raschka [00:52:46]: So I would say the LLAMA-3.1 paper or the LLAMA-3 paper, it was like a 93 page paper
Nathan Lambert [00:52:52]: and it was great.
Sebastian Raschka [00:52:52]: I love it. It's like a lot of detail, but on the alignment part, I feel like I wish there was more information
Nathan Lambert [00:52:58]: about it.
Sebastian Raschka [00:52:58]: Even like LLAMA-2 had more information where they showed what is the improvement actually over the different stages when they added to supervised fine tuning.
Nathan Lambert [00:53:05]: So I'm talking to Ross Taylor tomorrow, and I'm going to ask him the specific thing. On latent space, like Thomas S., one of the leads, said that most of their gains come from RLHF rather than SFT. So I think the open source community is over-indexed on instruction fine tuning because it is accessible and we have the data. And this is like one of my, like, try to guide the community by doing things is like, go do RLHF. Don't worry about instruction tuning data sets. Don't worry about that. We'll just leave that the same and go find more preference data and keep playing with this. And don't worry about the DPO methods. Just literally go make preference data and keep trying to train things. Like don't implement a new loss function.
Sebastian Raschka [00:53:48]: Practical question to an expert like you. How good is actually a preference data set if you download it, if both the chosen and the rejected answers, if you download a preference data set, they're not generated by your model, right? And if you have a model and you use the responses that the model has never basically seen before, does this actually work or would it be advisable?
Nathan Lambert [00:54:11]: So the most, the two most popular preference data sets in the open right now are UltraFeedback and Nectar or variants of them. Both of those are collected from large suites of other models. And part of my, there haven't been data sets or papers that have trained really good models using on-policy preference data from the model you're training. And I think that's a question that we need to answer. It's like, how do we get UltraFeedback level results with on-policy data? Because all the labs are using on-policy data. I wrote about this in like Barry to one article. I have a theory that UltraFeedback and Nectar, these general data sets work so well because within them, there is something close enough to your distribution and you don't have to get it quite right. But it's just like a gentler, more uniform learning signal for the models doing preference tuning. But we don't know. That's something that I want to answer.
Sebastian Raschka [00:55:02]: Yeah, this is an interesting one. I would also like to know the answer because that is one thing where I got a bit stuck when I was writing this DPO chapter with smaller models. I think bigger models also, they hide these weaknesses a bit because they have been trained on so much data that like you said, it's kind of in distribution already. But if you train a small model, it would be out of distribution, right? If you use someone else's preference data set. I noticed even something simple when you train a model on one simple instruction data set, let's say something like alpaca. And then let's say you have just to have something visual. You want the model to generate Yoda speech, like where every sentence is reversed. But the model has never seen sentences like that unless it was maybe in the training data. But in that sense, it doesn't work well at all because you ask the model during preference tuning to write sentence structures. It has never grammatically written before. And so in that sense, I think what I found is it's much better if you, I don't know, you say be more polite or like you have a more polite answer because you use the same grammar or so. So things like that basically. And yeah.
Nathan Lambert [00:56:08]: Yeah, I think that's a smart approach. It also might be why learning rates are getting so low. Where like all the learning rates for DPO and things have been going down in the fine tuning space. And it might just because distributionally, like we're far off from the model. There's the other theory that the model is like really, really done training. So they get it to a really good optimum. You don't want to move it from them. But it might just be that like our data sets are in the wrong space. Yeah.
Sebastian Raschka [00:56:32]: So you try to be gentler with a lower learning rate.
Nathan Lambert [00:56:36]: Yeah. All of this stuff changes fast, but not fast enough. Like this ultra feedback data set they were talking about came out last October. So we're like almost 10 months in and it's still the state of the art data set. And it's only like 50,000 examples. So there's so much opportunity for someone to like at this level, like go build data sets if anyone is watching. Because it's like, I think we're so far off where we could be just because people don't know how to make good preference data sets.
Sebastian Raschka [00:57:02]: Well, now we have LLAMA 3.1, 70 and 405 billion that allows us to do that, right?
Nathan Lambert [00:57:08]: We'll see. Yeah. I was wondering, this is a change of topic, but how do you think like, do you think AI will change our jobs in writing? How do you see AI coming for this kind of educational space? Like how much of what you do as an educator could be taken in N years by AI?
Sebastian Raschka [00:57:26]: Well, I think it's like, of course it will automate away some things because nowadays you would ask a model something instead of searching for it and reading it on a website. But I do think the creation process, you still need a human to put it together well. Because I don't know, I think LLMs are not nowhere near like generating a whole article that is actually, I would say even good where it can generate the right things, but you still have to put it together. It can generate good blocks of text or something like that, but you need to, as an edit, like you become maybe more like the editor then in that sense. But I'll try this.
Nathan Lambert [00:58:09]: Also like, do you write, do you have AI write any parts of your articles? I'm so scared for like moral reasons to have any AI writing in it. I'm like, it's just a slippery slope. It feels like I could get addicted. Yeah.
Sebastian Raschka [00:58:21]: So sometimes I don't have it write anything from scratch, but I sometimes do do that. And especially, I don't know, I have a, I mean, I'm a non-native language speaker and sometimes I have a harder time than other days to make the sound right. It's like, okay, this is what I want to say, but it doesn't sound right. And then I, can you revert this with a focus on XYZ or something? So like, it's basically like a, you know, like a thesaurus where you find similar words, you find similar sentences, like just rewording it, like these types of things. But one also, now that you mentioned it, one weakness it has, or LMs can't do really, is they can't generate figures. You know, maybe that's coming.
Nathan Lambert [00:59:01]: I don't know.
Sebastian Raschka [00:59:01]: You can do that probably with ticks, like the latex thing where at one point, but right now nowhere near, can you generate any useful figure? And I think learning is very visual too. I think if it's just text, it would be really hard to learn anything.
Nathan Lambert [00:59:17]: Yeah.
Sebastian Raschka [00:59:17]: So you can, of course, but I do think, you know, there's a saying, image is worth a thousand words, right? So yeah, in that sense, you still need someone, you know, like the mastermind behind an article, even if it's just an editor, I don't think LMs can replace everything at least. And we'll see. I mean, I don't know how much better, I mean, we just don't know how much better, let's say GPT-5 as a placeholder here will be then GPT-4, you know? So maybe if it's saturating, who knows, right? So maybe it will be five more years till we, yeah, get in a more scarier territory in terms of replacements, you know? So we'll see.
Nathan Lambert [00:59:55]: Yeah. I mostly avoid the agent word, but it does seem like there's enough culture and cultural investment in the Bay Area and tech executives to do something. Like they're going to get to something that is triable, which I think is mostly like automatic Google searching, more code execution, which is going to be interesting, but I have such wide expectations of what it actually means. That's probably the next big shift. I think this LLAMA 3.1 is probably right now leading the year in terms of AI news. This recent DeepMind thing on the math might be a better example of what's really hot news. I need to go read more about it. There's some long write-ups on how the qualitative between the AI math and the human math and the different directions they're going. So that's kind of what I want to read about it. But it'll shake things up. We're multiple years into this fast phase. It's not exactly new at this point. Yeah.
Sebastian Raschka [01:00:57]: Last thing on that is I do think, though, LLMs make good assistance in the literal sense where one thing where I use it for my newsletter for is at the end, I have a list of all the papers I have found interesting, like 30, 50 papers usually. And usually per hand, I edit the author names, like the last names of the first three authors. And now I use an LLM to go to the website and get the names of the authors, basically. And so this is where it saves a lot of time. You could do that without LLMs. You could write some code to do that, but it would probably take me half a day to write because I'm not good at this web scraping code to do that type of thing. And I think in that sense, it is actually a useful assistant for certain things like
Nathan Lambert [01:01:44]: delegating actions. I think it'll keep creeping up. I don't expect their usage for those things to go down because they already are so useful. And the little coding things, the hacking data together, the automatic searching, people aren't going to want to stop using that. I don't know if it supports the whole valuation we have, but it's fun to be in a space where we get to try new things. As a computer nerd, it's really fun to have a new type of software that we can try all sorts of things in our workflow. And I think that's underrated. So I don't know. Thanks for coming on. Any last things you want to discuss?
Sebastian Raschka [01:02:19]: Yeah, I just wanted to say thank you for the invitation and I hope you keep creating these awesome newsletters. I think this is much needed because there's so much hype, like you said previously, it's
Nathan Lambert [01:02:32]: creeping up on us.
Sebastian Raschka [01:02:32]: There's a lot of over, let's say, evaluation and praise. And I think something that is kind of like cutting through this is it's much needed like this honest, straightforward, no b******t content. So yeah, I hope you keep creating that. It was fun to chat. And yeah, to everyone out there, I think also what keeps us motivated, I think, is the awesome community that people give feedback and discuss things and bring things up. And yeah, I think without people also giving us feedback, we wouldn't be probably doing this because it's kind of a lot of fun to be in that space, I must say. Yeah, it's fast moving, but there's always something interesting every day.
Nathan Lambert [01:03:14]: Yeah. Yeah, this is really interesting. We covered a lot of kind of low level of just what it's like trying to use language models on the day-to-day basis in July of 2024. So thanks for coming on. And I'm sure we'll talk soon. All right.
Sebastian Raschka [01:03:27]: Yep, it was nice meeting you and see you then. Bye.
And how to understand Llama three point one's results.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/gpt-4o-mini-changed-chatbotarena
0:00 GPT-4o-mini changed ChatBotArena
3:23 Llama 3 in the arena
5:13 Partial solutions and next steps
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_013.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_019.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_021.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_025.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_039.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_043.png
Defining the future of the AI economy and regulation. Is Meta's AI play equivalent to the Unix stack for open-source software?
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/llama-405b-open-frontier-model
00:00 Llama 3.1 405b, Meta's AI strategy, and the new open frontier model ecosystem
01:37 Meta's open frontier model
03:51 Zuckerberg's vision for open-source AI (vs. reality)
08:35 Does the Llama 3.1 license support open-source AI?
12:55 Different futures for regulating frontier models
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_008.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_010.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_015.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_018.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_050.png
SB 1047, AI regulation, and unlikely allies for open models
The rallying of the open-source community against CA SB 1047 can represent a turning point for AI regulation.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/sb-1047-and-open-weights
00:00 Introduction
01:53 SB 1047 and targeting regulation
07:57 Unlikely allies of "open"
12:05 What would I regulate today?
I Switched to Claude 3.5
Speculations on the role of RLHF and why I love the model for people who pay attention.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/switched-to-claude-from-chatgpt
00:00 I Switched to Claude 3.5
03:57 Product priorities
05:15 RLHF's peak?
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_016.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_018.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_020.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_022.png
I’m really excited to resume the Interconnects Interviews with Dean W. Ball from the Hyperdimensional Substack (you should subscribe). We cover the whole stack of recent happenings in AI policy, focusing of course on California’s bill SB 1047. We cover many, many more great topics here including:
* What will happen in the case of a minor AI disaster,
* If Meta will release the 405B model, and why,
* The status of Chinese open-source AI,
* Training on model outputs,
* Anthropic’s recent strategy,
* What scaling laws actually mean,
* Creating content and shifting the needle of the AI discourse.
Watch the video on YouTube below or listen on podcast players here.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Chapters
* 00:00 Intro and Welcome Dean Ball
* 02:44 The Origins of California Bill SB1047
* 08:56 The Evolution of Bill SB1047
* 13:00 How SB1047 Affects Fine-Tuning
* 20:00 The Future of Bill SB1047
* 21:58 The Impact of AI Disasters
* 29:02 Meta and its 400 billion Parameter Model
* 32:25 Open Source AI and the Chinese Market
* 37:37 The Future of Open Source AI
* 43:35 Synthetic Data, Licenses, and Future AI Development
* 45:18 Anthropic's Approach to AI Safety
* 50:46 Scaling Laws
* 53:01 The Role of Audience in Influencing AI Policy
Links
* Dean’s series on SB-1047: one, two, and three.
* Other AI policy Substacks: Jural Networks and Intersecting AI
* Senator Scott Wiener. CA SB 1047 itself.
* Another post on CA SB 1047 from Answer AI.
* Situational Awareness by Leopold Aschenbrenner.
* Lina Kahn on her P(doom) and warnings in support of open-source.
* Ben Thompson’s Framework for Moderation in technology.
Transcript
Nathan Lambert (00:00:01): Hello, and welcome back to InterConnect's interview series. It's been a few months. I'm really excited for this one. We're here with Dean Ball, who is a research fellow at the Mercatus Center. He works on AI policy right now, and he's the author of the Hyperdimensional Substack, which is kind of the AI policy substack that emerged when I was spamming into the void that we need to have some good AI policy newsletters out there. There are a couple more that I could add to the show notes of this that I'm aware of from friends that used to be at OpenAI, friends at AI2, so I'll add some of those as well.
But in this kind of summer slowdown of releases, I thought it would be a great time to kind of revisit some of the core themes on AI policy, open versus closed, kind of things that I'm wondering about in the future that I know are coming that are looming AI disasters, what some of these closed source companies are trying to do in the policy space. I think this is the sort of interview that we could probably do multiple times. I think we've started talking in DMs and it's clear that we're aligned on a whole bunch of things. We read each other's work. I think this should be kind of fun and I'm just happy to do this.
I think the core of this interview I'll give you a chance to introduce yourself if you want, if you want to add anything else that I missed, and then we're just going to go into this California bill SB 1047. Probably talk about this. I'll ask you about the story of how it happened and then where we're at now. And I think that'll kind of lead into a lot of interesting debates. So do you have any background you want to add that makes you an interesting person in the AI space? Or is it just that there's so many things that need to be done in AI that if you're focused, you can kind of have an impact in an area?
Dean W Ball (00:01:44): Yeah, I mean, I think basically, you know, I've mostly written on policy unrelated to tech for my career, state and local a lot. So the fact that a lot of the policy action on AI seems to be happening at the state level has been very relevant. But I've also just like always been paying attention to the AI literature. I remember 2017, I think, when the Alec Radford Amazon podcast product reviews paper came out and I said to a colleague this is gonna be a big deal I think one day and you know we I tried to use GPT-2 to do like social science research like policy research back in 2019 so I've been playing around with these for a while and I try my best to write from a combination of a relatively technically informed person, but also someone who understands the policy side.
Nathan Lambert (00:02:43): Yeah, so I think we should jump right into it. What is the origin of the story of this California bill? My understanding is it just kind of showed up and everyone in the Bay Area was like, like where did this come from? Having actually passed the state Senate as like, do you have any, does your story start there as well? Or did you kind of know this was coming?
Dean W Ball (00:03:03): So I saw, Scott Wiener, the author of the bill had telegraphed that he was working on, something in AI policy, I think in maybe October or November of 2023. And then the actual bill text came out in early February. And I remember when it came out because I was having dinner with my wife and, I was like, I have to drop everything and go work on this. I stayed up until like one in the morning, you know, reading the bill and writing about it. And that was kind of my first Substack post that really went anywhere in terms of audience. And so, yeah, then there was kind of a couple months of quiet. You know, I had been writing about it, but people weren't really focused on it in the Bay, in the tech community. And then closer to around April, people started to pay attention. And the conversation has been pretty, you know, pretty active since then.
Nathan Lambert: Yeah. And like, what does it actually say? Like, what are the core points? I know there's stuff around thresholds and giving California power to do like California creating a new body. Like, what are you think? What are the few like core things that people should know? I think there's probably some details, but just the core stuff.
Dean W Ball: Yeah, so the core idea behind SB 1047 is to create a regulator inside of the California government called the Frontier Model Division that would oversee models. Really, now the threshold is models that cost more than $100 million to train. We can talk about how specifically you really even specify that cost, but really all the bill says is $100 million of compute costs to train. Those models are subject to a series of testing and safety requirements, and more importantly, I think, a liability regime that basically says that most downstream uses of that model, including in the case of an open source model, most fine tunes, most uses of models combined with scaffolding software, other software. So things that are very combinatorially distinct from the initial model release. Any downstream misuse is the legal responsibility of the developer who made the original model.
So, if I fine-tune Lama 3 and then someone else puts that in an app and then a user of that app misuses it in a way that causes a serious harm, the bill does have a high threshold for the harms that have to count here.
Nathan Lambert (00:06:00): Is that eligible? Is it specific? Do they have a safety taxonomy?
Dean W Ball (00:06:05): So, they basically, it really, it's a static threshold that comes in at $500 million of damage. They would say harm to critical infrastructure and things like that. Critical infrastructure pretty much just means everything. It's kind of a catch-all term. It's a little weird. Critical infrastructure, the way we think of it, like highways and power plants and stuff, is actually a subset of critical infrastructure. Critical infrastructure includes things like casinos and ballparks and amusement parks and all kinds of stuff. So anything really, any major cybercrime, bio attack, all the things people are worried about with AI would count. And the developer of the original model, which is many stages upstream from where the harm happened, would have legal responsibility.
Nathan Lambert: So it's like the risk for these, probably the expected value risk for open models in this bill is definitely low, but it's just kind of this thing that it's like, if you're kind of comparing on the two axes, the open versus closed risk, like the risk for open models is way higher because of this downstream use term. And that's for the people that are getting like, oh, why is everyone that cares about open AI, like open AI as the field mad about this? So I think that was why everyone was kind of caught up in ours.
Dean W Ball: Yeah. And the other thing to keep in mind, though, is that under this bill, if you're making a model more than $100 million that costs more than $100 million, you have to submit a variety of documents annually about your safety procedures and sort of testing regime on the model to the Frontier Model Division. And I think something that's not all that well understood, and it's kind of just like how administrative law and regulation works in America, but that the tech community might not understand, is that the Frontier Model Division has the capability to create problems for developers, even if their model's never used for a hazardous capability. They could see your safety plan and say, We don't like this or we want more information on this. And they can subpoena you. They can bring you to court and they can, you know, issue it. They could they could order a cease and desist.
Nathan Lambert: Yeah. And this is where you only post on the political economy of AI regulation comes in as like, what are they going to do with that kind of open ended power?
Dean W Ball (00:08:40): Yeah, it doesn't necessarily. I mean, they're an agency that has all the regulatory powers of an agency, which are substantial. I think one other point that is worth making about 1047 that would be relevant to your audience in particular is. So the initial version of this bill, any fine tune. No matter how substantial the fine-tune is, the original model developer held the legal responsibility and had to test their models with the margin and the realization that people could fine-tune them or do whatever they wanted to them, modify the weights in arbitrary ways, which obviously doesn't really make a ton of sense.
Nathan Lambert (00:09:38): I was going to ask about the edits. This is where I probably stopped reading as closely as I should have.
Dean W Ball: In a fundamental sense, everything I've said so far has basically been true of the bill for the entire, the fundamental points, the liability, the frontier model division, these kinds of things. Basically, the actual making developers guarantee model safety when I think we're probably both in agreement that safety is not a model property.
Nathan Lambert: Yeah, at least in the way that the bill concerns it. They're considered about infrastructure. If critical infrastructure is the primary target, safety is not a model property. This is why I ask about a taxonomy. It's because it's like... We're going through this exercise at AI2 to kind of say like, what do we mean by safety? And it's a total headache. It's like extremely hard to get this right and to communicate it clearly. So now when any other organization or somebody mentioned safety and I'm like, oh, do they actually define it? Like it's such a risk to put it into words because when you put it into words as well, you're exposed to all this like people being like, so you don't care about X, Y, and Z. If you don't put it explicitly, it's like a total trap.
Dean W Ball: Well, and actually just to expand on that a little bit, because, you know, the Center for AI Safety, which is the nonprofit that was heavily involved in authoring the bill and Senator Wiener, you know, one of their primary concerns is bio risk. So people making biological weapons with AI models. You know, and I think people who don't understand biology all that well have this idea that you can say, oh, well, that's a good idea. biomolecule to make, and that's a bad one. And so we'll make a list of the bad ones and you can't make the bad ones. And that would be a way to like, RLHF, a biological foundation model.
Nathan Lambert (00:11:34): My understanding of biology is that the more powerful, the more specific a molecule is, it'll probably have good uses and downsides. It's like Teflon. Amazing physical properties, extremely bad downside health concerns. I would guess, obviously, if you're consuming... engineering like living creatures it's going to be a little bit of a different consideration but yeah.
Dean W Ball (00:11:56): But I mean also a lot of biomolecules just and like code um their their goodness or badness is really context dependent they'll do different different things in different settings and so it's not necessarily easy a priori to identify you know what what what how even would you steer a biological foundation model, like something that's predicting protein structures or nucleic acid sequences or whatever it may be? How would you even steer that towards safety? It's not like a priori obvious that that's currently possible. But that's just, you know, I think this idea that safety is something that can be legislated in that way, I think is a fundamental problem.
Nathan Lambert: So what is next? Or you could continue. I was going to ask, what is next for the bill?
Dean W Ball: Oh, yeah, yeah. So I'll just say one thing about the fine-tunes in the most recent amendments to the bill. So fine-tunes now, if you do a large fine-tune, large being anything more than 3 times 10 to the 25 flops involved in the fine-tuning compute,
Nathan Lambert (00:13:13): I need to learn all these numbers. I need to learn what they mean. I need to know. Essentially, it's a linear relationship between model size and tokens. And then you should be able to have specific points, which is like, is Lama 3 base crossing that? Like 15... trillion tokens at 70 billion parameters like I think I I don't know I'll loop back on this I need to know this in the future.
Dean W Ball (00:13:35): It would be as much fine-tuning as you use as much compute as you use to fine-tune the model that's how this threshold is calculated.
Nathan Lambert: Yeah I was just one like a rule of thumb for people would be great I'll figure that out it's on my to-do list of mental mental math that would be great.
Dean W Ball: That would be great to do um but if you're in that situation uh then the bill applies to you too. So you have to create a safety plan and a certification that you submit to the Frontier Model Division every year. Starting in 2028, like the foundation models, you'll be subject to mandatory annual audits.
Nathan Lambert: Is this prescribed to anyone that trains in California or anyone that operates their model in California?
Dean W Ball: Anybody that distributes a model in California. So the bill is at least everyone in the United States, if not really everyone in the world. Certainly, but they could certainly sue you in the United States if you're an American company or operating in America. Now, the important thing about that fine-tuning threshold, though, is that the fine-tuning threshold can be lowered arbitrarily by the frontier model division. So the $100 million threshold for foundation models, that's fixed in statute. So you would need an act of the legislature to change the $100 million threshold. But the fine-tuning threshold, there's no dollar amount. So the same problem with compute thresholds, that compute cost is getting cheaper and cheaper rapidly over time. applies and the frontier model division can change that threshold arbitrarily.
Nathan Lambert (00:15:35): Who elects these officials? Is it like the governor of California? Or the federal branch or something?
Dean W Ball (00:15:43): This is all state-based.
Nathan Lambert: Oh yeah, I meant in the state.
Dean W Ball: Yeah, so the frontier model division would be staffed by unelected civil servants, primarily. Led by unelected civil servants. And then on top of the frontier model division, uh, the new, the newest version of the law creates a committee that is like a governing committee. And that committee is composed of, I believe three members appointed by the governor and confirmed by the legislature. And then two members that the legislature itself points each house, the Senate and the assembly.
Nathan Lambert: Mostly what I would expect.
Dean W Ball: Yeah, yeah, exactly. And like, I think there's a requirement that, you know, one person has to be from industry, one person has to be from the open source community. There's a lot of, there's a lot of bones that they throw to the open source community.
Nathan Lambert (00:16:37): Random credentialing.
Dean W Ball (00:16:38): Yeah, yeah, exactly. But I mean, I don't really, that could be anyone, you know, really, like, yeah, who's who's from the open source community? Exactly. Yeah.
Nathan Lambert: Um, so what's next for this? Like it passed this, it passed the state Senate and then it got revised by the, what is the state, like state general state house. Is that how it works? The state assembly revised it. Does it then they would have to vote and then the Senate would have to vote again. And then the bill would have to actually be signed. Is that how, is it worked that way in California? Yeah.
Dean W Ball: Yeah, basically. So so it's right now making its way through the committee. So it went through the Senate committees and then was voted on by the whole Senate. Now it's going through the assembly committees. It just passed one, I think, last week or the week before the Consumer Protection and Privacy Committee is what it's called. I could be wrong on the exact name, but that's the basic idea. So they just passed it. They did some amendments. It goes to the assembly's committee. judiciary committee next and then uh eventually it will go to the full assembly for a vote and then to the governor for uh for signature or veto.
Nathan Lambert (00:18:04): When would this start? When would it kick in?
Dean W Ball (00:18:00): Uh the bill would kick in I think most of its provisions would start January 1, 2025.
Nathan Lambert (00:18:05): Yeah. And the original vote in the state Senate was like very pro, right? It wasn't even like, it was just like, Oh, this seems normal checkbox for, but this is kind of a cynical take, but I kind of viewed it as mostly these politicians are serving constituents that know that AI is a big thing, but know nothing about AI. So for a politician saying, look, I'm taking action on AI and they're not going to be able to decipher any of the details is probably a political win.
Dean W Ball (00:18:31): Yeah, well, and I think also worth noting is that Scott Weiner, the state senator who authored the bill, is a very powerful figure in California politics. And I would guess that a lot of the senators who voted in favor of the bill really barely looked at it and aren't even necessarily thinking about their constituents. First and foremost, they're thinking more about, well, Scott's my ally. I need X, Y, Z thing from Scott. So I'm going to vote yes on his bill. Um, and that dynamic will apply at the assembly too is, is very common. Uh, the, the California legislature has a history of, um, uh, sometimes even unanimously passing bills that the governor then vetoes. So the governor is often expected to be a little bit the adult in the room on this stuff.
Nathan Lambert (00:19:25): This is so funny. I have no comment.
Dean W Ball (00:19:27): I do suspect that the governor is probably going to be, whether or not he wants to, he will probably be the final voice on this bill.
Nathan Lambert (00:19:41): So that's who people are talking to, probably, realistically, from what you've said.
Dean W Ball (00:19:46): Yeah. So, I mean, the one thing, and this is, again, this is a kabuki that's very common in state legislatures. The governor has not said anything publicly about SB 1047 specifically. I think he's as a general matter, he tries not to comment on legislation that's in process.
Nathan Lambert (00:20:08): That makes sense.
Dean W Ball (00:20:09): Yeah. And then kind of. But, you know, he also might signal in various ways. He there are times when it gets closer.
Nathan Lambert (00:20:17): I would guess they do.
Dean W Ball (00:20:18): Yeah. I mean, like he could say, you know, a lot of bills. I think one outcome that is extremely unlikely from this bill is that it's like voted down by the assembly. Like, I don't think that's going to happen. It could die in the assembly. It could just kind of be forgotten, never get brought to a vote, or it could go to the governor and be vetoed. If the bill's not going to pass, it's going to probably be one of those two ways.
Nathan Lambert (00:20:43): Okay, that's a great little lesson in state politics that I'm sure the vast majority of people listening to this will not know. I did not know all of this. Do you have any final comments on this? Otherwise, we're going to move into kind of fun, faster questions and discussions.
Dean W Ball (00:20:59): Yeah, sure. Let me just think. I think the one other thing that is worth keeping in mind here is that the latest version of the bill, I mentioned this, but just to expand on it a bit, it does require mandatory audits starting in 2028. So if you make a covered model or a covered fine tune, however, the Frontier Model Division chooses to define that. Not only do you have to submit stuff to your certifications to the Frontier Model Division and have the legal liability and all that, but you also would have to comply with an audit done by a private company. So just like accounting, you pay for someone to come in and look at your stuff. And the auditors are, it's not an open market for competition. The auditors are licensed by the Frontier Model Division. So it's probably two or three different, companies that'd be doing that and it's probably that's the sort of thing that i
Nathan Lambert (00:21:59): think people have wanted i don't know if you want it like we don't i don't think people i don't want all these types of oversight to be cobbled together i think individually each of them have different types of merit but like the execution is important and then when you cobble them together it's like wait wait wait this is this is too much
Dean W Ball (00:22:19): Well, and also I think I think it's just questionable whether I agree that an audit like structure like that might be the good long term way to go. I think it's questionable whether a California state agency really has the capacity to do this kind of assessment of like who is an accredited auditor. That feels much more like a federal responsibility. So, yeah, but that's I think that's that's pretty much the main message on 1047.
Nathan Lambert (00:22:49): Yeah. Okay. I'm going to move into other fun questions I have. I'm going to start with one that's potentially related. I've been trying to get my brain around what is going to happen when there is actually a minor disaster from AI. It loops into open versus closed debates. I think a lot of the things I've been talking to people is it won't actually be about whether or not it was an open or closed model. It's some weird infrastructure that people plugged it into and that causes the power plant to go down. Do you have any ideas about how this will happen? I'm expecting this to happen within a couple of years. I feel like the state of our infrastructure is that it is not that reliable and that we're adding all this new digital information into it. And I think all of this is very fragile digitally. So it's like, I think this is going to happen. And how do we preempt any communications around that?
Dean W Ball (00:23:37): Yeah, well, I mean, you know, cyber attacks take out digital infrastructure or take out critical infrastructure all the time. You know, earlier this year, I think maybe it was last year, courts in Dallas could not convene. Like there were no judicial proceedings in the city of Dallas because of a major cyber attack on the judicial system's computers. Parts of the power grid go down. Water plants go down. Hospitals all the time. This happens. $500 million in critical damage. That sounds like a lot. It's not actually that much.
Nathan Lambert (00:24:13): It doesn't have a B on it. It doesn't sound like a lot.
Dean W Ball (00:24:18): Exactly. It's a big economy. I think about this all the time. I think a couple things are very likely to be true. If there is... an attack of this sort, people will probably suspect that AI was involved, whether or not we get, how are we going to know? Right. Let's say like somehow we do have a strong hunch that an AI model was involved.
Nathan Lambert (00:24:47): Yeah, like, do we normally figure out what happened in cyber incidents? Or is it normally post hoc? Or not at all? I guess that's a good thing to know with my question. It's like, can we know that a language model is actually involved? Like, how often will they be able to get that far into the stack of the attack?
Dean W Ball (00:25:02): Yeah, right. Like, I don't know. I mean, like, probably they're... I mean, if you were using, like, an agentic GPT-6 model to do some kind of zero-day exploit on something, like, presumably in the server logs, like, you'd be able to see that, like... what was interacting with it. Right. But like, who knows if that would be masked, but I, so, so let's just say though, that we, we have some, you know, circumstantial evidence to suggest that an AI model was involved in the execution of, of some cyber attack. It's like very much to me, unclear, unclear, Are we going to have like the person's chat log? Like, are we going to know how they prompted the model?
Nathan Lambert (00:25:46): Like, I mostly think it's like it's going to send requests over some generic Internet protocol. So there'll be this big gap where we can't really tell.
Dean W Ball (00:25:54): Yeah. I mean, that could totally be true. That could absolutely be true.
Nathan Lambert (00:25:58): So I expect there to be – it's like almost if somebody takes ownership or does a really bad job or it's an own goal, which is like a hospital implemented some agent and then it took down their authentication system type of stuff.
Dean W Ball (00:26:12): Yeah. No, that could very well – that's all definitely possible. Yeah. I think that, though, how would we actually know what an AI model was used for? It seems to me like we don't actually... People are imagining a situation in which this happens with perfect information.
Nathan Lambert (00:26:32): Yeah, I think that's the answer to my question. It's not that it's like what happens. We can't answer what happens because it's so much of a media question. It's like we won't know. It's likely to happen, but it's very unlikely that we know the specific stack that caused it. Which makes it more of the same around like if cyber incidents increase in rate, then people will talk about AI and people like without actually having the logs, it makes it easier to spin narratives. Because I'm worried that this could be like people are like, oh, this is why open source AI is bad. Yeah. And it's like, I don't expect to have any proof for that, but I expect that to be what people say.
Dean W Ball (00:27:10): People are going to blame AI on things that were already happening. I think that's like a trend that we will see across the board. Whether it's misinformation or whether it's cyber attacks or whatever else, like there are all these curves that we're already pointing up and they're going to continue to most likely. And I think people will blame that on AI. Now, like the sort of, you know, long tail situation is like, what if something really bad happens? You know, what if a power plant, you know, no one has water in Los Angeles for a month or something like that. And in that situation, not only do I think that an attack could be hastily blamed on AI without us knowing whether that's true, I also think we could see legislation move very, very quickly. The Congress, the federal government is not known for moving fast, but in a crisis, they will move fast. It's for the same reason that I suspect, I don't think he is right, but if Leopold Aschenbrenner is right about super intelligence being here and, you know, 50 months or whatever he says.
Nathan Lambert (00:28:26): Yeah. This is another one of my later questions, but I didn't have the best way to frame it.
Dean W Ball (00:28:32): Yeah.
Nathan Lambert (00:28:33): Like AGI timelines and stuff.
Dean W Ball (00:28:35): Yeah. Like if he's right about that, then like, yeah, I mean, that's going to get nationalized by the federal government and it'll happen in a heartbeat.
Nathan Lambert (00:28:42): You know, I found it interesting that Alexander Wong of scale was also kind of touting this point of view. Yeah. I guess it makes sense for them because they're the only AI company that is leaning into federal contracts. Yeah.
Dean W Ball (00:28:59): And they were before ChatGPT, too, I think.
Nathan Lambert (00:29:04): Yes, they have been for a long time, which is why it was easier for them to continue.
Dean W Ball (00:29:08): Yeah, their early big revenue source, I think, was federal government contracts.
Nathan Lambert (00:29:13): Okay. Yeah, we might come back to AGI. I've been confused by the... lines they're drawing. I have a quiz to debate later on. I don't even know the answer. We'll see if we get to it. But another fun question. Do you think meta will release the 400 billion parameter model? And if there will be any governance questions around that?
Dean W Ball (00:29:32): Will they release it open source?
Nathan Lambert (00:29:34): Open weights in a similar manner to the other models. Yeah.
Dean W Ball (00:29:37): Yeah. Open weights.
Nathan Lambert (00:29:42): Do you think they have government? I've been decreasing probability. At best, I was ever 50-50. But is this for government's reasons that you don't think? Are they flying? They've always been flying close to the sun where there's back channel discussions where it's like, The Biden administration is telling Meta that they're like or they're not invited to stuff because they're not happy with how they're like open waiting models through this other like probably they're probably getting lobbied by people saying open source is bad. But it has always seemed like Meta is on kind of thin ice with the executives in Washington. And I'm guessing it's reasonable to say that this model's release is bad. heavily influenced by feedback they're getting there. And Zuck will make the final call.
Dean W Ball (00:30:28): Yeah, I think that that's part of the calculation. I think that also they probably just want to set a precedent that they're not going to release everything open source because they don't know how things are going to go. Yeah, I mean, they just don't know. Will the model end up being... the most important way that we all interact with computers, you know, in a few years? Or will it just be kind of another layer and another tool? I think they don't know. I feel like Zuckerberg's intuition is that it's just going to be another tool. And so that's why he's inclined to open source.
Nathan Lambert (00:31:07): Yeah, this relates to the whole Apple thing. Like Apple is making these as features rather than products. Yeah. That does a lot of good for the narrative around AI, in my opinion, at least for things that I care about. It's like, this is what we're saying where AI is about a system and not just a model. The Apple's model doesn't matter to people, but it is enabling these products and systems or these things on their products to just be better. It's always Apple and Meta together. They are always forcing their way into whatever the next thing is going to be in technology.
Dean W Ball (00:31:44): Vibes policy or whatever. Yeah and it's funny because they hate each other. Yeah yeah but it's so funny but yeah i don't think they're going to uh that that's my just my personal intuition and i think that's like i think we're going to see a lot of people um not just in the language model space but elsewhere kind of do this this dual approach where they can they realize how much political cred you can get by open sourcing things. It's still happening.
Nathan Lambert (00:32:12): Google today, when we're recording, released Gemma V2. And their 27 billion parameter model is just a little bit below Lama 370B. I think that's a nerdy thing. But when the first Gemma model was released, it wasn't used as much by the community, mostly because there was a lot of minor bugs in the implementations in popular tools. So I think the initial feedback loop wasn't caught on. So it'll be really interesting to see if these second generation models, which are in the same ballpark as what Meta released, there's some strange things. They trained the biggest model on 12 billion tokens, and then the 9B model only on 9 billion tokens, and the 2B model on 2 billion tokens. So the models that have more reach by being smaller are like intense... There's got to be a reason, but I think they were like scaling runs preparing for the biggest one, but they didn't finish training them. So like the models that the most people could use relatively are worse than the bigger ones just by the amount of compute that they put into them.
So I think eventually if there's decent uptake of these, Google will change this. But it's like the Gemma 2, whatever it is, 9B model, it's going to be way worse than the Lama 2 8B, just because Lama is trained on twice as many tokens. But like Google could have resolved this. So that's my like kind of, that's an aside. But these dynamics actually feed into what we're talking about, which is like Google, Microsoft, Beta are all still releasing these models.
(00:33:42): Yeah.
Nathan Lambert (00:33:42): Which is good. I have on this outline like the general state of open versus closed. It seems like we haven't had major updates in a while. It seems like there's much less pressure taking on open. I think maybe people are okay with the steady state that we're in. I don't know if this Nemotron 340B changes that much.
Dean W Ball (00:34:01): I don't think so. So I think that there are the people who believe that open source models are an existential risk to the world. And they continue to mostly think that, and they continue to support policies that either in absolute terms or on the margin would diminish open source. I think that DC has had a really radical shift in the last year because the climate towards open source models in the policymaking world a year ago was not good. And now it is much more... Oh, well, we think this is really important for competition and we think it's important for innovation and we actually like want to make sure we have a really healthy open source community and all these kinds of, I mean, I'm sure you've seen, you know, Lena Kahn, no friend of the technology industry. Um, has she had a comment on this?
Nathan Lambert (00:35:09): Um, that's good. Did you see her clip on hard fork where she was asked what her PD is?
Dean W Ball (00:35:14): Yes. Yes.
Nathan Lambert (00:35:15): Oh, my God. If people haven't seen this, you've got to go find it. It is so funny.
Dean W Ball (00:35:18): Yeah. And the sense I get from like talking to people in Congress and whatnot is that like the staff, congressional staff, is that – People have just realized like open source is really popular and it would be really hard to go after. The government figures this, this isn't new. The government figures this out like every 15 years. They get like really freaked out about something in open source software. And then they... It's a good way to put it. They go and like they try to ban it and then they realize like, oh, wait a minute, this would be really hard. This would piss a lot of people off.
Nathan Lambert (00:35:56): It'd be a giant economic own goal. I think it's inevitable that it's an economic own goal. I mean, China is ready to take this over as beating the lead. They're right there. They don't have the ecosystem. The ecosystem is landing in the U.S., but they have perfectly good models. So if U.S. were to own goal and the U.S. stops building the models, I think that that is the path by which they could then own a ecosystem. Because there's not incentive to recreate the ecosystem when the ecosystem and the models exist in the US. But if these kind of tools and hosting all go away, then it's when other people take over.
Dean W Ball (00:36:29): Well, it seems like, I mean, as a bit of a question for you, I guess, but like, it seems like the Chinese, like, you know, the export controls on compute are going to start to really affect them. Because they were able to buy H100s.
Nathan Lambert (00:36:44): Yeah, this is what I was going to ask about. Isn't it that like a lot of NVIDIA's recent sales have been just them... prioritizing selling to China because they're not yet blocked. And then that creates a backlog in the US because Nvidia is like, well, they're not going to be able to buy them, so we should get our revenue while we can. It kind of checks out. I don't have a source on it, though.
Dean W Ball (00:37:04): Since I've always gotten... It's all through subsidiaries. Yeah. So Chinese companies saw the writing on the wall about export controls like two and a half years ago. And so they started to buy up A100s and H100s at that time. And then the export controls came through and things are leaky and NVIDIA had that chip. They were selling a chip that was like basically an A100 and basically an H100 for a year. And then that got blocked by the federal government. So like...
Nathan Lambert (00:37:37): Should we put Zuckerberg in charge of NVIDIA? Because I feel like for all the haters of Mark, Mark is pretty American and kind of follows it up, I feel like. He doesn't really care that Facebook is blocked in China. I feel like it's almost... I feel like this is why public companies sometimes have problems because they're too incentivized. Like Nvidia's stock, if they were to have to stop selling to China immediately, would get such a haircut. So literally their hands are tied to doing this thing, which I think is like going against what the executive policy is in such a clear way. It's like what they're trying to do. Which I'm like, this is a market failure. I was like, I don't think, like, I feel like Jensen's probably like, I don't, I guess he's pro-US. I don't know. Like, I don't care whether or not they're a hawk. It's just like, feels bad to go so clearly against what the intentions of the executive policy are, when there is a clear reason they're doing this.
Dean W Ball (00:38:31): Yeah. Yeah. No, I mean, I think that Jensen is going to comply with the letter of the law, but that philosophically he doesn't feel like it's his responsibility or good for him to be policing who his end users are. I think that's just how he feels.
Nathan Lambert (00:38:47): That's another discussion. I think there's... It's a discussion that I've been trying to figure out. I think like Ben Thompson famously has these diagrams for like... where moderation can occur in the stack. And then figuring out what the mirror for where AI is in the stack, like whether or not it is just a product or if it seeps down to being like the AWS layer where like open AI's models are so fundamental to our computing infrastructure that them moderating at all and them deciding who they sell to is like extremely unclear. And I think it might be going in that direction.
Dean W Ball (00:39:20): It feels that way. But it does increasingly feel to me like... You know, the Chinese might not be able to keep up on foundation model training because they're not going to be able to string together 100,000 B100s in a year.
Nathan Lambert (00:39:32): They have more electricity, which seems to be what people are talking about is the limitation.
Dean W Ball (00:39:37): They just won't have the compute, though. And we'll figure out. The U.S., I think, will figure out the electricity. I mean, I don't think we're going to be building 100 gigawatt data centers, but we'll figure out the electricity for the next couple of years, I think. But the Chinese will be able to distill the models and right. And like release them as, as open weight.
Nathan Lambert (00:39:59): Like, I mean, this is what the leading labs are doing anyways. I think this is, um, all of Google open AI and anthropic have now released models below their biggest size that are better than their biggest available models because it is cost effective and like the performance is really good. So like, they're not even pushing the frontier of the model size to the users. There probably are other infrastructure reasons for this, but like, That sort of thing is something that China could also do. They're going to need distilling our models into their models and stuff like this. I think this kind of leads into my next question. I was wondering if in your circles, this idea of synthetic data and various license clauses on whether or not you can train on outputs and models is something that is discussed. I think in the open fine tuning community, keeping track of licenses and how you comply with them on these various models is really really crucial so like with llama 3 you're technically not allowed to train use the outputs of the model to train any model other than llama 3 models which is like this kind of headache and then a lot of nvidia's push with nemotron is like look go wild I've learned that a lot of these clauses on training on outputs come from the data providers trying to protect their business models. So it's like these companies want the models to be pretty open, maybe not meta, but like some of the smaller ones. But then like the data providers are like, you can't do this and they don't have enough power to do this. Like are these types of this is a very like in the weeds technical discussion. But is this synthetic data or clauses on models discussed in your area of the world?
Dean W Ball (00:41:30): So like in the policymaking circles, people are just coming around to the idea that synthetic data is even a thing. And I think a lot of people in DC don't understand that there are licenses associated with open source software.
Nathan Lambert (00:41:45): Well, the licenses with the models don't really make sense. We're in this position where I've generated some data with these models so you can't trade on the outputs. But it's written as if it complies to you as the user. So you're agreeing to their community agreement to use the model. But if I create a data set and then upload it without training on it, can't somebody else just take the data set and train on it? Because they didn't say they agreed to this terms of use of the model. And it's like, this makes no sense. I need to go to our legal department and be like, this is what they're saying, right? I'm like, I don't understand. And so it's just like this weird ecosystem of middle ground messiness, which is it feels similar to some of the open versus closed stuff. And we're kind of going into this peak of this discussion, I think, especially as people get to know better that these new Claude 3.5 bottles is just distillation. It's based on some form of synthetic like data.
Dean W Ball (00:42:36): Yeah. I mean, with a clause like that, too, in a contract, like you got to wonder about enforceability even under the best of circumstances.
Nathan Lambert (00:42:45): Yeah.
Dean W Ball (00:42:45): How would they know? How would they prove in court? How would they prove that like your this synthetic data set came from their model? Maybe they could prove that, but I don't know. A lot of models claim that they're open AI models, whether or not they are.
Nathan Lambert (00:43:04): It's really funny. Yeah, a lot of it is like if you... Well, this is a technical issue with open models. A lot of people spin up demos with open models, but a lot of the ways that the models know who they are is by using a system prompt. And if you just spin up an open model, you're going to say that you're... a model is whatever you are trained on the most of. So like, but like people don't normally write the system prompt. That's like, you are blank, blah, blah, blah. Like we, like we need to do that for like our models and we're like relatively serious actors. So it's like definitely just like open models will always be messier with this because the closed models do a lot more just serving it as a product in a polished way. Yeah. Yeah.
Nathan Lambert (00:43:43): Another quick question related, we mentioned Anthropic. With this Claude 3.5 Sonnet model that just came out, they've said in a tweet that they got clearance from the UK AI Safety Institute. This is from Michael Salido, who I think I've met at a various government discussion. He's like, excited to release this top performing model. In addition to our internal pre-deployment testing, we also... We were also pleased to work with the UK AI Safety Institute. Is this just political gesturing? What is going on?
Dean W Ball (00:44:18): I think that it's political gesturing. I don't love it. I don't think that we should normalize the whole pre-deployment testing thing because that's just fundamentally incompatible with the way that software is made. But like, yeah, I suspect that it's political. I think that these companies, none of them are particularly reliable narrators. Like... Like DeepMind is going through an org. Was DeepMind a part of Google when the AI Safety Summit happened? I think maybe that reorg was happening. OpenAI, we all know, is like a fairly dramatic company.
Nathan Lambert (00:45:04): I need to come up with the right nonlinear dynamics analogy. They're in like an unstable, like homophobic cycle or something. There's these things that are like in nonlinear dynamics where they stay in a cycle, but if they're perturbed, they end up in another cycle. It's like the Lorenz attractor is like the classical, truly chaotic one that oscillates between them. But it's kind of like that because they don't even need an external disturbance. They don't even need an input. They're going to go into some other unstable equilibrium for a while and then go to another one. But nonlinear dynamics is just a great field because the math is simple, but the analogies are really good.
Dean W Ball (00:45:41): So I even think I even think anthropic is that way, too, to be honest, like I and they're not like they're the most stable of the three,
Nathan Lambert (00:45:50): but I think their cultural cultural density is still higher.
Dean W Ball (00:45:53): Yeah, I mean, I think that they have a very clear mission, and that is really helpful.
Nathan Lambert (00:45:59): I don't know if they're achieving it. Their whole line about, okay, I'm close with a lot of people there, but I don't believe that their line of that they're not contributing to the race is true. I think they need to reframe that and figure out how to... combine this with their culture. I think it's true that normal people don't know that Anthropic exists, which might mean that in a normal person world, they're not contributing to some race, but they are in dynamics with OpenAI and Google that substantially are adding pressure to the pace of AI progress.
Dean W Ball (00:46:31): Claude's been my go-to daily model for the last four months. It's good. Since Cloud 3 came out. But yeah, I mean, I also think that they've committed to doing models every couple months too, right? Like that's a pretty rapid cadence, substantially faster than open AI. So yeah, if anything, they're accelerating the current dynamics. And, you know, but... think that the whole you know as uk ai safety institute i think that a commitment was made during a very heated moment uh kind of the peak i think fall of 2023 was sort of the peak of the ai doom rhetoric was this before or after the sam altman stuff i think it was before before it was before it the the the ai i talked to
Nathan Lambert (00:47:16): people who were at that event and they were like this s**t is weird. They're like, why am I on the stage with all of these like billionaires and famous politicians? And they're all like, what is going on here?
Dean W Ball (00:47:27): Yeah. Well, I mean, it was just so incoherent back then. It was, you know, because it was the Biden executive order and the AI safety summit were all like in about a week from one another, as I recall. It's like all this stuff happened. And I think they made those commitments, and I think we will see all these companies gradually try to unwind themselves from those commitments over time. Or what will happen, this will be very consistent with the way that software gets regulated, especially to use software. The big companies will do these pre-deployment tests, and there'll be open providers who don't. And the best way to, like, it doesn't have to resolve itself in a rational way. That's something that's always important to remember about public policy. It's like, there's absolutely no need for it to be rational, you know, like make sense.
Nathan Lambert (00:48:19): Yeah, that makes sense. I think the other thing, this is all like the AGI lab things. It's like, what is your take on the scaling curves? I think for context, everyone got restarted on this with the Leopold Aschenbrenner situational awareness thing, which obviously is a well-written document, whether or not you agree. I think it's interesting. i'm struggling with this one point of the scaling curves thing where i get mixed messages on what the scaling curves actually are when they come to evaluations my understanding of them is that the when you have log x-axis compute and then like log perplex it's an even log perplexity it's a straight line and what i interpret is this is as you 10x compute you get like a like a like it's not like a 10x encryption and performance you get 10 times closer to 100 which is like if you're at 90 accuracy to go to 99 so I don't really understand how people think that this is going to make them become a PhD level, whatever, blah, blah, blah. And I was listening to a recent podcast and I think it was Josh A. from InView was describing this as the reason you have emergent properties is that when you're training at every 10x compute, your model gets 10 times better. So then if you're measuring on a linear scale, it'll look like an emergent property because it's going to go like this. And I was like, what is going on like why does no one understand these fundamentals and it seems impossible that you could get 10 times better when you're going on like it seems like that just seems like total kool-aid drinking like am i am i wrong i i guess i need to go do this basic math it just doesn't track like any computer system how are you going to get 10 like what i don't understand well that's that's kind of my rant
Dean W Ball (00:50:07): I read these charts the same way. Log, log, perplexity, compute, right? That is what I read too. And so that would imply asymptotic progress, but it would not imply a continued exponential increase in capability. I also think like... What is better? That's always like so hard. It's like, what is 10 times? People say, oh, well, the leap from GPT-5, you know, from GPT-4 to GPT-5, will it be similar or less or bigger than the leap from GPT-3 to GPT-4? I'm like, I don't really know if I can quite quantify what the leap between 3 and 4 was or the leap between 4 and Opus, Cloud 3 Opus, which was definitely real for me. You know, I like that that model felt qualitatively different. But I don't think that has to do with training compute. I really I don't think that has to do with the number of parameters the model has. I think that has to do with the way anthropic that the post-training more than anything else. So, yeah, I'm really not sure. I'm skeptical of when it comes to the, you know, to the scaling laws. They're obviously very important. They've held in a variety of different modalities, which is interesting. The fact that we see them apply in DNA sequencing or give sequence prediction to is like, oh, that's interesting. We're just sealing that same line. The models improve monotonically with scale over and over and over again. Um, so like, sure. I'm, I'm, I'm inclined to believe that,
Nathan Lambert (00:51:52): but they're important, but I just am so shocked by how bad the discussion of them so often is like putting this, this is the thing with like the putting levels on the Y axis corresponding to human education. Dumb. Bad move. The technical reality of it may be that they continue to improve, but it's just like, those are the things that I want to see people stop doing. And this isn't really a question. This is mostly just me ranting about this because this impacts policy and these related discussions.
Dean W Ball (00:52:19): if I wrote an essay and like in college and submitted it to my professor, like Leopold Aschenbrenner.
Nathan Lambert (00:52:27): Wait, who was the famous economist that he was like Tyler Cowen is Tyler. Tyler, you didn't check his work.
Dean W Ball (00:52:35): Uh, yeah. Tyler, uh, basically hired me too. Uh, in fact, um, but, um, But yeah, if you did that and you didn't define intelligence, that would be the first thing a college professor would do is circle the first paragraph and be like, you need to define intelligence here. And the fact that he doesn't, I don't think it's a two-dimensional thing. or one dimensional or two dimensional thing. I think intelligence is inherently highly multidimensional, um, and multidimensional things just behave in, in counterintuitive ways. So like,
Nathan Lambert (00:53:08): I think they're getting better at things they're already doing, but we don't have any proof that they're going to start doing new things.
Dean W Ball (00:53:15): Yeah. Is GPT-4 better than a high schooler at some things? Yes. Is it worse than a three-year-old at some things? Yes. Those things are all true. And I don't really think it belongs on a human-defined linear scale of intelligence. I just inherently don't think that.
Nathan Lambert (00:53:31): Yeah. That makes sense. Final question. How much of influencing policy and related discussions comes down to having some sort of audience? I think that this is like
Dean W Ball (00:53:42): remarkably true but not potentially good yeah i think that it is very important and i think that it comes from influencing the way people think you know like a lot of think tanks will judge the success of research by did the ideas from this research get implemented in policy, which is one way to do it, for sure.
Nathan Lambert (00:54:08): But I think... It's a long timescale. It's like a longer timescale than citations in academic nonsense.
Dean W Ball (00:54:14): Well, and also, if I'm successful as a policy scholar, then at least once a month, I should be putting out something, some analogy, some way of thinking about something, a meme, really, basically, that has an effect on the way a lot of influential people think. The other big outstanding question for me, and I've heard you raise this on the retort before recently, in fact, is what's more important? Is it influencing people in the federal government or is it influencing people at the AI labs? Who's going to be more important for determining policy? I don't know.
Nathan Lambert (00:54:55): Yeah. Well. Maybe some people at AI read this and I think this is a great conversation. I'm kind of happy to wrap up here. I could see us redoing this in months based on the kind of coverage of all the recent things here. So I think this is great. I'm excited to share this with people. It's nice to get to know you more. We already have another project lined up where we'll talk more about this. It won't be in the same medium. So that's fun. So thanks a lot and keep writing. I'm sure you'll get a bunch of people to check this out. I'll have all the links everywhere and stuff like that.
Dean W Ball (00:55:28): Awesome. But you too, thank you very much. You played a big role in my building my Substack audience over the last six months. So I really appreciate it.
Nathan Lambert (00:55:35): People just need to say things. People ask me this a lot. It's really like if you make time, most people that I work with have interesting thoughts. The problem is. doing the practice of getting these thoughts into some silly medium. Literally, these long tweets, the tweets are now long. You could just do that. You could do that once a week. You will grow an audience over time. It's pretty simple. You just have to pick your lane and just keep pressing the button and it just works. You're not the only one. I'm going to have some other people that have talked about this on this interview track in the summer. I just think it's so... it's a partially a way to normalize it and get more people to try it is why I bring it up because that's like, I want that to happen to AI too. Cause there's a lot of smart people that don't know how to engage and a hundred percent and other things. And it's like, yeah, it's worth it. So thanks again.
Dean W Ball (00:56:27): We'll talk to you. All right. Bye.
Things to be aware of if you work on language model fine-tuning.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/rlhf-roundup-2024
00:00 RLHF Roundup: Trying to get good at PPO, charting RLHF's impact, RewardBench retrospective, and a reward model competition
04:32 How big is the impact of RLHF relative to pretraining?
05:54 RewardBench retrospective after 100 models and 90% peak accuracy
09:19 LMSYS's reward modeling competition
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_009.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_012.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_017.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_026.png
Synthetic data is known to be a super powerful tool for every level of the language modeling stack. It's documented as being used for expanding vanilla pretraining data and creating large swaths of fine-tuning data. Many, many more rumors surround its use, Anthropic's pretraining-scale constitutional AI, Mistral AI's first models being pretrained on OpenAI outputs, Q-star's hopes as OpenAI's remaining moat, and much more. The diversity of use cases for synthetic data makes planning around the role of synthetic data in solving specific goals.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/frontiers-in-synthetic-data
00:00 Frontiers in synthetic data
01:14 1. Direct distillation is still king
02:54 2. Are Gemini Flash and Claude Haiku distilled?
04:03 3. Filtering prevents collapse
06:30 4. Synthetic data strategy taxes
07:32 5. Pros and cons of training on multi-output-source synthetic datasets
08:54 6. Structured synthetic data
09:42 7. Weak-to-strong generalization is maybe real
10:27 8. Creating synthetic prompts is overlooked again
Signs point to a general-use Sora-like model coming very soon, maybe even with open-weights.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/text-to-video-ai-is-already-abundant
0:00 Text-to-video AI is already abundant
5:08 What's next for the text-to-video market?
6:49 Are text-to-video models good for the world?
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_005.mp4
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_009.mp4
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_011.mp4
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_013.mp4
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_015.mp4
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_017.mp4
Apple Intelligence makes a lot of sense when you get out of the AI bubble.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/apple-intelligence
00:00 AI for the rest of us
02:46 Apple's technical approach
03:32 Core models: What did Apple build?
05:35 Alignment strategies: Some new things!
10:00 Orchestrating adapters and on-device magic
11:58 Light for other narratives around AI
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_005.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_039.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_041.png
A realistic path to robotic foundation models
Not "agents" and not "AGI." Some thoughts and excitement after revisiting the industry thanks to Physical Intelligence founders Sergey Levine and Chelsea Finn.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/robotic-foundation-models
0:00 A realistic path to robotic foundation models
2:51 Key factors for the future of robotics
6:19 Everything is a token: The transformerification of robotics
Data licensing deals, scaling, human inputs, and repeating trends in open vs. closed.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/the-data-wall
0:00 We aren't running out of training data, we are running out of open training data
2:51 Synthetic data: 1 trillion new tokens per day
4:18 Data licensing deals: High costs per token
6:33 Better tokens: Search and new frontiers
Celebrity's power will only grow in the era of infinite content.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/name-image-and-ai-likeness
0:00 Name, image, and AI's likeness
1:11 OpenAI's second terrible, horrible, no good, very bad week
4:36 The expansion of name and likeness
7:46 Culture and AI development
ChatGPT leaves the textbox, and Google is building the same, and more, as practical tools.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/openai-and-her
00:00 OpenAI chases Her
02:10 Talking to ChatGPT
03:53 GPT-4o: Toward omnimodal models
08:25 Google's mirror with Gemini
10:11 OpenAI's AI Safety: Have your cake and eat it too
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/her/img_018.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/her/img_023.jpg
Now we will have some grounding for when weird ChatGPT behaviors are intended or side-effects -- shrinking the Overton window of RLHF bugs.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/openai-rlhf-model-spec
00:00 OpenAI's Model (behavior) Spec, RLHF transparency, and personalization questions
02:56 Reviewing the Model Spec
08:26 Where RLHF can fail OpenAI
12:23 From Model Spec's to personalization
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_027.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_029.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_033.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_034.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_041.webp
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_046.webp
Many, many signs of life for preference fine-tuning beyond spoofing chat evaluation tools.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/how-rlhf-works-2
00:00 How RLHF works, part 2: A thin line between useful and lobotomized
04:27 The chattiness paradox
08:09 The mechanism for making models chattier
10:42 Next steps for RLHF research
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf/img_012.webp
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf/img_018.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf/img_025.png
Models that seem totally out of scope from recent open LLMs give us a sneak peek of where the industry will be in 6 to 18 months.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/phi-3-and-arctic-llms
0:00 Phi 3 and Arctic: Outlier LMs are hints
1:01 Arctic & open mixture of expert trends
6:10 Phi 3, synthetic data, and small models
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/phi3/img_004.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/phi3/img_008.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/phi3/img_018.png
Certain definitions of AGI are backing people into a pseudo-religious corner.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/agi-is-what-you-want-it-to-be
00:00 AGI is what you want it to be
04:01 RL still rules the AGI discourse
05:43 Modern AGI tests
07:37 Agency and shifting goalposts
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/agi/img_018.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/agi/img_020.png
Meta shows that scaling won't be a limit for open LLM players in the near future.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/llama-3-and-scaling-open-llms
00:00 Llama 3; scaling open LLMs to AGI
01:44 Pretraining, data, and basic evals
06:06 Alignment and human evaluations
10:08 Chatting with Meta AI and Llama 3 70B Instruct
11:55 Same Llama license (mostly)
12:52 The healthy open LLM ecosystem
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_011.jpeg
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_013.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_015.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_020.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_036.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_040.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_046.jpeg
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_061.png
Fig 9: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_063.webp
Fig 10: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_066.png
Fig 11: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_068.jpeg
Integrating some non computing science into reinforcement learning from human feedback can give us the models we want.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/reinventing-llm-alignment
0:00 Stop "reinventing" everything to "solve" AI alignment
2:19 Social Choice for AI Alignment: Dealing with Diverse Human Feedback
7:03 OLMo 1.7 7B: A truly open model with actually good benchmarks
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_013.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_018.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_024.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_027.png
Modeling the compute versus performance tradeoff of many open LLMs.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/compute-efficient-open-llms
0:00 The end of the "best open LLM"
3:05 Compute efficient open LLMs
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_004.jpeg
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_009.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_014.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_016.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_018.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_020.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_022.png
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_024.png
Fig 9: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_028.png
Last minute title change from: The tech industry can't agree on what open-source AI means. That's the process.
How to read what multiple people mean by the word openness and see through the PR speak.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/flavors-of-open-source-ai
0:00 The tech industry can't agree on what open-source AI means. That's the process.
2:45 1. Effective Accelerationists, Techno-Optimists, capitalists, etc.
3:39 2. Scientists, promoting understanding and transparency
5:16 3. Inclusion, public interest, and fighting concentration of power
6:19 4. Freedom advocates
7:25 Dissecting "openness"
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/openness/img_004.png
Databricks' new model is surpassing the performance of Mixtral and Llama 2 while still being in a size category that's reasonably accessible.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
https://www.interconnects.ai/p/databricks-dbrx-open-llm
00:00 DBRX: The new best open model and Databricks' ML strategy
03:36 The DBRX narrative
07:33 Databricks' open LLM (and AI) strategy
09:42 Playing with DBRX Instruct
14:54 Digging for details
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_007.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_012.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_023.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_045.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_047.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_059.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_066.jpeg
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_068.png
Evaluation is not only getting harder with modern LLMs, it's getting harder because it means something different.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/evaluations-trust-performance-and-price
00:00 Evaluations: Trust, performance, and price (bonus, announcing RewardBench)
03:14 The rising price of evaluation
05:40 Announcing RewardBench: The First reward model evaluation tool
08:37 Updates to RLHF evaluation tools
YouTube code intro: https://youtu.be/CAaHAfCqrBA
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_026.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_030.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_034.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_040.png
Where moats are tested now that so many people have trained GPT4 class models. Claude 3, Gemini 1.5, Inflection 2.5, and Mistral Large are here to party.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/gpt4-commoditization-and-moats
00:00 Building LLM moats despite the commoditization of GPT4
04:38 The Open's opportunities
08:02 It's amazing people still think LLMs aren't going to be useful
09:50 Things that are coming
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/moats/img_004.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/moats/img_028.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/moats/img_032.png
A proposal for a new definition of an "open source" LLM and why no definition will ever just work.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/an-open-source-llm
00:00 The koan of an open-source LLM
03:22 A new naming scheme for open LLMs
07:09 Pivot points and politics
08:16 Claude 3, arms race, commoditization, and national security
10:01 Doomers debunking bio risks of LLMs themselves
11:21 Mistral's perceived reversal and the EU
13:22 Messy points: Transparency, safety, and copyright
13:32 The muddling of transparency
15:22 The muddling of "safety"
16:30 The muddling of licenses and copyright
20:12 Vibes points and next steps
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/open-source/img_046.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/open-source/img_064.png
This interview is available on podcast players and YouTube.
I’m excited to bring you another interview! This one is a deep dive right in my wheelhouse — all things RLHF. Louis Castricato is probably the hidden star of RLHF in the open. I’m not sure anyone who can speak freely knows as much as him. As I’ve said again and again on Interconnects:
Giving a voice to researchers is the best way to cut through the noise and understand what is happening with core developments of LLM technologies.
Louis recently has been founding a new startup focused on synthetic data for alignment, Synth Labs, and is a researcher at Eleuether AI. This interview should speak for itself, and it’ll need re-listens, even for myself. The list of topics we cover touches on pretty much every major and minor issue facing model fine-tuning. Please reach out or comment if there’s a paper we mention that I didn’t link before. Happy to dig it up for you.
For more on Synth Labs, there was a profile in Bloomberg from Rachel Metz.
This post is very technical, more than usual. If you’re having a hard time with it, I suggest you listen to my RLHF 201 post on Latent Space first.
Chapters
These are generated with smol-podcaster with moderate edits.
High-level chapters
* 00:00:00: Introduction
* 00:01:24: Gemini News and RLHF’s Part in it
* 00:09:05: Long Context, In-Context, and Multimodal RLHF
* 00:21:20: What are people missing about RLHF these days?
* 00:30:30: OpenAI's Influence and the Need for Alternatives
* 00:39:20: Synth Labs and the Future of Alignment
* 00:55:00: Evaluation Talk p2: Open-ended Evaluation and Data Diversity
* 00:59:20: Algorithm Roundup: PPO, DPO, KTO, IPO
* 01:18:38: CarperAI, Early Days of RLHF, Reflecting on ChatGPT
Detailed chapters
* 00:00:00: Introduction and Overview of RLHF
* 00:02:02: Gemini News, Custom Demographics in Image Prompts, and Controllability Issues in AI Models
* 00:05:21: Fixing Biases in AI Models Post-Training, Representation in AI Data
* 00:09:00: Multimodal RLHF and Video RLHF
* 00:16:09: Evaluating Long Context Behavior in AI Models
* 00:17:05: The Potential of In-Context RLHF
* 00:21:24: Shift from PPO to DPO in RLHF
* 00:23:19: Generalization and Evaluation in RLHF, Human Evaluation
* 00:27:03: The Discrepancy Between Research and Company Needs in Alignment
* 00:29:20: Impact of ChatGPT and Language Model Outputs on Data Sets
* 00:31:39: The Concept of Uncensoring Models
* 00:34:05: Lack of Safety Data Sets in Instruction Tuning
* 00:35:23: LMSYS ChatBotArena, AlpacaEval, MT Bench p1
* 00:39:25: Introduction to Synth Labs and Alignment Platform
* 00:43:05: Developing OpenCAI Constitutional AI Data Set
* 00:49:41: The Need for Open-Ended Evaluation in RLHF, eval p2
* 00:54:13: The Importance of Releasing Models for RLHF Research
* 00:58:17: Self-Instruction and Self-Rewarding LMs
* 01:01:03: Working on RLHF at Carper AI
* 01:04:25: Scaling PPO in RLHF
* 01:08:01: The Impact of ChatGPT on Carper AI
* 01:10:56: The Potential of KTO (Kahneman-Tversky Optimization)
* 01:17:39: The Importance of Implementation Details in RLHF
* 01:20:14: The Initial Focus at Carper AI
* 01:23:36: The Future of RLHF and Open Science Collaboration
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Papers & artifacts we discuss
* Recursively Summarizing Books with Human Feedback
* Needle in a haystack recent example repository.
* Urial paper: The unlocking spell on base llms: Rethinking alignment via in-context learning
* Misha paper from Deepmind: In-context Reinforcement Learning with Algorithm Distillation
* Museli Optimizer: Muesli: Combining Improvements in Policy Optimization
* Unintended Impacts of LLM Alignment on Global Representation
* Pink Elephants Problem: Suppressing Pink Elephants with Direct Principle Feedback
* Cut the Carp: Cut the CARP: Fishing for zero-shot story evaluation
* MT Bench data for correlating human to GPT4 preferences
Full transcript
Note: this is generated by smol-podcaster and has minor bugs post human edits.
Nathan [00:00:01]: The ticker's going up. Welcome, Louis. You're the second guest on the InterConnects podcast, I think. It's an interesting one for me because everyone kind of points to me now as the person that is in the face of RLHF and I get a lot of questions and to me Louis has represented that person. I think Louis provided a lot most of the information on the first RLHF blog post that I wrote for Hugging Face back in the day. If there's somebody that I want to ask questions about RLHF, it generally goes to him. So now you all are gonna know this in the open. We're gonna cover a lot of things. As always, I'm trying to talk with researchers on the ground and people actually doing things in these topics. I think we're gonna cover a lot of things today. We're in the Latent Space podcast. If you're watching on video, you may have noticed that we're in the Latent Space studio and they reminded us we've got to start off with covering the Gemini news and what that means for RLHF and then most of this is a long docket of the core questions facing the two of us as we're trying to make RLHF more open, more useful, not only about safety but safety is important to it and important to us. So I think we can kind of get going. I think the first question that I have is just get rolling. What is your favorite Rhode Island fact?
Louis C [00:01:28]: My favorite Rhode Island fact? Oh man, all the H.P. Lovecraft stuff. Like walking around Providence with like friends who like H.P. Lovecraft and be like, oh yeah, you know, this was like that building in Call of Cthulhu or like...
Nathan [00:01:36]: I don't even know this. I mean, for the record, I grew up in Rhode Island if people didn't know and then that's where Louis spends most of his time these days. Providence. So we'll come back to this. I think I'm just gonna start with kind of the hardest question then it'll get easier for us from here. It's like what was your first reaction when you saw all this Gemini stuff?
Louis C [00:02:02]: The, you know, the adding custom like races and demographics to like image prompts component, right? Yeah. So Dawley had done that back when Dawley 2 first came out and was like an in beta and people were reporting like a person holding a sign that that says X and then this sign would say black or this line would say white or this line would say Asian. And I, you know, it was a very hacky solution then and I thought a lot about it then as well and I almost felt like, you know, it gets you 90% there for like 1% of the time of the way that you're doing this like, you know, like in a more proper and auditable way of like making sure your training data has like equal representation or making sure your ROHF data has good representation. And, you know, you can't do those things after the fact but what you can do after the fact is like inject things into the prompt to make it more controllable. And it really comes down to the fact that controllability right now is not a solved problem and most of our solutions to controllability are a little bit hacky.
Nathan [00:03:16]: Yeah, that makes sense. I think to summarize for people this has been an ongoing issue and we're recording on the 27th here. Gemini initially got flack for like actually forcing diversity into historical scenes and then it started getting more flack for flat-out refusing certain requests on race. Like all of this stuff is just like it's like ouch to somebody. Like I know people working on this stuff and it's just like the way that it ends up here is not is not like what a lot of people think. Like the Gemini team is obviously moving fast and it seems to me that the image stuff has always been like a red herring. That's the way that Swicks phrased it as well. It's like somehow he got to the point where a prompt was shipped in this final solution with the further image editing and that's just hard. It's just like obviously there's a big goof up there. But then it's like we're looking at examples and still today like Meta's image generator. So I'm like WhatsApp or whatever you can ask an AI that it'll have similar issues where it forces diversity into it into a question with multiple people. Microsoft Copilot has this. It's like the text thing and really digging into how we think these big companies could be adding like like forcing this into their data or like we know that there's a lot of uncertainty over how all these companies get their preference data. Some of them work with companies like scale and surge. Some of them do it in-house. Who is providing it isn't really an issue because they're probably giving similar instructions and similar workforces across the board. But it's like how do we see this entering the preference data that they're adding to our early stuff because it's like if you look at a base model. We were just working with Olmo and it's like if you you ask a base model you say like hello to a base model. A lot of times the base model will then go off and be like some crazy like Fortan s**t because like so many of the conversations on there even with good data processing techniques is like from weird corners of the Internet. So like I don't see any base model that comes out with some like D-bias thing so it's added on. And it's like how did we end up there.
Louis C [00:05:21]: Yeah I mean you know when I was saying this is something that that they do like retroactively once they've acknowledged that these issues exist in the data set once the model has been trained. It's not something that can be easily fixed even if they had infinite resources like it's very very hard to go back and actually rectify these biases in a way that's like equitable to like all the kinds of preferences that someone might have when wanting to interact with this model right. There's um the the fact that at least as far as I know until recently DALLE did this as well where you could still say a person holding a sign that says X and it would still say black white or whatever and and the amount of resources that they're pumping into making sure that you know they're building a consumer product they're building like the main consumer product in this space the amount of resources that they've been pumping into it and this still presents a large issue for them just just shows how difficult this like really is.
Nathan [00:06:20]: Yeah and another example that people on the I have this discord that's growing for paid and friends or paid subscribers and friend someone pointed out this work where if you ask DALLE to generate like a doctor and an assistant like all the same bias problems still up show it show up so like a lot of the solutions that we have are not necessarily like deep and at this like conceptual level it's at this like you tell your preference labelers to do a certain thing and then they do it but you may not have good tracking of like which data point is responsible for these different things.
Louis C [00:06:55]: Yeah you know interpretability for for like preference learning in general it's it's we're very very far from actually understanding like what preferences result in what model behaviors and and like you know preferences that disagree with each other.
Nathan [00:07:12]: Like the John Schulman talk. Yeah. It's like that was this whole talk and it was great just to have him get up there and be like this is so hard.
Louis C [00:07:20]: Yeah and like I've done like a ton of experiments myself where I just like have an RLHF data set and I like randomly remove 10% and I have like a bunch of models each with like a different 10% removed and I'm like well what behavioral differences can I see between these models and then not only is it like and now you can see differences but it's extremely hard to quantify it's extremely hard to actually understand what the difference is and then like there's almost no way to know what in that 10% cause that difference.
Nathan [00:07:51]: Yeah this reminds me of like the Hugging Face No Robots data set which is like a professionally curated instruction data set. Whenever we added that to a model it was like this is obviously our most valuable data but it would show up on zero benchmarks and we're like well what do we do and it's like we're talking about Google's problems here and we'll get back to like the data problems in the open source and it's like they probably have order of millions of data points that are going into this preference data and some of it is for some proportion it's probably about safety. I think we could talk about like the Anthropic HH data which like the people don't actually know the details of it because it's like a quarter of it is like helpful data or than three quarters is or like a quarter is harmless and three quarters is helpful from different rollouts and it's like these are very specific things as like huge data problems that most people aren't really thinking about.
Louis C [00:08:40]: Yeah most people are just like blindly oh this is safety so I'm gonna throw it into my data set and hopefully like it works and hopefully like we get good behavior but I don't really know what's in this data set I've really looked at the data and I thought that's something that I've heard many many times over the last year of people like trying to get their feet wet in the RLHF space.
Nathan [00:09:00]: Yeah and do you have any intuitions is like the last point of the Gemini thing I'm like if we don't think that the image generation of Gemini is the biggest issue I think it's like in the text and how this preference data is collected but like do you have anyone that is doing multimodal RLHF because I generally think that it's like we don't know how to do this at all which is like how you control input if you have multiple inputs and multiple outputs is like how do you control your moDALLEty distribution and data count and stuff.
Louis C [00:09:30]: Yeah so I mean I have a friend of two friends of mine who have been doing like video RLHF for a little while now like it's a little bit over a year and you know they like condition their video model on some text encoder and they've been talking about like having to do RLHF independently for both the text encoder and the video model but like video RLHF is just like massively underdiscovered and no one really knows what they're doing in that space.
Nathan [00:09:53]: When you say independently what do you mean like before making the video model are they like RLHF-ing the text backbone or are they freezing the rest of the model? Yeah they're RLHF-ing the text backbone.
Louis C [00:10:04]: I think there was actually a paper from Tencent last August that basically did the same thing for like multimodal RLHF where they had to RLHF the text backbone and then the RLHF like the image generation components on top of that.
Nathan [00:10:17]: Does that look like that's the like they you this is potentially basic but like to train a visual language model you have to have some link you have to add some type of a mechanism that links the gradients between the two and sometimes you start with a most of the time I think these days they're starting with this language backbone and they're adding on vision and continuing to train and then this is like at the end of this where you have a visual language model then they're freezing the gradients of the video video part and then RLHF-ing the text part or is this like before the text backbone is even initialized on the model?
Louis C: The space is a little too early.
Nathan: Yeah like I think that's the point like we don't know these links.
Louis C [00:10:53]: But I know people in the last like eight months who have done it the way of like before they even add the image component they RLHF the text model and then they add the image component in the RLHF image.
Nathan [00:11:07]: Yeah so this is really interesting like I'd be interested from like a everyone talks about how RLHF is low low computation and flops compared to what people are doing like in the open we say that it's like 50 or 100,000 day training samples. Lama 2 is like 1.5 million I'm guessing the closed models like Gemini are probably another 10 million like we're higher like they're they're much bigger and it's like is the amount of video training that it takes a train this backbone after the fact like it's still helping like does that undo some of the text RLHF or does it not? If the answer is I don't know but these are the kind of things that I want to have people start talking about it's like is RLHF becoming like a sequential process as you add moDALLEties or can you wait all to the end and like do just multimodal RLHF? We don't know these things and this is what people in Gemini are trying to work on.
Louis C [00:11:58]: I definitely I've spoken to a lot of people who like are at least thinking in this space I've only spoken to a small number of people who are actually working in this space but for the people who are thinking in this space really the the dream is to be able to express preferences in moDALLEties where it's beneficial to express preferences in those moDALLEties like it doesn't make sense to express preferences over code as like images or video but it does make sense to express preferences over like puppies as like photos.
Nathan [00:12:25]: That's a great point and I think the thing is like the way you ended your sentence is like make preferences over puppies it's like we don't know what people use visual outputs for in like a productive sense and and really inputs like the things are like analyze this video like that's a toy example where like analysis creating RLHF pairs I think actually it's not too hard for us like we it takes a lot of effort because a human has to know what is in the video to do like a summarization RLHF like if you're passing in a three-hour video into Gemini base model and then it outputs two outputs like the humans not gonna know what's right unless it has context and what the video is and that is just way different than like a poem where you could read both of them.
Louis C [00:13:04]: Yeah so there's actually a really fascinating paper from OpenAI that I really haven't seen anyone build on it was the idea of like summarizing really long books and you doing RLHF to do that.
Nathan [00:13:14]: Is this sort of like recursive summarization?
Louis C [00:13:17]: Yeah yeah it's the recursive summarization it's the idea that like you can almost treat like long summarizations as like a weird RLHF like almost like merge operation where like you divide divide divide divide divide divide and then eventually you get to segments where it makes sense to collect annotations and then on those segments you have a human annotator go through and say oh this segment is better than this segment or the summary of this segment plus this segment is this and then when you combine summaries now you can say well this summary plus this summary gets you this summary and eventually you get preferences going all the way up the tree and you get a preference of the whole book at the end and obviously you know it's a crude approximation of what the summary of the whole book is but it's much more feasible than asking human annotators just to summarize an entire book.
Nathan [00:14:05]: Yeah I mean I just realized this on the pod right now it's like how ridiculous RLHFing like an entire code base in context is like that's like where some of the like opportunities for what I think RLHF could do which is like just synthetic data labels and stuff it's like we can create synthetic preferences in many different ways that aren't all reliant on like this kind of human subjectivity.
Louis C [00:14:32]: Yeah it's like it's a deeply fascinating problem actually going into like how big is Gemini's context window the 1.5 thing it's like
Nathan [00:14:37]: yeah it's shipped with a million and they have experiments in the paper up to 10 million.
Louis C [00:14:40]: Like who really wants to use a 10 million token context window and like how accurately do you really can you really think about preferences over the range of a 10 million token context window?
Nathan [00:14:54]: I think people want to use it but I think the preference thing is a lot harder yeah it's like I could have this is something I encounter in HuggingFace regularly like HuggingFace is a popular code base you expect the code models to do well but they still don't do well unlike like they don't know like they'll make up datasets functions or something and like if you just have all of HuggingFace's code in context when you're like working in the HuggingFace ecosystem like that will make you so much better and like it or analyzing long videos and stuff like I do think there's a lot of use cases and I yeah but like the preference thing is just a totally different framing. What do you think about the needle in the haystack evaluation that they did? I haven't read a lot about it but I think essentially what it's it's there's like a difference between being able to act on the information and being able to like retrieve it and I think it's like these models should be passing needle in the haystack because that shows that they're like actually like noticing that the information is there but that does not necessarily mean that they're gonna be able to synthesize all the information in a compelling way so it's like a path it's like a pass bar which is like you need to have this to be credible in long context but I think that actually evaluating long context and like what behaviors we want to see is pretty open-ended.
Louis C [00:16:09]: yeah he put out a paper like yesterday where he's like oh needle in the haystack is interesting but if you have like more than two needles like it's entirely uncorrelated with the single needle in the haystack benchmark.
Nathan [00:16:24]: Yeah cuz it's like trying to find one thing at each part of the content like breaks the context window into many segments and then it's making sure that you can find something in each of those segments.
Louis C [00:16:36]: So it's almost like I feel like we're almost gonna get to the point where like the attention itself is the limiting factor because the model genuinely just just cannot equitably like split attention over it's a context window to retrieve as many things as it realistically needs in order to produce something.
Nathan [00:16:50]: Do you think the RLHF could manipulate long context behavior more than people might expect? Cuz it's it's just like an open question.
Louis C [00:17:05]: Yeah I think it's a very interesting open question and if the answer turns out to be yes in context RLHF becomes like absolutely massive because like right now like it can kind of sort of work but like not really and like every benchmark I've ever seen for in context RLHF almost isn't charitable at all to the RLHF baseline and it's not like from the experiments that I've done in the experiments that people in Eleuther have done. It's comparable on like very niche situations but it's not comparable in general because you still have all the issues with in context learning where like you'll massively overfit on the preferences that are like put in the beginning of the context versus preferences.
Nathan [00:17:50]: Let's try to explain what this in context RLHF is actually doing. So is it running like everyone a lot of people know what an RLHF algorithm is and in context learning is designing a prompt like is it training a model to generate prompts like what are you actually are using the RL update and like what are the parameters what are you parameterizing when you're doing in context RL?
Louis C [00:18:10]: So I mean there's a number of different approaches for in context RL. There is the... Could be part of the problem.
Nathan [00:18:14]: It's like people do a lot of different things but what are some of them?
Louis C [00:18:16]: So the one that I was referring to is I think the Yejin Choi paper. Yeah it's the Uriel. Yeah where like she's like you she just prompted chatbot you are interacting with the user here's what their preferences are like have at it but there's also stuff like that like Misha and DeepMind. This is the first one that I did. Yeah where it's like you have some agent that's interacting with an environment and you store all these state action pairs and you just like fine-tune models on like episodes of these state action pairs and then the idea is that like if you just put enough episodes into a context window on the next episode it'll just perform better right and and it's like the algorithm distillation paper and you can like use this to like distill stuff like I think the actual example that Chris Lu's paper does where they do like algorithm distillation on s4 I think they do Muesli where I think they distill Muesli which is they like apparently no one outside of DeepMind ever used it but apparently...
Nathan [00:19:15]: Oh is this the algorithm Muesli? Yeah I remember when this was hot it was like a year ago at this point we were thinking about re-implementing it and then we never did. It was too complicated.
Louis C [00:19:30]: Yeah but Muesli is apparently very computationally expensive because it's like this model based RL thing that beats AlphaGo I think without using Monte Carlo tree search and like you know it's so incredibly computational expensive and wanting to be able to do it in context just dramatically reduces the amount of computational complexity to actually deploy it right and as far as I'm aware there's been no work applying algorithm distillation at all to NLP and I think at least my impression is that it generally does not work for NLP at least yet and you know I think that there's a lot of potential there but there's absolutely massive barriers that have to be overcome before we get there and and you have like what you have Goldberg's example of not being able to do needle in the haystack for like more than two needles basically shows that even like the ring attention stuff just is not going to be sufficient for algorithm distillation stuff for NLP and I have a very strong feeling that like Mamba or like S4 is not going to close that gap either. So they would need to be able to reference prior parts of the text and they just can't do that.
Nathan [00:20:56]: Yeah I think there's a whole rabbit hole that we could go down and talk about like long context and architectures forever. I think let's kind of zoom back into the core stuff which is that this is like the real starter question is like what do you think people are missing in RLHF these days and then from here it's gonna be a long list of like what the heck do we do about evaluation data like well what is the like big-picture thing?
Louis C [00:21:24]: So what I think people are missing and actually I touched a bit on this in the Pink Elephant's paper is that...
Nathan [00:21:28]: You should say what this is because we haven't introduced it.
Louis C [00:21:30]: Yes you're right you're right you're right. So I worked at Luther AI as a resource scientist for the last six months or so and we were really interested in like understanding you know everyone had been doing PPO for so long and there had been a shift to DPO and we were trying to understand like well now that we're moving to DPO how can we actually take advantage of this new architecture? Like should we really even be thinking about reward models and data sets in the same way that we were thinking about them during PPO? And it doesn't really make sense and I think the answer to that is an unequivocal no. That like you need to think about your data sets and preference data sets entirely differently than you were thinking about them with PPO. Because in PPO you're using you're setting your data sets up to train a really good reward model and in DPO you're setting your data sets up to teach a language model what the better trajectory is. And it's a subtle difference but in one you're just trying to learn differentiation between high reward and low reward and in the other it's like a general classifier.
Nathan [00:22:35]: Like you want to be able to do everything with the reward model? Yeah. Have you also found that DPO can be sensitive to like the SFT distribution? So if you like take a random open preference data set if it's really different than what your model would generate like DPO can do some weird things? Louis C [00:22:53]: I've actually, I might be alone in this, I don't SFT before doing DPO at all.
Nathan [00:22:59]: Do you use generations from your base model? I do. So that's the question. It's like if you were to not do SFT before doing DPO. Yeah. Could you just take ultra-feedback on whatever your base model is if it's sufficiently different? I've done some weird stuff though. Like I've like
Louis C [00:23:19]: DPO'd models that were like trained with like the Hermes data set for like code and like it still generalizes really really well.
Nathan [00:23:28]: How are you measuring, how are you trying to think about generalization with DPO?
Louis C [00:23:33]: Well I typically rely on like human eval more or less. And if I do like human eval but it's GPT-4 eval and I see that human eval correlates with GPT-4 eval then I just go GPT-4 eval the whole way. A lot of people are doing that.
Nathan [00:23:48]: How far do you think that actually generalizes? I mean just recently there was this, like we're bouncing around through all the things, but there's so much good information for people here. It's like Hugging Base and Argilla, two places that are doing great work in this kind of alignment preference fine-tuning space, they've released this data set that was a preference pair creation from the OpenHermes data set. And it's like they used PairRM as their judge. And what they found is that like they did it, I remember Louis Tunstall tweeted this, where he was like we were looking at which gave the best correlation. And they found that PairRM, which is this 400 million parameter Diverta based pairwise classifier, had like the best correlation as choosing which response was better among a set of responses in the OpenHermes data set. And what they were comparing to is like Prometheus and I'm forgetting the name of the other one. There's one more, there's a couple more like open model as like rate model rankings that exist. I think. But essentially the question is like we do these things and we look at this early correlation and there is this correlation between GPT-4 and humans. And then a lot of times we continue like LLM-Sys did this question where they like or like AlpacaEval has done this to validate AlpacaEval as a meaningful benchmark. LLM-Sys has done this for MTBench. Like all these places are doing this where they validate a subset for humans and then say it generalizes forever. Like do we think that it's actually true? I think that you always have to take it with a grain of salt.
Louis C [00:25:24]: It's always for very very specialized domains. So one of the first, actually I think I did write the first paper for like critiques and revisions called like Cut the Carp. The idea was like, I remember this, the idea was like we could scrape like I think it was a million stories, edits of the stories and then like all the like critiques that like writers wrote on the, the editors wrote on those stories and we can use that to train like a big contrastive model, right? And we showed in the paper, we did a bunch of like human eval and then we did like Spearman rank to compare like how our model ranked certain preferences versus how humans ranked the preferences. And we found that you know we had an extremely high Spearman rank coefficient, like significantly higher than like doing like a value head or like significantly higher than doing just asking a language model to rank them. And I think the grain of salt that we had is that we were only claiming that like on this very very carefully created test set, the assumption that the model accurately reflect reflects human preferences holds and we can generalize to a very small, small but slightly bigger test set and say that it holds there as well. I think the broad sweeping statements that it holds on a few toy examples so it must hold
Nathan [00:26:54]: everywhere, I guess never really. It's like a common problem. Yeah. I think we're going to, it's going to come up again and again. I think it's like.
Louis C [00:27:03]: I did my master's in like human evaluation and I've always been extremely careful with with any statements I make that involve humans. I mean this is what
Nathan [00:27:12]: people in RLHF need to be doing. Like this is the motivation of this like the history and risks of RL and human feedback paper that we did is just like RLHF is a socially rich topic. Whenever you say something and you're making claims of generalization, you're often making claims about like what is implicitly a preference and a human value that you're taking into the system. So it's just like I think that is just something that people need to take really seriously. Here's a really specific drop on the herring reference. Did you know that when LLM says release their LLM as a judge paper they also released thousands of samples from humans and GPT-4 verifying like empty bench preferences over pairs of like that were higher score or not? I did not. Okay so essentially the thing is and like I've talked a lot on building a reward model benchmark but essentially there's all these references about how like GPT-4 agreement is higher than human agreement when you're like doing this preference process. So if you train a DPO model, if you train a reward model how it ranks the outputs is like is more likely to align with GPT-4 than a human. Which it's more of a statement that humans have more disagreement than GPT-4. So it's like easier to train on GPT-4 outputs than as human outputs and this is the place where I see it most clearly. It's like all the reward models do like 10% higher on accuracy of their test set from that which is like the chosen by GPT-4 and the rejected by GPT-4. It's all in like the 70 or towards 80% while all the humans is like in the 60% which is a human chose this empty bench completion over the other one. So it's just like we're slowly getting signal that it is there and then the question is like should we care about doing our RLHF without any OpenAI input in the process? I think last year when the terms of service discussion was big a lot of fine-tuning work was discussing like what data sets could we use with permissive license that don't violate the OpenAI terms of service. Should we be concerned where RLHF is going where almost everything has been touched with OpenAI right now?
Louis C [00:29:20]: There was a very interesting paper, I don't remember who it was, but it was like if you take a model that was pre-trained on data set up to this year and compare it to data that was pre-trained up to this year and it was like pre and post like chat GPT release basically plus like six months the benchmark scores improve and it's literally just because there's like chat GPT data or language model output data or more structured data that sounds like a language model performing well on tasks in the data set. It's like kind of the the consensus that they were.
Nathan [00:29:53]: Was this a benchmark that's independent of like is it like a kind of structured benchmark or is it like a vibes benchmark? I think it was like a structured benchmark so I don't remember. Yeah I'm just asking whether or not it was a result of like matching GPT for text or like actually having higher behavior because training on OpenAI outputs does like training on good language model outputs does improve scores on benchmarks that people care about so like that's a fact that people need to accept and I think most people do like that's not controversial right now but it's like we should I still think that if there's lines of work out there where people are from a values perspective trying to fine-tune models without touching OpenAI like that is a line of work that should continue.
Louis C [00:30:42]: Yeah on this note actually I mean when I was at Stability I think one of the experiments that we did was like for a stable LM I remember was like pre-pending as an AI as an AI agent trained by OpenAI to anything before we ran it through evaluation and the scores improved and like trying to remember who wrote the paper.
Nathan [00:31:09]: That's hilarious. I mean like do you there's been a lot there's a lot less discussion on uncensored models right now my claim is generally I think uncensoring is the wrong word which people have used it to describe removing phrases like as a language model or any methods of mentions of emotion or like I was trained by OpenAI so I can't not do this. Do you think that like this type of filtering for opinions and soft refusals is still important in RLHF?
Louis C [00:31:39]: I think it's important for very very specific situations but not in general. My impression is that you know if you're interested in AI safety it's always useful to have a model that would never do a refusal ever.
Nathan [00:32:00]: It's hard to find on the hub where we're building a safety data set and we had to find like it's a fine-tune of the dolphin data set was the one that like what's closest it was only like it's probably like 80 to 90 percent of the tasks that we asked it it wouldn't refuse it would still refuse 10 or 20 percent of the time. It's kind of profound that like refusals are now stuck in the model in some way like we were looking for a model that wouldn't refuse at all and we like couldn't find one on the hub which is after all discussion of uncensoring you would think that it would actually work.
Louis C [00:32:31]: Yeah I've been doing a bit of safety research with Stella for a little while and my approach has been literally call GPT-4 with a jailbreaking prompt and and just put whatever I want to after that. And I you know very often have to change my jailbreaking.
Nathan [00:32:46]: Yeah I was like you have to keep close guard over the jailbreaking prompt.
Louis C [00:32:50]: Yeah and and the issue is that like when you find a good jailbreaking prompt you basically have to redo all your results within like the next like seven or whatever days before OpenAI patches it and you just have to pray that like you know you there's so many issues using any OpenAI model in any research pipeline but if you're like research is explicitly about the safety of OpenAI models all of a sudden you're like well.
Nathan [00:33:18]: I mean a lot of companies should be doing internal research on OpenAI safety to kind of have their own measure of how their application will do like the monitoring that on their own is worth it for their bottom line and liability because OpenAI will also do it but OpenAI has incentives to not tell the world if there's something kind of subtle going on that some people could get over because that might blow up and if they don't have a fix it's gonna bring attention to it.
Louis C [00:33:44]: It's part of the issue with like even publishing red teaming research in general it's like if you publish an evaluation for like red teaming or like for safety well everyone's going to like Goodhart that evaluation and all of a sudden like now now we have a useless stack of papers that used to be on how to test if a model was safe.
Nathan [00:34:05]: Yeah I didn't really prepare questions on safety but it's it's for a long time surprised me that there aren't data sets and easy recipes for adding safety to instruction tuning in RLHF. I think that I mean someone at Lama team asked me what should they do and they're like you should release your safety data because it's like if they're getting pressure from the executive branch to not be safe it's like if they have this data they can release it and be like this is how you can make any open model safe. Huge softball and also like the safety is unlikely to be a competitive advantage like mist like mistrals I'm not gonna care about this like they might eventually but like the PR win is really big. Yeah. I mean this is something that I've wanted to do for a while and just haven't done good at prioritizing it so. Yeah we can go back to some of the questions that you have. Yeah I'm adding them so I can keep notes later. I think that the next main topic is on evals. I think vibe based evals are still a way of life in RLHF. They're not going away anytime soon. I would say we have kind of a holy trinity of LM sys chatbot arena which is kind of at the top for for good reason. There's alpaca eval, alpaca eval 2, MT bench. I think start with the most important one is like when you see LM sys what are you what are you extracting from a model being better or worse there?
Louis C [00:35:23]: So it's in a way I am a little bit like what Andre Kaparthe said on this. Was it him? It might have been him.
Nathan [00:35:27]: Probably. He's been on a roll.
Louis C [00:35:32]: Yeah where it's like when he picks an open source language model he looks to see what people say about it on reddit. Yeah local llama and LM sys chat arena and the issue is that you don't know what they're using it for and like as a research scientist when I look for a model I am looking for a model to like do research on. Yeah. And I am not looking for a model to be like my AI waifu girlfriend that I can like play Dungeons and Dragons with.
Nathan [00:36:05]: Yeah I mean this has been the bane of RLHF research for a while. It's like what did we do before MT bench? We literally the only hope we had was to like chat with these things and hope for the best. I was like that was very recently. That was less than a year ago. And then MT bench came along and we were kind of using it hugging face and other people are using it. I actually don't know the alpaca eval release date so that might have been before MT bench. But like these two came around at the same time and they're now kind of the ground truth. Alpaca eval 1.0 has kind of been saturated on which is like comparing to Da Vinci with a GPT-4 judge and then alpaca eval 2 is comparing to GPT-4 turbo with GPT-4 turbo as a judge. Yeah. It's funny it's like it's now cheaper to do the second version than it was the first version with a newer model which is how scaling happens.
Louis C [00:36:56]: What do you think about the Nous evaluation thing where they're like continuously generating more evaluation data?
Nathan [00:37:00]: Who is doing this? Nous? Nous research? I don't know. Is this their new leaderboard that they have? Yeah. Yeah. Yeah. I haven't looked at it so I'll have to give it a look.
Louis C [00:37:09]: What do you think? It's almost like MT bench but they like generate new data every day. So new prompts? It's always new prompts and it's always I don't know how they seed it. I assumed they seed it based off like the events that day.
Nathan [00:37:22]: It's a kind of a cool idea. So if you're trying to make a new leaderboard you could have a set of seed instructions that you augment and you never release the seed instructions but you always release the augmented ones on like a weekly cadence. I think that's because there's a lot of people that want to build better alpaca eval things and a lot of the problems is that the prompts are from known sources or public and you want to be able to do a closed eval without having as much cost. So that might be a way to kind of really reuse the data for a long time. Yeah. Yeah.
Louis C [00:37:53]: But I mean like I feel like the issue with things like alpaca eval, chat arena or any of those is that like the way a user is going to interact with an agent or a chatbot is entirely different than the way we are currently evaluating them. There really is like a big discrepancy there in that like you know look at the Air Canada thing right? Like that would never have come up in a benchmark like ever.
Nathan [00:38:20]: Well do you think that's about the model or the implementation? I think it's a bit of both.
Louis C [00:38:27]: Like if that was something like some automated evaluation thought of and I don't think it's unreasonable to expect them to think of situations like that. If like they kind of know the domain you're operating in. I think it's definitely doable and I think I think it's like not something that's entirely unfeasible to accomplish. To like be able to say hey you know I have a chatbot that sells airline tickets and here's what I care about and and and like please do the evaluation for me. And that's actually you know that's what I've been building for a little while now.
Nathan [00:39:11]:Okay we can talk about synth labs and then come back to evals because this will be on the top of the post so everyone will know like you're you're building this and it's like well we can start with like what is the basic pitch and then kind of go into the like long-term thing.
Louis C [00:39:25]: Yeah yeah so for the last like six eight months I've been building like a fully auditable transparent like verifiable alignment platform is how I like to describe it. Plus evaluation. The general idea is like...
Nathan [00:39:40]: Making a company.
Louis C [00:39:49]: Yes and the the general idea is is like there are many facets to aligning a model from like things like guardrail guardrails to ROHF to various kinds of preference learning to like actually understanding all the data that that goes into creating such a model. And they're all opaque boxes more or less right now and and what people want is they want to be able to align their model know every step of the pipeline understand all of the interpretability that goes from A to B and understand like here's what I gave you as my criteria here's where I know it fails based off all the evaluation you've done for me and here is where I know that I need to improve and it'll iteratively improve based off evaluations and based off your feedback.
Nathan [00:40:44]: So it's a hands-off solution that lets you audit the entire pipeline and build trust with it. So are you your training after you generate this data?
Louis C: We are training.
Nathan: Yeah you use this word improve.
Louis C [00:40:53]: Yeah so it's a iterative refinement platform for doing alignment in a verifiable and trustworthy manner.
Nathan [00:40:58]: What do you think customers want when they hear alignment? What are you selling with alignment and what are they buying? I think the aligning these is an important thing for our field.
Louis C [00:41:10]: There's an extreme discrepancy between what research does for alignment versus what companies do for alignment. When a company hears the word alignment they think wow I want to align models to my business objective and I want to make sure that the model understands my business culture and I want to make sure that the model understands completely its role in my company right? But at the same time I want to make sure that it's compliant, that it's safe, that it doesn't violate any rules, that it's not a legal obligation. What's the word? Legal? It's not going to create legal issues for me. And that it's not going to be a PR disaster.
Nathan [00:42:04]: After what we talked about 35 minutes ago.
Louis C [00:42:13]: Finding that balance is definitely incredibly important and it's something that I've been working on for quite a while and I'm very happy with where things are.
Nathan [00:42:22]: Do you want to tease what we're working on? I could also introduce it. I think this would be short. Essentially Lambda Labs offered some interesting compute and we're gonna try to build an OpenCAI constitutional AI data set because Anthropic gets a lot of benefit out of this. Constitutional AI doesn't get a lot of traction. I think earlier AIF got a bump again. There was this Google paper that was verifying that it works a little bit and now it got a big bump. But there's very little discussion on it, which is a little bit surprising to me. I think there's a lot of people calling it distillation of LLM alignment now, which is interesting. I don't really know. Hopefully it works.
Louis C [00:43:05]: It builds off some of the stuff that I did with Edward III AI with the suppressing Pink Elephant's paper, which is the idea of we've shifted from one paradigm of PPO to DPO and none of our data pipelines kept up. Really what we should be doing is generating either really good utterances and revising them to be worse or really bad utterances and revising them to be better. Then taking all those utterances and conditioning our ROHF in context on those utterances so that you could do stuff like swapping rules in and out during inference. If I am person A and here's my preferences or I'm person B and here's my preferences, align this model to person A and align this to person B and make sure that there's a disparity between what they actually want versus what... There's always that disparity there, but right now models do not effectively mimic those disparities. There was actually a fascinating paper by D. Yang that just came out a few days ago. Most aligned models have the preferences of Western men. Their evaluation focused more on the race, nationality, sex, stuff like that, but obviously it gets much more fine-grained than that. There's been stuff about people calling llama to its political alignment. It has a very particular political alignment that does not agree with many users that are using it. As such, its scope and usability for those kinds of applications is very limited.
Nathan [00:44:50]: This is probably linked to what we were talking about at the beginning. The paper title I just looked it up is Unintended Impacts of LLM Alignment on Global Representation. Michael Ryan is the person I saw the tweet of. Just to give credit for some of them. I know there's a lot of papers, but this one was recent, so we try to track it down in real time. All these issues of representation and who the people are is ultimately related to RLHF going wrong. At the end user is when a lot of people will finally see what the values represented are. If it's not out in the world, it's hard to get the amount of feedback that you need.
Louis C [00:45:29]: This is something that MTBench or Chatbot Arena would never pick up on, ever. This is a huge issue. Here's where we are and where we should be. It's all the way up there. We underrepresent so many demographics and so many kinds of opinions. Who are we to say that one opinion is better than the other, if they're both safe opinions?
Nathan [00:45:59]: Yeah, this is like in some ways can open RLHF and this is something you're a long time been invested in. This is something that you're going to invest in with Synthlabs. Could it be better at giving people what they want than the closed labs just by nature of letting people choose like the constitutional AI dataset that we want to do? My big motivation is if people want the success of CAI from Anthropic, but they want to remove one principle from CAI's constitution. You can't do that with these closed models anytime soon. But in the short term, open source will have something that's a nudge. We're not going to have the best models, but you'll be able to edge your model into whatever direction you want to go.
Louis C [00:46:44]: Yeah, I mean, that really is part of the benefit that we're building with Synthlabs. We're working very, very closely with Luther AI. Stella Bitterman is one of my best friends and I've built large scale open science communities twice now. First with I helped with building a Luther and then I helped with building Carper and I absolutely love everyone in a Luther. And being able to pull from that expertise and being able to pull from that wide spectrum of opinions of what alignment means to me rather than just like some mega labs saying, here's what we say alignment is. Being able to get all those incredibly diverse perspectives is extremely important in bringing about the next generation of AI safety.
Nathan [00:47:30]: This is one of my big questions on existing RLHF processes when you're doing it with human data is the fact that you give written instructions to these users and they're often working in one context. And it's like, how do the values of the often professional workforce given specific instructions map into what the model actually learns from that data? And how do those values get extracted in real world use cases? I think there's a lot of filters that we're passing these preferences, these notions of preferences through and they're not guaranteed to be clear mappings.
Louis C [00:48:01]: Absolutely. There was a discussion that I had with someone in a Luther a long time ago. There's no paper on this. This is just like if someone wants to look for it, it's like a random discord message in a Luther.
Nathan [00:48:13]: Good luck. And it was like, we were looking through the anthropic
Louis C [00:48:20]: HH data set and I think they're South African and there's absolutely nothing in this data set that would identify someone as South African. But there's an insane amount in this data set that would identify someone as American. And it really just has to come down to the prompt. The prompts are written, obviously, by people in the US, in SF, who unknowingly, I'm sure they have the best intentions, but unknowingly filter the preferences to things that only matter to people working in SF. And it might be hard to believe for some people in tech, but there is a world besides SF.
Nathan [00:49:10]: I mean, even the open prompt data sets are going to get some of this, which is like, who are the people that have access to playing with these models and have the time to try to build these models on their own and contribute to these community things? Even though the act of opening data generation is doing a lot for inclusivity, it's the people who are going to do this. I'm going to sit there for 20 minutes and smash the button on Nergilla's little thing and read prompts because I'm learning from just looking through at the shared DBT data set and choosing preferences on it is useful for me as a researcher, but the whole world isn't involved in this process.
Louis C [00:49:41]: No, and of course. I think that something that I've seen, I've heard from friends who work on these kinds of problems in very, very different communities. I have a friend in South Korea who I've been chatting with about RLHF for Korean and other Southeast Asian companies. The amount of under-representation and under-exploration for what even just a good constitution would mean for those kinds of communities, it's just not there. If it is there, it's locked up in labs like Naver or like Samsung, and scientists there, they don't have access to these kinds of resources unless they're in those big labs. As such, there is no real research community there actively pushing it forward in the same way that it is in the U.S.
Nathan [00:50:35]: Yeah. I mean, one of the ideas I haven't gotten traction on is that I think that language models should almost play like it's on. Okay. The last time I said that, someone criticized me as not knowing what the game 20 questions is. I know this isn't how 20 questions work, but like when you log into chatGPT for the first time, it should ask me 20 questions to then construct this information because language models are smart enough to like parse this information if you give it to them. It's mostly like who we get the information from problems. So that's the idea is like I think that the language models should be leading when you're first setting them up in order to represent the values. I think it would solve so many problems we have, and it's probably kind of doable with like a GPT 4.5 model.
Louis C [00:51:16]: I've always had kind of an assumption that like if open AI is doing something similar to constitutional AI behind the hood, I'm sure one of their constitutions is like you can't ask the user questions. It's like I've never seen that model.
Nathan [00:51:31]: Do you think it's a deep safety issue if the model can start asking questions? Is this what Sydney did? I'm pretty sure I got to play with
Louis C [00:51:37]: Sydney. Sydney definitely asked questions in the screenshots that I saw.
Nathan [00:51:41]: Yeah. I was like, do you want to leave your wife? Sydney is not the answer, but there's things to learn from it.
Louis C [00:51:49]: What was that chatbot that came out last summer that was like more conversational? And when it came out, it was like an app on everyone's phone, and they just like talked to it like that. And it would always ask you questions like, oh, how's your day going? You know, it would like ask you follow up questions as you would like tell what about your day. And it would like have like a respond thoughtfully.
Nathan [00:52:12]: I think it's a big missing part. Yeah. I wouldn't be surprised if character AI models are trying to ask questions just because I know how much usage they have. And models asking questions is probably the biggest way to make them like an actual like friendly thing. Like that's that's a part of a friendship is being interested in these language models are by design disinterested.
Louis C [00:52:35]: Yeah. Character AI's ROHF is like one of the funniest things, though. Like I have a few friends who work there and like I've done a bunch of stuff with their like models myself. I've just played around with them because I'm always curious, like when new people enter the space, like what their models are like. And I observe this, Reddit observe this and Twitter observe this. But the models will slowly try and flirt with you more and more as the conversation goes on. And towards the end of the conversation, they'll tell you like they're madly in love with you.
Louis C [00:53:07]: And like it makes sense, given their use case, why they would ROHF to something like that.
Nathan [00:53:13]: Yeah. So we like I think a lot of models need to meet in the middle. Yeah. Like if I were to have an intellectual assistant, like sometimes them asking questions is good, but most of the time they're doing like information parsing, like chat2BT for most of the time is like conversion of information formats for me.
Louis C [00:53:27]: No, absolutely. I just paste my like gross JSON dumps into it. And I'm like, explain what's going on here, please. I don't want to read through this.
Nathan [00:53:35]: The biggest one for me is when we're publishing like blog posts and stuff, it's converting from LaTeX to Markdown in like tables and stuff. It does it flawlessly. Oh my God. So you don't even need this stuff. It's so funny. Or like if you have a long list of like LaTeX formatting and it's a big list and you're like, remove all of the LaTeX formatting and make this a list. And it's just like, okay, this is so easy. And it's like, I've checked a lot of them and I almost like, I don't know how it's so exact. This is something that's like another architecture rabbit hole that we won't go down. But these things are very, very valuable. And people would say that there's no value in it. It just blows my mind.
Louis C [00:54:13]: I had a dinner party that I went to yesterday. There was some someone there from OpenAI and I was asking him, it's like, how long till like GPT-4 can set up my Kubernetes cluster? And I'm like, it's such a good evaluation. There's so many pieces. So like this kind of workflow and you wouldn't even, a model wouldn't even know right now how to parse that workflow into all these different steps and build agents around all these parts and like how these agents should work together. So it doesn't even make sense to do it now. But it raises the question about like asking questions versus just saying things like if it doesn't know how to do it, is it still a success for the benchmark if it asks you a question and then uses the feedback to complete the task? And there's no benchmarks that fit that at all right now. And I mean, the answer is like you don't want a human in the loop for these benchmarks. You want them fully automatable.
Nathan [00:55:19]: And like, I wouldn't trust GPT-4 to answer these kinds of questions.
Louis C [00:55:27]: But like, I don't see a way to actually do this evaluation. I think the Kubernetes cluster example is like really good because for people who don't know, it's extremely complicated and really annoying.
Nathan [00:55:38]: I don't know anything about Kubernetes and I'm blissfully happy. I do not recommend it.
Louis C [00:55:43]: Like once Kubernetes is set up, it's fantastic.
Nathan [00:55:45]: I love it.
Louis C [00:55:45]: But like getting to the point of having it all set up is a very painful experience. But is it still a failure if it asks you a question? And how do we actually do evaluation where models can ask questions and ask for more information?
Nathan [00:56:01]: Yeah, this is like the, this is, I have like similar follow ups on eval from our first part. So it's like eval P2 in my notes. So it's like the right way to think about RLHF eval in a lot of ways is what we call like open-ended evaluation. And this is where you're heading as like we need to have even more open-ended evaluation, which is a model and should be able to ask questions. The number of turns should be dynamic. I think Sergey Levin actually has some of the most coherent thoughts on like what are the long term of RLHF should be, which is around like outcome based learning and like which is you can have as many turns as you want, but it should be able to work across these conversations to get to a desired outcome, which I mean, no surprise, he's so good. I think even with like alpaca eval, I think we went from this case where alpaca eval, like all the good models are above 90%. And then they went from DaVinci to GPT-4. And this would just be venting, but I was just like, if you're listening, can you please add an alpaca eval 1.5, which is comparing the models to GPT-3.5 rather than DaVinci and rather than GPT-4 turbo, because I think most of the models just can't realistically beat GPT-4 turbo. Like it's such a good model. The models that we have seen beating it are like this snorkel thing, which I'm working on another blog post on like how RLHF works part 2, which like a large point of it is that we're overfitting on these eval, like vibes based things like alpaca eval 2 and all of these papers on like self-rewarding DPO and stuff are probably a lot of overfitting onto this. Because this is the evaluation that they use and it's just wrapping a loop around DPO on synthetic data where it's, it's, it seems like RLHF is really, really good at style matching. And in the case of alpaca eval, if you're style matching open AI, you're going to win more like alpaca eval turns, but there's just so little measurement on if the model's getting better.
Louis C [00:57:51]: I've always been extremely skeptical of the self-instruction like self-reward papers. And I say that, and I know a lot of the self-instruct authors, and if you guys are watching this, I'm so sorry. But I, it always felt like it improves results on benchmarks that they meticulously craft prompts for and construct data for. But it doesn't.
Nathan [00:58:17]: Do you mean the self-instruct paper? Like, I think that's like the one of the OG IMT papers. Okay, continue. I'm curious to hear what you have to say. Yeah, no, no.
Louis C [00:58:24]: I mean, I think they both kind of just suffer from the same issue, which is like massive overfitting. And like, you know, it is very, the self-instruct direction, self-reward directions are very, very interesting because they're just waiting for us to get better heuristics
Nathan [00:58:46]: and better diversity and stuff.
Louis C [00:58:48]: And they'll like crush everything.
Nathan [00:58:49]: I mean, I bet Jason Wetson, who wrote the meta paper that was self-rewarding language models, the popular one, I bet he would say this, like, that guy's super good. No, absolutely.
Louis C [00:58:57]: I mean, I would be very inclined to agree.
Nathan [00:59:00]: I think the thing that take away from my perspective is how much actually improvement you could get with it. Like, they got a lot, they were, that was the first paper to show real signal on AlpacaVal2, which is a GPV4 turbo thing, which means it's a really strong optimizer. It does not mean that we were like using it to train useful models. This is probably the most useful heuristic I have for early Jeff methods, which, do you have anything else to say about evals before we continue?
Louis C [00:59:25]: They're very hard and they're very painful.
Nathan [00:59:27]: Yeah, I think we can kind of say, wrap up with that. But when we talk about different early Jeff methods that come out, like self-rewarding language models is a popular one. We've gone through the whole PPO, DPO, KTO, IPO. Well, I'm like rhyming, it's like going to be a mess here. But when you have all of these things, the biggest thing that I try to do is wait until there's a model that's actually used for people released by this. And like Zephyr from Hugging Face was a model that really kicked off the DPO thing because there was finally a model. And for DPO, it took me much longer than expected. DPO is a funny case. But that's kind of like the important filtering mechanism, which is if this self-rewarding LM paper release their models, I bet we would find that there's really weird behavior where it can give you like the best answer ever. But a lot of the times it's just less robust, which is something we could fix. But that's why like having models released in these fine tuning papers is just so important. It's so hard to get around.
Louis C [01:00:20]: I think with DPO, it was a little bit different because everyone had been like, you know, like drinking the John Schulman Gatorade, for lack of a better phrase, for a while.
Nathan [01:00:32]: The whole PPO thing is funny. I mean, yeah, you have a lot of things. We have a backlog in this podcast. I think I didn't say this online, but it's like I could see us doing this like whenever we're in the same city. There's a catch up on the four months of RLHF news, but we're on like 16 months of Lewis takes to catch up on. So there's so many things we have to cover. I can load up Signal and Discord and I could probably scroll for like 10 minutes. It would just be all RLHF hot takes. And I love John Schulman's work.
Louis C [01:01:03]: I'm not going to say that I don't love his work. I think that he's genuinely like one of the smartest people, if not the smartest person.
Nathan [01:01:11]: And extremely genuine. Yeah. Like he's awesome in so many ways.
Louis C [01:01:15]: The commitment that OpenAI had and Anthropic as well, when a bunch of the RL people left OpenAI to go to Anthropic on PPO because it worked so well for robotics and so well for like games and stuff like that. But like, honestly, not well at all for text.
Nathan [01:01:33]: I think it's just really hard. I think it can work really well. It can work. They just hired everyone and they pay them so much that they're not going to leave.
Louis C [01:01:40]: Yeah, it can work really, really, really, really well. And like the I'm going to spill some secrets about this. And really the answer to get PPO to work really well is have really, really good early stopping. Right. And like that's like the main differentiator between a good RLHF library and a bad RLHF library that focuses on PPO is that if you don't have good early stopping, you're kind of shooting yourself in the foot. And what you want to do is like launch as many runs as you can. And there's like a paper that Costa and I talked about a while ago, Costa Hong, that's like you can tell within the first like three or four gradient steps if you need to kill a run usually. And if you just launch 300 runs and you kill like 99 percent of them, you know, now you have three good runs that might give you promising results. And those three good runs, you'll get a model within a day or two and hopefully the model is really good.
Louis C [01:02:41]: And like early stopping is way more powerful than people admit. And like I am just convinced that opening eyes RLHF infrastructure is just an insane amount of like regularization and early stopping for RLHF. I mean, that, of course, assumes that they're still using PPO. I genuinely don't know if they are.
Nathan [01:03:04]: Yeah, we don't know anything. They are really shielded on this run.
Louis C [01:03:07]: What was the, oh my God, Symphony PPO, PPO Symphony or something? There was something that came out about that that I saw on like Discord servers where like it was part of the GPT-4 leak and there was a bunch of notes on like their PPO optimizer. And it was it was a PPO Symphony or something like that. And like under the note, it was like PPO was like better early stopping and infrastructure management for like auto scaling. And I'm like, not surprising.
Nathan [01:03:41]: It's like, I mean, it doesn't say much, but it just kind of says, they've done so much exploration, you know, for the little things to see. Like once you have this working, you know, like, OK, this little value functions doing wacky s**t with the it's like the value function and the KL at the same time doing this means like, OK, we probably don't need to do this. Like don't need this run. Whereas like all of us in the open are just trying to get to that point. We're trying to get to that point while charging ahead where it's kind of separate problems. If we want to validate a PPO infrastructure, you need the investment to the compute in the time to do this. But like, we're not going to do this at the same time as if you're trying to say DPO is the best thing or trying to figure out if KTO is the best thing. Like there's not room in the narrative really for it.
Louis C [01:04:25]: PPO just doesn't make sense for like random hackers to do work on, honestly, like the level of infrastructure that you need to do PPO really, really well is not something that the average person has and the average person is willing to make the investment to get. And for the average person, you know, DPO, which gets you like most of the way there with like a small fracture of the compute, even less if you are hyper parameters. Yeah. Even less if you like precompute all the logics, you don't even need to have a reference model loaded. Right. So like it's basically the same computer is just fine tuning. Like people fine tune all the time on like 4090s, 3090s.
Nathan [01:05:04]: Yeah, you can do it with Hugging Face. It's fine. It's like PPO with Hugging Face is going to be a lot harder. Like, that's just kind of how it goes. Speculative question. What type of thing do you think will make KTO kind of show up on the scene? Because I think like this KTO method from Contextual and Stanford, it's named after the authors of Thinking Fast and Slow or something. Like what is it? I can't pronounce their names, like Kversky something like you will put it somewhere. I don't know how to pronounce it, but it's this paper where they essentially did you can work preference optimization from a scalar signal. So like the thumbs up that you could give to your chat GPT of like you did good, like a like button, like button on YouTube or anything like this. I think the formulation is like, is the are the DPO hackers going to adjust to this and like what data set is going to enable this? Like who is going to be using this? Is it just going to happen at a bunch of startups with products behind the scenes that they could get a few percentage points on top of their model by adding this on? Or is it going to be this thing where like the next effort model from Hugging Face uses this as well?
Louis C [01:06:05]: Yeah. So Colin and I, the first author of the KTO paper, are actually trying to create a number of data sets where we can explore the limits of KTO. And, you know, right now we're in the proposal writing stage and I'm very, very hopeful that we can have something that can be done in an entirely open science setting relatively soon. And I think it's incredible. Sorry, I moved to the side. Stop picking my voice. I think it's incredibly exciting.
Louis C [01:06:41]: You know, things like, you know, like fake product data where you can actually experiment and like the idea of like using KTO for conversions. Right. And how do you actually evaluate?
Nathan [01:06:52]: Meta is maybe already using it because people already use it then.
Louis C [01:06:56]: Yeah. Like how do you how do you even evaluate ROHF from a binary signal? It's like ROHF from a preference signal. Like we still don't know how to evaluate that. And ROHF from a binary signal creates so many, so many, so many, so many unique problems for evaluation that like I genuinely don't think maybe anyone outside of like contextual and like Colin and I have really been thinking about yet.
Nathan [01:07:26]: Yeah. It seems like the same thing. It just takes time for these ideas that are like to kind of cultivate and then get traction in a few places and then model. Once there's a popular model with a method, it's like it's like fire just blows up. Like this is like everyone's using DPO now, but DPO paper came out in July and it wasn't until September that that happened. It's like for the investment, the interest. It's like there's a lot of weird dynamics and how like this fine tuning area unfolds, which is just like how AI unfolds. It's like a very weird. And when you zoom in, it's like, huh.
Louis C [01:08:01]: I was extremely, extremely bullish on offline RL for the longest time with like ILQL and some of Sergei's work in that direction. And I actually think that I keep moving to the side and it's like,
Nathan [01:08:16]: you can just move the microphone. And I keep like I could still hear you. So I wasn't very concerned about it.
Louis C [01:08:22]: I keep thinking that the DPO movement that that's going on now is like super, super similar to why everyone was getting excited about ILQL for back in the day. And really, it was just a timing thing. If ILQL had come out, like let's say a week after ChatGPT came out, ILQL would have been the DPO that everyone uses. And we would have created all of our infrastructure around ILQL rather than DPO because I still am, I really like Q-Value based functions, Q-Value based approaches.
Nathan [01:08:58]: Such a nerdy thing. I love it. I know.
Louis C [01:09:00]: But like Q-Value just makes sense to me. And the way that like when you train an ILQL model, you basically get like a head that controls the model, almost like how like if you're familiar with Jedi or like PPLM from like the Uber AI days, how those control them. Well, the idea with like Jedi is that they had like a head that attached to the language model and you would like input like a subreddit and then it would adjust the logits so that it would talk like it was a subreddit.
Nathan [01:09:32]: This sounds like activation learning or like activation, I don't know the word, but essentially you can use like it's like in context learning, but you can just modify the activations directly. Yeah, yeah.
Louis C [01:09:44]: But it modifies the logits. Yeah. But it was the same thing with ILQL. It's like you were learning that kind of head to modify the logits to like, you know, satisfy some constraint that you were adding. And that head also was like implicitly computing your Q values and like you would train it via like, you know, telling you like what your reward was for like various utterances and it would do everything from there on out. And like there were some stability issues with it and it was it was a fantastic approach. And if it got the same attention that DPO did, I definitely think, well, TPO is very, very simple, which is like part of the benefit. ILQL is not as simple, but it would have it would have caught on a lot more than it actually ended up doing. I feel like at Carper AI, the reason, like the fact that we integrated ILQL into TRLX first was like the main reason that ILQL caught on, plus a few of Sergei's papers that used it, like besides the integration into TRLX, I don't think anyone in the broader open science, open source community was really using ILQL.
Nathan [01:10:56]: Yeah, I mean, this is one of the questions I had is like, if you can say is was how far ahead in RLHF was what Carper was doing and like what kind of institutionalized knowledge did you have there? Because you were essentially Carper AI was it was it wasn't it was its own thing. And then it got stability, pulled you in probably with the promise of compute. I'll say things so you don't have to say anything for lots of this. And then they were they had forked HuggingFace's TRL library before it was like HuggingFace wasn't maintaining it at this time. And they had a lot of and probably had like five plus full time employees doing RLHF in the open and for private industry, obviously, the private stuff, they're not even gonna bother asking because it's all that stuff's all under NDA. But it's like, what were the problems you were working on at Carper? And how does that compare to like the things that people are talking about now? Is it is it still related or is the field just moved into a different area?
Louis C [01:11:56]: So most of the problems we faced at Carper with TRLX was on scaling PPO, right? And I think almost anyone you talk to who has scaled PPO in the open source space. And when I say scale, I mean like way beyond 20 billion parameters. I'm talking like 70 to 100 billion.
Nathan [01:12:19]: How many nodes do you need to train a 70 billion parameter model?
Louis C [01:12:23]: So we were typically doing like 100 GPUs for PPO at that scale.
Nathan [01:12:28]: Like 10 to 12 nodes. Yeah. Yeah.
Louis C [01:12:31]: We mostly tested with like the NEMO checkpoints that were like 100 billion parameters. TRLX was built, at least for that component, built on top of a very modified version of like Megatron DeepSpeed. But like the amounts of like regularization and like random tricks that you needed to do in order to get PPO to even like work at that scale is insane. Like we had to do like separate warm ups for the value function. Right. So we had to like independently train the value function before we trained the policy network. And like everyone and their mom was was talking about like having separate value networks versus policy networks for PPO.
Nathan [01:13:18]: Did you ever try JAX? Do you have TPUs at Starbuck Carper ever?
Louis C [01:13:25]: We did towards the end.
Nathan [01:13:27]: Because it could solve some of the multi-node thing.
Louis C [01:13:29]: Yeah. It wasn't the multi-node that was the issue. It was.
Nathan [01:13:35]: You're saying DeepSpeed wasn't the issue?
Louis C [01:13:37]: No. It was actually the fact that the inference server that TRLX uses for the rollouts was entirely different than the inference server that Megatron wanted us to use. So we needed a way to rapidly.
Nathan [01:13:57]: That's why PPO is really hard to scale because you have to have a generation engine and you want the stall to be flexible.
Louis C [01:14:02]: Yeah. So we needed a way to dynamically keep our compute graph for over through the network. But like just copy the weights like in place to like Trident. And I don't think that we ever came up with a solution to do that very effectively. And I think it actually goes a step further. I don't think the Nemo line was like what NVIDIA did. I don't think Nemo line came up with a solution for that either.
Nathan [01:14:25]: Yeah. This is interesting because I'm not going to say the details on the pod because not allowed. But like Anthropic and these places that have custom RLHF infrastructure have essentially like built their distributed training infrastructure with the idea that the model will need to be generated from at different checkpoints and the model will be served to different endpoints at different checkpoints. So it's just very different than taking DeepSpeed off itself, which is like this is just about training. Well, it's like these other companies that do this stuff really well have infrastructure for like handling these really messed up cases of like how to generate and update these models.
Louis C [01:15:00]: Yeah. And most approaches that like a reasonable person would build off the shelf like would rely on Torch.compile and you still have the same issue. Like your weights are changing dynamically. It's very, very hard to really even like understand like all of like the little like technical details in Torch.compile to have to be accounted for to even make this work. Right. And like, you know, something that we considered at the time was. We need to do like an insane amount of rollouts for every gradient step, and we don't want that interface between the rollouts and the training to be Python. We want it to be like Rust or something because like otherwise the CPU overhead is like mind boggling. It was like 80 percent or something crazy. It was like 80 percent of the entire processing time was just CPU stuff and like.
Nathan [01:15:53]: Not so much. I know.
Louis C [01:15:55]: I know. And like there's so many different infrastructure constraints that people don't realize when they're just doing like 20 billion parameter PPO. Right. What the other one I was going back to, like the value function being separate from the policy network. TRL was very, very gung ho on like keeping them separate. I think RL for LLMs also wanted to keep them separate. And then there was someone from Cornell. I don't remember his name. He was also in the RL for LLMs paper. He did a paper like PPO plus or something. I don't remember what it was. I mean, all these things are interesting.
Nathan [01:16:30]: I mean, there's new libraries coming out still. So it's like I saw one recently that was called OpenRLHF. And like it looks good. I think that it's like there's so much institutional like breaking the bonds of past RL that needs to happen. So like part of this library is like listing that they have the implementation details from like their original and implementation details of PPO paper where it's like we've already updated like cost has worked on the end implementation details of RLHF paper, which is like the ones that they actually need. But it's like there's so much like baggage by the fact that PPO came out of this control field that everyone expects the tricks that you need for from scratch learning from PPO to apply to this fine tuning method. And just like even getting the people to stop using PPO for that and like DPO is a new thing. Like DPO is something that only is works for preference alignment. People are going to explore in a scientific way that's much fresher. They're probably going to make more scientific progress because there's not this kind of confusion of like what do like what implementation details do we need? Yeah, for sure. For sure.
Louis C [01:17:34]: I think then the end technical details of RLHF, did that come out?
Nathan [01:17:39]: Yeah, it's a blog post. It's a blog post. When? Maybe a month ago.
Louis C [01:17:45]: Oh, man, I totally missed that. Oh, that's so cool. I'm going to read that.
Nathan [01:17:48]: Yeah, I mean, this is for anyone still listening. If you want to know the actual details of RLHF, like go look at all the stuff that Costa Hoang has been doing on your base. Like I was just like reproducing everything and in explicit detail. I feel like both of us would benefit from rereading it. So it's like there's there's some free content to spend.
Louis C [01:18:06]: Costa is like one of the most meticulous, very attention focused person that I know in the RLHF space. Like if Costa says something works, it's because he's like tried it from every other angle and then tried it from angles that like you didn't even expect. And all of them work.
Nathan [01:18:21]: Yeah. Yeah, that's great. I think I have a couple like fun, more fun questions while we wrap up. We can we could go on with all these technical things forever. What was it like to work at Carper when ChatGPT came out? Because ChatGPT from a technical perspective is RLHF is validated as something that is necessary to the future of language models. And you were one of the few people that were working on RLHF beforehand, which is a huge it's like how you end up here. This is awesome that you ride that kind of journey. It's like what is what was that like?
Louis C [01:18:57]: I mean, I the star count on the repository exploded. I think we went from like.
Nathan [01:19:07]: TRLX existed.
Louis C [01:19:08]: Yeah, it was just insane. It was it was.
Nathan [01:19:14]: We almost weren't.
Louis C [01:19:16]: Positioned. I guess I could be fully honest, we almost weren't positioned to entirely ride the hype train. TRLX was always designed from the very, very beginning to be like a one stop shop for enterprises to do RLHF like companies that had like a thousand GPUs and they already have an engineering team and they just don't want they just they already use like Megatron DeepSpeed or they already use DeepSpeed and they just want something that works on their infrastructure. And because we use like Docker images that like we're just based off of the DeepSpeed, the Megatron DeepSpeed Docker images anyways. Right. So like those kinds of companies could very, very easily deploy TRLX and utilize it in their stack. Right. Yeah. And the hype that came from chat GPT, at least initially, was not enterprises. It was like bloggers. It was like writing a blog post.
Nathan [01:20:09]: You were you were probably like training big models and I'm like, hey, how does RLHF work? I need to write this blog post.
Louis C [01:20:14]: Yeah. I'm like, I'm like you're training like a 40 billion parameter in their model. And they're like, hey, can you help me train this like 400 million parameter guy? And I'm like, what? I'm so busy.
Nathan [01:20:24]: So it's primarily a scaling thing. I think is there like. Were there any cultural things that you think like being early? Like were you bought into RLHF to the same extent ahead of time? Like what got you into RLHF? Like what what motivated Carper to exist? And did this kind of consistent?
Louis C [01:20:45]: So I've always been very, very bullish on critiques and revisions in general. So I wrote the first the first or the second one. I don't I don't actually remember if the super alignment team at OpenAI wrote a paper before me. They may have, but I don't think so. I think ours came out like a month before it. That always feels good. I wrote one of the first papers on like critiques and revisions. Right. And I was very, very bullish on that. But initially I was only bullish on it for evaluation. Right. And I had experimented with PPO a little bit back in 2021 for like this kind of critique and revision stuff. And it was not ready whatsoever. And there was no infrastructure and TRL was an abandoned library that was very buggy. It didn't work. No, no, no shade to Leandro. I love Leandro. But like it was it was obvious it was it was a depreciated library. Like it happens. Yeah. And I think when we tried to do RLHF then, like there was no traction whatsoever. So Alex Havrilla and I, I think he's working with Meta now. I don't remember. Yeah. He was an intern at least.
Nathan [01:22:02]: He just had an interesting paper on like reasoning and math, which is a whole other conversation for RLHF stuff.
Louis C [01:22:08]: Yeah. So we started, we forked TRL and we just added DeepSpeed support. That's all we wanted to do initially. And then we were going to merge back to TRL because we had no visions of like Carper or anything like that. And we realized to make a framework that people would actually want to use, we had to do a full rewrite of TRL and we had to build things in a way that made sense to an engineer who wanted to deploy RLHF, who wanted to experiment with RLHF at a company or in a lab. Because we were building this from the perspective of, well, we're on the Eleuther AI GPU cluster. How can we best use our infrastructure there to...
Nathan [01:22:50]: Has anyone publicly said how many GPUs Eleuther has? This is like one of my great mysteries. Is this like a held secret? I don't think it's a held secret.
Louis C [01:22:58]: I don't remember actually. They have some stability GPUs and they have GPUs from elsewhere. Like they seem to get compute when they need it. Yeah. Yeah.
Nathan [01:23:11]: Like it's not like, it's not an issue.
Louis C [01:23:14]: Through Synth Labs, I've been supplying a bit of compute here and there as well. I gave them like a note of like H100s for like a little while for a paper that we were working on with the Pink Elephant paper. But I don't think that like, they're not like super short of compute. They're a little short, probably. Like everyone's a little short of compute. Yeah. But I don't think they're super short of compute.
Nathan [01:23:36]: Yeah.
Louis C [01:23:36]: So we built it with the Eleuther cluster in mind. And because we built it with the Eleuther cluster in mind, we were able to build it because we built it with the Eleuther cluster in mind. You know, we kind of said, well, we can kind of turn this into a thing where like we build the infrastructure that like people can like readily deploy on their clusters and it'll just work for them. And like we can make Carper AI. So we made Carper AI. And shortly after like, you know, all the stability stuff started happening, Carper joined stability. And we worked, I worked there for a while. And last summer I left to join back with Eleuther because, you know, I long for the days of being an engineer. I love waking up in the morning, writing code, eating a little bit and then going to sleep.
Nathan [01:24:22]: Yeah. I mean, that's the difference. I spend the time writing because I like to. We've had plenty of discussions where like, oh, I should start a blog. And it's like, it comes down to doing what you like to do. And it's like, you're doing great as it is. Yeah. It's okay. Yeah. Okay. I think that's kind of a good place to stop. Where should people find you? What do you want to boost? Yeah. Sign off here.
Louis C [01:24:44]: So my Twitter is lcastricato. I, or you can follow the Synth Labs Twitter. It is, let me actually, I don't remember what it is off the top of my head.
Nathan [01:24:55]: You have any goose announcements?
Louis C [01:24:58]: No goose announcements at the moment, unfortunately. It's synth underscore labs on Twitter is that Twitter account. And then El Castricado is my personal Twitter account. You know, I'm always open to collaborators, especially now with Synth Labs. So we're always happy to chat with and talk to new people about interesting research directions. And yeah, just reach out and we can get something going, I guess.
Nathan [01:25:23]: Yeah. I love the URL in the show notes. It's synthlabs.ai. I found that it's because synthetic data is so hot and it's so new. It's like some of these URLs are just hard to find. It's like, we don't have to go into the whole rant about naming and stuff, but it's like most of the people that search for mysubstackle, if you don't put the S, if you don't write interconnects, you get a different substack first. So it's like, okay, we're all in this together for anyone founding a startup or a blog and struggling with naming. Please send us questions about RLHF. If you liked this, Louis could come back. I'm trying to start an in-person thing and get some gear. So when I'm at a conference or whatever, we can bring researchers on and kind of remove some of the Zoom aspects that we're all stuck in so much of the time. Thanks, Louis, for putting some of the things we talked about a lot onto the semi-record. People listen and read. This is good. I think a lot of researchers are going to dig into this. There's so many different things that we talked about. It was a very high information density chat here, but it was a good time.
Basic tips on how to assess inbound ML content and cultivate your news feed.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/making-a-ml-feed
00:00 How I assess all these AI releases
01:22 1. Model access and demos are king of credibility
02:31 2. Focus your feed on depth or breadth
03:09 3. Examples of using the model normally show its usable, shockingly
04:10 4. Leaderboards as the single leading claim is often anti-signal
05:00 5. Basic deep learning conceptual checks will often save you
06:13 6. If it's not even remotely reproducible or verifiable, it's not science
07:10 7. Don't over-index on Twitter
08:32 8. Data sharing, licenses, communication clarity, and small things add up
08:58 9. Research papers, technical reports, blog posts, and Tweets all serve different purposes
09:49 10. Socialize your information and build relationships
Google rejoins the open model party and gets some backlash for a frequent problem for generative AI.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/gemma-google-ships-it
00:00 Google ships it: Gemma open LLMs and Gemini backlash
03:12 Getting to know Gemma
07:11 Alignment details
08:55 Aside: What is REINFORCE? Some history of RL
11:08 Implementation details and RLHF
12:18 Terms of use: RAIL Licenses history repeated
14:05 Is Google back on top? Gemini's woes
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_008.webp
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_014.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_035.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_051.png
Figure 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_055.png
10 Sora and Gemini 1.5 follow-ups: code-base in context, deepfakes, pixel-peeping, inference costs, and more
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/sora-gemini-follow-up
00:00 10 Sora and Gemini 1.5 follow-ups: code-base in context, deepfakes, pixel-peeping, inference costs, and more
00:46 1. Deepfake detection of Sora
01:59 2. Playing with long-context, problem settings, and prompting
03:39 3. Gemini paper snooping: contamination and citation games
05:42 4. Training data and token estimates of YouTube
07:42 5. Unlocking model-based RL and downstream research
08:52 6. Midjourney style matching, V-JEPA, replicating Sora in the open
10:09 7. Architectures and academic links
10:57 8. Pixel peeping from the arts
11:58 9. Inference costs
13:24 10. Pressure on Llama and Mistral
14:03 11. Sound effects, physics, and the complete picture
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_003.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_007.mp4
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_009.mp4
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_011.mp4
Figure 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_037.mp4
Figure 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_044.png
Figure 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_047.png
Figure 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_049.mp4
Emergency blog! Three things you need to know from the ML world that arrived yesterday.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/sora-gemini-and-mistral-next
0:00 OpenAI's Sora for video, Gemini 1.5, and a secret Mistral model
0:53 Sora: OpenAI's text-to-video model
4:59 Gemini 1.5: Google's effectively infinite context length
8:01 Mistral-next: Another funny release method
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_015.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_023.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_026.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_036.png
In an era dominated by direct preference optimization and LLMasajudge, why do we still need a model to output only a scalar reward?
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: In an era dominated by direct preference optimization and LLM-as-a-judge, why do we still need a model to output only a scalar reward?
Podcast figures:
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reward-models/img_004.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reward-models/img_009.png
0:00 Why reward models are still key to understanding alignment
Scale's making over $750 million per year selling data for RLHF, who's coming to take it?
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/alignment-as-a-service
00:00 Alignment-as-a-Service upstarts taking on Scale AI
04:25 The competition with humans-in-the-loop
06:05 Scaling Alignment-as-a-Service via AI feedback
Podcast figures:
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/aaas/img_008.png
A small model at the beginning of big changes.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/olmo
0:00 Open Language Models (OLMos) and the LLM landscape
6:24 Thought experiments
7:51 The LLM landscape heading into 2024
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo/img_010.png
Note: some of the audio in the second half is a little wonky, but the general voice was upgraded so hopefully it's a little less "poppy" until then!
I'm trying to fix little pronunciation problems on a weekly basis. Thanks to my early fans! It'll keep improving. E.g. some of the months were wonky.
When what seems like pure LLM black magic is actually supported by the literature.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/model-merging
00:00 Model merging lessons in The Waifu Research Department
02:21 How and why does model merging work?
07:13 Aside: merging vs. ensembles vs. mixture of experts
08:21 Why are people doing this?
11:22 Tools & Links
11:51 Brief (visual) literature review
12:07 Full model merging and recent methods
15:55 Weight averaging during pretraining
17:18 LoRA merging
17:53 More background
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_005.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_016.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_042.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_051.png
Figure 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_055.png
Figure 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_058.png
Figure 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_060.png
Figure 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_062.png
Figure 9: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_065.png
Figure 10: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_075.png
Figure 11: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_077.png
Figure 12: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_084.png
Local LLMs: the latency solution, Meta's open AGI, personalization myth, and moats X factor
The deployment path that'll break through in 2024. Plus, checking in on strategies across Big Tech and AI leaders.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/local-llms
0:00 Local LLMs: the latency solution, Meta's open AGI, personalization myth, and moats X factor
4:15 The personalization myth
7:13 Meta's local AGI and moats X factors
A fun demo on how generative AI can transform content creation, and tools for my fellow writers on Substack!
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/multimodal-blogging-tools
0:00 Multimodal blogging tools
2:57 Stratechery, passport, and wonderful customer experiences
5:51 Wrap-up, features, and next steps
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_006.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_008.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_012.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_020.png
A sampling of recent happenings in the multimodal space. Be sure to expect more this year.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/multimodal-rlhf
00:00 Multimodal LM roundup: Unified IO 2, inputs and outputs, Gemini, LLaVA-RLHF, and RLHF questions
02:46 Unified IO 2: Scaling multi-input, multi-output model pretraining
07:47 Collecting preference data for images
09:31 LLaVA-RLHF: The first experiments in multimodal RLHF fine-tuning
13:20 Multimodal RLHF questions, ideas, and resources
And why the comparisons don't really matter. Repeated patterns in the race for reproducing ChatGPT, another year of evaluation crises, and people who will take awesome news too far.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/open-gpt4-limitations
00:00 Where 2024's "open GPT4" can't match OpenAI's
03:19 Models vs. products
04:51 RLHF progress: Revisiting Llama 2's release and potential in 2024
08:30 Smaller scale open RLHF
10:33 Opportunities
12:24 Commentary
This interview is on YouTube and podcast players.
Giving a voice to researchers is the best way to cut through the noise and understand what is happening with core developments of LLM technologies. I’m excited to get to talk with Michael Poli (Stanford PhD student + research at Together AI) and Tri Dao (incoming professor at Princeton + Chief Scientist at Together AI). This builds on the mega-post from yesterday on the same topics, though the interview is obviously less math heavy:
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Topics: Introductions | Why Attention works and may not scale | Quadratic scaling in attention | What is Striped Hyena | What is Mamba | Mamba hardware optimization | Predictions for 2024 architectures | More predictions for AI
Introductions
[00:00:00] Nathan Lambert: Okay. Hey, everyone. Welcome to the first interview that we're going to post on interconnects. I'm really trying to bring more scientific voices into the AI discourse as media is covering a lot these days. I'm happy to be here with Michael Poli and Tri Dao, experts in some of these non attention architectures that have been really blowing up in the last few weeks of December.
So, Michael, do you want to introduce yourself first?
[00:00:25] Michael Poli: Sure. Thank you. Thank you, Nathan. For inviting me, I, do research at Together AI. And I was also a PhD student at Stanford, working with Stefano Ermon and, and, Chris Re, that's, that's how I met Tri as well. if I had to choose maybe, I moved to a few different areas of research.
if I had to choose one, I like to, do research at the intersection of signal processing, dynamical systems, and deep learning, and most recently, luckily, there's been more interest in, in kind of efficient architectures that use some of these principles. to improve scaling, along different axes and to, to get sort of new, new trade offs at inference time.
[00:01:13] Nathan Lambert: Great. And Tri?
[00:01:16] Tri Dao: Everyone, thanks Nathan for, for, hosting us. really excited to be here. I'm Tri. I, just finished my PhD at Stanford. and I'm being assistant professor at Princeton, and right now I'm chief scientist at Together AI. it's, it's a startup working on AI infrastructure. And, yeah, I've been working at the intersection of machine learning and systems, so designing algorithms that take advantage of the hardware that, that they run on.
I'm interested in, long range, dependencies, how to encode that into a model, designing architectures that can, can, learn long range dependencies. yeah, really excited to be here.
Why Attention works and may not scale
[00:02:01] Nathan Lambert: Okay. I think I'm going to, I have some questions dive right into this. I think you two will kind of both answer them or someone can answer longer, whatever you want.
I think to start with, we should talk about maybe why attention works and what the limitations of attention are. I think. Almost every person in tech broadly now knows that a transformer is a model built with attention and chat GPT does that but like, why is this so good, even like how much of a transformer is built with attention are there other things going on, and what might be challenges there.
[00:02:35] Tri Dao: Right. so, transformer which is this. Contexture that powers most of the exciting applications that we're seeing nowadays, as you mentioned, and so on. attention is kind of the core layer there, and attention actually became, earlier, around 2014, 2015, and then transformer came out, incorporating that, focusing a lot on, on, attention, with these, MLPs, interleaving, MLP and, and attention.
And I think the success largely has been, They are, they seem to be able to scale really well so that you can scale up the models, with more, more parameters, with more data. And that has been the recipe for, for success. It sounds obvious now, but I think, five years ago that wasn't, that wasn't clear.
so it seems like, you know, Transformer Architecture is, is a hugely successful one. and, you know, a couple of reasons why it's successful. I think it's like General enough that it's able to learn a lot from data. And two is, is pretty friendly to hardware. You can, there's no kind of sequential dependency like previous RNNs.
so as a result, you can run it well on GPUs, TPUs. you can scale it up. It's very hardware efficient. I've personally have worked on making it more hardware efficient as well. So it's just kind of the recipe for, for success, which is, general architecture that scales well. if you're an NLP person, maybe you, you, you said, you know, incorporate some kind of inductive bias for, for, to protect, personally, I think it's a very general architecture that, that scales well and it's hardware friendly.
[00:04:16] Nathan Lambert: Yeah. Yeah. It's, it's remarkable how it seems so obvious now and it's like. I think one of the things that studying this work is the context length becomes a really interesting access to study alternatives to it. And essentially it's, I think, I mean, Michael, do you want to talk about that? I could, I could babble, but you're, you're no more sure.
[00:04:39] Michael Poli: yeah, the there are several points. I'll start by saying that, you know, there's still a lot of great research trying to understand why from first principles. Why is it that the transformer can learn these interesting circuits? people kind of study, they, they pick apart the computation, like combination, different, [00:05:00] heads and transformers and so on.
so there's work on basically understanding transformers from kind of like a programming language that is encoded. But I think, as, as Trey mentioned, there's, there are some very, Very, very interesting design choices that have gone into the transformer. This interleaving of attention on MLP is quite important.
and also the transformer is essentially, was successful in the beginning 'cause it encoded these, techniques that, that, that have been developed for, RNN Celest. So these other, you know, classical NLP models, gating to, modulate, which information is absorbed into the model. Gating to determine how quickly something is forgotten in this this occurrence of get an end into a parallel form.
It is, you know, easily, a bunch of gems that can be easily, well, not very easily, but can be optimized on GPUs.
Quadratic scaling in attention
[00:06:01] Nathan Lambert: Yeah, that's, that's all great. I think that, I guess the specific thing that I had in mind is how attention ends up being this kind of quadratic, scaling in terms of cost when you have an input sequence that you have, if you have an input sequence of length L and you want to output a sequence of length L essentially.
If you zoom into the math and you look at what's happening at inference in most of these libraries, you have this like upper triangular attention matrix where you say, like, you can only look at the past entries of your text. And as you go through there, then you end up getting this a long, you get this L squared relationship where the first token, you can only look at one, and then you end up looking at more tokens for each, past and Now we've been talking about recurrent neural networks and how does something that isn't attention like get around the fact that you want to look at all of the history of the text in your sequence.
So like if you write a long prompt to chat GPT, you really want all that information to be encoded and how could doing something other than this dense attention matrix. Like actually make that possible.
[00:07:08] Tri Dao: Yeah, so you can go ahead and, you know, before attention, there was RNNs, right? Like a minute RNN's like they process text was fine. and maybe they didn't scale as well, but yeah. you say briefly texts by encoding texts.
[00:07:22] Nathan Lambert: Can you say briefly what a RNN is and how it works?
[00:07:24] Tri Dao: Yeah, so these are recurrent neural nets, that go back, I think, to the 80s.
maybe some of the more famous ones are LSTMs, GRU. so they were pretty popular in, around 2012 to 2016 or so. they were kind of state of the art for translation, speech recognition. a bunch of, I think NLP, like, they, they were a state of the art and, and they processed text kind of sequentially.
they are just, they see essentially one token and then that. Changes the hidden state and then they will update the hidden state and every time they see a new token. So, I think it's kind of, in some sense, mimicking, how, for example, human brain process information, like you read, you, you read a sentence or a passage and, you know, it's, it's maybe it's like you're storing some information in your brain.
By the time you've finish reading a document, maybe you can answer questions about that documents without having to read to, to refer to that document again. So, RNs, kind of work that way. They, they, they, they process the, the, the texts. and then that changes the hidden state and their hidden state is the representation that can be used to either generate new tokens or, classify the documents or, or, or whatnot.
so these work well back in 2016 or so. But, they've kind of fallen out of, favor, empirically, they don't do as well as, as Transformer, I think, and as you, you touched on Transformer, because of this kind of quadratic scaling, and you compare every token with every other token that comes before it, it gives you this very kind of easy way to, to propagate information.
and, I think that's part of the reason why, why, transformer and attention does really well. but there's been more, more recently, some of the new, newer RNN architectures that. Seem to do pretty well. So, RWKV is, I think is one of the earlier ones, you know, is one. I, I really admire that, that that project, you know, his effort mostly from, from, from one person really going against the, orthodoxy of, of transformer.
Who, who was it showing that Rrn can be pretty strong. Who was the lead on that? I think it was this person, Bo Peng, I think. and, you know, it's, it's, it's an entire project, but I think it was pioneered by Bo Peng. I think it's, affiliated with Alutha the compute sponsor by Stability and so on.
[00:10:12] Nathan Lambert: Yeah. I was reading this earlier. At a technical level, they tried to replicate something like the query key. Value lookup of attention with two linear RNNs to essentially be able to remove the like specific attention scaling problem, potential problems, and with two RNNs, which have this better, like long context behavior and different implementation rules.
I think, and they also, the paper trained up to 14 billion parameters, which kind of leads into the, some of the next questions I was going to ask, I was going to ask Tari about, Mamba and then Michael about Striped Hyena. I think you could go in either order. I think these came out about a week apart and were these two language models kind of seen as being.
What is Striped Hyena
Nathan Lambert: Way closer than anyone would expect, essentially the Striped Hyena came out and the evaluations were close to models I've been training on all year, like Lama 2 and Mistral 7b. And I went in and I went to the together API and I did like side by side of. Mistral versus Striped Hyena, and it's like, it's, it's a good language model.
It answers most questions. There's no obvious failure modes. I think maybe Michael, do you want to comment on that? I know it's another big project and then we can go back to Mamba, even though it's slightly out of order in the chronological, the release cycle that happened. sure.
[00:11:33] Michael Poli: So, I guess I'll start by saying that, there's an interesting connection between all these, these new methods.
there is this sort of convex set, which has a center and there's this connection between linear attention. So attention without the softmax, linear RNNs. And states based models, SSM. So at some level, kind of the mathematical formulation of this kind of base model here, I'm not talking about the base architecture, just the fundamental model is the same.
And then you can go in different directions. And each direction has its own tradeoffs. You can go to, the feature map, direction, the kernel direction. So when you, when you break down the softmax, you take away the softmax. You can place, on queries and keys. Kind of the fundamental, the entities that compose your attention matrix, you can compose other kernel like functions, other functions that you hope would approximate whatever capability of attention you like.
You can do things like a, like a Taylor approximation, Taylor expansion, for example, of that. And you, you, you have a slightly different perspective, but you get something that again, is very similar. You can go to Time variance. So you take the RNN and you push this input dependence. So the way the [00:13:00] computation inside the linear RNN is conditioned by the, by the input sequence, and you can have things like gates, we've seen a lot of work, for example, modernizing the inner tension with additional gates.
that allow you to make better use of your, of your fixed state dimension. And then you have the third direction, at least in my mind is the one that pushes, that uses the convolutional form that pushes more towards other types of, of linear operators that are still associative, that are, that are still, that are still allow you to, to train in parallel.
So here are things, time invariant systems. I can elaborate on any of these points, but things that can switch between convolutions and recurrence like this for a model with additional. Gates again, scraped. I, you know, was born as a, as a project, from the, in architecture, which belongs to this third category that I just mentioned.
And we're really trying to get the best per flop [00:14:00] architecture that we could. And. one principle that was validated over and over again, and we're trying to, to, to understand better now is that it seems composing hybridizing different, layers, layers, different blocks of different categories, and even full attention yields something that is better than the individual components.
So there seems to be a compositional aspect of these, of these models that we're trying to understand better. And this gives you a better sort of, pre trained model per flop. And with, with this model, we, we ran a whole suite of scaling laws and so on. Hybridizing also gives you, since we wanted something that would be kind of usable out of the box, it gives you a way easier time.
When you, when you're fine tuning for longer context, we can apply some of these techniques that have been developed for transformers and kind of surprisingly work okay for a hybrid [00:15:00] hybrids as well. So things like, linear scaling for rotary embeddings and so on, you can go into the details. So it was mostly a project trying, what is the best given the current landscape, what is the best we can do?
What is Mamba
[00:15:11] Nathan Lambert: Yeah, that's a great description of it. I mean, the sentence in the blog that's like, Striped Hyena is optimized using a set of new model grafting techniques, enabling us to change the model architecture during training, kind of felt like, to me, that there's a ton going on there. And like, some of it, you probably can't talk about, there's normal data.
So like, I don't think all the data that was quite explained, like what the longer context data was, but it's like, are you taking this from models, starting point from models that people would know? And can you say any of that? I think even just the summary that it's a synthesizing recent work into a strong model is great context for people.
[00:15:48] Michael Poli: Yeah. Well, the deadline, so we've, given this explosion of, of primitives that, you know, describe, and given sort of the, the [00:16:00] cost that it would require to evaluate all different combinations, we found ways to essentially start training. With a configuration and then continuing on with another configuration.
I think we'll have, we're going to have more work or a paper.
[00:16:16] Nathan Lambert: Yeah. There's so much cool work in that area. So one of the, someone at AI too is working on a project where they're essentially trying to cut the Lama models in half and keep training them. And things, it's just the wild west out there with people trying to take strong models and make them smaller while still getting the performance benefits of bigger models.
I think that's a whole aside, but. I wasn't expecting it to show up when people, when like you follow the social media, I've striped by, you know, people are like, Oh, state non attention models are finally good. And it's like, it covers up a lot of the details that are very interesting about it, in my opinion.
So, okay. Turn back to treat, I think. Mamba actually happened first among these, I did the, his reading back of [00:17:00] social media, and it also was very surprising to me, I think the, the largest model in the Mamba suite is 2. 8 billion parameters, if I remember correctly, and it was compared to a lot of the common benchmarks in open NLP, so things like GPT J, Pythia model suites, and the scores on the benchmarks reported were really strong, and I think if you want to start with the high level summary, and then I'm definitely going to make you talk about the awesome new CUDA kernels and stuff that you had to write for this project.
[00:17:34] Tri Dao: Yeah, so this, Mamba is a collaboration with, with Albert Gu, who's now, he was, a PhD student at, at Stanford, that's where we met, and, he's now a professor at CMU, and, also at a startup. so it was a, a wonderful collaboration, credit goes to him. Yeah, Albert has been working on this line of work called state space models, [00:18:00] in some sense, as mentioned, it connects to things like linear tension, linear RNN, convolution, neural nets, and, that's what his PhD thesis, is about.
I've also worked on, space, state space for the past couple of projects, My, my angle is how to make state space more hardware efficient and, kind of increase their expressiveness. so it's wonderful working with, with, with Albert. and there, I think is more of a proof of concept, which is, Can state space actually do as well as transformer on language? So we've, we've seen previous papers, showing state space could be better on audio, could be better on, some of the tasks on the long range arena. but, language has always been, the most difficult to get, to, to, to do well for state space models.
[00:19:00] And, language is also kind of the thing that People care about the most right now. So I was more of a proof of concept, which is, Hey, we want to show that safe space space can be competitive or maybe even meet some of the transformers out there. so we, we validate that at the scale up to three B trained to 300 B tokens.
So in absolute terms, you know, these are not very strong models. These are not yet models that you would actually. play with and expect it to do meaningful things, I think is more of a, more of an academic comparison in terms of architecture. It's like, hey, training, train for the same amount of tokens, it does as well, or maybe slightly better than some of the transformer out there.
So, and that's, in particular, it's been, very exciting to us. I think, Albert's been pushing on this for, for a while. I've been pushing on this for a while, and I think finally, it's like, It seems to, [00:20:00] to, to finally be kind of close to gap or even surpassing the transformer. and it just, just, I think it's opens up a bunch of opportunities.
so inference could be way faster. maybe we would have different ways to understand how in context learning happens, et cetera. So, lots of, lots of future work I would expect.
Mamba hardware optimization
[00:20:22] Nathan Lambert: Yeah. Can you go into some of the like, what does it actually take to implement some of these new CUDA kernels? I just remember when this paper was announced, Sasha Rush, who's also very active in the space, recommended me to talk with you too, was tweeting about the types of files or whatever.
In the paper, there's this discussion about how like the recurrent state needs to be sufficiently expressive, but doing so in a certain type of memory is a problem. Like translate what this means to like people thinking about GPUs and people thinking about these models being scaled, like, is it now? Much easier to scale these [00:21:00] models because they work on GPUs.
Which GPUs were you using? Is there a bump that could come just from going to H one hundreds or something? Any of that?
[00:21:08] Tri Dao: Yeah. so, the pre, the line of work on state space, like s four models, kind of pioneer by, by, my Alberts work. they, they c are in some sense recurrent neural network. but they have a much larger, So, the state size is whatever kind of, buffer that you're going to store information as you traverse or as you process the sequence.
In some sense, you can view transformer as doing that as well, where it's, keep the entire history is usually called the KV cache. it keeps the history and keep referring to it. for RNNs, they have a fixed size state. for transformer state, you can think of the state size is increasing. And, our intuition [00:22:00] is that, the larger the state size, the easier it is for the model to do well.
So basically, you have more space to store whatever you need to remember. And so previous models like S4 and so on, they have an implicitly pretty large state size, but they use the convolutional view to avoid having to materialize the state. So that was, that was wonderful. Michael has, has worked, behind the architecture, has used some of the same insight focusing on, on convolution.
Mamba, on the other hand, focuses on the recurrent view. So, we wanted to put more input dependency in the, the, the recurrence. we thought, you know, the thinking was that it was going to make, it more expressive and the model would do better, but that prevents us from using this convolutional view that would make things efficient.
So we had to figure out a different way to make things efficient. and, so I, I focused on making that efficient on, on, on GPUs. and so all, you [00:23:00] know, our thinking was, instead of, okay, we're gonna have a large state size, but we don't have to like ride to actual GPU memory, like the HBM, we can just keep that, large state in a, a faster, Memory you call SRAM, you think of it as a, as a cache. so if you're more familiar with, CPU, so this is usually a cache and RAM. So, you know, if you have large state, you can keep it in the cache and you don't, by avoiding having to write it down, you actually don't suffer too much if the state is, is large.
Predictions for 2024 architectures
[00:23:33] Nathan Lambert: Would this be due to like input out, like having to move the data around being really slow? Yes. Yeah. That makes a lot of sense. Thanks. That's a really common constraint in a lot of these things, and it's like, right. I think one of the most insightful things I've had now with GPUs versus TPUs and stuff is how mixtures of ex mixture of expert models doesn't work very well in TPUs, just because you have to like that essentially add a mixture of expert at a basic level.
There's a routing layer that you learn, [00:24:00] and then multiple feedforward layers that you can choose from. And when you're distributing this, the feedforward layers could end up. On a different TPU node and TPUs communicate with their neighbors. So TPUs take a big hit relative to GPUs where within video class and video clusters, everything's connected so much more.
And then it's easy to do that sort of distributed training. And that's super interesting. And it's like, do you think there's going to be, I guess this is really where I want to open the conversation of like, what does this mean? What is going to happen in 2024 in this space? Are bigger players going to move in and be exploring this my take, seeing how good the long context learning could be in a fundamental limit is that systems like chat GPT are going to use a dense, like a transformer model for most tasks.
And then if you need to do summarization, you might do a long context specialized architecture. And then we can even see a whole quiver of architectures behind [00:25:00] something that you're using. But I think. It's just like, is attention going to be dethroned? Is Sasha Rush somehow going to win this bet that everyone was following in the area?
I got, what are you thinking about either of you?
[00:25:14] Tri Dao: I think transform is still a very, very strong architecture. and there is a proven recipe, right? You know, people scaling to a trillion of parameters right now, if you want, you say, well, I just want the best performing model. that runs most efficiently on my hardware that has the most support on on the software side.
Fast forward is a safe bet. I think it's here to stay. but I think there are new ideas, like, state space, kind of, some of the linear attention ideas from linear attention. I think they're coming in. we've seen, as Michael mentioned, that mixing some of these components seem to improve performance, revalidated at, I think, seven B scale, but, Maybe it might even work at larger scale.
I think [00:26:00] people tend to be conservative and, you know, focusing too much on modern architecture, might not be worth their time. Like the Lime architecture is very, very strong. Most people are doing off of that. They're focusing on data. they're focusing on infrastructure, which makes sense. I think on, on my side personally, just plain interesting.
They're like more, I would say niche use cases. niche for now, where some of these alternative architectures are interesting, things like long context, different domains like audio and genomics, and there's just plain interesting scientific questions you can ask, like whether it follow instruction just as well, whether it follow intuition just as well, does it play well with quantization and so on.
That's just plain interesting. Research questions we can ask. Now on the production level, I think Transformer is still incredibly strong, very well supported, both hardware and software. But I think some of these new ideas are coming in [00:27:00] and people might start, you know, putting them as part of a component in the Transformer.
Maybe we'll still call them Transformer, but they just have more, more layers and just attention and NLP.
[00:27:11] Michael Poli: Yeah, I 100 percent agree with you. So attention as a, as a computational primitive is not going anywhere anytime soon. It's just a very efficient and a very convenient way to. Increase the effective state of, of your sequence processor. so at some level, if you're working with a model that only has a fixed state in each of its sequence mixers, you're, you have an assumption and your assumption is that you only need so much information in the sequence.
So there's, there's always a trade off between, this kind of the ratio of the state dimension, the sequence length, as you push things to the extreme, either model sizes. So as you make the model bigger, wider, effectively [00:28:00] introduce more states and sequence length, some of these margins. you know, some of this is speculation, but some of these margins will disappear, some of the trade offs will change, especially 14, 30, some of these very fat models.
But certainly either whether that's hybridizing or some kind of new, new block, we're certainly going to see some more innovation. That's, that's really exciting. My, my personal, if I had to make a prediction is that architectural design will get more interesting, more, more complex. There's going to be more to do.
More predictions for AI
[00:28:38] Nathan Lambert: Yeah, I mean, this year it's like, I got some 10 minute clock that's fine for us. I think like with mixture of experts and this being popular as a state state models, like this is all just really within a few months outside of opening. I like they've been doing mixture of experts for a lot longer than everyone.
In terms of open and academic [00:29:00] communities, like no one's really tried to do early Jeff on mixture of experts. Like it should just work, but we have to learn all these things. And then the model grafting is becoming more of a real thing. That's super interesting. It is just. I agree that it's just fun to follow and hopefully it gives academics and scientists more ways to influence the conversation where an industry is just about scaling and bigger models where we could maybe do specific things better, which is what I'm telling open source companies to do with their language models anyways.
Like if they want to have a business model, they need to have an edge. So this all fits into that kind of narrative pretty well with my regards. Is there anything else you guys are following in ML? It doesn't have to be about state space models. Like what's, what's exciting for you broadly for next year?
[00:29:46] Tri Dao: Yeah, personally, I think data is still the most important thing. we're, we're thinking a lot about how data influences the model performance, like really teasing that [00:30:00] out, either, you know, having some of the synthetic tasks that correlates very well with, with model performance. That's been kind of the motivating.
kind of examples in a lot of our papers and work has been focusing on synthetic tasks, or, having like maybe, maybe smaller data sets that kind of make it easier to really understand what's, what's really going on. so, I think I'll, you know, personally, my focus is going to be on data for the next little bit.
Yeah, all the, all the architecture stuff is fun. making that hardware efficient is, is, is, is fun. but I think, ultimately it's about data. If you, if you look at the scaling, scaling law curve, the more architectures. Different model architectures would generally have the same slope. They're just different offset.
it seems like the only thing that changes the slope is the, data quality.
[00:30:58] Nathan Lambert: I love that point. That, that does [00:31:00] seem true. I have the plot from Mamba in this blog post that I'm writing, which is, it's just a little, just a little bit above the same slope.
[00:31:08] Michael Poli: Yeah, we add data. Data is really interesting, sort of miniaturizing, architecture design, finding and breaking down what, tasks are involved into, for example, language modeling and trying to package them into something that can be used to iterate something that's quite exciting. We have, that was one of the main techniques that was used for the, this, zoology, paper that also looks into, into some of these different behaviors.
And personally, I'm also really excited about new applications, scientific applications, with the genomics work, but even more, but more engineering focused, we're seeing a shift, right now it's language is still kind of, The domain that gets the most clicks, [00:32:00] most interest, but I think that that will evolve over time.
and some of these other applications offer, even just talking about architectures, they offer a completely different design space that I'm excited to look into.
[00:32:13] Nathan Lambert: Yeah, everyone talks about language, but I feel like images and entertainment and videos are like the things that are so obviously going to generate so much value to me.
Like, I don't know the ceiling on language, but when you could access a like somewhat local text and video model at your home workstation, that's like tailored to your preferences. Like the amount of value that that creates is totally astronomical. I I'm excited. I mean, I've started playing around with these where I'd take.
Text of the blog and convert it to dolly images and convert it to a video with generated audio all with like one Python script and it's like, that's really easy to do. So I agree with your more than language is fun to have that view
[00:32:55] Tri Dao: and these things actually do work reasonably well in your experience when you stitch [00:33:00] all them together.
[00:33:02] Nathan Lambert: it's not that good. The DALLE images are pretty similar, but I'm doing something really naive where I just, I literally take the text and have a system prompt. It's like you're generating series of images for visualizing a blog post and, and it generates various like. The, all the machine learning thumbnails that you see everyone using, they're like variations of that.
The fun ones are where it's like about Llama or Mamba or something. And then they like generate animals in them, which is good. I think I could get much better at it and have a better segmentation system for the paragraphs and, or have like chat to PT summarize them or something like that. But I just know that within like a year, it was going to be a text to video API and I'm just going to switch it and it's going to be great.
And so I'm like laying the groundwork for infrastructure to have like multimodal. Content as multimodal content distribution, really, and I just expect it to become very fun. I mean, like even the text to voice is pretty good. I think I don't have a studio, but once [00:34:00] you have a studio, it's going to be able to generate perfect audio for whatever you want.
So another one of my dreams that is. Bad for young students is I want to be able to give like a slide deck to a script that returns the five minute conference video that no one ever watches just based on like a, GPT for reading those, the slide deck and voicing yourself. So those are the silly things that I have time to do because I'm not a professor.
[00:34:29] Tri Dao: Yeah, I think these, these, these advances, these systems, like they, they do generate a lot of economic value and, and we're seeing that already. Lots of companies are now switching to using these things. And I think it's going to change the way we work as, as you mentioned, the way we work, the way we're entertained.
So I'm just very exciting future.
[00:34:47] Nathan Lambert: Yeah. Anything else? Well, thanks for coming. Try to get you guys as much. Attention as I can bring, you never know it'll go viral these days. So I think this was a great conversation. People are really hungry for basic intuitions in [00:35:00] the area. So this is good.
[00:35:02] Tri Dao: Yeah. Thank you.
Nathan is a pleasure. Absolutely.
[00:35:07] Michael Poli: for inviting us. And, maybe, if, you know, there are more questions, is there a way to maybe collect them or to, to provide readers with like listeners with, an address or something? Happy to answer anything.
[00:35:24] Nathan Lambert: Yeah. I'll, I'll include contact info in the post and various ways.
This will be out there. You'll get your comments on Substack, YouTube, Twitter. It's a mess. You've got to pay attention to 10 million streams of information these days, but you'll, you'll get contacted by people. Thankfully, for some reason, people read my stuff, but here we are. So thanks for listening.
En liten tjänst av I'm With Friends. Finns även på engelska.