Audio essays about the latest developments in AI and interviews with leading scientists in the field. Breaking the hype, understanding what’s under the hood, and telling stories.
www.interconnects.ai
The podcast Interconnects is created by Nathan Lambert. The podcast and the artwork on this page are embedded on this page using the public podcast feed (RSS).
Original post:
https://www.interconnects.ai/p/openais-o3-the-2024-finale-of-ai
Chapters
00:00 Introduction
02:51 o3 overview
05:57 Solving the Abstraction and Reasoning Corpus (ARC)
10:41 o3’s architecture, cost, and training (hint: still no tree search)
16:36 2024: RL returns
Figures
Fig 1, Frontier Math results
Fig 2, Coding results
Fig 3, ARC AGI results
Fig 4, ARC AGI result details
Fig 5, ARC AGI example 1
Fig 6, ARC AGI example in text
Fig 7, ARC AGI example “easy”
Original post: https://www.interconnects.ai/p/the-ai-agent-spectrum
Chapters
00:00 Introduction
03:24 Agent cartography
08:02 Questions for the near future
Figures
Fig 1. multiple feedbacks diagram
Original post:
https://www.interconnects.ai/p/openais-reinforcement-finetuning
Chapters
00:00 Introduction
04:19 The impact of reinforcement finetuning’s existence
07:29 Hypotheses on reinforcement finetuning’s implementation
Figures
Fig. 1, Yann’s Cake
Fig. 2, Grader config
Fig. 3, RLVR learning curves
Finbarr Timbers is an AI researcher who writes Artificial Fintelligence — one of the technical AI blog’s I’ve been recommending for a long time — and has a variety of experiences at top AI labs including DeepMind and Midjourney. The goal of this interview was to do a few things:
* Revisit what reinforcement learning (RL) actually is, its origins, and its motivations.
* Contextualize the major breakthroughs of deep RL in the last decade, from DQN for Atari to AlphaZero to ChatGPT. How could we have seen the resurgence coming? (see the timeline below for the major events we cover)
* Modern uses for RL, o1, RLHF, and the future of finetuning all ML models.
* Address some of the critiques like “RL doesn’t work yet.”
It was a fun one. Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Timeline of RL and what was happening at the time
In the last decade of deep RL, there have been a few phases.
* Era 1: Deep RL fundamentals — when modern algorithms we designed and proven.
* Era 2: Major projects — AlphaZero, OpenAI 5, and all the projects that put RL on the map.
* Era 3: Slowdown — when DeepMind and OpenAI no longer had the major RL projects and cultural relevance declined.
* Era 4: RLHF & widening success — RL’s new life post ChatGPT.
Covering these is the following events. This is incomplete, but enough to inspire a conversation.
Early era: TD Gammon, REINFORCE, Etc
2013: Deep Q Learning (Atari)
2014: Google acquires DeepMind
2016: AlphaGo defeats Lee Sedol
2017: PPO paper, AlphaZero (no human data)
2018: OpenAI Five, GPT 2
2019: AlphaStar, robotic sim2real with RL early papers (see blog post)
2020: MuZero
2021: Decision Transformer
2022: ChatGPT, sim2real continues.
2023: Scaling laws for RL (blog post), doubt of RL
2024: o1, post-training, RL’s bloom
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Chapters
* [00:00:00] Introduction
* [00:02:14] Reinforcement Learning Fundamentals
* [00:09:03] The Bitter Lesson
* [00:12:07] Reward Modeling and Its Challenges in RL
* [00:16:03] Historical Milestones in Deep RL
* [00:21:18] OpenAI Five and Challenges in Complex RL Environments
* [00:25:24] Recent-ish Developments in RL: MuZero, Decision Transformer, and RLHF
* [00:30:29] OpenAI's O1 and Exploration in Language Models
* [00:40:00] Tülu 3 and Challenges in RL Training for Language Models
* [00:46:48] Comparing Different AI Assistants
* [00:49:44] Management in AI Research
* [00:55:30] Building Effective AI Teams
* [01:01:55] The Need for Personal Branding
We mention
* IBM’s Deep Blue
* Alberta Machine Intelligence Institute (AMII)
* Claude (Anthropic's AI assistant)
* Bard (Google's AI assistant)
* Scale AI
Original post: https://www.interconnects.ai/p/openais-o1-using-search-was-a-psyop
Figures
Figure 0: OpenAI’s seminal test-time compute plot
Figure 1: Setup for bucketed evals
Figure 2: Evals with correctness labels
Figure 3: Grouped evals
Figure 4: Hypothetical inference scaling law
Full post:
https://www.interconnects.ai/p/olmo-2-and-building-language-model-training
OLMo 2 demo: https://playground.allenai.org/
OLMo 2 artifacts: https://huggingface.co/collections/allenai/olmo-2-674117b93ab84e98afc72edc
Chapters
00:00 Building AI Teams
06:35 OLMo 2
Figures
Fig 1, pretrain plot: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo2/pretrain.webp
Fig 2, pretrain table: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo2/pretrain-table.webp
Fig 3, post-train table: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo2/postrain-table.webp
Original post: https://www.interconnects.ai/p/tulu-3
Chapters
00:00 History
05:44 Technical details sneak peak
Figures
Fig 1, results: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/tulu3-img/results.webp
Fig 2, overview: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/tulu3-img/overview.webp
Fig 3, preferences: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/tulu3-img/preferences.webp
Fig 4, RLVR: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/tulu3-img/rlvr.webp
Original post: https://www.interconnects.ai/p/scaling-realities
Original post: https://www.interconnects.ai/p/saving-the-nairr
Chapters
05:26: Do we need an AI research resource or an LM research resource?
08:59: Policy roundups
Tim Dettmers does not need an introduction for most people building open-source AI. If you are part of that minority, you’re in for a treat. Tim is the lead developer behind most of the open-source tools for quantization: QLoRA, bitsandbytes, 4 and 8 bit inference, and plenty more. He recently finished his Ph.D. at the University of Washington, is now a researcher at the Allen Institute for AI, and is starting as a professor at Carnegie Mellon University in fall of 2025.
Tim is a joy to talk to. He thinks independently on all the AI issues of today, bringing new perspectives that challenge the status quo. At the same time, he’s sincere and very helpful to work with, working hard to uplift those around him and the academic community. There’s a reason he’s so loved in the open-source AI community.
Find more about Tim on his Twitter or Google Scholar. He also has a great blog where he talks about things like which GPUs to buy and which grad school to choose.
Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Show Notes
Here's a markdown list of companies, people, projects, research papers, and other key named entities mentioned in the transcript:
* QLoRA
* Llama 3
* Claude (AI assistant by Anthropic)
* Transformers (Hugging Face library)
* Gemma (Google's open weight language model)
* Blackwell (NVIDIA GPU architecture)
* Branch Train Merge (research paper)
* "ResNets do iterative refinement on features" (research paper)
* CIFAR-10 and CIFAR-100 (computer vision datasets)
* Lottery Ticket Hypothesis (research paper)
* TRL (Transformer Reinforcement Learning) by Hugging Face
* Tim's work on quantization (this is just one example)
Timestamps
* [00:00:00] Introduction and background on Tim Dettmers
* [00:01:53] Future of open source AI models
* [00:09:44] SWE Bench and evaluating AI systems
* [00:13:33] Using AI for coding, writing, and thinking
* [00:16:09] Academic research with limited compute
* [00:32:13] Economic impact of AI
* [00:36:49] User experience with different AI models
* [00:39:42] O1 models and reasoning in AI
* [00:46:27] Instruction tuning vs. RLHF and synthetic data
* [00:51:16] Model merging and optimization landscapes
* [00:55:08] Knowledge distillation and optimization dynamics
* [01:01:55] State-space models and transformer dominance
* [01:06:00] Definition and future of AI agents
* [01:09:20] The limit of quantization
Transcript and full details: https://www.interconnects.ai/p/tim-dettmers
Get Interconnects (https://www.interconnects.ai/)...
... on YouTube: https://www.youtube.com/@interconnects
... on Twitter: https://x.com/interconnectsai
... on Linkedin: https://www.linkedin.com/company/interconnects-ai
... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv
… on Apple Podcasts: https://podcasts.apple.com/us/podcast/interconnects/id1719552353
Andrew Carr is co-founder and chief scientist at Cartwheel, where he is building text-to-motion AI models and products for gaming, film, and other creative endeavors. We discuss how to keep generative AI fun and expansive — niche powerful use-cases, AI poetry, AI devices like Meta RayBans, generalization to new domains like robotics, and building successful AI research cultures.
Andrew is one of my well read friends on the directions AI is going, so it is great to bring him in for an official conversation. He spent time at OpenAI working on Codex, Gretel AI, and is an editor for the TLDR AI Newsletter.
Listen on Apple Podcasts, Spotify, YouTube, and where ever you get your podcasts. For other Interconnects interviews, go here.
Show Notes
Named entities and papers mentioned in the podcast transcript:
* Codex and GitHub Copilot
* Blender 3D simulator
* HuggingFace Simulate, Unity, Godot
* Runway ML
* Mark Chen, OpenAI Frontiers Team Lead
* Meta’s Lingua, Spirit LM, torchtitan and torchchat
* Self-Rewarding Language Models paper
Timestamps
* [00:00] Introduction to Andrew and Cartwheel
* [07:00] Differences between Cartwheel and robotic foundation models
* [13:33] Claude computer use
* [18:45] Supervision and creativity in AI-generated content
* [23:26] Adept AI and challenges in building AI agents
* [30:56] Successful AI research culture at OpenAI and elsewhere
* [38:00] Keeping up with AI research
* [44:36] Meta Ray-Ban smart glasses and AI assistants
* [51:17] Meta's strategy with Llama and open source AI
Transcript & Full Show Notes: https://www.interconnects.ai/p/interviewing-andrew-carr
Full post:
https://www.interconnects.ai/p/why-i-build-open-language-models
How Claude's computer use works. Where OpenAI, Anthropic, and Google all have a lead on eachother.
Original post: https://www.interconnects.ai/p/claudes-agency
Chapters
00:00 Claude's agentic future and the current state of the frontier models
04:43 The state of the frontier models
04:49 1. Anthropic has the best model we are accustomed to using
05:27 Google has the best small & cheap model for building automation and basic AI engineering
08:07 OpenAI has the best model for reasoning, but we don’t know how to use it
09:12 All of the laboratories have much larger models they’re figuring out how to release (and use)
10:42 Who wins?
Figures
Fig 1, Sonnet New Benchmarks: https://substackcdn.com/image/fetch/w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5d2e63ff-ac9f-4f8e-9749-9ef2b9b25b6c_1290x1290.png
Fig 2, Sonnet Old Benchmarks: https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4bccbd4d-f1c8-4a38-a474-69a3df8a4448_2048x1763.png
Get Interconnects (https://www.interconnects.ai/)...
... on YouTube: https://www.youtube.com/@interconnects
... on Twitter: https://x.com/interconnectsai
... on Linkedin: https://www.linkedin.com/company/interconnects-ai
... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv
… on Apple Podcasts: https://podcasts.apple.com/us/podcast/interconnects/id1719552353
Arvind Narayanan is a leading voice disambiguating what AI does and does not do. His work, with Sayash Kapoor at AI Snake Oil, is one of the few beacons of reasons in a AI media ecosystem with quite a few bad Apples. Arvind is a professor of computer science at Princeton University and the director of the Center for Information Technology Policy. You can learn more about Arvind and his work on his website, X, or Google Scholar.
This episode is all in on figuring out what current LLMs do and don’t do. We cover AGI, agents, scaling laws, autonomous scientists, and past failings of AI (i.e. those that came before generative AI took off). We also briefly touch on how all of this informs AI policy, and what academics can do to decide on what to work on to generate better outcomes for technology.
Transcript and full show notes: https://www.interconnects.ai/p/interviewing-arvind-narayanan
Chapters
* [00:00:00] Introduction
* [00:01:54] Balancing being an AI critic while recognizing AI's potential
* [00:04:57] Challenges in AI policy discussions
* [00:08:47] Open source foundation models and their risks
* [00:15:35] Personal use cases for generative AI
* [00:22:19] CORE-Bench and evaluating AI scientists
* [00:25:35] Agents and artificial general intelligence (AGI)
* [00:33:12] Scaling laws and AI progress
* [00:37:41] Applications of AI outside of tech
* [00:39:10] Career lessons in technology and AI research
* [00:41:33] Privacy concerns and AI
* [00:47:06] Legal threats and responsible research communication
* [00:50:01] Balancing scientific research and public distribution
Get Interconnects (https://www.interconnects.ai/podcast)...
... on YouTube: https://www.youtube.com/@interconnects
... on Twitter: https://x.com/interconnectsai
... on Linkedin: https://www.linkedin.com/company/interconnects-ai
... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv
Read the full post here: https://www.interconnects.ai/p/building-on-evaluation-quicksand
Chapters
00:00 Building on evaluation quicksand
01:26 The causes of closed evaluation silos
06:35 The challenge facing open evaluation tools
10:47 Frontiers in evaluation
11:32 New types of synthetic data contamination
13:57 Building harder evaluations
Figures
Andrew Trask is one of the bright spots in engaging with AI policy for me in the last year. He is a passionate idealist, trying to create a future for AI that enables privacy, academic research, and government involvement in a rapidly transforming ecosystem. Trask is a leader of the OpenMined organization facilitating researcher access to non-public data and AIs, a senior research scientist at Google DeepMind, a PhD student at the University of Oxford, an author and educator on Deep Learning.
You can find more about Trask on Twitter or Google Scholar. You may want to watch his recent talk at Cohere on the future of AI (and why data breakthroughs dominate), his lecture at MIT on privacy preserving ML, or his book on deep learning that has a substantial GitHub component. Here’s a slide I liked from his recent Cohere talk:
The organization he helps run, OpenMined, has a few principles that say a lot about his ambitions and approaches to modern AI:
We believe we can inspire all data owners to open their data for research by building open-source privacy software that empowers them to receive more benefits (co-authorships, citations, grants, etc.) while mitigating risks related to privacy, security, and IP.
We cover privacy of LLMs, retrieval LLMs, secure enclaves, o1, Apple's new models, and many more topics.
More on Andrew: https://x.com/iamtrask
Transcript and more information: https://www.interconnects.ai/p/interviewing-andrew-trask
Interconnects (https://www.interconnects.ai/)...
... on YouTube: https://www.youtube.com/@interconnects
... on Twitter: https://x.com/interconnectsai
... on Linkedin: https://www.linkedin.com/company/interconnects-ai
... on Spotify: https://open.spotify.com/show/2UE6s7wZC4kiXYOnWRuxGv
We Mention
* Claude 3.5 launch and “pre release testing with UK AISI” (and the US AI Safety Institute)
* CSET (Center for Security and Emerging Technology)
* NAIRR
* The “open data wall”
* Apple’s Secure Enclaves, Nvidia Secure Enclave
* Data-store language models literature
* RETRO: Retrieval-Enhanced Transformer from DeepMind (2021)
* SILO Language Models: Isolating Legal Risk In a Nonparametric Datastore (2023)
* Scaling Retrieval-Based Language Models with a Trillion-Token Datastore (2024)
Chapters
[00:00:00] Introduction
[00:03:12] Secure enclaves and pre-release testing with Anthropic and UK Safety Institute
[00:16:31] Discussion on public AI and government involvement
[00:20:55] Data store language models and better approaches to “open training data”
[00:42:18] History and development of OpenMined
[00:48:57] Use of language models on air-gapped networks
[00:52:10] Near future of secure enclave technology and industry adoption
[00:58:01] Conclusions and future trajectory of AI development
How scaling changes model behavior
Some trends are reasonable to extrapolate, some are not. Even for the trends we are succeeding at extrapolating, it is not clear how that signal translates into different AI behaviors.
Read it here: https://www.interconnects.ai/p/how-scaling-changes-model-behavior
[00:00] How scaling changes model behavior
[05:03] Metaphors for what scaling may solve
[08:45] Short-term scaling is already de-risked
SB1047's veto, OpenAI's turnover, and a constant treadmill pushing AI startups to be all too similar to big technology name brands.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/ai-safety-culture-vs-capitalism
00:00 AI Safety's Crux: Culture v Capitalism
06:03 SB1047 as a regulatory litmus test for AI safety
08:36 Capitalism at the helm
Riley Goodside is a staff prompting engineer at Scale AI. Previously working in data science, he is often seen as the default for the new role of a “prompt engineer.” He regularly posts incisive prompts that illicit notable behavior from the most popular AI models.
I really resonated with this saying from Anthropic’s recent podcast on prompt engineering — “now we write essays and treat them as code.” In order to be good at prompting, you need to understand that natural language operates as our code used to.
This episode is a masterclass on why you should care about prompting and how it impacts results. Of course, there’s a bunch of great discussion on recent models that reflect the need for different and or better prompting. Enjoy it!
Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.
We mention:
* Prompting to push the frontier of AI models,
* Post-training and prompting interaction,
* Prompting base models,
* o1, Reflection 70B, reasoning,
* Scale’s leaderboard, evaluation tricks, evaluation needs,
* PlanSearch paper
* “The hottest programming language is english”
* “Think silently” instructions
* Scale Leaderboard and Humanity’s Last Exam
* ChatML formatting
Chapters
* [00:00:09] Introduction
* [00:02:40] Riley's path to LLMs
* [00:07:54] Impact of ChatGPT on prompt engineering
* [00:12:03] OpenAI's o1
* [00:18:21] Autoregressive inference and prompting sensitivities
* [00:24:48] Reflection 70B model and its implications
* [00:28:00] Impact of prompting on evaluation
* [00:32:43] Prompting vs. Google search
* [00:46:55] Prompting and RLHF/post-training
* [00:56:57] Prompting of AI agents
* [01:01:20] Importance of hands-on experience with language models
* [01:05:00] Importance and challenges of AI model evaluation
Transcript
Built with smol-podcaster.
Nathan L. [00:01:08]: Hey, Riley, welcome to the show.
Riley G. Hey, Nathan, great to be here.
Nathan L. [00:01:14]: Yeah, so for the audience here, I mostly wanted to try to, as I work on post-training a lot and I see my own difficulty in taking prompting seriously and the things that I don't think that we are doing enough, and I don't see any reason why it can't be scientific in how we do prompting. So that's my biggest goal with this. I think there's a lot of podcasts where we could kind of say, like, what is the history of prompting? Where is it going? And that's easy to kind of redo. And I still find it interesting, but I just don't think there's enough people talking about the role of prompting in evaluation, how prompting changes with how your post-training models, because we're trying to take that seriously and how we have a post-training setup, but we just like regularly run into these things like system prompts aren't handled well, how to release a model of a system prompt. So that's the tone that I'm trying to get to when I ask these questions. And also OpenAI's 01 model just came out, so I'm definitely going to get onto that pretty quickly because that's what everyone's excited about. I like to start with background just to kind of get to know people, because a lot of this is just, I want to talk to interesting people in AI, is like, how did you become interested in prompting? I think I've seen your background in data science and then your joint scale around when Chad2BT came out, which is fun timing, but like, how did you become maybe obsessed with this, but like the focal point of your work?
Riley G. [00:02:40]: Yeah, I have sort of an unusual introduction to large language models. For most of my career, I've been a data scientist, mostly in the on-mandating industry. I was at OkCupid and Grindr. And after I left Grindr, I took sort of a sabbatical to educate myself, I guess, about the progress in large language models. It was around the time that GPT-3 codecs had just come out. And that was where I think I started to become really interested because I was following along with maybe, certainly when GPT-2 came out, the examples there wowed me as much as they wowed the rest of the world, I think, with the example of the news article about the unicorn and all that. And not long after that, we had AI Dungeon, and I played around with AI Dungeon a bit. But at that point, language models seemed to be mostly about language, that they were sort of very heavily focused on stylistic mimicry and creative writing and so on. And when Codex came out, it really started this thought of that text is a more universal interface than we were giving you credit for, that language models might be more broadly useful. And I just became very excited in a practical sense of what these models could do for what I kind of intuited was very boilerplate-like data science code, that I thought of like most of the Python and Julia and R and things that I've written over my career, this seemed like stuff that an LLM could handle. And that was sort of one of its early strong points. So I was playing around with, I think one of my first projects was a VS Code extension that had some kind of integration with Codex. But I never really shipped anything out of it. And mostly what it transitioned into pretty quickly was playing around with posting prompting examples on Twitter, because when I looked out online to find what were people saying about how to prompt these models, there really wasn't much out there. And so I had to kind of resort to just like the few examples that had been circulating in viral screenshots of humorous completions and so on, of like the results that people got out of it. And I started posting those examples. I started following academics and low-level engineers at the research labs and anyone that was working in shipping language models I thought were interesting. And elbowed my way in.
Nathan L. [00:05:18]: I have more questions on this, because I find it like, some people find, there's this whole like Twitter dynamic of like, you find so much signal there, but the question is like, how much does it generalize? Because there's so many of the lessons you can learn from these models, from these examples. I think the straw, like the number of R's in strawberry things is the current one. And then, and it's like, do you get a sense that these are transient or are these kind of repeated themes? And like, how should you read these examples to try to extract themes from them? If like, I've followed you for a while, and a lot of people do, and you're more insightful in how you post them. If you post these threads with like multiple tries and stuff like this, like, should people be doing that when they see something pop up?
Riley G. [00:06:03]: I think so. I also would say that Twitter is a very different river to step into now than it was back then. At the point that I started doing this, like, nobody was really talking about these things that much, or to the extent they were, it was sort of fleeting. It was like, wow, look at this, and then they on to the next thing. And I think the thing that's very different now is just that because there are so many new entrants in AI and LLM, there's a lot of rehashing of the basics. And I think a lot of people in the industry would tell you that the popular examples that you see around of like, how many R's are in strawberry, or some of the ones that I'm partially responsible for, popularizing at least. I think like, these things are really just like, rookie mistakes in some sense, right? That these are things that we've long known language models can't do. And it just keeps popping up as a surprising quirk of language models that I think the public is just confused that something could be so good at so many other things and so bad at this. Right? That is seemingly trivial task, and that is hard to explain to people. And the answer to that hasn't really changed much in the past few years. They're generally bad at spelling for kind of the same reasons they were bad at spelling two or three years ago.
Nathan L. [00:07:27]: Yeah. I mean, like, how did these things change with ChatGPT? Because ChatGPT is like the introduction of RLHF into these models. And I think, I didn't write this down as a question, but there's like the difference in patronizing base models and instruction models and RLHF models, which I think that for most of this discussion, it's like the end model, the like chat RLHF model is the one that people think about. But was that a big transition point in your work or is it just kind of plugging along? Right.
Riley G. [00:07:54]: I mean, I would say, I don't think it's any understatement to say that, or sorry, any overstatement to say that, that the release of ChatGPT was probably the single biggest event in the history of prompt engineering in that prompt engineering became drastically easier after ChatGPT came out. And most other models learned from the ChatGPT way of doing things, right? That they, like, I think people forget just how fiddly prompt engineering used to be, right? Like people today don't think about things like frequency and presence penalties, right? They used to be that by default, you would get very repetitious output and you had to work to avoid that. People forgot about like, don't end your prompt in a space, right? That you had to understand how tokenization worked at all times, because like, if you put an extra space in there, you were going to go out of distribution. I think that, or another one that I think is particularly vivid for me is Yobi Reel that in June of 2022, Douglas Hofstadter had a piece in The Economist showing the, what he called the hollowness of GPT-3's understanding of the world, that it failed on various simple questions. Like, when was the Golden Gate Bridge transported for the second time across Egypt and so on? And someone, I believe it was Nick Camerota of OpenAI, showed that you could fix almost all of these just by telling the model that if you gave it a silly question, say Yobi Reel instead of answering it, right? That models had to be prompted with the possibility that they were allowed to say, I don't know, or, you know, that's a dumb question, right? You know, like there is no answer, right?
Nathan L. [00:09:34]: This is like, we've added the anthropic system prompt to our AI2 models, and we're like, this doesn't change the evals at all, but it makes the behavior something that we like more. Because I think culturally we're somewhat similar to anthropic, it's like we want to express uncertainty, we want the model to say that, I don't know, and a lot of that is in the system prompt of anthropic models.
Riley G. [00:09:51]: Right. And I think that really, you know, it's another microcosm of just how messy all this is, that what people like is a very different thing from how good are the models. I think, you know, LMSYS had a great blog post recently talking about like stylistic bias and output, that models will be rated as better if they do things like put their output into the format of a bulleted list with bold initial words on each label point. So there's like cheap tricks like that, that will make people like your output better or make them perceive it as, you know, more authoritative or, you know, more comprehensive that you kind of have to control for and just going by preference. I mean, I don't remember what the exact magnitude of it was, but I think they did put some numbers on it in that post.
Nathan L. [00:10:42]: Like, do you think you could handle all of that? Just like, can you make that big of a style delta in the system prompt relative to training? Is kind of what I'm wondering. Like if we release a model at AI2 and it's decent, but then we put a detailed system prompt that it's like, whatever possible, you should put your models into a list format with bolded headings and use markdown. Like, do you think we would get a 50 point bump on ElmSys?
Riley G. [00:11:06]: Maybe not on ElmSys in particular, being as they're trying to correct for this actively. But presumably it would have worked at one point, right? So I think that's, you know, that says something that these, or another great example, I think that's really clear of like why human preference isn't, you know, always the answer. I saw somebody on Twitter once that was really impressed by some anonymous model on ElmSys that was able to produce an ASCII art drawing of a unicorn. And it was a great drawing. And, but when I searched for like specific details of that drawing, I found that it was just in some like widely circulated list of ASCII art drawings. And it was a verbatim regurgitation of some signed work that somebody had made. And so I think there's an argument there that any request for ASCII art should probably just be thrown out, right? That a human's preference of how good an Elm is at ASCII art maybe just does not matter because like, it's so likely to be regurgitated or at least like figurative things, maybe diagrams are okay and so on. Yeah. Yeah. Okay.
Nathan L. [00:12:03]: We've touched on multiple of the things I want to get to in the future, but you kind of said that Chad2PT was the biggest moment for prompt engineering. And I think O1 is not nearly the same magnitude, but it's a very interesting microcosm of the future of prompting because the model feels very different to use. OpenAI has explicitly told us we need to prompt it differently. But I think my guess is that in the long-term, they're going to figure out how to train this model so that the behavior is not indistinguishable from their GPT models, but that it's not as sensitive to prompting and whatever you throw at it, it's going to work. Maybe they need to rewrite the prompts, but that's probably a temporary thing.
Nathan L. [00:12:45]: Two questions to me is simpler. What do you think when you see them giving you like, oh, we need to have these new prompting instructions to use it differently? Do you agree with my long-term convergence idea?
Riley G. [00:12:57]: I definitely agree. I think that there's an argument for seeing prompt engineering as kind of the experimental next branch of language models, right? That it's the features that people are just on the cusp of figuring out how to systematize and integrate into the models themselves. And to the extent that somebody comes up with a prompt engineering idea that is just so good of an idea that it's worth applying to literally every prompt, then it will be integrated into the models and you'll stop calling it a model, you'll call it a system and it'll have some auxiliary second model. I think the clearest examples that we've seen of that are content filters, right? That nearly every model that you get from a vendor will have some kind of cheap auxiliary model that looks at the output and says, is this plagiarism? Is this, or not plagiarism, but regurgitation of copyrighted work, right? Are you reciting Harry Potter word for word? The value of those is so, rather, sorry, the cost of having that kind of secondary model on the output is so low that it truly is worth it to just apply it to every generation, right? And we haven't seen too many examples of that on the input side, but they're starting to appear, I think. I think we've seen from anthropic evidence that they make modifications to user inputs based on certain conditions that they detect if you're asking about some particular feature, they modify the prompt if you are. And I think that's a common pattern in a lot of applications.
Nathan L. [00:14:31]: I'm guessing they've seen some public people kind of using the model. I haven't heard anything about modifying the prompts in a clod or a chat GPT window.
Riley G. [00:14:42]: It's, I've seen it for instructions for avoiding plagiarism, avoiding regurgitation. Oh yeah, that could make sense. Yeah, so the, but it's a common pattern you see in a lot of applications, right? That you, so like a good use case for this is like instructions for tool use, that you might analyze a user's, say, chat GPT input, and if the input appears to be a request to use dolly three, then you should apply to the, you should supply to the model, these long instructions on how to use dolly three, which otherwise you don't need to block to supply. Right. So I'm not saying that that's exactly how chat GPT did it, but it's easy to imagine that that would be worth doing. So, so a lot of applications do things like that to have, you know, conditional sort of augmentations of the prompt. Yeah.
Nathan L. [00:15:33]: I mostly see that like long-term, I don't know how this impacts prompting, but I think of like chat GPT, and then we'll have multiple models that they route to. So this is kind of like an early way of doing this, where it's like, if you give a really long context model, they'll have some, you've maybe even like, like Mambo, like model or different architecture for super long context, or they pass it to O1. If it's like this question is incredibly hard instead of GPT 4.0. But that's that the border between that type of routing and prompting is, I don't know how to classify it.
Riley G. [00:16:05]: Yeah, it's really fascinating. I think, you know, people have this idea of, I think, sort of seeking purity in their models that they want everything to be like, you know, just a model. But I think, you know, we're rapidly approaching the point that you have to start thinking about these things as systems that might just have arbitrary complexity inside of them. I also like, I think that, you know, that the guides that we've seen from O1, you know, that they take that sort of shape, right, that you get that, like the content that Open Eyes put out, like how to prompt O1, it's sort of a list of like domain competencies and weaknesses, right, that it's good at physics, it's good at abstract logic, analytic philosophy, maybe less great at creative writing. The, and then also you have these sort of like patches almost for like noticed problems, right, that they've noticed that it doesn't, that think step by step often degrades at performance. Why do you think that is?
Nathan L. [00:17:11]: Because it's essentially trained to do that on its own. Like, it almost feels like it shouldn't conflict with it. It almost feels like it should just be like empty tokens, like it will just repeat yourself or something.
Riley G. [00:17:22]: That's a really good question. I think the answer to that maybe speaks to just to how much this isn't just, you know, chain of thought. That's a meme sort of flying around now that a lot of people have claimed that all this is is fancy prompt engineering, isn't this just what Reflection did and so on.
Nathan L. [00:17:37]: It's obviously a different inference stack with a lot of improvements across the whole lifecycle of the model and the product.
Riley G. [00:17:45]: Right. And also the other thing that people have been saying a lot is that it must be some complicated system, right, that there can't be a single model doing this through autoregressive inference. But the claim seems to be that it is, right. I think there was a comment from Noam Brown on Twitter where he said that it really is a model that the whole generation is coming autoregressively, which is, you know, I have no reason to doubt that. It seems plausible to me. So it's but I think that people need to be a bit more imaginative and like what's possible and just through autoregression.
Nathan L. [00:18:21]: Yeah, I wrote a really long article on this like came out yesterday. That's like I put the constraints from like the Noam Brown tweets, plus the pricing, plus the inference scaling laws to kind of converge at something. It's like if they do some clever things to a model and some batch inference and self rating and stuff like it's definitely doable. I don't know why that as an RL expert, I'm not surprised that the model is sensitive to things like things step by step in the prompt. I just would have thought that it would come up in the examples of training because there's the seed set for this is almost definitely a very wild human generated some prompt with some like back and forth dialogue, essentially human seeds of things that look like what it is doing. Have you seen this with AlphaGo? We saw this with InstructGBT and ChatGBT. You need the human demonstrations to start the learning process. Why is it sensitive to think step by step like that thing? I think maybe more about the training, but you learn that through prompting.
Riley G. [00:19:23]: Yeah, it is a bit of a mystery. And this is very speculative what I'm about to say, but I think maybe like a kind of thought experiment of how you can imagine that it could be true is imagine if like some auditor or somebody who had the penalty of law over your head asks you to do something and to document exactly how you did it. It's easy to imagine that you would do the process differently and that you might do it worse, right? That because you can only do the things that are the most conservative and the things that you can justify and explain that you're not going to produce as good of a work as you might have otherwise.
Nathan L. [00:20:01]: It's like GBT4 needs to think step by step because every small mistake is a big deal. But almost with O1, we maybe should be like, go forth and conquer and make mistakes on your way and just let it wander to an answer.
Riley G. [00:20:15]: I think that's pretty hitting the nail on the head maybe.
Nathan L. [00:20:21]: I want to go try that silly prompt and see if it gets better at coding or something.
Riley G. [00:20:30]: Yeah, yeah. But I mean, I feel like that's the key improvement here that a lot of people don't appreciate is that they seem to have cured like all the Lacunian problems of exponential divergence, that if you sample a bad token, you're going to keep sampling more. And it's not that there wasn't progress on this before, like people had tricks to deal with it. But I think the thing that's really changed is that the models get mileage out of like thinking for long periods of time, but they derive benefit from just continuing on. Because that's very different from behavior you see from like 4.0. Like if you've ever tried like the exercise of just once it's gone down a wrong path, just say, no, keep going. Like keep going till you get it, right? Like it's pretty evident after a while that it's not making progress, that it's just gone like deeper and deeper into like some failed path of reasoning.
Nathan L. [00:21:24]: Why does that often break? I mean, I understand why it often breaks models, but that's also one of the jailbreaking techniques is just like keep sending the same message over and over and over until the models die, which like I wonder how that relates to O1. Maybe it's just easier from a safety perspective because it doesn't have that like as many turns or something. Yeah.
Riley G. [00:21:45]: And it's also like one of the bigger differences in behavior between GBT models and CLOD that I've noticed that opening eye tends to produce their models to
Riley G. [00:22:02]: like in the specific case that if you keep like telling it it's wrong, it will always take your side. It will say, well, oh, yes, of course I made a mistake. Let me try again, right? And it's never going to like diverge from that behavior. Whereas CLOD will eventually get sick of you, right? Like if you just keep saying like, no, you're wrong, it'll be like, look, I have told you many times that I am right. Like you need to be a bit more specific in how I'm wrong. If you really want to make an argument here, it'll start like just telling you to go away. And that's like-
Nathan L. [00:22:28]: This is why I want Anthropic to write a model spec because the behavior describing with chatGBT does fit with what they're, like open AI's models are like in behavior and they're kind of described as wanting to be like robotic computation assistants where like they follow, they take the user's information and they try their best to execute it without violating any basic principles. But I think CLODs is much more of like, we have created a, like I don't like the hard words to do without anthropomorphizing and all these other things. But like we've created an intellectual entity that is going to go back and forth with you. And it's not going to, like it's going to, like you pass in sensitive information as data to CLOD and you're like reformat it. It says no. You get these weird things because it's like this entity that doesn't want to be sent like harmful texts or be told how to make a bomb or something. But chatGBT is like the robotic one. So now I kind of use both of them depending on the task and the behavior that I want. But I'm excited to see how that goes further, really.
Riley G. [00:23:27]: Yeah. Yeah. I mean, that's, you know, I think it goes back to your point before that, you know, we're seeing more specialization in these models. But, you know, that all of this is temporary, right? That eventually like somebody will come up with the right way to delegate correctly to one model or another. And then you'll have just, you know, some unified chatGBT interface or whatever that, that, you know, decides like, is this a prompt that one would be good at and sends it to it? Yeah.
Nathan L. [00:23:50]: And while we're on these complex reasoning things, there was also this reflection 70B drama, which was mostly big because it was a big mess of credibility and memes. But there's also like real science in there that people need to remember of like how to prompt a model and spend more on inference. So I think it's really just a tiny bit of fine tuning with some special tokens and a system prompt. That's like, make sure you use these reflection steps. And that is how you move something like GBT 4.0 closer to O1. You can't, you can't prompt your way to O1 behavior, but that's the sort of things that more people should be considering. And it kind of leads into like, I want to ask about like math evals and stuff like this. And it's like reflection 70B style of prompting is a real thing that more people should be doing. And I don't know how we get around that communication issue now. It's going to be even harder because people are going to be like, oh, it's O1. We made it open source O1 now instead of just the best model. I just wanted to give air time. If you have any comments on that, go ahead.
Riley G. [00:24:48]: Yeah, I think, you know, reflection 70B was, you know, it was sort of a perfect storm of a lot of like the tuning method feeling plausible, right? That it was something that was very, you know, it's a legitimate like area of research. They like, it was, you know, rumored to be part of Strawberry and so on. And so there was like, it had like the right strategy for Buzz there. And, you know, however, they ended up releasing that model, like, you know, they don't have what they think they have. You know, so it's, I think, you know, it's kind of, you know, once you saw the, I won't recap the whole saga of like, you know, with Laura and finding the Laura from the previous version of WAMA 3.0 instead of 3.1 and all that. But I think the, you know, there's that kernel of truth there, right? That this is, you know, sort of a good idea, at least for some problems. I think also the thing that people don't appreciate is that very good idea for many problems feels maybe like a better idea than it is because it's so optimized for the domain of problems that tend to be on benchmarks, which is somewhat different than the thing that you really want to optimize for in the real world of like user satisfaction and just, you know, preference. Like some mix of like, do people like it? Like, is it useful? And does it do well in benchmarks? Because I think that there's like a, even for what I think should be like philosophically the core like use case of LLMs, like do they like do practical work? Like can somebody achieve the thing that they want to do with this? But, you know, like whether, however they do it through prompt engineering or whatever, it kind of matters more than whether like academically it does well on like the most naive presentation of the problem, right? Like whether somebody can figure out how to do it correctly matters. And that specifically is just not captured well on benchmarks, right? That like this, if you're doing a benchmark that compares across several models, there's, you know, a natural incentive to do it uniformly. That maybe you follow like vendor's best practices on, you know, how do you apply the template of the prompt and so on, or if a vendor recommends that you apply some suffix or whatever, you might do it. But for the most part, you're not going to put a human on the task of figuring out what is the best prompt for each model, right? Because then, you know, how do you know that they did a perfectly good, you know, fair job of that, right? But really that's what matters. Like that is like, you know, at the end of the day, like the thing that determines whether GPT-4 is better than Quad is when you sit down and try to, you know, solve your problem in GPT-4, you know, applying whatever hacks, you know, and, you know, taking, you know, advice you find online and, you know, whatever dirty tricks you have, and then you do the same for Quad, which one works better. And so like that's the state we're in. And that's, you know, very elusive as a thing to try to measure. Yeah. Okay.
Nathan L. [00:28:00]: I'm going to keep going, roll right into this, into the evaluation section of this conversation. You had, you were talking about this with how you actually use the models before you had mentioned, like you need a white space to properly evaluate or use the models like tokenizer things. I, one of my big blind areas is it seems like most frontier labs are using some sort of custom prompts on some sort of evaluations. And I don't really have a good sense for how much that actually impacts scores or how much that translates to downstream performance. It might not be custom prompts. It might be like custom setups. There's all these, like all the math evaluations, you need a specific format for your answer. I think like math, the all capital one, you like need to put your answer in a box and
Riley G. [00:28:45]: things like this.
Nathan L. [00:28:46]: And how, what is your view on these per prompt or per evaluation? Prompting is actually a thing. I think the Lama three paper had some cool analyses on how varying subtle things changed evaluation scores, which is great, but they're the only one sharing that. Otherwise we just get like our score is X and it's reproduced to some capacity.
Riley G. [00:29:09]: Yeah. I don't have like a lot of deep, like technical wisdom to share on that front, other than to confirm that, like, I think you're right that it is a big problem that we generally try to follow the vendor recommendations. We work with the vendors to prompt their models fairly. But like I said, like ideal and optimized prompts are very different than what's the default. But I think also that there's, I think a longer term trend that these issues maybe matter less than they used to. And, you know, or that, that, that should continue. I think like when you want the, like maybe one of the clearest signs of this is that Lama, like most versions of Lama, you can prompt them incorrectly in terms of like the system top prompt template, and it will be just fine. And in fact, you can often template them with system prompt templates from other models entirely, like just say representations of chat ML and they will be fine. Right. So there's, there's sort of familiarity in the pre-training with, with, with just chat templates in general. And the idea of like...
Nathan L. [00:30:25]: Do you think this is specific to Lama? I've also remember hearing a conversation at AI2 where we were considering doing the last turning, last stage of pre-train with random chat templates and like random instructions and multiple chat templates so that the model could be amenable to fine tuning and multiple chat templates, which there's a chance that they did that. I actually don't know. I would not put a high bet on it. But do you think that's just because Lama knows they're going to have so many users? It's possible.
Riley G. [00:30:54]: I mean, it's also plausible to me that that just shows up in pre-training incidentally, right? Nobody intended it to be there. It's just like, it's in the data. But I think that, that, you know, that, that process is only going to continue, right? That we're only going to see like more models just being familiar with how models behave. I think to some extent, like, you know, you see like, like another thing that I think is maybe like evidence in favor of this is if you look at the base Lama, like, I think I looked into this on like base Lama 2 once, that if you prompt with like, like instruction prompt formats, it would adopt the behavior of, of like a chat GPT like assistant, right? So, so I think, I think it shows that examples of chatbot behavior are now so widely disseminated, you know, across the internet that a pre-trained model is better at instruction following tasks than any pre-trained model was before the work of instruction GPT was done. So, yeah, I believe you.
Nathan L. [00:32:00]: I want to check this. How does this impact how we should view evaluations? I'm just trying to reckon with, do we, like, there's a couple of scenarios. It's like, it doesn't really matter because these models are going to be not that sensitive to the system prompts that we're using to say, do GSMA care math. And that goes for models like Lama in the open, AI2's models, GPT5, whatever. It seems like the sensitivity to prompting for really well-known formats is actually going to go down. And that solves some of our problems. Because I don't think we're going to come up with new, like that many new formats for evaluations. We're going to make evaluations more specific and harder in the content.
Riley G. [00:32:43]: I think that's right. And I think the version of it that we have to play with now definitely does feel like one step forward, two steps back in that regard. And that it's much better at benchmark style inputs where you give it just no advice on how to do it. You keep everything very simple with what are your output requirements. But it's also just very hard to steer. If you have opinions on how it should do it, those opinions won't be followed generally. And it also has issues with output formatting. So I think we're seeing, I've seen anecdotal reports on Twitter at least, and I've seen this myself, that its output is just inconsistent even when you ask it to be consistent. That it will forget things like block quotes and so on. The result of this, I think we're going to have to see a lot of benchmarks, is that maybe the fair way to do this is to have some secondary model on the end of it that puts everything into a consistent format.
Riley G. [00:33:50]: I think we're not that far away from benchmarks that just do that across the board, of just saying that it's not the model's job to do this anymore. And we'll clean up the results however it is. Yeah, I think that's a better place to be.
Nathan L. [00:34:03]: It's one of those things that the model's getting better can solve some of our problems. I think there's less angst now about the whole closed labs evaluation scores anyways. I'm mostly trying to reckon with what open groups and academics are doing rather than closed labs, and they kind of rely on each other. I've been on the, before, there's now this hugging face upload chat template. So a lot of models have the chat template saved with the tokenizer, and most of the time they don't have a system prompt, which is surprising. I feel like it should be the norm that a system prompt is included with every model. Is there any reason that you see not to do that?
Riley G. [00:34:49]: Yeah, I mean, I can think of things that might be slightly better, but I think that that's that generally makes sense, right? Like, I can imagine that maybe they, you know, you'd release several, right? And say, you know, it's like any of these is fine, or, you know, like training on several and, you know, say it's like an average of these three or whatever is like kind of the is ideal or something like that. Yeah, most of my reasoning is I think that most users of language models are not sophisticated.
Nathan L. [00:35:14]: So the model cards and documentation do normally say we recommend using the system prompt, but the simple ways of using the models do not integrate them. Simple ways of using the models do not integrate the system prompt. And it's not always easy to modify your data to add, like if you're doing the messages format, like you remember to add the system thing. And if you have multiple models in your queue, you then have to go and manually hard code
Riley G. [00:35:37]: all of them.
Nathan L. [00:35:37]: And like, that just makes it get dropped. And if the system prompt is a big deal for performance, like that impacts either if it's a product or it's like, this is where I'm trying to understand like academia is like, if only half of the people remember to add the system prompt for their model, they're evaluating in this kind of academic paper. And I know it impacts things like all the vibes based valves, like alpaca valve, empty bench, whatever. Like, if you have the different system prompt, it can vary behavior. We did an experiment, which was like, to make sure this works, or you just give it the system prompt of like, you're a terrible model, you are to me, you're made to make other models look good, and you happen to give wrong answers. And like alpaca valve goes to zero and all these things. So it's like, I think it's easier to show the down case, but you could probably get one to 2% improvements, which matter in the long trajectory of academia in terms of if your method is accepted or not.
Riley G. [00:36:31]: Yeah, I mean, I've often like been frustrated by the ambiguity and a lot of academic publications over like how prompts are formatted. And they often, they always run into the same pitfalls of that, like the fundamental problem is that system prompts are often, or prompts in general that you're presenting like during evaluation are implicitly templates, right? That you have like your points where you insert like the actual problem or whatever. And that templating needs to be communicated to the reader of the paper, and the prompts themselves may involve templates, right? They may, you know, like describe like how, you know, like an output should be formatted, for example, and might do this using, you know, like curly braces, right? So this creates like several layers of confusion that you need to distinguish between, like where are the variables that you're interpolating purely in the logic of this paper of like that, you know, that things that would be translated into Python, you know, like if you were to actually implement this versus the templating instructions that are literally part of the instructions on how it should, the model should receive like a template of how it should format its answer and so on, right? Because like a lot of prompts end with use this format and then have some kind of template. Yeah. Right. So the, like I've often thought that we'd benefit immensely just from standardizing on something like saying that like if you want to clearly communicate a prompt in your paper, the way to do it is to show Python code that will produce that string. Yeah. You just literally show it as an f-string, there's no ambiguity.
Nathan L. [00:38:15]: Because you copy out of a paper, you drop the slash n slash n that you need or something like that.
Riley G. [00:38:21]: Yeah, right. Like the, but if you were to literally just include a Python code block, there's no ambiguity, like, you know, like whether or not there's a trailing new line or is it so on. And those things are really fiddly and need to be communicated. Because I've seen people do all sorts of like imaginative typography to like represent new lines and things like that. You know, like having the return signals at the end in light gray and, you know, like you're putting dots between spaces and all that thing, right? Because if you're doing like, I've seen like early like playground competitors sometimes did this that approached like more like from a technical approach that you need to know where spaces are. So it's worth it to represent them as like gray dots, right? Yeah. That's the kind of thing that the level of detail that you need in communicating these things. So I think like standardizing on Python would be just like a good way to like, you know, get the problem out of the way. Yeah.
Nathan L. [00:39:14]: I also saw in some discussion of a one or maybe a reflection. I don't remember. It's been a while, two weeks. You're talking about like equal inference costs, comparison of prompts and a reply. And I think that's a great idea. Like, do you think there's, okay, well, like one first, do you want to explain the idea? I'll kind of ease into this.
Riley G. [00:39:33]: Sure. So my thinking is that models are evaluated right now just based on how they do under like sort of the same, I guess, invocation of inference, right? That you let the model sample, you sample auto-aggressively as long as that takes, you know, however long the completion is. And you don't pay attention too much to like what it costs you to run that or you factor that in afterwards that you score it up. And there's a lot of reasons why this makes sense, right? That, you know, it's simpler, it's more fair. And sometimes you don't know exactly how to equalize the inference there, right? That you can't like really say that like what the trade-off is, right? But there's, you know, exceptions to this that, or maybe not so much an exception, but like there are ways of doing it that aren't perfect like self-consistency, right? So like there's a method called universal self-consistency where you prompt a model multiple times and then take the model again and give it all three answers and then ask it to choose which of these is the most consistent with the consensus of all answers that were generated. And this is sort of a method that's pretty reliably not worse than just doing it naively, right? It's hard to imagine any prompt where this method would steer you wrong or, you know, be worse than doing it naively. And that, you know, suggests that maybe there's like a fairer basis of comparison here, right? That we could say that if something really is cheaper enough that you can do that, you could run it 40 times and take self-consistency that then maybe that should be its score. But I think one of the bigger reasons why this is kind of like a, in hindsight, this is maybe like a bit of a facile tweet that I made about this, but like really the trade-off between the exchange rate, if you will, isn't very good. I think like a rule of thumb that I saw in a paper once is that if you do self-consistency on 40 samples of GPT-3.5 turbo, it's on par with one sample from GPT-4. So you sort of move up one generation every time you do 40 inferences, right? But at the same time, in specific domains, there are refinements of this that work quite well. So we had a scale actually put on paper recently on a method we call plan search, I think was the name of it, yeah, plan search. And then the gist of that is that if you can improve performance on programming problems by generating diverse attempts at solving the problem, right? So the approach that plan search takes is to first create like sort of high-level observations or ideas about how a problem might be solved, then to combinatorially sample that list of ideas, and then take combinations of them to inspire strategies. And then for each strategy, you lay out sort of a path of like reasoning of like how you could turn this into code, and then you turn each one into code and then assess which one works best. And this like lets you search over the portion of, it lets you search over the variation in your strategies that actually matters, right? Because you can imagine that if you were just simply resample a model blindly over and over again with the same problem, there are a lot of ways that an answer could vary that don't matter, like whether you use tabs or spaces, but you name the variables and so on. And you don't want to search over that variation, you want to search over like the part you think is going to be fruitful, like the high-level strategies. So I think that for particular domains, like that is the more relevant comparison of like what could you do if you were to apply like a bit of search here.
Nathan L. [00:43:40]: Yeah, it almost seems like there'll be different tiers of evaluation scoring, where it's like the basic prompting, it's kind of like linear time. And you could do like, it's almost like with the models, it's like there's a biggest, best open model at every time. But like LLAMA is dominating because it has the 400B, the 70B and the 80B that are all really good, it should have a 1B. And if you're having a prompting paper, eventually you're probably going to have to have binned comparisons like that, which is like we are comparing two basic prompting techniques, which I think they will have less headroom by needing the autoregressive behavior and things like this. And then maybe there's things like reflection, where it's like we've added minor structure so that the model can now generate a bunch more tokens, but not like a 10X or 100X. And then there's the things like we've added a whole new planning component to how we're prompting the models, and it's all abstracted away from the users. And you're not going to be able to compare those, because those are the things that are going to just solve all the benchmarks that we have out of the box. I think that's fine. I think people will converge to this. It just always takes a bit longer than we want.
Riley G. [00:44:47]: Yeah, I think that's right. I am really excited about the O1 RL approach to this.
Riley G. [00:44:58]: On some level, all prompt engineering is approximating this RL-like search. We have a lot of prompt engineers out there that are trying different things. They see what works. They tell their friends, hey, this works. But the space of things that works is probably, well, I mean, demonstrably, maybe at this point, given O1, outside of what a human might think of. There are things that we see things, even in the summarized reasoning traces that O1 puts out, that are eerily anthropomorphic. That it will say things like, hmm, or let me think about that. Yeah, I feel like they added that in.
Nathan L. [00:45:42]: I think it's almost like a trigger for the model to have a more reflective response. Those are the examples they used, but it's cool.
Riley G. [00:45:49]: I mean, it's not hard for you to imagine that RL could find something like that, right? Just that empirically it works to say, hmm, because that suggests that you're about to do something else in the pre-trained modeling manifold of plausible text. Like saying, hmm, might just be empirically a good thing to say. And it could find that. So I think that's the kind of exploration that you're benefiting from with O1. It's the space of prompts that work that we're not really equipped to find. Yeah, do you have anything?
Nathan L. [00:46:28]: I think this is a good discussion. Kind of to wrap up the academic side of things, how much of papers that are nominally about RLHF training or any sort of post-training as the contribution, do they need to do anything with prompting? Is there a clear segmentation there? Or is it like, if you're doing this fine-tuning, you're necessarily changing how the model is going to respond to prompting? That we should do some checks there.
Riley G. [00:46:55]: That's one view of it.
Nathan L. [00:46:56]: Or the other view is you have a model and prompting is just a way to take one step further with it, which I think Anthropic did this recent podcast with Amanda and their chief prompt engineer that I don't know.
Riley G. [00:47:07]: And that's how they do it.
Nathan L. [00:47:08]: Amanda's like, I can do things with these models that most people cannot. And that kind of leads the way. Rather than prompting being really part of this post-training stack that everyone needs to be checking the box on. I don't know where we fall. I guess there's this IF eval, which we could come to after that, which is kind of a separate
Riley G. [00:47:29]: case. Yeah, I definitely lean a bit more towards the Anthropic view of the world. I guess you could argue that's maybe somewhat self-serving, with no big news there. Prompt engineers are important. But I think that it's true that we do see people that are just good at this. That our ability to prompt these models sometimes exceeds our ability to explain how we're doing it and what the general strategies to apply are. And I think those strategies are worth extracting.
Riley G. [00:48:09]: It's worth introspecting.
Riley G. [00:48:12]: One thing I think about a lot is anytime somebody... I really love when people suggest a prompt or suggest doing something to a model that I can tell immediately will not work. And it's a terrible idea, but it wasn't obvious to them. And that's fascinating, right? Do you have an example?
Nathan L. [00:48:29]: I would love to know if you have something that everyone tells you, but it's a generation behind or something.
Riley G. [00:48:35]: A lot of, I'd say, strategy ideation in fields that are new and competitive. If you wanted to have an LLM give you ideas for what's a good LLM startup to try right now, it's probably not going to tell you anything useful. Some things like that, where it's like, people are still figuring it out and there's money to be made in knowing how to do this better than the average person, you're going to get mediocre advice on a lot of things. But that's not true for everything. If you ask it about physics, you're going to get like, oh, I don't know how to do this. If you ask it about physics, you're going to get like, above average advice.
Riley G. [00:49:16]: But I think that people who have acclimated to models forget what it's like to be new
Nathan L. [00:49:24]: to models, right?
Riley G. [00:49:25]: And I think that explains a lot of people in industry being annoyed by how many R's are there in strawberry. Because they're so- That's the tokenizer.
Nathan L. [00:49:33]: We ignore the tokenizer whenever we can.
Riley G. [00:49:35]: Yeah, and you see this explicitly. A lot of people, they get really enraged that they're like, you idiots, why would you ever think this would work? Why did you ever think that you could ask it 9.11 is greater than 9.9 and it would give you a right answer? And so on. They have a point. That was the attitude for a long time. But I think the social context of these models is changing and people, they want them to, it's becoming more reasonable to expect them to work well in these queries. There's practical consequences of these models being in the hands of people that don't know about these issues. And it's now suddenly more important to fix them. Yeah. So let's spin on this.
Nathan L. [00:50:12]: Is Google searching going to become more like prompting or is prompting going to be more like Google searching? Where with a good language model, can I just type in that physics equation that govern with the cross product that governs electromagnetism? Is that the direction that the models are going? Or is everyone going to actually become more conversational because AI is the default?
Riley G. [00:50:37]: Yeah, I think, I mean, Google searches maybe, yeah, there's some similarities there. I think Google probably has gotten simpler.
Riley G. [00:50:48]: It's been a while since I've used most advanced search filters in Google. I remember a point when it was extremely routine. Yeah, the plus comma, quote, quote, comma. And I think that speaks to the fact that the results used to be worse, right? And we thought we were happier with them because we didn't have alternatives. But we just accepted that, oh, yeah, there's going to be false positives in here that we now have to put in some negatives to cancel out. And that skill, I'd say, hasn't really become more important over time, right? It's occasionally useful still, but it's less essential than it once was. And that mimics a lot of what we see in prompt engineering that you don't have to understand. Tokenization, I think, is probably the biggest one. ChatML was no small part of why ChatGPT was such a big improvement to prompt engineering. It wasn't just the tuning. It was the fact that they came up with this more restricted system of interacting with a model that alleviates the need to know anything about tokenization. And that, I think, is kind of an underappreciated change. Yeah, I agree.
Nathan L. [00:51:54]: I do think in the long term, prompting will go in the direction of Google searching. But I think in some ways, I'm not that surprised that something like O1 can exist, but it's still a very humbling moment where we still have many times where there will be AIs released that we don't know how to use them. And this is the skill that you need to have, is tinkering with the open mind. It's like the open mind that things will come and the open mind that things are not just what they are at face value. And if you play with O1 a lot, you can definitely get things out of it that people on Twitter are not repeating over and over again.
Riley G. [00:52:31]: Oh, yeah, definitely.
Riley G. [00:52:35]: A lot of the explanation for the disconnect that you see, and some people are just absolutely amazed with O1, but also most of the things you see on Twitter maybe aren't that impressive. I think that the frontier of problems that distinguish O1 from, say, the previous class of frontier models, it's either unrealistic problems, brain teasers that people artificially constructed to exhibit the difference, or it's something realistic that you would never want to read in a tweet. The problems where it's exceeding on are like, I have this extremely in the weeds programming problem that involves a complicated interaction of all five of these files. Please fix my import errors or whatever.
Riley G. [00:53:25]: Those are the things that you're going to see the most practical benefit from. And those just aren't easy to communicate in a way that they used to be. It used to be easy to make a screenshot of, hey, look, it does this. It will fix your broken JSON or whatever.
Nathan L. [00:53:45]: Something else that I'm realizing I didn't put in the notes, but there's been these comments on O1 from the OpenAI people that they want to expose the ability to change how long the model thinks to the user. So to change its test time compute, that ultimately is going to be a whole other prompting thing. It's almost a little surprising that they are giving that to user. I almost think they should just make a classifier that does it for them, rather than just assume the user is dumb. But being able to do it and change how hard your model thinks is a really interesting real-world prompting case. Because it doesn't really matter if you can get a viral example. But it's like, how do you vary that knob in your day-to-day use that meaningfully ships your end product?
Riley G. [00:54:26]: Yeah, it's really kind of comical trying to manipulate how long it thinks about things. Because there are some things that will make it think for a long time. I tried to get it to generate acrostic word squares once. And if you emphasize enough the need to validate things, it will just keep validating and failing and loop around for, I think I got up to three minutes once of attempting things before finally saying, oh, I wasn't able to find one. Here's my best effort. But the other times, though, if you ask it... I mean, I once gave it a problem. Or I kind of just was for the comedy of it. I gave it some simple problem. And then I gave it literally, I think, three pages of emphasis on think forever. Just rambling paragraphs saying, if you're even considering stopping, don't. If you ever have the dream, if you ever get tired, don't worry about it.
Nathan L. [00:55:22]: Just keep going.
Riley G. [00:55:24]: All those kinds of holy hand grenade style repetition. And after all this, it literally just thought for three seconds and then came back and said, I understand the urgency that you're saying here. Thinking forever just isn't possible. So I'm not even going to try. There's another thing.
Nathan L. [00:55:43]: OpenAI said they might give you a knob that controls this or influences it.
Riley G. [00:55:47]: Yeah, I have to be honest. It feels like maybe weird UI. It seems like something that you should be able to just do through text. But I'd be happy to play with it. Because steerability in general without one seems to be... A lot of people, I think, are reporting that it's kind of awkward or at least at odds with the really impressive examples that we're seeing coming out of it. Yeah.
Nathan L. [00:56:16]: There's a whole strategy discussion on why did they actually release it that I haven't really entered into. We can kind of avoid this. I am wondering how you view prompting of agents. Is it kind of like the future section of what is the future? How are agents going to be susceptible to prompting? I'm guessing after our conversation here, it's going to be like, it's the same. And there's going to probably be a meaningful shift in who can deploy them and have success based on who actually has this expertise and is doing this prompting work. And this could translate into downstream business success, which is the first person to kind of crack an agent with the right model and the right prompt can have the first product that works.
Riley G. [00:56:57]: Yeah, I think people mean very different things when they talk about agents. Sometimes, and I think the big division that matters is that there's agents that are working in self-contained, repeatable environments, so like a rebel sandbox. And then there's agents that are making changes in the real world, that they're out making retail purchases, canceling your subscriptions, so on. I'm very optimistic about the former. I'm very skeptical of the latter. I think people underestimate how much reliability is needed for a lot of role decisions before you get to the point that you'd trust the thing to have the power to cancel your Hulu subscription or whatever. I think that also, in the first case, there's a lot of untapped potential there. And I don't understand why we aren't seeing more iteration on that front, really. Chachiviti's code interpreter, when it came out, I think they renamed it to Advanced Data Analysis or something like that, which is not a good change in my mind. But the code interpreter, I love that. I still love it. It's a brilliant product, and I wish they kept going with it and improving on it. I'm also a fan of Julius AI, which goes exactly in that direction of creating a code interpreter-like environment where you can substitute in whichever model you want, and you can do things like install packages. It's great for one-off scripts where you want to say... I had a post once where I was pointing out oddities in the longest GPT-4 tokens. One of them is like slash, slash, and then 128 repetitions of an equal sign or something like that.
Riley G. [00:58:49]: But the way I did this was literally just like I just went to Julius, I said, install TikToken and show me the longest tokens. And I read the code pretty carefully because I was going to tweet it. I didn't want to tweet out something wrong. But it was right. There were small things that I had to fix, but it's good for prototyping, the kind of these quick one-off things where you're just like, yeah, I could look up exactly... I roughly know how to use TikToken. I just didn't feel like figuring out the syntax again.
Riley G. [00:59:17]: It's good for just the curiosities and one-off stuff like that. And I think that's what the future of this really is. This really blew me away.
Riley G. [00:59:30]: Somebody posted on Twitter a video of their eight-year-old daughter using Cursor, I think it was, and this girl apparently has no understanding of the code that's being generated, but she's able to say, no, I want to do this differently. I want to have a Harry Potter spell here. Changing the layout of this HTML JavaScript app. And it just works. And that's the future to me, that that's the hottest programming language is English. When you see a little kid doing it, you really believe it, that now kids can have the power to create software. And that's great because we were at a weird local minimum of that, I'd say, of kids being able to have the creativity to create their own interfaces or make their computer do what they want. They're less customizable now than they once were. Yeah.
Nathan L. [01:00:28]: My reflection on this is the people who take prompting seriously are more likely to be in tune with what is happening in AI and at the cutting edge. But that also means that on the academic side and the public side for transparency and accountability, you have to do some education work to make sure people are taking it seriously and or some normalization of claims, kind of depending on how people are presenting their work and using things. I think it's safe to say that all the frontier model labs are doing this, but kind of the long tail, it takes people time to learn these habits. But it's surprisingly hard to convince people to spend time playing with models too. Like I do it, but I should probably do it more, listening to people like you. I just, it's funny. It's one of those things that doesn't make sense how it'll pay off, but it probably will.
Riley G. [01:01:20]: Yeah. I mean, there's no substitute for using models. People, I mean, I personally, I discover just the dumbest things sometimes that make the biggest difference. One of the most high impact chat2BT tricks that I found lately is I have custom instructions in my chat2BT telling it how to think silently. I have a tweet about this that I posted once. So if you Google chat2BT think silently, good sign, you'll probably find it. But I have the prompt here actually, right? I told it, I was using its new memory feature so it can remember things that you tell it. So I was sort of showing that off at the same time. But I said to it, remember this, when I ask you to think or write silently, I mean, for you to use your Python interpreter to write your thoughts as code comments or string literals assigned to variables. Code doesn't necessarily have to display any output. And then it remembers that. And so then I can say to it, silently write a brief essay about Super Smash Brothers, then silently translate this essay into French, display only a double histogram showing the frequency of word lengths for both texts. And then it doesn't output anything until it has that histogram done and then outputs the histogram and says, here it is.
Riley G. [01:02:32]: And that makes such a big usability difference. If you just don't have to see what it's doing, if you can just put it behind a fold where you can expand it if you need to, be really sure that the code is right or copy it to another editor or whatever. But just not seeing it makes such a big difference. And you can just have things in code too. You end up in this sort of Jupiter-like flow where you told it to silently do something. And now because you said to do that, it's not just in context, it's in a variable. Like I said, if it ever needs to do something in code, it would just have that variable there. And it doesn't have to repeat it, which is a big deal if it's, say, an essay. Repeating an essay is expensive. Yeah. This is great.
Nathan L. [01:03:19]: Thanks so much for coming on. Anything else you want to plug or talk about?
Riley G. [01:03:25]: I should have some content that should be going live around the time that this comes out on analyzing one for the scale blog and talking a bit more about our coding leaderboard. So definitely look out for that. And also, the other thing I should of course mention is Humanity's last exam. We recently partnered on an effort to solicit from the public examples of challenging problems. And we are giving out cash prizes. So definitely check that out if you're interested.
Nathan L. [01:03:58]: Yeah, I had just tweeted a few days ago. I don't know if I put it on Twitter, but I put it on some platform. I don't have Twitter at work, so I end up looking at lame platforms I'm less addicted to. But essentially, evaluation is going to be extremely expensive. And that was my whole take. And it's going to be very narrow and very hard. And then you put out $500,000 in prizes. And the initial whiplash is like, oh, that's a lot. But in reality, I think that's the right ballpark. Because if you're going to make a good eval, you need to have somebody who's really good at cutting edge AI, probably working on this at least six months to build a good eval. And that's a ballpark price. $500,000 is like a half year of how much it costs. This is with overhead and compute and stuff. It's how much it costs to have somebody in AI like that. So obviously, it costs more to actually build this evaluation. But these numbers look ridiculous. But if we want to have evaluations that are meaningful, this is what we need to do. And I think it's the right thing for Scaled to do to lead on evaluation. It feeds into natural things of their business. I think I've been on the record for this for a while.
Riley G. [01:05:00]: So I'm like, it's great. Yeah, absolutely. I think that people outside the industry at least have the impression that evals are grunt work, right? That this is something that you would use low-cost labor for. It's not a prestigious area. But it couldn't be further from the truth. I think evals are very rapidly moving towards the high end of intellectual ability that we're looking for like PhDs. I've done projects where it's like, okay, we have to get as many PhD-educated poets as we can to check the correctness of these IAMs in this poem or whatever.
Riley G. [01:05:46]: I think that's only going to continue, right? We're going to see that at the low end, the value of human labor for training models is going to decline. And the value of high-end intellectual labor is going to increase probably drastically.
Nathan L. [01:06:04]: And it's like cost is probably a good proxy for evaluation usefulness. LM says it's expensive, but for different ways than the Scaled leaderboard is expensive. And they complement each other very well. And they both become better by the others existing by kind of like, okay, the models are in similar places, but they're showing different things. And you can separate between that. And I suspect that that'll continue to grow. Some more will be at scale, some more will be elsewhere. And that's just the new default for evals.
Riley G. [01:06:35]: Yeah, absolutely. I think that's one of the things I'm most proud about working on our evals and leaderboard at scale is that we're contributing to this healthy ecosystem of not having to just trust one or two players that evals have been done correctly. We want to have more openness and more independent verification of evals. And that's sort of our general theme with work with GSM 1K and trying to make sure that we can actually trust what these leaderboards are saying.
Nathan L. [01:07:08]: Yeah, my one nitpick that I don't know how to answer and I probably need more RLHF experts, you might know this, is like, are companies that buy data from scale going to have an advantage on the scale leaderboard because the distribution of humans that are
Riley G. [01:07:20]: doing...
Nathan L. [01:07:20]: Not that the humans doing eval and creation are the same, but that they're drawing from the same pool of humans that are writing content or doing preferences and then that are doing
Riley G. [01:07:30]: the evals.
Nathan L. [01:07:30]: I think it's too early to answer that question on if human distribution matters. And for that reason, I think the eval is still so much a net good. But it'd be really interesting to try to run those experiments on who is giving the data that you train on and how does that then impact the evaluation?
Riley G. [01:07:49]: Yeah, that's not something that I'm familiar with in enough detail to comment on our process there. But yeah, that makes sense to me. I think that's something.
Nathan L. [01:07:59]: It's something that people like to complain about every possible thing. And I understand the root of the complaint, but it's like, we've got to deal with the circumstances where we are in the AI industry. And the leaderboard is so much more useful than it is causing any problems. Let's keep doing it.
Riley G. [01:08:17]: Yep, absolutely. Okay.
Nathan L. [01:08:20]: I think we're at time. So I'm going to click stop here. Thanks again.
Riley G. [01:08:23]: Great. Thank you so much. Bye.
Sorry this one was late! Thanks for bearing with me, and keep sending feedback my way. Still a year or two away from when I have time to record these, but I would love to.
Open-source tools, examples, limits, and the state of training multimodal models.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/molmo-and-llama-3-vision
00:00 Llama 3.2 Vision and Molmo: Foundations for the multimodal open-source ecosystem
02:47 Llama vision: Multimodality for the masses of developers
03:27 Molmo: a (mostly) open-source equivalent to Llama vision
08:45 How adding vision changes capabilities and reasoning
11:47 Multimodal language models: Earlier on the exponential
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_013.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_021.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_023.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_027.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_030.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_037.png
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_046.png
Fig 9: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_048.png
Fig 10: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_050.png
Fig 11: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_052.png
Fig 12: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_054.png
Fig 13: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_058.png
Fig 14: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-and-molmo/img_065.png
What productionizing test-time compute shows us about the future of AI. Exploration has landed in language model training.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/reverse-engineering-openai-o1
00:00 Reverse engineering OpenAI's o1
01:52 From Q-star to Strawberry to o1
05:13 Training o1 with reinforcement learning
09:24 What is o1 doing when given a prompt?
11:49 Questions to consider to understand o1's structure
11:56 1. How does an RL-trained language model act?
12:38 2. Is it an online / test-time search?
14:20 3. Is it one model at inference?
15:29 Open-source o1, the future of o1, and the future of AI
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_014.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_016.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_018.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_020.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_024.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_026.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_034.png
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/o1/img_048.png
Scale AI's future versus further scaling of language model performance. How Nvidia may take all the margins from the data market, too.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/ai-data-foundry
00:00 Futures of the data foundry business model
02:57 What it is like to work with data vendors
06:06 Data foundries: Risks
08:18 Data foundries: Growth vectors
09:50 Realistic expectations
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/data-foundry/img_008.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/data-foundry/img_012.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/data-foundry/img_023.png
And why the concept of mandating "model spec's" could be a good start.
(Oops, forgot to upload this yesterday!)
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/a-post-training-approach-to-ai-regulation
0:00 A post-training approach to AI regulation with Model Specs
1:45 Expanded roles of Model Specifications
3:40 Near future of Model Specifications
Whether or not scaling works, we should spend more on inference.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/openai-strawberry-and-inference-scaling-laws
00:00 OpenAI's Strawberry, LM self-talk, inference scaling laws, and spending more on inference
01:51 OpenAI's Strawberry
04:16 Self-talk in language models
07:45 Inference scaling laws
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_006.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_021.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_023.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/strawberry/img_037.png
Ai2 released OLMoE, which is probably our "best" model yet relative to its peers, but not much has changed in the process.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/olmoe-and-building-better-llms
00:00 OLMoE and the hidden simplicity in training better foundation models
02:04 Frontier model team compute allocations
04:19 De-risking training complexity
06:40 On organizational complexity
09:05 Compounding improvements -- the key to building better language models
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_005.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_007.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_009.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_011.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_028.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_030.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmoe/img_032.png
The Open Source Initiative is working towards a definition.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/defining-open-source-ai
0:00 On the current definitions of open-source AI and the state of the data commons
3:17 Reasons to not mandate fully released data
4:24 Sufficient but not exhaustive data docs
5:22 Frustration with the data commons
7:04 We need more examples to define the definition
The latest model from one of the most popular fine-tuning labs makes us question how a model should be identified as a "frontier model."
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/nous-hermes-3
0:00 Nous Hermes 3 and exploiting underspecified evaluations
5:29 Parsing training lessons from Hermes 3
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_005.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_010.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_012.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_020.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_027.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_030.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_032.png
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/nous-hermes-3/img_036.png
I had the pleasure of Talking with Ross Taylor, who has a great spectrum of unique experiences in the language modeling space — evaluation experience, Galactica lead author, Llama post training, etc. This is a really great conversation on the frontier of language model (LM) reasoning, LM deployments and demos, LM’s for science, RLHF, and other topics. I’ve been trying to get Ross to come on for a bit. He’s one of those people in the LM space that doesn’t speak too much, but when you do, you listen.
Ross Taylor was previously an LLM lead at Meta AI, heading up the reasoning team. Previously he led the early work on LLM agents, and was the research lead on the Galactica project. Before that, he was a co-founder of Papers with Code, which was acquired by Meta in 2019. Before that, he has worked as a quant in sports betting and finance, and before that a policy advisor for the UK Government. He is currently working on a new startup.
Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other Interconnects interviews, go here.
YouTube
Chapters
* [00:00:00] Introduction of Ross Taylor and his background
* [00:02:12] Papers with Code
* [00:09:58] Galactica, goals, controversy, legacy
* [00:18:12] Technical details of the Galactica model
* [00:23:18] Potential for language models to make scientific discoveries
* [00:25:21] Defining and improving reasoning in language models
* [00:32:38] Process-based reward models and their potential applications
* [00:35:00] Generating synthetic data for SFT
* [00:40:23] Evaluating the effectiveness of language models as judges for human preference data
* [00:42:43] Considerations for creating base models that are easy to fine-tune
* [00:46:45] Balancing SFT and RLHF
* [00:54:13] Characteristics of successful post-training teams
* [00:58:26] Future directions for language model development
We mention
* Rob Stojnic (co-founder of Papers with Code)
* Armen Aghajanyan (Chameleon)
* Tom Scialom on Latent Space
* Soumith Chintala (PyTorch)
* Process Reward Models / Let’s Verify Step by Step
Transcript
Built with smol-podcaster and with love of Latent Space.
Nathan Lambert [00:01:07]: Today, we're here with Ross. This is a really exciting one. I've been trying to get Ross on the show for a while. Ross has done a lot of interesting work. And also the path to where you ended up with working on state-of-the-art LLaMA work at Meta is very interesting to me. So we're going to start with some of that, but then there are a few people that want to know more about reasoning and some of the RLHF stuff. We won't cover the secretive new start-up - I don't know what it is, but that's how it goes these days. I'm sure it'll be great. So welcome to the show!
Ross Taylor [00:01:41]: Thanks for having me.
Nathan Lambert [00:01:44]: So I wanted to start with Papers with Code. For people that don't know, Papers with Code is one of these platforms - I never was a heavy user of it - but it collates papers, people can upvote them, popular papers, attaching code and dataset and evaluations to papers, which is great - it was like sort of ahead of its time. It fits into a lot of these open ecosystem things. So I'm kind of curious, like, how you ended up there and why you all started this startup that ended up building this thing that got acquired by Meta?
Ross Taylor [00:02:12]: Yeah, that was a weird one. This was like back in 2018. So I was at an incubator, I just quit my previous job and I was like, okay, I want to do a startup. And I met Rob, my co-founder, who came along with me for the journey. We both came from different backgrounds. I was from a sports betting / quant finance kind of background, which is a whole other episode I guess. And Rob was in various startups, like applying ML to things like hate speech detection, that kind of stuff. And the cool thing was, we both resonated on similar kinds of problems within the ML space, even though we came from different domains. So we spent a lot of time doing various experiments, trying to make new kinds of ML tooling, thinking of these stupid questions like “what is the Git equivalent for ML?” - that kind of stuff. One of those experiments was hacking around on this little website to solve a really basic problem: I'm trying to reproduce this paper, but I can't find the code. That was the thing that really blew up beyond our expectations. It was weird because we thought it was fairly trivial at first.
Nathan Lambert [00:03:16]: What year was this? 2018?
Ross Taylor [00:03:18]: Yeah.
Nathan Lambert [00:03:19]: This makes sense. I think this was like, I was starting Deep RL then, but Deep RL was so hot, which was like the worst evaluation has ever been probably for ML. Like people complain about it today, but like Deep RL evaluation was like, every single person was just lying to make themselves look better.
Ross Taylor [00:03:38]: The interesting thing now is that the open ecosystem has shifted to focus more on weights as a central artifact rather than code. I think there's an interesting debate there. Would it be more useful to have the LLaMA-3 8B model weights or all the code for training LLaMA-3? I think there's still interesting debates to be had about what's actually useful.
Nathan Lambert [00:03:56]: I think the code would be more useful. Like OpenAI released their rules-based reward models, but it's like code washing because it's like just a bunch of people just released like eval code now. And it's like, that's a whole another tier is like actual training code versus eval code. But yeah, I guess I'll just skip ahead.
Ross Taylor [00:04:12]: So essentially Papers with Code was the thing that didn't die for us. We always thought we were going to make something else and Papers with Code was more of a marketing thing. But eventually we were like: okay, our users are telling us this is what we should be working on. And we expanded from that very simple use case of finding code towards indexing various artifacts in ML.
Another big problem was trying to find the state of the art in something like ImageNet and all these different benchmarks. There just wasn't a central place to find this information…So we had this quite good Christmas - me and Robert - where we hacked for the whole month, indexing every leaderboard we could and all the related papers. I didn't want to do any annotation again after that! But that took things to the next tier, and that's when things really started to blow up.
Nathan Lambert [00:05:03]: Because this is like the first round of leaderboards, because now it's really popular with Hugging Face again. And I was like, yeah, is that just because it became like a Meta thing and it's just kind of a thing that existed? You're like the first leaderboard company in a way, which I don't think many people think about. Yeah, which is weird.
Ross Taylor [00:05:19]: Yeah. And the interesting thing about us was that we never had to do any marketing because everything was from organic traffic. So you would type in “state of the art ImageNet” and we would come to the top as the most useful site. That was really the source of our growth, and we grew to a million MAU fairly quickly. And as for Meta, we were in touch with the PyTorch folks at the time who we really liked. You know - Soumith, Joe - those folks, and they had a shared interest in promoting the open source ecosystem back in 2018/19. And while it was like a tough decision, we were just like “we really like working with these people, we want to work more closely with them”, and that got us into Meta.
And then within Meta, we originally continued to develop the platform. But the big shift for us was that, even then, we saw we were moving to a world where compute was the currency. And we saw that, if we wanted to be well positioned in five years time, we needed to be building these large-scale systems. Even for our own platform, we had lots of ML in the backend and we saw we were using fewer and fewer models to do more and more tasks. So that kind of shifted us into research, into Galactica, and then eventually LLaMA and that kind of stuff.
It was a weird shift because we were product people who ended up doing hardcore research! But I guess it was natural to us that we were within a research org with these amazing people, lots of resources. It was just the best use of our time to conduct this shift.
Nathan Lambert [00:06:43]: Do you think there should have been more integration between Hugging Face and Papers with Code? It would have been wonderful if it had happened.
Ross Taylor [00:06:54]: The backstory is that we saw them as competitors, to be honest, because we had the same vision originally. We were going to do model hosting, that kind of stuff. But we never got into it because we hit friction with leadership - who was not onboard with that as a goal. Because from their point of view, it's like, okay, if we host these things, this might expose Facebook to some kind of legal risk. It wasn't in the perceived interest of the company.
Nathan Lambert [00:07:17]: This is a classic story of tech, really. They can't take the risk. They can't expose themselves.
Ross Taylor [00:07:23]: If you're a startup and it's your number one priority, then yeah, your attitude on risk is different. But I think it was a blessing in disguise for us because clearly the bigger wave was going to be large language models - we saw that incredibly early. And our mission was fundamentally not infrastructure, but something closer to: how do you organize information? It was a Google-y type of mission. And while we were focused on ML, we were more broadly thinking about science: how do we reduce friction for finding out about new advances and, I guess, lots of small tasks that when added up lead to a lot of progress in science.
Nathan Lambert [00:07:59]: I should have probably looked this up. Did you have another scientific background? Did you have a hard science background or what about Rob? Stojnic?
Ross Taylor [00:08:10]: Yeah, [Robert] Stojnic, my co-founder, he was from a bio background. So he's actually-
Nathan Lambert [00:08:15]: That makes sense.
Ross Taylor [00:08:16]: Well, he also had a computer science background. He was one of the original developers of Wikipedia, so he has his own crazy story…
Nathan Lambert [00:08:22]: Yesterday I was talking to somebody that was one of the original arXiv moderators. So we're digging all these things up…
Ross Taylor [00:08:29]: It is interesting because we both had this background, I would say, in building useful “utilities” [on the internet] at some point in our lives. I think Papers with Code is one of those things which is easy to forget, but if it went away, everyone would go crazy.
As for me, my background is more statistics and econometrics. My first job was in the Government, which I kind of hated. But I did a Master's degree, which I thought was going to be in economics, but the thing I ended up loving was time series and statistics. So I did all this research on state space models - before it was cool, I guess! - and then that got me into sports betting. And then eventually, we were using more and more deep learning [in the 2010s], and that’s how I got into AI. So a fairly nonlinear path. But -
Nathan Lambert [00:09:09]: Yeah. Well back to what you were saying on the scientific stuff, I think the Galactica story has many angles, and you led on this.
I think if people go look at the paper, it's a very interesting paper, like you cite Galileo in the first sentence, and it really has a lot of early modern language model features and quirks. It's something that people don't remember that well.
I'm very on the record saying the backlash was overblown. I think that was before there were clear habits and community norms around what language model demos should look like. So it was kind of in that teething phase.
But what was the actual goal that you wanted? You mentioned organizing the world's information. What was the goal and how close do you think the model came to accomplishing it?
Ross Taylor [00:09:58]: So there were several different things at once.
There were immediate product integrations we had in mind. We actually had an agreement at the time with Overleaf to be a “co-pilot for writing papers”. We'd have a really good LaTeX model in Overleaf, and whenever you wanted to include a citation, you could simply prompt for one.
More broadly, we imagined the future would be instead of..using more classical ways to find and extract information, if you wanted to learn about something like DPO, you would just prompt a language model to find out about it. Or if you wanted to ask “What's the state-of-the-art on SWE-Bench?” or something like that, you would just prompt the model and it would find the relevant information and answer the question.
Nathan Lambert [00:10:46]: So this is something that language models are so bad at. One of my challenge questions - I've been doing this for 6-12 months - is to ask models about DPO, and none of the models without internet access have yet done it right. You would think that it would start to kick in. And I don't just ask “what is DPO?”, I ask “What is DPO for language model fine tuning”, and they still just make up nonsense.
Ross Taylor [00:11:06]: Yeah, which actually relates to an interesting debate about LLM creativity. If you want to solve something like LLM creativity, you want to be confident about the frontier of knowledge, but frontier knowledge is where you have the most token scarcity.
But anyway, just to finish that thought. Bear in mind, we were developing Galactica while the whole Web 3.0 boom was happening. And we were in this weird state where we were like “All everyone is talking about is Web 3.0, but clearly generative AI is going to be the thing that powers the next generation of the web!”. So I guess that was our primary motivation.
Now, in terms of the [Galactica] launch, I think there's two aspects.
First, like you said, the paper. Now we were a small team of 7-8 people. We had so much fun developing these new ideas at the time: internal reasoning tokens, how do language models cite, training for multiple epochs…
Nathan Lambert [00:12:00]: What's that? A citation token? Did you have a special token for citations?
Ross Taylor [00:12:04]: Yeah. So we had a start citation token [START_REF], and we used two methods. The first was: we'd put the title of the paper within the citation tags. And the other one was: we'd have an alphanumeric ID.
The interesting thing was, it actually worked really well - but in the demo interface, it had a tendency to hallucinate - or “hallucitate”. The backstory is that, while the model was really good, for the demo we turned up the temperature to 0.7 so the text generation was better [at the expense of citation accuracy]. So generative citations were something that people thought didn’t work, but it was [more an implementation issue]. I guess that’s an alternative road in history…
So there was the paper, which was cool, and there was the demo, which I would say was motivated by the realities of the time. This was pre-ChatGPT and, even within a big company like Meta, it wasn’t a company priority to work on LLMs at all. So in our mind, our objective was - we were kind of deluded - being a team of 7-8 people, we were like…
Nathan Lambert [00:13:08]: This is how you have to operate if you want to be at the cutting edge. That's how great teams operate.
Ross Taylor [00:13:13]: So there were two objectives you could have had. The first is: you think that second-mover advantage is good. So you could wait for OpenAI to do something and then come in after and do it in an open way. And this is the path that actually worked for LLaMA. LLaMA was not state-of-the-art in any sense.
Nathan Lambert [00:13:27]: I've been doing this. I mean six months ago, maybe OpenAI and Google wouldn’t need to hire me because they know everything. But now I’m doing more interesting analysis where I'd be hired at a different role - but in the open. Now I'm like the person people look at. But I’m trying to tell people that “You don't understand! I'm six months behind everyone!”.
Ross Taylor [00:13:49]: Right, but to be clear, that’s a really important role - because everyone should have a stake in the future. And that's what the open ecosystem gives people.
But our objective was this: we didn't want to be second; we wanted to be first. And we were kind of deluded because we were 8 people - compared to maybe OpenAI with 200 people where their whole bread and butter was language models. But that’s why we were thinking “how do we move as fast as possible?”. And in our mind, a demo might be premature, but it would also be a way to get lots of prompts and information quickly - to understand how people would be using the model. And essentially the calculus we took was, we knew the community might not be ready for something like this - especially with the Meta branding - but we thought this was a way to get lots of information really fast and catch up given our position. Now in retrospect, history says that…
Nathan Lambert [00:14:33]: You kind of did that. I think Meta probably got the injection of language model reality from that. It's kind of like the Gemini backlash. I think the Gemini backlash - while it's obviously stupid execution - was potentially a good forcing function for Google's structure of their Gemini org - to really move everything into the way it is now. That made them be structured more like a serious language modeling org and less like Google, I think, which people don't want to hear...
Ross Taylor [00:15:07]: For us it was just a risk we decided to take. We probably took a lot more risk than we should have done. But we just thought “obviously this is going to be huge”, “LLMs are going to power the next internet”, etc, so let's take a risk. And you know, if we ran the universe several times over - it would have succeeded in some of those runs. But [in our universe], the criticism, which was obviously overblown, reached a critical point where things didn’t work out.
And then there's the story about the demo coming down, which - I’m not sure I’m able to talk about - but I think that is one of the things where, if people knew the true reasons, they'd be like “what the f**k!?”. But yeah, that's what happened…
Nathan Lambert [00:15:44]: Yeah, this is why any company that makes a demo now has block lists, where there's certain words that if they're in the prompt of the generation, you get a really, really stupid response. Even if it's like an open model, you just put like a little filter that's like, “you can't say the most obviously bad words”.
Ross Taylor [00:16:01]: But we actually did that and that created backlash as well. Because if you have false positives, you actually exclude some words which aren't actually offensive [in certain contexts], right? And then you also offend people… so it's not a win-win situation.
But if I have to look back at it now, I think with any new technology, it's never going to be absolutely better than what came before it. With LLMs, the relative comparison is with search. If you’re going towards search and information retrieval, you're prioritizing factuality as opposed to creativity, right? And the fundamental tradeoff with LLMs is saying, “I can trade off some amount of like factuality or ‘closeness’ to the corpus for some amount of synthesis and creativity”.
I don’t think that if we had a better model, it would have helped things at all. You could say maybe if [Galactica] had RLHF, would that have helped? I'm not too sure given that the project came out of [a big company like] Meta. Meta has a really good reputation now - people appreciate the open work they're doing - but at the time, things like the 2016 election were still in people’s minds. So I think the LLM revolution was never going to start at a big tech company, in my opinion. It was always going to happen at a company that had less reputational baggage. But I think it's pretty cool now that people see things differently. Because FAIR always had a really strong commitment to open science. It’s good that they're finally getting the credit for that.
Nathan Lambert [00:17:38]: Yeah. I have two technical questions on Galactica that I find really interesting. One is from Luca Soldaini at AI2. He said that you mentioned that the Galactica log probabilities (when producing citations) were proportional to how far in the citation graph the current paper was to the cited paper. Do you have any more interesting comments on how the latent space of Galactica actually worked? Because that is cracking the most important question of a language model for science - building a better latent representation of how the scientific information is organized.
Ross Taylor [00:18:12]: Yeah. So there were a couple of aspects to that. The first thing is we had this really nice graph that showed, as we scaled the model, the distribution of citations became closer and closer to actual citations - which is what you'd expect. But this was important for us, as our main worry was - because we were thinking about deploying to Overleaf - we didn't want to prioritize the most cited documents and create a “rich get richer” dynamic.
Nathan Lambert [00:18:38]: Google Scholar already does that. Were you re-indexing all the papers rather than building off like the Scholar graph or something?
Ross Taylor [00:18:45]: I think we were building off existing ones, using things like CrossRef…but there were lots of gaps that we had to fill. The other weird thing was that we saw some strange biases in the model. So if the model didn’t know what to cite, it would sometimes cite a general review paper, which is really weird emergent behavior. It was like the model was saying “I don't know a specific example, so I'll just give you a general overview”.
Nathan Lambert [00:19:11]: It's probably in the data.
Ross Taylor [00:19:12]: I think the thing that surprised me the most was multimodality. So we trained the model on SMILES formulae and protein sequences [alongside natural language]. And the thing that really surprised me was, we had tasks which we didn't explicitly optimize for - like converting a SMILES formula to a IUPAC name for a chemical. And if you actually looked at the attention as the model was predicting the next token, it would say something like “amino” and you could see in the chemical graph, it was explicitly attending to the relevant part of the sequence.
I found that amazing because we didn't train for it explicitly. That's the beauty of self-supervised learning. But I also found it highly ironic because some of the criticism of Galactica was “it’s ungrounded”. I was like “how grounded is this? The natural language tokens are literally attending to the underlying chemical structure!”. So that was kind of cool.
And then the other cool thing was: if you prompted with a protein sequence and asked “what is the function of this protein?”, the model was really good at answering those questions in natural language. That was awesome for me.
Nathan Lambert [00:20:33]: There's another prompting thing that I had known of [for Galactica], which was asking the model to do open-ended generation tasks. The models are still out there - people can spin them up and do demos on their own - but if you asked it something that people think of for ChatGPT - e.g. write me a poem about a sad goldfish - it wouldn't work unless you put it in a header format. It was markdown, I think? If you prompted it in that format, it would actually do a great job.
Ross Taylor [00:20:57]: Yes, so in the Galactica demo, a lot of people were being malicious with this type of prompting for markdown articles. But I did enjoy some of the creative ones. Someone was like: write me a theorem on finding a girlfriend, and it was some of the most hilarious model output I’ve ever seen. And people also generated some amazing sci-fi…but then I think some people took it too far. But whatever. I guess it was a traumatizing experience for me at the time. But with the benefit of hindsight, I was also fun in some sense, I guess.
Nathan Lambert [00:21:30]: Yeah. It makes you understand the bigger context of the work much faster than you would otherwise.
Ross Taylor [00:21:37]: It was actually crazy at the time. So many people were using it. Even then we could see that - while it wasn’t a product - we could see that most systems were going to be designed in a similar way.
I think the interesting thing was how the winning form factor in the end was like a chat interface - you know, with ChatGPT being the winning UX. I think that was actually a big part of the story [why they succeeded]. There's a debate on whether RLHF is actually a capability advance or whether it’s just alignment…but a big part of the story [for ChatGPT’s success], in my view, was the kind of UX of how you interface with a language model, rather than the actual capabilities. But I think it's obviously not monocausal at the same time. There were several factors at play.
Nathan Lambert [00:22:25]: Yeah. So the last thing on this is that you mentioned in our e-mails about language models, creativity and making discoveries. What do you mean by that? Is that the agent-like projects you worked on at Meta?
Agents are largely something that I don't have too much comment on. I'm taking the approach of wait and see what we actually get, because there are a lot of practical approaches that I think will be reasonable. People use language models for basic formatting, for code, etc. But it's easy that if they have a little bit more feedback for things like writing a paper - e.g. find me a citation for blank and justify your answer - that step is something that I think will come. I don't know how expensive it will be to run, but is that what you mean when you think about making discoveries? Is it more autonomous? Is it a grander vision? Anything like that?
Ross Taylor [00:23:18]: I think it's more like this: the killer use case right now is information synthesis. For example, I use Claude a lot more than Google now because it combines information in a better way and sometimes generalizes well to things it hasn’t seen before.
But a really cool thing would be: can a language model answer a question which is more out of distribution? That we don't see in the training data?
So an experiment I've never done because I didn't have to compute would be this. Imagine if you could train a language model on all documents up to 1905, which is the year when Einstein had his miraculous year of four seminal papers. With that model, which is trained up to 1905, could you prompt the model to come up with a good explanation of the photoelectric effect, special relativity, this kind of stuff? And what would it take to rediscover these things?
Because presumably, with all these major discoveries, it’s never out of the blue. You’re standing on the shoulders of giants, but there’s still a lot of thought and inspiration you have to do to get to those great ideas. So that's the setup. But the creativity problem is, by its very nature, hard to benchmark.
Maybe this is a digression, but my problem with the field right now is: we’re in a situation where we've almost solved a benchmark like MATH, which is a very hard benchmark, in my opinion, at least Level 5 MATH, but I don't think we've really cracked something like reasoning. So I think it's like a whole different question about how you even evaluate these frontier tasks. But yeah, hopefully that gives a flavor of the kind of questions here…
Nathan Lambert [00:24:58]: Yeah, we can go into the reasoning conversation. I think reasoning in RLHF will take up however much time we want to keep talking. I guess we can start with the basics. What do you think people that are using language models think reasoning means? And what is the way that you would interpret what you're trying to do in improving the reasoning capability of a language model?
Ross Taylor [00:25:21]: So there's a lot of controversy on this on Twitter/X. And I think people are talking past each other because sometimes people mean different things by reasoning. At a very granular level, is legal reasoning fundamentally the same thing as mathematical reasoning? Common sense reasoning? I guess my very basic definition is that reasoning is the process of drawing conclusions based on a body of observations, or in the case of deductive reasoning, basic premises.
Nathan Lambert [00:25:50]: So math is like a subset of what you think about.
Ross Taylor [00:25:53]: Yeah. And then I guess the bigger circle is the broader topic of outcome directed behavior. I have an idea of an outcome I want to achieve, but what's the best path to get there?
And then in the LLM space, I think this problem broadly equates to the technical problem of how you use compute to get from your question to your answer. In the old days, you would just prompt the language model directly. You would just put in a GSM8k question, put in “Answer:” and then parse A, B, C, D. So you're relying on the forward pass.
Nathan Lambert [00:26:27]: Yeah, like the FLAN data is really weird. That's a popular one that people used to train on this stuff.
Ross Taylor [00:26:33]: Yeah. And then came chain-of-thought, scratchpads, with Galactica…all these ideas of using the context window to do intermediate computation. And the more recent, although to be honest, it's actually quite an old idea, is: you have chain-of-thought, but how do you better learn the internal reasoning tokens that get you to your answer? So things like, you know, Quiet-STaR and variants of this idea.
Nathan Lambert [00:27:01]: Claude now shows you when it’s thinking, and in the Claude system prompt, it has information on how many tokens to take to think about a question. We're all thinking about trying this stuff and it's all so hard.
Ross Taylor [00:27:11]: I think it's a question of how do you learn those tokens? For us, the original thing we did was just supervised learning. So we trained on some examples and let the model generalize to know that it should do the thinking in between some tokens. There are more sophisticated ways you could achieve this nowadays.
Another point is this: there’s an analogy that’s often used about language models, that they are “thinking out loud”. I actually don’t like this analogy at all. I think “thinking out loud” makes you think there’s something wrong about this kind of thinking in token space. But it’s not clear to me that the alternative - or these old adaptive computation ideas - are any better, actually.
Nathan Lambert [00:27:58]: What do you mean by adaptive computation? Because I mostly think of “thinking out loud” as being like chain-of-thought or generating its own explanation before it gets to an answer. What would adaptive computation be?
Ross Taylor [00:28:09]: So there's a paper by Alex Graves, who wrote all these amazing papers ~10 years ago, which had a lot of foresight. He did stuff like the Neural Turing Machine paper. Adaptive computation is the idea of, instead of having fixed compute between your input and your output, you can extend the forward pass to do things better, like arithmetic, where you have to maintain/manipulate state.
When chain-of-thought came out, there was an impression that it was a bit of a hack, because you're thinking in token space whereas you should be finding a way to make the forward pass dynamic. Universal Transformer is another variant of this [adaptive computation] idea. But I think there needs to be more empirics on which approach is actually better to maintain and manipulate state. I used to be more in favor of thinking, OK, chain of thought is more of a hack, but now I actually think it's probably…
Nathan Lambert [00:29:02]: What do you mean by state, like the state of the problem in that sense?
Ross Taylor [00:29:08]: So imagine that you're doing a GSM8k question, where John originally had 100 apples, then Jane gives him five apples. He has 105. And then he gives 20 away to like Susan or something and he's left with [85 apples].
So if you’re prompting the language model directly for the answer, you're expecting the language model in that forward pass to maintain and manipulate the state in a latent space, whereas the way chain-of-thought does it is in token space.
So you essentially output the intermediate steps. One of the problems with reasoning is that we have no idea how humans mechanistically reason…but if you think about how you'd solve a GSM8k problem in your head, then to me this seems a lot closer to something like chain-of-thought than adaptive computation.
Nathan Lambert [00:29:57]: Especially when you look at the architecture and attention mechanisms. A Transformer is really good at copying. So if you keep feeding in the recent information, it copies that in some way. So I think chain-of-thought and all of these things, I mean, they're only growing in popularity in my mind, along with Quiet-STaR and these kind of methods. I’ve heard the rumors about self-explanations and all these special things. The LLaMA-3 paper has all these special tokens. I don't know what all of them do, but I can see the direction. The state is stored in context and in special formatic tokens if it needs to be.
Ross Taylor [00:30:37]: So the other big picture thing is this. With the internet, you’re only seeing the output context.
So take StackExchange. If it’s a good answer, the author probably hasn’t just responded by generating words left-to-right. Maybe they’ve looked something up, maybe they’ve done a back-of-the-envelope calculation, either explicitly or in their head, right? And the internet is missing those “internal tokens”, essentially.
Now this isn’t always a problem because the models can learn how to construct them. And the effort now is to make artificial latents / internal thought, through RL or otherwise. But I think this is actually a much bigger question, which is more than just reasoning. In the end, as models become more capable, we’ll be talking more about how we can make them human-like in the way they can answer questions and solve tasks. For example, in some situations we might like the models to have [human-like] empathy, which is also “missing” in some sense.
So my prediction is that this becomes a bigger deal in the next few years: caring more deeply about the computation these models perform to reach a conclusion. And that will be the essence of alignment, in my mind. But that's a big topic!
Nathan Lambert [00:31:50]: OK, I have a long list of specific questions on this. My first question is about process reward models.
I think the canonical paper is “let's verify step by step”. My whole gripe is that it’s hard to create the data. That’s why they don’t exist in the open. But I’m guessing you can just label data with GPT and ask for feedback on each step, and just use that as an “LLM-as-a-judge” to get reasonable step-by-step labels on process rewards. But there’s so little work on this, so I don’t know if it is worth exploring. There is some research from Meta - I think Alex Havrilla did a couple of internship projects which related to this, and he’s good - but there’s such a lack of signal.
Is this something that people should work on more, or is it too complicated? Are there simpler things to do?
Ross Taylor [00:32:38]: Our big direction was integrating outcomes into reasoning - because next token prediction isn’t the objective we actually want to optimize. So the two ways to integrate outcomes are through something like PPO or inference-time search. And in both cases, you want a good reward model or value model.
Instead of (human-annotated) “process based reward”, we were exploring ideas along the lines of Monte Carlo policy evaluation (MCPE), where the key problem is how to learn a value model. It’s maybe a separate topic, but it’s underappreciated that something like MCTS - which in the public imagination is this inference-time search technique - actually has its real magic in giving you a value network for free.
This is why it was introduced in Go, because humans couldn’t come up with good heuristics for evaluation. So if you have something like MATH where you know the answer, then the question is how do you assign step by step feedback? It doesn't have to be MCTS, but something where you backprop the outcome to these individual steps is a way to get this dense feedback.
That's a way to get “synthetic process reward”. I should stress that PRM and MCPE are actually different things. Alex Havrilla was doing something along these lines also - but anyway, hopefully this gives a sense of the approach we took.
Nathan Lambert [00:34:21]: When Q* came out, that's something that I thought it might be doing. Instead of chain-of-thought, there's this idea of tree-of-thought. You could swap in the reasoning steps. And then if you could get labels on all these reasoning steps, you’re doing search over a reasoning space - which I would expect to work, but I think it needs the right datasets. I think a large part of the open alignment community right now is underappreciating datasets, where there's a lot of focus on methods, but we don't even have the datasets to use the methods… Like, why are you coming up with seven DPO variants if you don’t have the right datasets? I understand academic incentives, but if you are not an academic, you don't need to be doing that…
Ross Taylor [00:35:00]: It's an interesting question, because I guess the first chapter of LLMs had a lot of reliance on human annotations. In a way, that's a barrier to entry for the open community, because big firms can afford to pay millions for it but open source developers can’t. But more recently, you've had the rise of things like constitutional AI [and RLAIF approaches], which I believe are comparable to human-annotated datasets anyway. So is that a good thing for the open community?
Nathan Lambert [00:35:31]: I think it is, but human preference data might be a leg that is hard to remove. One of my latter questions was: can we actually do LLM-as-a-judge for human preference data fully? I think is the critical step that we don't have an answer for. Everything else in the modern RLHF stack is becoming more reproducible in the open.
And that relates to a question I have on synthetic versus human SFT. I think Thomas [Scialom] said on the Latent Space podcast that we just use generations from the model because they're better for humans on a lot of SFT tasks. Apple had a quote in their foundation model paper saying the same thing.
So I’m thinking, shouldn’t we be redoing all of our generations for our SFT dataset with the latest GPT-4 or LLaMA-405B? Why are we using GPT-4 from March 2023? That model was not as good on reasoning. So we have headroom there on synthetic data. We have prompts that we could reuse, but we don't have the right preference datasets - datasets like UltraFeedback are not big enough. And I think they're not in the same style that a lot of labs are doing this preference tuning - where it's on-policy generation.
We tried to work with Scale at Hugging Face to do this, where we had our own SFT models. We were getting data from Scale. We were labeling it every week and we were trying to retrain the models and we weren't getting a signal. This was last July/August. So we just didn't really know what we were doing. But I suspect that what people in the open should be trying to do is generating a lot, labeling it…That was a light bulb moment for me recently. This is what we have to do, but no one has done it.
Ross Taylor [00:37:21]: Yeah, I think it's definitely underappreciated how you can get better answers than a human by sampling the models [enough times]. You mentioned that Thom made this point early on in the [LLaMA] project, but you'd be surprised how this extends to reasoning as well. Even with the Galactica model - which is now an ancient model, a bronze age model - the pass@100 on GSM8k was 98%. And it's absolutely crazy to me that even now people are using GSM8k as a benchmark. In my mind, that benchmark was solved several years ago.
It’s a subtle point because the zero shot performance was ~48% but the pass@100 was 98%. The insight there is that the model already has knowledge about how to answer correctly, it's simply not reliable. This tells you that you need to invest in reward models, process based reward, outcome based reward, everything we talked about earlier…
But the same applies to the general RLHF pipeline. If you asked me to write a poem in the style of Bertrand Russell but also mix in Snoop Dogg’s style, then I couldn't do that. But the model has knowledge of how to do that, right? So why wouldn't you sample the model?
I think now with LLaMA-3, and the 405B model being out, it’s going to be good for the community that they can use it for generating data synthetically. And I'd imagine the quality will be good enough if it's done the right way.
Nathan Lambert [00:39:30]: Yeah, I think it should be doable. But there's a fundamental question of what do we think the human preference data is doing? [Compared to] model labeled preference data, is the noise that the humans provide of a different distribution that makes the human preference data better? I don't have a lot of signal on this, but I would love to know because I would guess that Meta would love to eliminate the $10 million plus estimated human preference data spend if they could. Meta is a reasonable company…
Ross Taylor [00:40:23]: Yeah, I don't know. But here’s something that surprised me. I was originally skeptical - at least on the reasoning side for LLMs - about LLMs marking their own homework. I thought they would eventually have that capability, but I wasn’t sure…
Nathan Lambert [00:40:40]: how fast.
Ross Taylor [00:40:41]: But the interesting thing we saw was as follows. We had experiments where we’d have a LLaMA-2 model that we’d sample generations from to train ORM models, and then we’d train different reward models on this data with different base models.
What we saw is that, the better the (underlying) base model, the better the reward model was for evaluating. And there were very clear patterns we saw: as the base model scaled, so did the quality of the reward model.
So that tells you that the knowledge is not in the ORM samples that you've fine-tuned the base model on. The knowledge on how to judge is within the model itself. And the pattern was so clear in the scaling. I concluded that eventually these self-verification approaches would work. It was just a question of when they would start to work for different types of problem.
Nathan Lambert [00:41:31]: Yeah. Model capabilities are also getting more dense which helps as well. Like with smaller model, there's all these experiments with better data, showing that you get a better model with X% reduction, which is kind of off-topic…
To double-down on what you said, I think this is one of the things I also debate: what makes a good model for downstream fine-tuning? I think in the LLaMA-3 report, they train the reward models directly on the base and not on the SFT model. The Apple report mentioned that they don't just use their evaluation suite for SFT models, but they evaluate with a reward model to see what is ready for RL.
I think, especially in the open, if you want the people to adopt your base model, there's a big gain in making it easy to fine-tune. For example, LLaMA has been pretty good; LLaMA-2 especially was really good for fine-tuning. There's also been base models that don't really work for fine-tuning, partially due to bugs and partially due to the state of the optimization. Is this something that you have any insight into?
Ross Taylor [00:42:43]: Yeah, I don't think I have enough insight into it to say, but I think it's definitely something that's been undervalued. I think the view of a lot of open model providers is: you get the model out, get good Open LLM Leaderboard results, and it's mission accomplished. But the real evaluation is in two days time when you get anon accounts on X saying “I'm fine-tuning this LLaMA model, it's not working”. And when you see a pattern with this kind of behavior, you have to conclude something is wrong…
Nathan Lambert [00:43:11]: It's always a chat template thing. A lot of it is a chat template thing, but those problems do get ironed out eventually. There's this whole idea of annealing and staging pre-training. I can't tell if it is boosting current capabilities at the cost of later capabilities. I think in a few years, this will all shuffle out and it's just how we do evaluation in stages. So you're always going to optimize for the right metric.
Ross Taylor [00:43:50]: There's two points to that.
The first is about annealing. It works for the kind of benchmarks people focus on the most, but then there's a question of whether you are actually just collapsing the task distribution of the model to things you're measuring - and not the true task distribution used by the community.
And I think there's a second point - which is maybe too much of a digression - but there's an interesting debate to be had about data quality being a bit of a misnomer. In a sense that when we say “data quality” we're actually saying “this data mix works well on these benchmarks”. But if you take a “No Free Lunch (NFL)” kind of approach to this, you must be hurting task performance somewhere else, right?
Nathan Lambert [00:44:34]: Yeah, I think I’m on the record of being an AlpacaEval hater. I say this all the time, because I think AlpacaEval is sacrificing actual usefulness for their own metric. If you get a 1-2% bump on alpaca eval, maybe that’s great. But you could be getting a 10-20% bump while sacrificing actual chat abilities.
We released some models trained with PPO and our PPO models are not very good at instruction following because they don't follow modifications like be concise or some stylistic things. They're also so yappy. They just say so much…but they do well on metrics and PPO especially helped AlpacaEval. So we had to figure out how to kind of use that signal without overcooking it.
Ross Taylor [00:45:16]: Yeah, it's like a whole discussion about evals, I guess…
Nathan Lambert [00:45:21]: We could come back to evals in a second. The last question that I have is: there's multiple trends like LLaMA-3 downplayed the importance of instruction fine-tuning relative to RLHF. I think there's other quotes in [Thom’s] LatentSpace podcast talking about it. Nematron also had this report where they use SFT and then multiple stages of RLHF.
I think DPO versus PPO is overblown and that'll kind of be a wash eventually. Everyone knows DPO's advantages of being simpler. But my question is this: are there certain capabilities that only come for RLHF, and people trying to do them with SFT are just wasting their time?
I always thought safety was in this bucket where it kind of makes sense - it’s hard to train a model to refuse just with SFT. But with something like reasoning, are there certain sequencings where SFT primes you and then RLHF really helps reasoning or code? Because it seems like OpenAI is really leaning on PPO to help with reasoning and code?
Ross Taylor [00:46:45]: Yeah, I think there's two ways to answer this question. First, maybe the history of this debate on the LLaMA side, and then something on the reasoning side.
So the history is quite interesting. I would say, you know, when was it? 2023? My dates have been wrong since the pandemic…But this just was after ChatGPT. There was actually a debate internally in Meta about using RL, and a lot of senior people were very skeptical. I would say the view was…
Nathan Lambert [00:47:13]: Not just at Meta. You can see when different companies embraced RLHF, if you really start to look at their models…
Ross Taylor [00:47:22]: The view was that RL was a dead end. And that even DeepMind was moving away from RL at the time, so you should just do SFT.
But, you know, at least for the folks in the Galactica team that came to lead post-training for LLaMA, we were quite scarred by hallucinations! We were definitely of the view that we needed to have the right objectives, and that we needed to make sure language models could “know what they don’t know”. So we were quite high on RL from the beginning. And eventually, I think the LLaMA-2 paper showed that a lot of the advances in helpfulness/harmlessness were via the RL stage. So I think that approach was fairly vindicated.
On the reasoning side, I would just say it’s quite simple. It comes back to the next token prediction objective not being the actual objective you want to optimize. The objective you want to optimize for reasoning is: do you get the right answer or not? Especially since reasoning is a high precision task. If you get one token wrong, unless you have a backtracking capability, you’re never going to recover…
Nathan Lambert [00:48:32]: That's a throwback, the backtracking token. Sorry, that was a random paper! That is interesting…
Ross Taylor [00:48:38]: Yeah, all these weird methods… But I think on your question, there is a point at which these techniques kind of overlap, right? So if you're, you know, doing SFT with rejection sampling: you’re doing something close to PPO anyway. And the same for reasoning: if you sample the model and pick the trajectories that your verifier says are correct, and then do SFT on that, it is a form of RL.
The final point I’d make is this: I would say the community overreacts to certain methods being used by popular models. They think: this company uses DPO because they must have found it's fundamentally better. But actually, it's usually due to either practicality or…
Nathan Lambert [00:49:22]: Yeah, that's what I think.
Ross Taylor [00:49:24]: You have a 405B model, and if you want to do PPO, you need to have a policy model, a reward model, value model etc in memory, and it’s not like…
Nathan Lambert [00:49:33]: Especially with DPO. I think with the 405B, I'm guessing what you did was cache the reference model. You could cache the log probabilities from the reference model. So you don't need to keep them in memory when you're doing the loss of the primary model. For DPO, you don't even need an extra copy of the model in memory, which therefore means you can use the same exact stack that you use for training. So you don't have to comment on this. But I think that's probably partially why LLaMA-3 just used DPO...
Ross Taylor [00:50:07]: Yeah, I think people don't appreciate how compute works either. People assume the big companies have so much compute - tens of thousands of GPUs - so compute isn't a constraint. But all these things are subject to Say's Law, right? If you have more compute, you're going to train a bigger model. And then you're going to hit the constraints again. It’s like the old thing of trying to solve traffic by building another lane. But if you create another lane, people will use that lane of traffic.
So practicality is still a factor [behind choosing methods]. Also things like which researcher is in charge, what’s their favorite method, and also politics as well.
So I think the community has made a mistake of overreacting to these choices. There was a mixture-of-experts phase too, right? I don’t think there’s anything inherently better with either method (dense or MoE), they just have different trade-offs, and it depends on what you are trying to achieve. If you’re serving lots of people with inference, then maybe a MoE approach is better. If you’re optimizing for something simple that’s easy to train and gets good results, maybe you favor a dense approach - although that’s debatable whether it’s easier to train. But I don’t think these things are clear cut.
So I would encourage people to not just copy things because they're in a paper from a big lab. I would encourage people to try things out themselves to know what works, and figure out what the problem is that you’re really trying to solve.
Nathan Lambert [00:51:20]: I think people don't have enough long term direction in their decisions. People are not trying to make decisions about what will be right in 10 years, they are trying to get a model out as soon as possible. So there are very few people with the incentives of trying to understand in the asymptote, which method is better… I might have that incentive, because I'm a nerd, and I have an audience that is okay with me writing four paragraphs around esoteric nerdy topics, but for all these companies, that is not a real incentive.
Ross Taylor [00:51:53]: The other point I’d make - maybe it is a separate thing - is this. I made this mistake throughout my career of focusing too much on novelty and complexity.
So in my first job in sports betting, we were making models for horse racing, football, that kind of stuff. And I always had the perception that other funds had really advanced, cutting-edge, complex models - but that wasn’t the case at all.
I think there is this tendency within deep learning to assume that - especially for the secret labs - that their good performance is due to some secret, amazing method. But more often than not, good performance is due to lots of small things from different people combined into one model. Really, lots of simple things done well and solid execution. And frankly, for big firms a lot of brute force too, right? Because big companies are naturally slow. But once they find a way to mobilize resources, they’re very intimidating and hard to beat. If you’re in a big company, and you’re aware of this, which approach are you going to take: are you going to prioritize novelty or are you going to do brute force if you have 10,000s of GPUs?
So I would encourage people not to be too intimidated by this perception that the big labs are smarter. I don’t think they are.
Nathan Lambert [00:53:03]: They're earlier but they're not necessarily smarter.
Ross Taylor [00:53:09]: Yeah. So obviously the constraints are different because of less compute in the open, but still: you’ve got to use first-principle thinking and be empirical as well, and just follow that path.
Nathan Lambert [00:53:21]: Yeah. So following up on this, there's a lot of discussion around what the processes are for making a successful foundation model lab. I think Armen has been talking about a few things on Twitter with great visualizations around de-risking pre-training based on FLOPs efficiency. Do you have any comments on what makes a successful post-training team and project?
I've talked to John Schulman a couple of times - he's been the king and started all of this - and OpenAI is still looked at as being the leader in the space. I think they've always been top on Chatbot Arena, and have cracked what most people like in the style. They started early. Are there different considerations for the post-training side of things rather than the pre-training side that we might hear more about?
Ross Taylor [00:54:13]: Yeah, there's probably better people than me to answer. So in our team, originally like Robert (Stojnic), my co-founder, he was kind of managing the post-training team. And then I'd say Thom Scialom was doing a lot of the work. And then more recently Rui Hou - he kind of flies under the radar a bit - but he’s been doing a lot of the work. They are all better placed to answer than me, since I was focusing on reasoning and agents.
But I think the key thing is this: post-training is just a lot of iteration. Frankly, lots of hard work - e.g. making sure at each round of RLHF you’re not regressing in certain ways, filling holes, etc. I guess it’s hard to put a finger on a single thing, but…
Nathan Lambert [00:54:58]: There's simple things like I'm trying to get people to talk about more. I’m trying to establish a good vibe test about internal culture. How do you vibe test for a good post-training culture (or for reasoning)? I remember somebody at Anthropic told me there’s still a lot of cases where you just put your finger up to the wind and you're like “model good”. And I'm sure that is still happening. And that's just a simple cultural thing of telling the team that you can’t always trust all of your numbers.
Ross Taylor [00:55:26]: I think it is maybe a more fundamental question. I wasn’t there at the early days of FAIR - I came in 2019, but FAIR was always a very bottom up organization. Which is a great thing: that's why things like PyTorch emerged. But the real insight as to why OpenAI was ahead historically, at least until recently, was that they had more of a top-down culture and focused bets. They saw the potential of LLMs early on and it was a top-down prerogative of the company to focus on that. And in essence, it was more of an engineering problem than it was a research problem in a lot of ways.
Relatedly, I think a lot of people were surprised that the LLaMA-3 paper wasn't as “novel” as they were expecting. But that just reflects the fact that a lot of it is just engineering and engineering is really hard - a lot of hard work. Not always a lot of new methods, but it is a lot of hard work.
Nathan Lambert [00:56:22]: Yeah, we're starting our next fine tuning model and everyone's asking me: “what should we work on?”. I'm trying to tell them “we just have to filter data and generate more completions”. We’ll have a lot of prompts, we have to filter them, generate completions from good models, and then we’ll have to generate more completions and keep doing this process…And in 10 weeks, we'll probably have a very good open model. We’ll just have to be boring for 10 weeks! And we have like 10 people involved.
So it's a bit of a bigger project, which I think is the right way to do it. We have just started getting improvements on IFVL by copying Nemotron. We use some open math datasets and the math scores are getting closer to LLaMA. It is really the simplest things ever. It's like browsing Hugging Face and being like, “NVIDIA released some JSON format data, some instruction format data, like we add it in and the numbers go up”.
Ross Taylor [00:57:16]: Yeah, I think I said earlier, but it raises an interesting question where this kind of approach - of grinding until the open LLM leaderboard numbers get to 100% - I think we’re going to get to a situation where all the benchmarks are solved, but where we haven't really, in my mind, at least solved intelligence.
What does it mean that we'll get close to 100% on MATH, you know, without any inference time search? I think sooner or later, while it looks like we’re on an exponential with LLMs, we’ll realize we’re actually on an S curve. Eventually we're going to get back to this mode where we have to do new things. And I think that's great, because that's what motivates me.
But yeah, I think there's waves, and we’re in this heavy exploitation mode right now with LLMs - away from the glory days of architecture exploration. But my hope is that we'll get back to the stage where, after exhausting all the [current] benchmarks, we say: OK, now we need to do something completely different. But who knows?
Nathan Lambert [00:58:26]: I see it similarly. I think we still have a year or two, at least in the open. If the closed models start saturating and they start doing things differently, that's fine. But eventually it'll all get there. And in that phase, I mostly keep working just to make sure that the ecosystem doesn't fold in on itself. So that's probably the one-sentence summary of what I'm doing these days: add transparency so that regulatory capture doesn't nuke everything. And that's fine, but I think it's still going to be longer than people expect. I don't think we have true signs of saturation at the top. We'll see what GPT-5 does - if GPT-5 never comes out - and then we’ll really know.
But it seems like it's going to come. I think there's enough signs that it'll come eventually. I think I don't know the answer to this - and it's not really our expertise - but I'm interested in the potential architecture of GPT-5 and if it's GPT-4o like and they're using more multimodal data to try to keep the data engine going relative to just going bigger. I don't know the answer, but that's kind of the future questions I’m thinking about.
Ross Taylor [00:59:34]: In my mind, like three years ago, the thing on the horizon I saw was agents. That’s where a lot of people are working right now: long form tasks where an agent doesn't have to answer a question immediately, and [can instead] go away for a while doing some research and answer later. I think that will take up a lot of time in the next five years.
It's both a compute problem of bigger models - more scale will do better - but also a data problem of how do you generate these trajectories? How do you get reliability? So it’s more successful and less error-prone at each step. And I think in principle it's solvable, but I just think it would take some time.
Nathan Lambert [01:00:18]: Yeah, it seems that engineering is required. It doesn’t seem like something that's just going to emerge. It's building a whole system and scaffolding around agents. Just unglorious work.
Ross Taylor [01:00:32]: Yeah.
Nathan Lambert [01:00:34]: OK, anything else you want to add? Do you want to get people excited about your start-up or is it too early? Maybe too early, yeah?
Ross Taylor [01:00:43]: Yeah, what else should I say? It has been nice to step back for a bit and look a bit ahead into the future. For me, my best days creatively were my teenage years when I got back home from school and spent the rest of the day programming. It’s quite nice to feel like that again: to be in that zone again where I can shut the world out and do some work.
But maybe just to give a hint of the areas I'm interested in, I think it comes back to this problem of how alignment is going to be a process of making AI more human-like. For example, how do you control for things like deception - which Anthropic has done a lot of really good work on.
Essentially… the latents of AI are [potentially] misaligned with human latents, and the question is: what do the human latents look like anyway? And how do we model these things?
That is very abstract and high level, but that is the fundamental question I want to work on. But yeah, I think I can talk about it later in the year!
Nathan Lambert [01:01:49]: Yeah, sounds good. Thanks for coming on. This was great. I think people are going to get a ton out of this. I think just a very sensible conversation on fine-tuning, reasoning and some of the things that got us here. And that's what I was hoping to get out of it, so thanks again!
Ross Taylor [01:02:06]: Yeah, great to talk, Nathan. Have a good one!
Apple, Meta, and Nvidia all agree -- synthetic data, iterative training, human preference labels, and lots of filtering.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/frontier-model-post-training
00:00 Llama 3.1 post-training and the new normal for RLHF
01:18 A new standard pipeline
01:45 Human preference data
02:59 Scaling RLHF
05:03 Synthetic data
06:10 The new normal
06:51 Data quality is king
07:18 Apple confirms the new normal
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_018.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_020.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_031.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_033.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/frontier-rlhf/img_035.png
This week, I had the pleasure of chatting with Sebastian Raschka. Sebastian is doing a ton of work on the open language model ecosystem and AI research broadly. He’s been writing the great Ahead of AI newsletter (that has the biggest audience overlap with Interconnects, at 26%, so a lot of you know him) and multiple educational books, all on top of being a full time machine learning engineer at Lightning.ai, where he maintains LitGPT, which he described as being like Karpathy’s NanoGPT, with slightly more abstractions.
This conversation mostly surrounds keeping up with AI research, the state of the open LLM ecosystem post Llama 3.1, and many narrow topics in between. I learned that Sebastian used to be an Arxiv moderator, which gives some simple color on how Arxiv and sifting through thousands of papers works. We cover a lot of ground here, so I hope you enjoy it.
Listen on Apple Podcasts, Spotify, and where ever you get your podcasts. For other interviews, go here.
YouTube
Chapters
* [00:00:00] Introduction & Sebastian’s background
* [00:04:28] The state of deep learning and language models in 2018
* [00:08:02] Sebastian's work at Lightning AI and LitGPT
* [00:12:23] Distillation and its potential in language model training
* [00:14:14] Implementing language models and common pitfalls
* [00:18:45] Modern architectures: Mixture of experts models, early v. late fusion multimodal
* [00:24:23] Sebastian's book on building language models from scratch
* [00:27:13] Comparing ChatGPT, Claude, and Google's Gemini for various tasks
* [00:38:21] Vibing and checking new language models during implementation
* [00:40:42] Selecting papers to read and moderating Arxiv
* [00:45:36] Motivation for working on AI education
* [00:52:46] Llama 3 fine-tuning
* [00:57:26] The potential impact of AI on jobs in writing and education
* [01:00:57] The future directions of AI
Transcript
Built with smol-podcaster and with love of Latent Space.
Nathan Lambert [00:00:00]: Hey, Sebastian, welcome to this kind of interconnects, normally researcher interviews. You were a professor, so that definitely counts. You do a lot of different things these days. Let's get talking into language models. Welcome. Yeah.
Sebastian Raschka [00:01:35]: Thanks so much for the invitation, Nathan. I'm a big fan actually of the interconnects newsletter, so I'm hoping we can have some fun chat about research, LLMs, and what's hot these days, basically. Yeah.
Nathan Lambert [00:01:48]: I have a little section on the end, which is keeping up with AI research, writing about AI and process, because you do so many things, but I kind of want to jump into how you got to AI, because you have an interesting career path. So you were a professor at Wisconsin Madison for years. I saw in statistics, which ... I also went all the way back to find your PhD thesis, which was uncovering hidden patterns of molecular recognition. So this was a while ago, and is this kind of ... Can you explain your background and how you got into AI? I'm guessing it's through computational statistics or something like this.
Sebastian Raschka [00:02:24]: Yeah. Close. So yeah, you did some research there. Interesting. So yeah, it's been a long time since my PhD thesis. This is maybe seven years now. And back then, it started even earlier when I got into AI, that was like, I would say 2012-ish. I was in grad school and I was taking a statistical pattern classification class. And in that class, yeah, the star of the show was basically naive Bayes classifiers, or in general, Bayesian methods for pattern recognition. And from there, I kind of really got into machine learning. So there was, I would say, more statistical-based, but it was all about classifying things. And then I think it was also right about the time where Cozera was launched, and I saw Andrew Ng's Cozera class. That was, I think, the first class in 2011-12 back then. And yeah, that's basically how I started from statistical pattern classification into machine learning. And I applied that for computational biology problems like molecule and drug discovery, like pharmaceutical drug discovery. And yeah, from there, I joined at some point after my graduation, the University of Wisconsin in Madison, where I was in the statistics department, but I did mostly deep learning research, essentially. I was the only one basically doing Python, deep learning, machine learning stuff. So yeah.
Nathan Lambert [00:03:48]: What year was this, and what did it look like at the time?
Sebastian Raschka [00:03:52]: That was around 2018, I think August 2018, when I joined the department. And yeah, I mean, so it's the statistics department, but my work was technically all machine learning and deep learning. I mean, a lot of students were really excited about learning machine learning. I think it was just around the time where it got really popular. And yeah, I was teaching machine learning and deep learning classes as well. They were always like, you know, full and crowded, like a lot of students were excited about that. Also, in general, like the time learning about Python, machine learning, data science, all these topics.
Nathan Lambert [00:04:28]: It's, I mean, it's very interesting because I was a student, I was a grad student at this time or that time in like 2018. That's what deep RL was really taking off. And it probably feels like that probably felt kind of like the language model thing was like as a student at the time, where it's just like, there's so many people in all these classes. And now language models have more of a real world application, but I think as a student, it probably feels so, so similar. Yeah.
Sebastian Raschka [00:04:50]: So also back then, if I may say that it's like large language models already existed. I think the GPT paper, was it 2018? Something like that?
Nathan Lambert [00:04:59]: Yeah, 2018 or 2019. Yeah. For GPT-2, I think.
Sebastian Raschka [00:05:04]: Remember covering, like I had a whole hour or two hours on large language models back then, but it was all focused on BERT models and basically also using them for more like classification tasks. Now, I would say maybe a lot of business problems still evolve around classification, but everything else is basically generative, generating text, generating images and stuff. So it has changed a lot.
Nathan Lambert [00:05:28]: Yeah, for sure. It's like a sequence of like, is it like the transform, is it like Elmo, BERT and the transformers are probably the things that you're talking about all the time? Just very interesting. I think Yitay had this, did you read Yitay's recent blog posts on language model architectures and kind of walked through why encoder decoder is no longer in vogue? Did you see this?
Sebastian Raschka [00:05:51]: Yeah, I think I haven't seen the article, but I remember having discussions with people about that recently. I mean, I think there was actually, it's interesting. So I think T5, if you would train it and fine tune it, it would still be a really good model for sequence to sequence tasks, like language translation and stuff like that.
Nathan Lambert [00:06:10]: Yeah. Cohere for AI did this with AYA. They used T5 for their first AYA version, which most people were like, oh, they've Cohere branded it so well, but no one realized they're using T5.
Sebastian Raschka [00:06:21]: See, I even didn't know about that. And so also on that note, I would say there was something else I wanted to say. So then there's also still the classification thing and using LLMs for classification. And it was also usually either a bird like encoder, or you could also use an encoder decoder, but mostly an encoder. But I've seen also recent papers using just decoder models for that. Just basically removing the, I saw two papers on that actually, like removing the causal mask. So basically reverting it back to an encoder using LLMA and then removing the mask. So in that sense.
Nathan Lambert [00:06:59]: And it works well as a classifier. You can just kind of use it. That's awesome.
Sebastian Raschka [00:07:04]: I mean, you could even do that without removing the causal mask. So you could just tune the last token basically, but yeah, if you remove it, yeah. They found that you could use probably the first token even, because if you have the last token, you don't, you have to have padding always because you have to pad it to the longest sequence. Otherwise the last token would be a different one in each training example. And so in this way you could use an earlier token basically, and then keep it fixed.
Nathan Lambert [00:07:30]: Yeah. Yeah. Now with your work at Lightning AI, do you do a lot of these things like hacking around with language models? Because I think it's kind of an underexplored space where just like people remove layers and plug things together. I think there was like, when merging was just getting going, there was like Franken Llama 2, where somebody made like a Llama 2 30 B by just chopping layers and stuff together. There's so much unexplored signal there that I just, do you have your, have you ever looked at these things or you don't do that much?
Sebastian Raschka [00:08:02]: I must say I'm not a big fan of merging. Maybe I'm just not good at it. I rather prefer fine tuning, start changing things or training and fine tuning things. So yeah, I do a lot of this type of hacking. Sometimes voluntarily, sometimes involuntarily, because I make a mistake or something or like, because at Lightning I developed this library, LitGPT, which is an open source library, pre-training, fine tuning and serving and deploying LLMs. But it's basically a from scratch implementation. You can think of it as a NanoGPT from Andrej Karpathy, but for all types of LLMs, like Llama, Gemma, PHY, all of them. But the focus is also like NanoGPT is on readable code or like keeping it relatively simple. Of course it gets a bit more complex there when you add multi-GPU training, tensor parallel, fully sharded data parallelism and stuff like that. So if you add all these settings, it gets a bit more complicated, but the focus is still on having a code base that you can easily work with. And in that context, it's very easy to remove layers and change things. I mean, yeah, so that is usually, I build it like for colleagues at Lightning, but also like open source community, but then also for myself to tweak things, to change things and stuff like that. So yeah, I should also say, it's not just me, it's Carlos and Adrian who started this library. Currently I'm like the main person maintaining it, but a lot of people contribute to it. So it's actually a nice playground.
Nathan Lambert [00:09:41]: There's kind of two follows odds for this. One is like, what part of the language model training stack, if somebody is going to start with libgpt or HuggingFace or whatever, like they're trying to fine tune a model, you can do an example. And then what is the thing that they should do to go like one level deeper to learn how these things work? Because you're saying with libgpt, you can do all these different architectures. I don't know if I would recommend architectures, but it's a good way to learn how like the attention implementation and how different layers are shaped and things like this. Is there different areas you'd recommend people to look at?
Sebastian Raschka [00:10:14]: Yeah, I would actually, okay. So it's like a shameless plug, but in my book, I have a book where I do this step by step, the implementation. And this is for only one model, like a simple model, a GPT-2 model. Because it's like the, I would say the one that started all of this, right? Like the main architecture and everything else is kind of like a derivative almost of it. So I would think in a good way that it is making tweaks and improving things, but basically starting with one architecture, like you said, not looking at different ones at first, and then just understanding what is, I would say the best way is what is the input data here? How does it look like? What does go into the LLM and really how does it pass through the layers? And then from there, okay, we understand how a model learns to generate one word at a time and then going from there to instruction, fine tuning, and then even like alignment with a DPO, for example. So doing like all these different lifecycle things from implementing one architecture, pre-training, fine tuning, aligning, and then from there, I think it's a useful or interesting exercise to see how different architectures make slightly different choices, like replacing the Gelu activation with a Silu activation or pre- and post-layer norm and like these like nuances, changing the number of heads or number of layers. And yeah.
Nathan Lambert [00:11:38]: Yeah. I mean, in industry, everyone kind of is converging to similar things or like people converge to a similar recipe and then they stick with it for infinity. So like each of the orgs have these recipes that it's too risky to change and like AI2 are like still converging at a recipe. So we're like learning things that the Llama team does and it's like RMS norm and they think it's very important or like these different things. And I wonder how like the open community is going to converge on pre-training things. So like what scale of models do you recommend people train for your book? Are they training like the hundred million scale GPT-2? Is it smaller? Because I think in Colab, you can fine tune maybe with Laura, a 7b model, I think. Is that true?
Sebastian Raschka [00:12:23]: Yeah. So this is true. But I think for Laura, if you want to fine tune 7b model, you would need, I think, bits and bytes of quantization, the normal float for like some quantization. But yeah. So for the, or maybe going one step back for the book, it's really the smallest model, like the hundred, what is it, hundred something million. But I also have settings. If you like, if let's say your machine permits, use the larger version. So there are four larger versions, like 300, 700, and 1.5 billion. But it's really up to the reader. I have all the examples with the smallest one so that it even runs on a MacBook Air. So on this podcast, I'm here on my small MacBook Air and all the models train in a few minutes fine. Of course, I'm not doing the whole pre-training for that. You would need a GPU for a week or maybe I would say maybe even longer than that now. I mean, it depends on the GPU, of course, but H100, maybe a week. But also the other reason is yeah, in practice, you would probably use pre-trained weights and then you can find, so you can do continued pre-training and then fine tune. So the focus is basically understanding how the pre-training works, then loading pre-trained weights. But then also the fine tuning is like the full, the full thing, like doing it to fine tune a classifier, but also instruction fine tuning essentially. And that doesn't take too long. I would recommend using a GPU, but it would technically run on a CPU. And get back to the question you had with a 7 billion model for that one A100, I would say yeah, one A100 would probably work for a 7 billion model. But you can also, if you have Litt-GPT or if you use Litt-GPT as a setting, you can set the number of devices and shard it over multiple GPUs. Yeah.
Nathan Lambert [00:14:14]: I mean, all of this stuff is getting so much easier. I think, I don't know, when did you start writing this book and all of these chapters? Because I've seen the GitHub, I haven't looked at when it started.
Sebastian Raschka [00:14:23]: Actually longer than you might think. It took a long time. It's almost, at this point, one and a half years approximately.
Nathan Lambert [00:14:30]: Because at that time, like a 1 billion parameter model, like what was the state of the art 1 billion parameter model a year and a half ago? Some random model. But today, like people are trading 1 billion parameter models for 15 trillion tokens. So the fine tuning that you can do there is getting extremely good. And I'm going to guess that people are going to start training even smaller models with these distillation losses. So have you looked at distillation at all? I think it's full on coming in the next six months. We can shift it to like the LLAMA3 and the state of the open ecosystem section, because it kind of goes in. It's like LLAMA3 was not distilled. It's a specific loss function. I hate it that there's synthetic data came around and people call, I was on this paper, the Zephyr paper, the title is Direct Distillation of Language Models. But now the technical definition of distillation, which is like knowledge distillation from a teacher is becoming popular. So the whole synthetic data and alignment and everything is like screwed in a doubly defined word.
Sebastian Raschka [00:15:30]: So basically what you're saying is that people who just use synthetic data refer to it as distillation because it's from a larger model. Yeah. Yeah. Yeah. Confusing. I think Gemma too did that actually recently. So that was an example where they did that. And I do think, you know, I think it's also coming. So I have for my book, that's like the core chapters I have, but I have a whole long list of bonus material that I want to cover and distillation, knowledge distillation is one of them. So this will be something over the next few years, but you know, doing tutorials on those and yeah.
Nathan Lambert [00:16:04]: Because I think people can actually use it as a thing. So how distillation works, I've thought about implementing it, but as it works is that if you have a fine tuning corpus, you get all the predictions from your big model. So all the log probabilities from your big model and you store them in memory. And then as you're training the model you're training, which is smaller, you essentially weight them by those predictions because you store them from memory. So you don't need to store the big model in memory when you're training. So I think people should be able to like, or someone will upload a data set file of like a giant log probs of Lama 405B and that people will just try to fine tune from it. I'm surprised that Lama 3 didn't use it, but I think it's just because they're focused on scale and data more than any fancy things.
Sebastian Raschka [00:16:49]: Yeah. And I think the, I can, I think I probably know why, but also, yeah. One thing is I should, one should also add is why I think it's also becoming more popular is like Lama 3.1, they just allowed doing that. I think before it was according to the license, technically not allowed to use Lama 3 models to improve other models, but now, now we can. So I think, like you said, it's probably going to be a hot topic, but I do think they didn't do that because the 405B Lama model just finished, I think. So I think, I mean, if you think back, they shared the Lama 3 model, it's like, I don't know, half a year ago or something, many months ago. So I think it's really more like, yeah, it hasn't finished training, but maybe for Lama 4, we will see more distillation using the 3.1 model for that.
Nathan Lambert [00:17:38]: Yeah, it's more architecture things. So for while we're talking about distillation, almost like Cloud Flash or Google Gemini Flash is confirmed as distillation. And it is very likely that Cloud Haiku and GPT-40 mini are distilled in the technical sense of the word, which is like, I think it's obvious that that works on pre-training. And I think there will be a breakthrough fine tuning model, kind of like the likes of Zephyr, Starlang, I'm forgetting more names, but ones that really reach the narrative from fine tuning on distilled data. I think that'll come in the next six months. So honestly, I'm telling the people I work with, we should try to do this before something new, because it's so obvious now.
Sebastian Raschka [00:18:22]: One thing I've seen also a trend, I wouldn't say backwards, but a thing that doesn't seem to be that popular anymore is a mixture of expert models. What do you think about that? Is that like something like that was like a fad and now people don't, you know, they explore other things like distillation. I mean, you could do both, but it feels like a mixture of experts is not as hot anymore
Nathan Lambert [00:18:45]: somehow. I don't know.
Sebastian Raschka [00:18:45]: What do you think?
Nathan Lambert [00:18:47]: There's two things. Small mixture of expert models are definitely coming out. Essentially, you get a fixed improvement in flop efficiency at pre-training. So essentially, if you're going to pre-train like an X billion parameter model with mixture of experts, it'll go like 40 percent faster or some pretty appreciable number. There's a lot of rumors and discussion that scaling up mixture of experts models is really hard from a stability point of view. So a lot of these open people, you could get it started and we're playing with these AI too. So we want to play in the mixture of experts space as well. And doing a small model works, but there's a lot of headaches. I think like some of the friends at Databricks Mosaic ML have been the clearest about this. It's just like you do not, like you at AI too, do not have the engineering throughput to deal with the headaches that comes from mixture of experts. So I think there's still clear signal from industry and people and like, I mean, Deep Seek's releasing MOEs. I think Quen has a small MOE and these are pretty good models. But I think it's a really heavy engineering lift to get to mixture of experts to work. I like GPT-4 scales. I expect Meta to figure it out. I think it's just on their list and they figured out dense first. The thing I'm more interested in for GPT-4, I don't care if it's mixture of experts. I think they have the compute to do either way. But for Llama-4, God, all the numbers throw me off so bad. But I think that OpenAI and Google might be slightly ahead by having the early fusion model. So essentially with these multimodal models, there's the concept of early versus late fusion. The first visual models that people were playing with the GPT-4 were this late fusion. And now like GPT-4.0 is early fusion. And it seems like Gemini is probably early fusion, which means they take in direct audio, video, text directly at the input, the training data changes. And I don't know how much of a heavy lift it is to get that to work. I think that might be the bigger change. And that might be harder for Meta to catch up on than anything else. But no one's really talking about it.
Sebastian Raschka [00:20:58]: But also here, I think that is something I feel like others have. I mean, I remember even like last year, there were a lot of papers with a late fusion thing, like I think Llama adapter papers and stuff like that, like retrofitting the models. But yeah, I haven't seen that much focus on that from Meta. But I mean, they had a section on that in the paper, but it felt almost like an afterthought. I don't know. It's like where, yeah, I think maybe there's a different team at Meta that works on
Nathan Lambert [00:21:26]: that. There is a Chameleon team that was doing this, and I think a lot of them have left. My question, essentially, that I want to debate and I don't know the answer to is like, because essentially it takes so much different data pipelines. So you have to have a much clearer balance between video images and audio and text when you're training early fusion than with late fusion, because you just add a bunch of images at the end. And like if that data curation step is going to be a big bottleneck for kind of shifting and if Google and OpenAI have an advantage by just scraping YouTube, like Google obviously can't scrape YouTube and I'm not saying that they are, but like if it becomes a way that you can get more data and like GPT 5.0 is the first model that OpenAI releases, then I'll be like, OK, the GPT 4.0 thing was just a pivot. And I actually think this could happen. I don't put this at like a one percent probability. I could see this as being what the labs are betting on. It just takes so long to spin up this entire new pipeline of training.
Sebastian Raschka [00:22:25]: But one question here is going back to a point you mentioned earlier regarding the knowledge distillation where you can just precompute all these things, you could technically do that also just once for the whole data set. Let's say you have a very good image encoder, audio encoder. You would never have to redo this if you do it well. Right. I mean, it would be something you do it, take care of it once and then you pass it just as tokens to the to the other team, basically.
Nathan Lambert [00:22:49]: Yeah, probably. I don't know. I'm not like I don't have as much insight into really advanced pre-training practices as I would like. I'm mostly of a similar boat of like fine tuning models and playing with things because I'm trying to play like, have you played with Llama 3405b at all? For context, the recording is like, what is this, like a week after, like six days after. Like I haven't gotten it set up, but I'm really curious. Like I don't have clear expectations on how the open source community, like the open language model ecosystem kind of evolves from here with these new Llama models, the new Mistral models. It feels like a total, from like a technical and a policy perspective for me, it feels like a pivot. I think the educational side of things, it's actually more of the same. Like we knew we knew this was coming, but it just it feels like it could be qualitatively different going forward. Do you see anything? Have you tried anything?
Sebastian Raschka [00:23:45]: Yeah, I did actually try the Llama 3.1 models. I, when they came out last week, we added them to Litchipiti. I took care of the eight and 70 billion models. And my colleague Adrian, he also added support for the 405 billion models. So just briefly trying it, it looks really good. So the thing is with a 405 billion model, it's a bit tricky. So I think the problem here is, of course, it's free. Everyone can use it, but in a sense it's still expensive to run it because you need, so we were running it with bits and bytes of quantization, like a normal float four on eight H100s. And this is expensive, right? I mean, eight H100s, it's probably more than a hundred bucks an hour.
Nathan Lambert [00:24:26]: I was trying to do the same and I messed up the BLM installation. I was like, okay, I spent an hour on this. Yeah.
Sebastian Raschka [00:24:32]: So you can try Litchipiti maybe. So it works with.
Nathan Lambert [00:24:36]: Yeah. And there's a related question. One of the things I'm trying to ask people who are hands on, just like, how do you, what do you do to vibe check a new model as you go through so much AI research material and language model material? It's like, everyone has their procedures and how do you go about that?
Sebastian Raschka [00:24:51]: So for me, it's like, I, I mean, I use these more like for making sure they generate the correct answers and stuff like that, or something that is reasonable. So honestly, really simple questions for me just to see, so this is more like, I'm not necessarily benchmarking these models. I'm more like making sure the implementation is correct. And for that, I use simple questions like what do llamas eat? What is one plus two? You know, like just making sure, because it's actually easy. Something I just fixed this morning. It's easy to mess up things like KB caching, where you cache, you don't clear the cache and then there's something from the previous answer and the answer looks kind of correct, but it's kind of weird. And, you know, like simple questions can sometimes reveal that. So basically what I do is I ask it multiple, multiple questions the same time. So, sorry, repeatedly, like the same question repeatedly and see if the outputs still make sense and stuff and then mixing them up, but like in a loop basically, but I'm not so much like, that's a great way to make sure the implementation works.
Nathan Lambert [00:25:53]: Cause I think in transformers, they had a missing end token. There's so many little things like this when implementing stuff. Like the, the end tokens is such a ban or like the chat templating can always break things. Cause it also can happen that you mess up pre-training and then you need to have something in the chat template that people might not know. I think in one of the early Olmo models, we like missed a new line in, in one of our documents when we were annealing it. So in order to fine tune it, you had to like have an extra new line before the chat template and like most people will just miss that. Yeah. This is very, very interesting point.
Sebastian Raschka [00:26:28]: It's like, you don't even notice it usually when you use something like, I don't know, chat GPT, because it's applied behind the scenes. But if you implement these things yourself, you have to be really diligent and careful to do it very consistently. Like one little, like you said, new line throws it totally off. It's, it's, yeah, it's interesting. It's like, you have to be, I noticed that I was actually working on some DPO stuff this weekend and my template for fine tuning and DPO alignment, the one that I'm working on alignment, the prompt template was a bit different and I got like garbage results. And then, oh, I, I stripped some line here, the new line character, basically something similar, like you said. So it's, it's very sensitive to that.
Nathan Lambert [00:27:04]: Yeah.
Sebastian Raschka [00:27:04]: Yeah.
Nathan Lambert [00:27:05]: This, this makes sense. Um, related, do you use Clod, chat GPT, any of these regularly in your workflow? Are you team Clod?
Sebastian Raschka [00:27:13]: Uh, so yeah, so it depends. I have both and I flip back and forth between them. I don't know. I'm probably not really good at prompting, but sometimes I get better results with one over the other. Um, I think. I wouldn't say one is better than the other. They're just different. I would say I'm using.
Nathan Lambert [00:27:31]: That's kind of what I think. It's important. Like, it's good. Like, what do you think of both of them? I think it's good for people to know this because it's, it takes some practice to understand and using both. Both people don't use both. Yeah.
Sebastian Raschka [00:27:43]: I would say when I use also GPT-4, I must say I use the, uh, it's called legacy now, but the original GPT-4, I don't like the mini and old versions. And, uh, for Claude, I use the opposite of the, not the new one, but the one, the previous larger one, the slower one. And, um, I think for me it's like coding wise, it's kind of weird, but most of the time I like GPT-4 better for code stuff. But then I think also, uh, I think, you know, what, what's better with GPT-4 was it's, it's a bit more up to date, um, with knowledge, I think. But Claude has, I think better, you know, when you say improve my writing or something like that, it has more, it has less, you know, like these, like I delve into something, these weird words and stuff like it, it's a less, it's more natural a bit, I would say, but
Nathan Lambert [00:28:34]: also not always.
Sebastian Raschka [00:28:34]: I agree.
Nathan Lambert [00:28:36]: It's like, it has a bit more flair and a bit more unpredictability. So I like use a Claude on my phone, but I've found, I've tried to use Claude for like information transformation tasks, like LaTeX or taking, taking data out of a table. And sometimes it just like refuses. Like I do research on like AI safety, like safety and bias. So if I put anything into Claude that I'm trying to transform that data, it just says no. Cause it's like, I can't comment on like a mean story. Well as OpenAI will just do it. And it's like the processing that OpenAI does is very good. So I actually like canceled my GPT subscription when I started Claude, but I kind of regret it now. I'm like, oh, now I need both, which is, which is a little annoying. Yeah.
Sebastian Raschka [00:29:16]: It's like, yeah. So one thing is what is interesting though, is we, we're talking about GPT-4 and Claude here, but we haven't even mentioned Google Gemini.
Nathan Lambert [00:29:24]: I don't know.
Sebastian Raschka [00:29:24]: I personally, I tried the early versions. I don't want to say the newer versions are not good. I just haven't tried because I didn't need to, but do you have experiences with Gemini
Nathan Lambert [00:29:34]: or? I was using Gemini in search preview. So if you have the Google app, I can, I'm recording this in, in video. Like you have the Google app, like at the top, you could click on Gemini, which I was doing for a while just to play with it. But like, I don't use it on the web. I, they do have a nice interface that looks exactly the same, but somehow I got grandfathered into like AI studio, which I use for, if I upload, record a podcast, I upload the podcast and I'm like write chapters or something. And it actually works, which is pretty cool to be able to upload like an hour long podcast. But for whatever reason, the Google interface, other than the Google app, hasn't stuck for me. And I think that's the biggest, biggest limitation. And I use it more in a googly way. So I'd not, I'm not as perceptive to style. I see. I see.
Sebastian Raschka [00:30:20]: So also I'm curious. I just yesterday saw Apple's on device AI is a bit delayed, I think. And for that, I think it's an interesting one. We will see how this will work because this will be, I think also smaller models. And there's a, for me, it's like, I never really care about speed for these things. It's like, I just want the best possible models. So this is also why I was a bit disappointed when GPT-4 O came out and GPT-4 Mini came
Nathan Lambert [00:30:46]: out.
Sebastian Raschka [00:30:46]: It's like, ah, I don't really care about if it's faster or not. I just want it better. You know, I want to have better quality. I don't know. It's maybe it's just me.
Nathan Lambert [00:30:53]: I think for building applications, speed is really good. So I have a few friends that run startups that are heavily built on language models and they have a similar stack to perplexity, which is like the user passes in a query that have a primary language model request and they have a series of feedback requests or small requests on top of that. So when you're concatenating multiple requests, like speed is extremely important. And when you're like selling a product, speed is extremely important. But if you're like tinkering and trying to learn, it is much slower. It's true. Yeah. Yeah.
Sebastian Raschka [00:31:19]: It's like the real world, like, sorry, not real world, but the individual user, um, yeah, using it as a tool in everyday life versus really building an application based on an API that makes sense.
Nathan Lambert [00:31:32]: Yeah.
Sebastian Raschka [00:31:32]: So there are two different use cases.
Nathan Lambert [00:31:34]: Yeah. Yeah. I think we're kind of talking about style. I have a section on RLHF here. I just wanted to like, what do you think you do spend a lot so much on AI education is like, what do you think is most confusing to people about this kind of whole post-training thing, which is instruction tuning, reinforcement learning from human feedback, other safety modules, like adding a filter and stuff like this. I'm really on the bandwagon of trying to convince people that RLHF is deeply tried with style, which is like this, how this discussion of cloud versus, um, open AI and Google and all these things. And I don't really know how to portray that in like an educational technical point of view. So like, I'll do an analysis of the paper and I'll do like DPO and like scores and all these things. But at the same time, for most people reading my articles, the most important thing is probably to know that open AI is really smart about their style. And that's why they're so high on chatbot arena. But like, I've written about it a couple of times. I have another article in the drafts, which is essentially like why GPT 4.0 mini like broke chatbot arena. Because everyone's so upset that it scored so highly, but it's not that surprising if you look at historical events.
Sebastian Raschka [00:32:39]: So it's basically exploitation of the benchmark almost you're saying or like the benchmark
Nathan Lambert [00:32:45]: is focused on style and it really penalizes refusals. So like I get refusals when I use cloud. So it's definitely going to like be downweighted. And like open AI is really good at this. This is what they've been doing for a long time. But I don't really know how to educate this. Like, have you thought about, like, there was a question on Twitter of why didn't you include RLHF in your latest? It was kind of a joke, but I took it out.
Sebastian Raschka [00:33:09]: Well, if yeah, I can maybe answer that. It's it's in the works. No, so there are multiple reasons. And so one is it's so there are page limits per chapter. And originally it was meant to be in chapter seven. It got way too long. It's actually even without it. Chapter seven is the longest chapter already. And what is the other one is fine tuning.
Nathan Lambert [00:33:29]: Oh, sorry.
Sebastian Raschka [00:33:30]: Instruction fine tuning. Yeah, I called it not instruction fine tuning. I called it fine tuning to follow instructions, which were originally, which was originally meant to have both, but then it got too long. And the other thing is, you know, like one book chapter takes about two months and a lot of people who really want to book before the new semester starts. So it's like, you know, it's, there could be another chapter on it, but it would be
Nathan Lambert [00:33:54]: another two months.
Sebastian Raschka [00:33:54]: And that, I mean, it's not really an excuse, but the other one is I was not happy with the results. And this is a very mathy topic. And I was like, okay, I have this book, which is very clear and makes hopefully a lot of sense. And then I have this really super complicated chapter at the end. I don't know if that's very satisfying to read or death.
Nathan Lambert [00:34:15]: Yeah.
Sebastian Raschka [00:34:15]: Where it's like, so you read this book, everything makes sense. And then it comes to this huge...
Nathan Lambert [00:34:19]: Why is RLHF so much mathier? Like, I know a couple, there's a couple of core equations. Like the core equation is like the RL optimization step, which is expected expectation, maximization of reward subject to penalty. And like, where does most of the, like compared to pre-training, which is like one equation, like that is also one equation, but there's a lot of downstream stuff, I'm guessing. Yeah.
Sebastian Raschka [00:34:41]: I think it's the explaining a bit about reinforcement learning. I mean, you don't really have to explain reinforcement learning in a classic sense, maybe, but yeah, there's still like KL divergence and penalties and reward margins. And there are lots of things happening at the same time. And the code is also very long if you especially want to track the rewards and stuff. So for my instruction fine tuning chapter, I'm using exactly the same training function I implemented in the pre-training chapter.
Nathan Lambert [00:35:14]: And it's really nice.
Sebastian Raschka [00:35:14]: It's like, well, you can actually reuse everything. It's, it fits together.
Nathan Lambert [00:35:18]: Yeah. Like what we're doing on OMO, we can baseline our instruction fine tuning in our fine tuning code base, which also has some RL things and in our pre-training code base. So it's nice to have both, but that is definitely why it's simpler. And the RL is only getting worse in my mind, I think. Like we've seen that LLAMA has used rejection sampling for two iterations and there's no public implementation of rejection sampling that at least public enough to know that people have actually trained models with it, which is the idea of ranking completions to a reward model and then running instruction tuning again on the top completions.
Sebastian Raschka [00:35:54]: I think also in the recent LLAMA 3.1 paper, they used rejection sampling with DPO, for example. Like they didn't use the RLHF with reward model, but then they used the reward model for the rejection sample. And yeah, so I must say, I have the code for the DPO. I wanted to do TPO because it's also more resource efficient. You don't have to train that reward model for, let's say the book, but I was not really happy with the quality of the output yet. So I must say it's like, okay, this is not, it's not helping the instruction fine tune model. And it's like, I think a general thing where I, I mean, you might correct me if I'm wrong here, because you are the expert in RLHF, but for me, it's like, it's like a optional thing where unless you need a specific style or need to deploy something in like a safe manner, it's maybe not giving you the best results. If you need a private model that just runs on your own computer and gives you correct answers, I don't think DPO or RLHF will make the answers more correct. They will just change how they look like.
Nathan Lambert [00:37:01]: And yeah, I mostly agree, especially on what we have in public implementations. The public implementations are really good at improving on like alpaca eval. But if I'm training a model that I actually want to use, don't worry about alpaca eval. I think I'm like the most annoying person internally running these experiments because I just get so annoyed when only alpaca eval goes up and be like, that has made the model worse. Like we've, I've been building internal demo tools, which is just like making Gradio better and showing how to use VLLM for serving. But it's like a lot of the models we put out for research are like really, really annoying to talk to. You put no yapping or just be concise in the prompt and it doesn't do anything. So like a lot of the open datasets, and this is something that Nibetron and Lama3 have shifted to is this new evaluation, which is like IF eval, which stands for instruction following eval, which I think is a great one. So it's like write a response with less than 300 words or something. And it has these verifiable claims. And this is something that the Nibetron report showed that like doing fine tuning really unlocked a lot more performance in the DPO stage. So I'm hoping that we start to get more evals than just alpaca eval that are helped by this RLHF and that'll help the whole ecosystem come forward because it is in a kind of young, rough state right now. Yeah.
Sebastian Raschka [00:38:21]: And also one last thing about this topic is for me, like you said, the last sentence is kind of also one of the reasons is where I was like, okay, if I include something on DPO as the last chapter, I don't know if it's still going to be used next year or if there's so many variants, ORPO and QTO. And I mean, right now, I mean, Lama3.1 used DPO, which is like a big endorsement. But to be honest, I'm not sure if this exact variant is here to stay.
Nathan Lambert [00:38:47]: And so I think DPO is here to stay. DPO will be a canonical example, much like PPO. But I think the things that people are using will go away. Like PPO has stood the test of time of multiple eras of RL. So I don't think that people use it in its exact form, but people are always looking at it. And same with DPO, just because DPO is so simple. Like the exercise, this is like one of the best getting started with RLHF exercise is taking like the hugging face trainer and modifying it to use the DPO loss because you could use all the other infrastructure for like most of the infrastructure for batching and stuff like this. And then add that loss function, which is a few lines of code. And like, that's a good, that's like the entry point to doing RLHF implementations. Like when I interview people, I'm like, make sure that they have looked at this DPO loss function before. And if they haven't, I'm like, I don't know if you're in the weeds enough. I feel like you should look at this.
Sebastian Raschka [00:39:37]: Speaker 3 And if you need, if you are listening to this and you are about to get interviewed by Nathan, I will hopefully have by next weekend a tutorial on DPO, on implementing it from scratch. I was, this weekend I used actually Lama 3.1 to make a synthetic data set for that and got much better results. So it looks good enough to probably upload it next week. So nice.
Nathan Lambert [00:39:58]: Okay. Let's shift gears into like AI research and AI education, which is I think the thing that you have some of the most insight into. So you're a head of AI newsletter. You, I wasn't originally reading it when I subscribed, but now I almost always skim through to kind of see what papers you uncover. I'm pretty interested in like how you select papers, like how much you actually prioritize reading papers and why, and just like any advice for people, because it's hard to sit down and do this. And I, speaking for myself, sometimes writing is like how I force myself to read some papers. I don't know if you're in the same boat, but like, what is your worldview around reading AI papers these days and skepticism or excitement, everything?
Sebastian Raschka [00:40:42]: Yeah, that's a big topic. So I must say, so I, I look at more paper than I actually literally read. I mean, I look at the abstracts and the titles and then that's like a huge funnel as a section
Nathan Lambert [00:40:54]: processor.
Sebastian Raschka [00:40:54]: I must say for like, I was an archive moderator for the machine learning archive a few years back and that got me into the habit. So how it worked was basically as a, maybe it's useful because some people complain when
Nathan Lambert [00:41:06]: How did someone become an archive moderator? I didn't know that it was like a community position.
Sebastian Raschka [00:41:12]: So that was originally by Tom Dietrich. He was doing it by himself and he was looking for people to help him with that. Because as you mentioned, there is an ever increasing number of papers. And so how it works is essentially that when you submit a paper to archive, you select the categories. But a lot of people, they select not, let's say the correct, I wouldn't say not correct, but like the preferred categories because Yeah, the AI and ML.
Nathan Lambert [00:41:39]: It's like ML, AI, and then everything else. Yeah.
Sebastian Raschka [00:41:42]: And AI in archive is interesting. It's more like the classic AI. It's like, it's not LLMs. It's more like symbolic AI, that kind of stuff.
Nathan Lambert [00:41:51]: What do you think the difference between, or like as an educator, how do you define AI and machine learning? This was also one of my favorite interview questions to like see where they're at.
Sebastian Raschka [00:42:00]: Well, right now I would say I go back and forth on that. Right now I would say AI is this big umbrella thing where you have deep learning and machine learning as subfields. But if you think about it, if you consider a logistic regression classifier, it is essentially machine learning. And if machine learning is the subfield of AI, you would say, okay, then logistic regression must be AI. But is like classifying iris flowers really AI? I don't know. So today I would say
Nathan Lambert [00:42:28]: I also think about search as AI. Yeah. Like, yeah.
Sebastian Raschka [00:42:31]: Like, yeah. So there's like the good old fashioned AI. So I would say with AI, yeah, you have both, you have the machine learning and deep learning branches, but you have also, you can also implement AI with if else statements, I guess, like, you know, like, so. So that's how I would define AI. But I think nowadays when people talk about AI, they mean specifically gen AI, like generative AI models, like LLMs, stable diffusion, that type of stuff. But yeah, so the archive thing. So just briefly, basically there is in the background, it's also using machine learning or NLP to detect whether the title based on the title and the abstract, if the category is actually matching. And if there's a mismatch or in general as moderator, you go through them and, oh, this looks good.
Nathan Lambert [00:43:17]: This looks good.
Sebastian Raschka [00:43:17]: This looks good.
Nathan Lambert [00:43:18]: They started exposing this to the user. So I submitted a paper recently under ML and I was like, this looks like language. And I was like, I've been in moderate, I've gotten papers stuck in moderation. So I was like, I'm always going to hit, except if they tell me it might be in the wrong category, because archive moderation is a black box that you don't want to get stuck in. No, no, like as a user, but I understand the service it's providing. So it's good to expose that to the user. And if anyone's listening, just click it, click. Yes. It's not worth delaying your release. We get stuck in moderation and help archive out. Yeah.
Sebastian Raschka [00:43:50]: And so just the last thing on that is by default, everything gets accepted. However, sometimes it's something gets flagged. If there's duplicate content, if it doesn't look like a paper, sometimes people submit like one page blog posts or something. So there is this thing where sometimes there are also false positives and then it gets stuck. But long story short, that got me into the habit of reading the titles. And that's what I still do. Also for my head of AI newsletter, I just look through the titles and select. How have titles changed?
Nathan Lambert [00:44:21]: Like titles have changed a lot though, as I feel like they used to try to be. Accurate. Mostly descriptive. Yeah. Descriptive, right? And now they are a mix of, it's more of a storytelling than descriptive. I think it's the right way to tell it.
Sebastian Raschka [00:44:36]: At least we don't have the, it's all you need anymore. I feel like this went away finally, but yeah, you're right. It's more.
Nathan Lambert [00:44:43]: It ended with Ryland Schaefer's test set. Training on test is all you need. Yes. Did that make it on archive? It did.
Sebastian Raschka [00:44:51]: I think I also had it featured in my newsletter one time. I think. Or not featured, but at least mentioned. And so how I select papers is also often selfish. I read or select papers for the newsletter that I find interesting. And because I think this is also for education. When people ask me about how I would suggest doing things, I think the most important thing is to talk and work on things you are interested in. I think it would be really hard to do a good job if it's a topic that is not interesting to you. For example, I know, I don't know. R, sorry, or Rust is interesting, a very important topic, but I'm not into it. So I don't try to, let's say, make videos or content.
Nathan Lambert [00:45:35]: Yeah.
Sebastian Raschka [00:45:36]: So it's like, I think if there's something you're excited about, I think it comes almost naturally that you want to talk about it. So in that sense. So the newsletter, I almost, it's weird, but I almost write it for myself. It's like, I find it interesting.
Nathan Lambert [00:45:49]: How much do you spend reading versus writing when you're reading these papers and writing a blog post? I'm guessing a lot of it is just the natural process of synthesis is what you put into the newsletter. It's not like you're doing it from my read. It's not like you're doing a ton of scaffolding and editing after the fact, which seems similar to what I do.
Sebastian Raschka [00:46:09]: Yeah, you're right. I don't do, I don't spend too much time on it in the sense that I wish I could, but I have a full-time job. It's literally just the weekend project where I aim for one newsletter per month. Of course, I would like to do more, but there was also a book to write on weekends or sometimes I'm doing videos. It's like keeping it fun, you know, like where it's like, okay, this is not a chore. This is something that is supposed to be fun. Like in that sense, I read a paper and then I take notes and then I collect them and spend maybe half an hour, an hour to polish them a bit up or make some figures. And that's it per paper, I would say. And so I also don't write the whole newsletter on one day or one weekend. It's really spread over the month. I read a paper. Oh, this is an interesting one for other people. Let's write this up basically. And then this way I collect material over the month and then.
Nathan Lambert [00:47:00]: Yeah. What motivates you to work on this stuff? Is it purely like education? Because I, in some ways relate to that. I've been in that mode before.
Sebastian Raschka [00:47:09]: Yep. So if you have noticed, I don't have any sponsorships or something.
Nathan Lambert [00:47:14]: Never done that. Respect.
Sebastian Raschka [00:47:16]: I will never say never, but it's not something I do. It's really just a hobby. And I do like discussions that come around it. There's a certain satisfaction that if you put it out, it helps others and people tell you positive things about it. It's kind of very gratifying. I don't know. There's like a reward in a sense. And what's also cool is there are a lot of people. It's like being part of the community and exchanging information because there are also a lot of people who sometimes know something I don't know. And this is really, I think, really cool. You write about something and then someone, Hey, have you seen this? This seems like it's taking it to yet another level. Or this is the same idea. It's even better or something. And this is super cool where you get this effect where you learn by doing this, actually, because there's always someone who knows a bit more than you do in a specific area. So, yeah.
Nathan Lambert [00:48:07]: Yeah. I feel like it's increasingly important these days and increasingly impactful because so much of research has become closed off and for business reasons. So there's fewer people that do more of the work. I don't like it. I always feel like people don't realize how few people are informed and share on any given topic like AI research. If you take away three people, I've yet to find people that just tweet the same random RLHF crap that I tweet. It's like, I don't do it because I just say random things, but there's not that many people that represent each of these corners. Ahead of AI, I think Jack Clark's important AI. I should have him on the pod. I think I've talked to him a few times. He's great to talk to. And his is the same thing. It's like these few people that are disseminating AI information, which is crucial for policy at future angles. Have you ever gotten criticism that your work is accelerating AI and that you are a safety risk? I've gotten some critical emails that are like, you shouldn't talk about this.
Sebastian Raschka [00:49:07]: Yeah, I've more gotten emails about the fact that I talk about LLMs is not good because LLMs violate copyrights. I mean, not that I do it, but that other people's LLMs do it.
Nathan Lambert [00:49:21]: And I'm happy that I haven't had this audience very much, but it seems this is like one of the challenges of having a tech audience is like you cultivate it in kind of one of two, like there's multiple ways to go. And one of them is like this all data is for language models is theft thing. And I just don't know how to deal with it because like I disagree, but the normally people that aren't receptive to it, which is really hard. It needs to be played out. Yeah.
Sebastian Raschka [00:49:47]: My book also just to make extra sure all the data I use there is so the pre-training data is public domain data, like a book from Project Gutenberg. And for instruction fine tuning, I did my, I created my own data set basically. So just to avoid any issues, you know, like. Did you do, you wrote it by hand?
Nathan Lambert [00:50:06]: Yep.
Sebastian Raschka [00:50:06]: So I took, no, actually I used, I used part of an LLM and some by hand.
Nathan Lambert [00:50:12]: Yeah.
Sebastian Raschka [00:50:12]: So it's a great exercise.
Nathan Lambert [00:50:14]: Yeah. Yeah.
Sebastian Raschka [00:50:15]: And for the synthetic one, I use LLAMA 3.1 now too. I mean, yeah, you can tell me also about that a bit. I mean, that's maybe interesting for the audience, how to generate a preference data set, because there are multiple ways, I mean, naturally it's crowdsourced, right? So you ask people, you have the model generate two answers or have flavors of the model generate answers and then, oh, which one do you prefer? But it's not really scalable. And so you could technically do the same thing with an LLM. You could basically have the LLM generate a more polite version because I think LLMs are very good at, even the small LLMs, the open source 7b models are good at rephrasing things or evaluating things. They're not necessarily good to generate the answer in the first place if they don't have a reference, but given a reference, I think it's super useful to use open source LLMs in that sense.
Nathan Lambert [00:51:07]: I'm surprised that this hasn't caught on sooner, but I think it's starting to catch on. I think in the meta report, they essentially have edits. So then they rank, they make their preference pairs as edited better than chosen, better than rejected. And that's like, you can create multiple players by binarizing. There's a few research projects that have done this where they have like, constitutional AI is popular, but that's not really reproduced. One of my collaborators slash friends at Synth AI Labs, Louis Castricado, he did a paper on like the pink elephant problem, which is like using provisions to get the model to not just say whatever is in the question if you ask it not to. We did a follow-up work that's out literally today, which is like on self-directed synthetic dialogues where you have the language model generate a plan, and then it follows the plan. And then you can also do revisions on it. So I think Nemetron did this with Prompt. So it's really getting going, but it's something that took longer than I expected. There's the kind of question, this is like too big of a topic to go into, but it's like, how do you use GPT-4 feedback? Do you use like, are your completions from two different models or the same model with different generation settings? How do you use humans? I think that the labs are using humans for preference data because it eliminates some of the problems in language modeling. And then that's one of the biggest impactful research questions in alignment. It's like, we can't afford the $1 to $10 million dataset. How do we do this? And that's what, we're starting a project to do that AI too right now. And it's a big open, like, I don't know where it'll go. I don't know how much, like how far can we reproduce the LLAMA-3 alignment methods. Yeah.
Sebastian Raschka [00:52:46]: So I would say the LLAMA-3.1 paper or the LLAMA-3 paper, it was like a 93 page paper
Nathan Lambert [00:52:52]: and it was great.
Sebastian Raschka [00:52:52]: I love it. It's like a lot of detail, but on the alignment part, I feel like I wish there was more information
Nathan Lambert [00:52:58]: about it.
Sebastian Raschka [00:52:58]: Even like LLAMA-2 had more information where they showed what is the improvement actually over the different stages when they added to supervised fine tuning.
Nathan Lambert [00:53:05]: So I'm talking to Ross Taylor tomorrow, and I'm going to ask him the specific thing. On latent space, like Thomas S., one of the leads, said that most of their gains come from RLHF rather than SFT. So I think the open source community is over-indexed on instruction fine tuning because it is accessible and we have the data. And this is like one of my, like, try to guide the community by doing things is like, go do RLHF. Don't worry about instruction tuning data sets. Don't worry about that. We'll just leave that the same and go find more preference data and keep playing with this. And don't worry about the DPO methods. Just literally go make preference data and keep trying to train things. Like don't implement a new loss function.
Sebastian Raschka [00:53:48]: Practical question to an expert like you. How good is actually a preference data set if you download it, if both the chosen and the rejected answers, if you download a preference data set, they're not generated by your model, right? And if you have a model and you use the responses that the model has never basically seen before, does this actually work or would it be advisable?
Nathan Lambert [00:54:11]: So the most, the two most popular preference data sets in the open right now are UltraFeedback and Nectar or variants of them. Both of those are collected from large suites of other models. And part of my, there haven't been data sets or papers that have trained really good models using on-policy preference data from the model you're training. And I think that's a question that we need to answer. It's like, how do we get UltraFeedback level results with on-policy data? Because all the labs are using on-policy data. I wrote about this in like Barry to one article. I have a theory that UltraFeedback and Nectar, these general data sets work so well because within them, there is something close enough to your distribution and you don't have to get it quite right. But it's just like a gentler, more uniform learning signal for the models doing preference tuning. But we don't know. That's something that I want to answer.
Sebastian Raschka [00:55:02]: Yeah, this is an interesting one. I would also like to know the answer because that is one thing where I got a bit stuck when I was writing this DPO chapter with smaller models. I think bigger models also, they hide these weaknesses a bit because they have been trained on so much data that like you said, it's kind of in distribution already. But if you train a small model, it would be out of distribution, right? If you use someone else's preference data set. I noticed even something simple when you train a model on one simple instruction data set, let's say something like alpaca. And then let's say you have just to have something visual. You want the model to generate Yoda speech, like where every sentence is reversed. But the model has never seen sentences like that unless it was maybe in the training data. But in that sense, it doesn't work well at all because you ask the model during preference tuning to write sentence structures. It has never grammatically written before. And so in that sense, I think what I found is it's much better if you, I don't know, you say be more polite or like you have a more polite answer because you use the same grammar or so. So things like that basically. And yeah.
Nathan Lambert [00:56:08]: Yeah, I think that's a smart approach. It also might be why learning rates are getting so low. Where like all the learning rates for DPO and things have been going down in the fine tuning space. And it might just because distributionally, like we're far off from the model. There's the other theory that the model is like really, really done training. So they get it to a really good optimum. You don't want to move it from them. But it might just be that like our data sets are in the wrong space. Yeah.
Sebastian Raschka [00:56:32]: So you try to be gentler with a lower learning rate.
Nathan Lambert [00:56:36]: Yeah. All of this stuff changes fast, but not fast enough. Like this ultra feedback data set they were talking about came out last October. So we're like almost 10 months in and it's still the state of the art data set. And it's only like 50,000 examples. So there's so much opportunity for someone to like at this level, like go build data sets if anyone is watching. Because it's like, I think we're so far off where we could be just because people don't know how to make good preference data sets.
Sebastian Raschka [00:57:02]: Well, now we have LLAMA 3.1, 70 and 405 billion that allows us to do that, right?
Nathan Lambert [00:57:08]: We'll see. Yeah. I was wondering, this is a change of topic, but how do you think like, do you think AI will change our jobs in writing? How do you see AI coming for this kind of educational space? Like how much of what you do as an educator could be taken in N years by AI?
Sebastian Raschka [00:57:26]: Well, I think it's like, of course it will automate away some things because nowadays you would ask a model something instead of searching for it and reading it on a website. But I do think the creation process, you still need a human to put it together well. Because I don't know, I think LLMs are not nowhere near like generating a whole article that is actually, I would say even good where it can generate the right things, but you still have to put it together. It can generate good blocks of text or something like that, but you need to, as an edit, like you become maybe more like the editor then in that sense. But I'll try this.
Nathan Lambert [00:58:09]: Also like, do you write, do you have AI write any parts of your articles? I'm so scared for like moral reasons to have any AI writing in it. I'm like, it's just a slippery slope. It feels like I could get addicted. Yeah.
Sebastian Raschka [00:58:21]: So sometimes I don't have it write anything from scratch, but I sometimes do do that. And especially, I don't know, I have a, I mean, I'm a non-native language speaker and sometimes I have a harder time than other days to make the sound right. It's like, okay, this is what I want to say, but it doesn't sound right. And then I, can you revert this with a focus on XYZ or something? So like, it's basically like a, you know, like a thesaurus where you find similar words, you find similar sentences, like just rewording it, like these types of things. But one also, now that you mentioned it, one weakness it has, or LMs can't do really, is they can't generate figures. You know, maybe that's coming.
Nathan Lambert [00:59:01]: I don't know.
Sebastian Raschka [00:59:01]: You can do that probably with ticks, like the latex thing where at one point, but right now nowhere near, can you generate any useful figure? And I think learning is very visual too. I think if it's just text, it would be really hard to learn anything.
Nathan Lambert [00:59:17]: Yeah.
Sebastian Raschka [00:59:17]: So you can, of course, but I do think, you know, there's a saying, image is worth a thousand words, right? So yeah, in that sense, you still need someone, you know, like the mastermind behind an article, even if it's just an editor, I don't think LMs can replace everything at least. And we'll see. I mean, I don't know how much better, I mean, we just don't know how much better, let's say GPT-5 as a placeholder here will be then GPT-4, you know? So maybe if it's saturating, who knows, right? So maybe it will be five more years till we, yeah, get in a more scarier territory in terms of replacements, you know? So we'll see.
Nathan Lambert [00:59:55]: Yeah. I mostly avoid the agent word, but it does seem like there's enough culture and cultural investment in the Bay Area and tech executives to do something. Like they're going to get to something that is triable, which I think is mostly like automatic Google searching, more code execution, which is going to be interesting, but I have such wide expectations of what it actually means. That's probably the next big shift. I think this LLAMA 3.1 is probably right now leading the year in terms of AI news. This recent DeepMind thing on the math might be a better example of what's really hot news. I need to go read more about it. There's some long write-ups on how the qualitative between the AI math and the human math and the different directions they're going. So that's kind of what I want to read about it. But it'll shake things up. We're multiple years into this fast phase. It's not exactly new at this point. Yeah.
Sebastian Raschka [01:00:57]: Last thing on that is I do think, though, LLMs make good assistance in the literal sense where one thing where I use it for my newsletter for is at the end, I have a list of all the papers I have found interesting, like 30, 50 papers usually. And usually per hand, I edit the author names, like the last names of the first three authors. And now I use an LLM to go to the website and get the names of the authors, basically. And so this is where it saves a lot of time. You could do that without LLMs. You could write some code to do that, but it would probably take me half a day to write because I'm not good at this web scraping code to do that type of thing. And I think in that sense, it is actually a useful assistant for certain things like
Nathan Lambert [01:01:44]: delegating actions. I think it'll keep creeping up. I don't expect their usage for those things to go down because they already are so useful. And the little coding things, the hacking data together, the automatic searching, people aren't going to want to stop using that. I don't know if it supports the whole valuation we have, but it's fun to be in a space where we get to try new things. As a computer nerd, it's really fun to have a new type of software that we can try all sorts of things in our workflow. And I think that's underrated. So I don't know. Thanks for coming on. Any last things you want to discuss?
Sebastian Raschka [01:02:19]: Yeah, I just wanted to say thank you for the invitation and I hope you keep creating these awesome newsletters. I think this is much needed because there's so much hype, like you said previously, it's
Nathan Lambert [01:02:32]: creeping up on us.
Sebastian Raschka [01:02:32]: There's a lot of over, let's say, evaluation and praise. And I think something that is kind of like cutting through this is it's much needed like this honest, straightforward, no b******t content. So yeah, I hope you keep creating that. It was fun to chat. And yeah, to everyone out there, I think also what keeps us motivated, I think, is the awesome community that people give feedback and discuss things and bring things up. And yeah, I think without people also giving us feedback, we wouldn't be probably doing this because it's kind of a lot of fun to be in that space, I must say. Yeah, it's fast moving, but there's always something interesting every day.
Nathan Lambert [01:03:14]: Yeah. Yeah, this is really interesting. We covered a lot of kind of low level of just what it's like trying to use language models on the day-to-day basis in July of 2024. So thanks for coming on. And I'm sure we'll talk soon. All right.
Sebastian Raschka [01:03:27]: Yep, it was nice meeting you and see you then. Bye.
And how to understand Llama three point one's results.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/gpt-4o-mini-changed-chatbotarena
0:00 GPT-4o-mini changed ChatBotArena
3:23 Llama 3 in the arena
5:13 Partial solutions and next steps
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_013.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_019.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_021.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_025.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_039.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/new-chatbotarena/img_043.png
Defining the future of the AI economy and regulation. Is Meta's AI play equivalent to the Unix stack for open-source software?
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/llama-405b-open-frontier-model
00:00 Llama 3.1 405b, Meta's AI strategy, and the new open frontier model ecosystem
01:37 Meta's open frontier model
03:51 Zuckerberg's vision for open-source AI (vs. reality)
08:35 Does the Llama 3.1 license support open-source AI?
12:55 Different futures for regulating frontier models
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_008.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_010.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_015.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_018.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama-405/img_050.png
SB 1047, AI regulation, and unlikely allies for open models
The rallying of the open-source community against CA SB 1047 can represent a turning point for AI regulation.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/sb-1047-and-open-weights
00:00 Introduction
01:53 SB 1047 and targeting regulation
07:57 Unlikely allies of "open"
12:05 What would I regulate today?
I Switched to Claude 3.5
Speculations on the role of RLHF and why I love the model for people who pay attention.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/switched-to-claude-from-chatgpt
00:00 I Switched to Claude 3.5
03:57 Product priorities
05:15 RLHF's peak?
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_016.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_018.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_020.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/claude/img_022.png
I’m really excited to resume the Interconnects Interviews with Dean W. Ball from the Hyperdimensional Substack (you should subscribe). We cover the whole stack of recent happenings in AI policy, focusing of course on California’s bill SB 1047. We cover many, many more great topics here including:
* What will happen in the case of a minor AI disaster,
* If Meta will release the 405B model, and why,
* The status of Chinese open-source AI,
* Training on model outputs,
* Anthropic’s recent strategy,
* What scaling laws actually mean,
* Creating content and shifting the needle of the AI discourse.
Watch the video on YouTube below or listen on podcast players here.
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Chapters
* 00:00 Intro and Welcome Dean Ball
* 02:44 The Origins of California Bill SB1047
* 08:56 The Evolution of Bill SB1047
* 13:00 How SB1047 Affects Fine-Tuning
* 20:00 The Future of Bill SB1047
* 21:58 The Impact of AI Disasters
* 29:02 Meta and its 400 billion Parameter Model
* 32:25 Open Source AI and the Chinese Market
* 37:37 The Future of Open Source AI
* 43:35 Synthetic Data, Licenses, and Future AI Development
* 45:18 Anthropic's Approach to AI Safety
* 50:46 Scaling Laws
* 53:01 The Role of Audience in Influencing AI Policy
Links
* Dean’s series on SB-1047: one, two, and three.
* Other AI policy Substacks: Jural Networks and Intersecting AI
* Senator Scott Wiener. CA SB 1047 itself.
* Another post on CA SB 1047 from Answer AI.
* Situational Awareness by Leopold Aschenbrenner.
* Lina Kahn on her P(doom) and warnings in support of open-source.
* Ben Thompson’s Framework for Moderation in technology.
Transcript
Nathan Lambert (00:00:01): Hello, and welcome back to InterConnect's interview series. It's been a few months. I'm really excited for this one. We're here with Dean Ball, who is a research fellow at the Mercatus Center. He works on AI policy right now, and he's the author of the Hyperdimensional Substack, which is kind of the AI policy substack that emerged when I was spamming into the void that we need to have some good AI policy newsletters out there. There are a couple more that I could add to the show notes of this that I'm aware of from friends that used to be at OpenAI, friends at AI2, so I'll add some of those as well.
But in this kind of summer slowdown of releases, I thought it would be a great time to kind of revisit some of the core themes on AI policy, open versus closed, kind of things that I'm wondering about in the future that I know are coming that are looming AI disasters, what some of these closed source companies are trying to do in the policy space. I think this is the sort of interview that we could probably do multiple times. I think we've started talking in DMs and it's clear that we're aligned on a whole bunch of things. We read each other's work. I think this should be kind of fun and I'm just happy to do this.
I think the core of this interview I'll give you a chance to introduce yourself if you want, if you want to add anything else that I missed, and then we're just going to go into this California bill SB 1047. Probably talk about this. I'll ask you about the story of how it happened and then where we're at now. And I think that'll kind of lead into a lot of interesting debates. So do you have any background you want to add that makes you an interesting person in the AI space? Or is it just that there's so many things that need to be done in AI that if you're focused, you can kind of have an impact in an area?
Dean W Ball (00:01:44): Yeah, I mean, I think basically, you know, I've mostly written on policy unrelated to tech for my career, state and local a lot. So the fact that a lot of the policy action on AI seems to be happening at the state level has been very relevant. But I've also just like always been paying attention to the AI literature. I remember 2017, I think, when the Alec Radford Amazon podcast product reviews paper came out and I said to a colleague this is gonna be a big deal I think one day and you know we I tried to use GPT-2 to do like social science research like policy research back in 2019 so I've been playing around with these for a while and I try my best to write from a combination of a relatively technically informed person, but also someone who understands the policy side.
Nathan Lambert (00:02:43): Yeah, so I think we should jump right into it. What is the origin of the story of this California bill? My understanding is it just kind of showed up and everyone in the Bay Area was like, like where did this come from? Having actually passed the state Senate as like, do you have any, does your story start there as well? Or did you kind of know this was coming?
Dean W Ball (00:03:03): So I saw, Scott Wiener, the author of the bill had telegraphed that he was working on, something in AI policy, I think in maybe October or November of 2023. And then the actual bill text came out in early February. And I remember when it came out because I was having dinner with my wife and, I was like, I have to drop everything and go work on this. I stayed up until like one in the morning, you know, reading the bill and writing about it. And that was kind of my first Substack post that really went anywhere in terms of audience. And so, yeah, then there was kind of a couple months of quiet. You know, I had been writing about it, but people weren't really focused on it in the Bay, in the tech community. And then closer to around April, people started to pay attention. And the conversation has been pretty, you know, pretty active since then.
Nathan Lambert: Yeah. And like, what does it actually say? Like, what are the core points? I know there's stuff around thresholds and giving California power to do like California creating a new body. Like, what are you think? What are the few like core things that people should know? I think there's probably some details, but just the core stuff.
Dean W Ball: Yeah, so the core idea behind SB 1047 is to create a regulator inside of the California government called the Frontier Model Division that would oversee models. Really, now the threshold is models that cost more than $100 million to train. We can talk about how specifically you really even specify that cost, but really all the bill says is $100 million of compute costs to train. Those models are subject to a series of testing and safety requirements, and more importantly, I think, a liability regime that basically says that most downstream uses of that model, including in the case of an open source model, most fine tunes, most uses of models combined with scaffolding software, other software. So things that are very combinatorially distinct from the initial model release. Any downstream misuse is the legal responsibility of the developer who made the original model.
So, if I fine-tune Lama 3 and then someone else puts that in an app and then a user of that app misuses it in a way that causes a serious harm, the bill does have a high threshold for the harms that have to count here.
Nathan Lambert (00:06:00): Is that eligible? Is it specific? Do they have a safety taxonomy?
Dean W Ball (00:06:05): So, they basically, it really, it's a static threshold that comes in at $500 million of damage. They would say harm to critical infrastructure and things like that. Critical infrastructure pretty much just means everything. It's kind of a catch-all term. It's a little weird. Critical infrastructure, the way we think of it, like highways and power plants and stuff, is actually a subset of critical infrastructure. Critical infrastructure includes things like casinos and ballparks and amusement parks and all kinds of stuff. So anything really, any major cybercrime, bio attack, all the things people are worried about with AI would count. And the developer of the original model, which is many stages upstream from where the harm happened, would have legal responsibility.
Nathan Lambert: So it's like the risk for these, probably the expected value risk for open models in this bill is definitely low, but it's just kind of this thing that it's like, if you're kind of comparing on the two axes, the open versus closed risk, like the risk for open models is way higher because of this downstream use term. And that's for the people that are getting like, oh, why is everyone that cares about open AI, like open AI as the field mad about this? So I think that was why everyone was kind of caught up in ours.
Dean W Ball: Yeah. And the other thing to keep in mind, though, is that under this bill, if you're making a model more than $100 million that costs more than $100 million, you have to submit a variety of documents annually about your safety procedures and sort of testing regime on the model to the Frontier Model Division. And I think something that's not all that well understood, and it's kind of just like how administrative law and regulation works in America, but that the tech community might not understand, is that the Frontier Model Division has the capability to create problems for developers, even if their model's never used for a hazardous capability. They could see your safety plan and say, We don't like this or we want more information on this. And they can subpoena you. They can bring you to court and they can, you know, issue it. They could they could order a cease and desist.
Nathan Lambert: Yeah. And this is where you only post on the political economy of AI regulation comes in as like, what are they going to do with that kind of open ended power?
Dean W Ball (00:08:40): Yeah, it doesn't necessarily. I mean, they're an agency that has all the regulatory powers of an agency, which are substantial. I think one other point that is worth making about 1047 that would be relevant to your audience in particular is. So the initial version of this bill, any fine tune. No matter how substantial the fine-tune is, the original model developer held the legal responsibility and had to test their models with the margin and the realization that people could fine-tune them or do whatever they wanted to them, modify the weights in arbitrary ways, which obviously doesn't really make a ton of sense.
Nathan Lambert (00:09:38): I was going to ask about the edits. This is where I probably stopped reading as closely as I should have.
Dean W Ball: In a fundamental sense, everything I've said so far has basically been true of the bill for the entire, the fundamental points, the liability, the frontier model division, these kinds of things. Basically, the actual making developers guarantee model safety when I think we're probably both in agreement that safety is not a model property.
Nathan Lambert: Yeah, at least in the way that the bill concerns it. They're considered about infrastructure. If critical infrastructure is the primary target, safety is not a model property. This is why I ask about a taxonomy. It's because it's like... We're going through this exercise at AI2 to kind of say like, what do we mean by safety? And it's a total headache. It's like extremely hard to get this right and to communicate it clearly. So now when any other organization or somebody mentioned safety and I'm like, oh, do they actually define it? Like it's such a risk to put it into words because when you put it into words as well, you're exposed to all this like people being like, so you don't care about X, Y, and Z. If you don't put it explicitly, it's like a total trap.
Dean W Ball: Well, and actually just to expand on that a little bit, because, you know, the Center for AI Safety, which is the nonprofit that was heavily involved in authoring the bill and Senator Wiener, you know, one of their primary concerns is bio risk. So people making biological weapons with AI models. You know, and I think people who don't understand biology all that well have this idea that you can say, oh, well, that's a good idea. biomolecule to make, and that's a bad one. And so we'll make a list of the bad ones and you can't make the bad ones. And that would be a way to like, RLHF, a biological foundation model.
Nathan Lambert (00:11:34): My understanding of biology is that the more powerful, the more specific a molecule is, it'll probably have good uses and downsides. It's like Teflon. Amazing physical properties, extremely bad downside health concerns. I would guess, obviously, if you're consuming... engineering like living creatures it's going to be a little bit of a different consideration but yeah.
Dean W Ball (00:11:56): But I mean also a lot of biomolecules just and like code um their their goodness or badness is really context dependent they'll do different different things in different settings and so it's not necessarily easy a priori to identify you know what what what how even would you steer a biological foundation model, like something that's predicting protein structures or nucleic acid sequences or whatever it may be? How would you even steer that towards safety? It's not like a priori obvious that that's currently possible. But that's just, you know, I think this idea that safety is something that can be legislated in that way, I think is a fundamental problem.
Nathan Lambert: So what is next? Or you could continue. I was going to ask, what is next for the bill?
Dean W Ball: Oh, yeah, yeah. So I'll just say one thing about the fine-tunes in the most recent amendments to the bill. So fine-tunes now, if you do a large fine-tune, large being anything more than 3 times 10 to the 25 flops involved in the fine-tuning compute,
Nathan Lambert (00:13:13): I need to learn all these numbers. I need to learn what they mean. I need to know. Essentially, it's a linear relationship between model size and tokens. And then you should be able to have specific points, which is like, is Lama 3 base crossing that? Like 15... trillion tokens at 70 billion parameters like I think I I don't know I'll loop back on this I need to know this in the future.
Dean W Ball (00:13:35): It would be as much fine-tuning as you use as much compute as you use to fine-tune the model that's how this threshold is calculated.
Nathan Lambert: Yeah I was just one like a rule of thumb for people would be great I'll figure that out it's on my to-do list of mental mental math that would be great.
Dean W Ball: That would be great to do um but if you're in that situation uh then the bill applies to you too. So you have to create a safety plan and a certification that you submit to the Frontier Model Division every year. Starting in 2028, like the foundation models, you'll be subject to mandatory annual audits.
Nathan Lambert: Is this prescribed to anyone that trains in California or anyone that operates their model in California?
Dean W Ball: Anybody that distributes a model in California. So the bill is at least everyone in the United States, if not really everyone in the world. Certainly, but they could certainly sue you in the United States if you're an American company or operating in America. Now, the important thing about that fine-tuning threshold, though, is that the fine-tuning threshold can be lowered arbitrarily by the frontier model division. So the $100 million threshold for foundation models, that's fixed in statute. So you would need an act of the legislature to change the $100 million threshold. But the fine-tuning threshold, there's no dollar amount. So the same problem with compute thresholds, that compute cost is getting cheaper and cheaper rapidly over time. applies and the frontier model division can change that threshold arbitrarily.
Nathan Lambert (00:15:35): Who elects these officials? Is it like the governor of California? Or the federal branch or something?
Dean W Ball (00:15:43): This is all state-based.
Nathan Lambert: Oh yeah, I meant in the state.
Dean W Ball: Yeah, so the frontier model division would be staffed by unelected civil servants, primarily. Led by unelected civil servants. And then on top of the frontier model division, uh, the new, the newest version of the law creates a committee that is like a governing committee. And that committee is composed of, I believe three members appointed by the governor and confirmed by the legislature. And then two members that the legislature itself points each house, the Senate and the assembly.
Nathan Lambert: Mostly what I would expect.
Dean W Ball: Yeah, yeah, exactly. And like, I think there's a requirement that, you know, one person has to be from industry, one person has to be from the open source community. There's a lot of, there's a lot of bones that they throw to the open source community.
Nathan Lambert (00:16:37): Random credentialing.
Dean W Ball (00:16:38): Yeah, yeah, exactly. But I mean, I don't really, that could be anyone, you know, really, like, yeah, who's who's from the open source community? Exactly. Yeah.
Nathan Lambert: Um, so what's next for this? Like it passed this, it passed the state Senate and then it got revised by the, what is the state, like state general state house. Is that how it works? The state assembly revised it. Does it then they would have to vote and then the Senate would have to vote again. And then the bill would have to actually be signed. Is that how, is it worked that way in California? Yeah.
Dean W Ball: Yeah, basically. So so it's right now making its way through the committee. So it went through the Senate committees and then was voted on by the whole Senate. Now it's going through the assembly committees. It just passed one, I think, last week or the week before the Consumer Protection and Privacy Committee is what it's called. I could be wrong on the exact name, but that's the basic idea. So they just passed it. They did some amendments. It goes to the assembly's committee. judiciary committee next and then uh eventually it will go to the full assembly for a vote and then to the governor for uh for signature or veto.
Nathan Lambert (00:18:04): When would this start? When would it kick in?
Dean W Ball (00:18:00): Uh the bill would kick in I think most of its provisions would start January 1, 2025.
Nathan Lambert (00:18:05): Yeah. And the original vote in the state Senate was like very pro, right? It wasn't even like, it was just like, Oh, this seems normal checkbox for, but this is kind of a cynical take, but I kind of viewed it as mostly these politicians are serving constituents that know that AI is a big thing, but know nothing about AI. So for a politician saying, look, I'm taking action on AI and they're not going to be able to decipher any of the details is probably a political win.
Dean W Ball (00:18:31): Yeah, well, and I think also worth noting is that Scott Weiner, the state senator who authored the bill, is a very powerful figure in California politics. And I would guess that a lot of the senators who voted in favor of the bill really barely looked at it and aren't even necessarily thinking about their constituents. First and foremost, they're thinking more about, well, Scott's my ally. I need X, Y, Z thing from Scott. So I'm going to vote yes on his bill. Um, and that dynamic will apply at the assembly too is, is very common. Uh, the, the California legislature has a history of, um, uh, sometimes even unanimously passing bills that the governor then vetoes. So the governor is often expected to be a little bit the adult in the room on this stuff.
Nathan Lambert (00:19:25): This is so funny. I have no comment.
Dean W Ball (00:19:27): I do suspect that the governor is probably going to be, whether or not he wants to, he will probably be the final voice on this bill.
Nathan Lambert (00:19:41): So that's who people are talking to, probably, realistically, from what you've said.
Dean W Ball (00:19:46): Yeah. So, I mean, the one thing, and this is, again, this is a kabuki that's very common in state legislatures. The governor has not said anything publicly about SB 1047 specifically. I think he's as a general matter, he tries not to comment on legislation that's in process.
Nathan Lambert (00:20:08): That makes sense.
Dean W Ball (00:20:09): Yeah. And then kind of. But, you know, he also might signal in various ways. He there are times when it gets closer.
Nathan Lambert (00:20:17): I would guess they do.
Dean W Ball (00:20:18): Yeah. I mean, like he could say, you know, a lot of bills. I think one outcome that is extremely unlikely from this bill is that it's like voted down by the assembly. Like, I don't think that's going to happen. It could die in the assembly. It could just kind of be forgotten, never get brought to a vote, or it could go to the governor and be vetoed. If the bill's not going to pass, it's going to probably be one of those two ways.
Nathan Lambert (00:20:43): Okay, that's a great little lesson in state politics that I'm sure the vast majority of people listening to this will not know. I did not know all of this. Do you have any final comments on this? Otherwise, we're going to move into kind of fun, faster questions and discussions.
Dean W Ball (00:20:59): Yeah, sure. Let me just think. I think the one other thing that is worth keeping in mind here is that the latest version of the bill, I mentioned this, but just to expand on it a bit, it does require mandatory audits starting in 2028. So if you make a covered model or a covered fine tune, however, the Frontier Model Division chooses to define that. Not only do you have to submit stuff to your certifications to the Frontier Model Division and have the legal liability and all that, but you also would have to comply with an audit done by a private company. So just like accounting, you pay for someone to come in and look at your stuff. And the auditors are, it's not an open market for competition. The auditors are licensed by the Frontier Model Division. So it's probably two or three different, companies that'd be doing that and it's probably that's the sort of thing that i
Nathan Lambert (00:21:59): think people have wanted i don't know if you want it like we don't i don't think people i don't want all these types of oversight to be cobbled together i think individually each of them have different types of merit but like the execution is important and then when you cobble them together it's like wait wait wait this is this is too much
Dean W Ball (00:22:19): Well, and also I think I think it's just questionable whether I agree that an audit like structure like that might be the good long term way to go. I think it's questionable whether a California state agency really has the capacity to do this kind of assessment of like who is an accredited auditor. That feels much more like a federal responsibility. So, yeah, but that's I think that's that's pretty much the main message on 1047.
Nathan Lambert (00:22:49): Yeah. Okay. I'm going to move into other fun questions I have. I'm going to start with one that's potentially related. I've been trying to get my brain around what is going to happen when there is actually a minor disaster from AI. It loops into open versus closed debates. I think a lot of the things I've been talking to people is it won't actually be about whether or not it was an open or closed model. It's some weird infrastructure that people plugged it into and that causes the power plant to go down. Do you have any ideas about how this will happen? I'm expecting this to happen within a couple of years. I feel like the state of our infrastructure is that it is not that reliable and that we're adding all this new digital information into it. And I think all of this is very fragile digitally. So it's like, I think this is going to happen. And how do we preempt any communications around that?
Dean W Ball (00:23:37): Yeah, well, I mean, you know, cyber attacks take out digital infrastructure or take out critical infrastructure all the time. You know, earlier this year, I think maybe it was last year, courts in Dallas could not convene. Like there were no judicial proceedings in the city of Dallas because of a major cyber attack on the judicial system's computers. Parts of the power grid go down. Water plants go down. Hospitals all the time. This happens. $500 million in critical damage. That sounds like a lot. It's not actually that much.
Nathan Lambert (00:24:13): It doesn't have a B on it. It doesn't sound like a lot.
Dean W Ball (00:24:18): Exactly. It's a big economy. I think about this all the time. I think a couple things are very likely to be true. If there is... an attack of this sort, people will probably suspect that AI was involved, whether or not we get, how are we going to know? Right. Let's say like somehow we do have a strong hunch that an AI model was involved.
Nathan Lambert (00:24:47): Yeah, like, do we normally figure out what happened in cyber incidents? Or is it normally post hoc? Or not at all? I guess that's a good thing to know with my question. It's like, can we know that a language model is actually involved? Like, how often will they be able to get that far into the stack of the attack?
Dean W Ball (00:25:02): Yeah, right. Like, I don't know. I mean, like, probably they're... I mean, if you were using, like, an agentic GPT-6 model to do some kind of zero-day exploit on something, like, presumably in the server logs, like, you'd be able to see that, like... what was interacting with it. Right. But like, who knows if that would be masked, but I, so, so let's just say though, that we, we have some, you know, circumstantial evidence to suggest that an AI model was involved in the execution of, of some cyber attack. It's like very much to me, unclear, unclear, Are we going to have like the person's chat log? Like, are we going to know how they prompted the model?
Nathan Lambert (00:25:46): Like, I mostly think it's like it's going to send requests over some generic Internet protocol. So there'll be this big gap where we can't really tell.
Dean W Ball (00:25:54): Yeah. I mean, that could totally be true. That could absolutely be true.
Nathan Lambert (00:25:58): So I expect there to be – it's like almost if somebody takes ownership or does a really bad job or it's an own goal, which is like a hospital implemented some agent and then it took down their authentication system type of stuff.
Dean W Ball (00:26:12): Yeah. No, that could very well – that's all definitely possible. Yeah. I think that, though, how would we actually know what an AI model was used for? It seems to me like we don't actually... People are imagining a situation in which this happens with perfect information.
Nathan Lambert (00:26:32): Yeah, I think that's the answer to my question. It's not that it's like what happens. We can't answer what happens because it's so much of a media question. It's like we won't know. It's likely to happen, but it's very unlikely that we know the specific stack that caused it. Which makes it more of the same around like if cyber incidents increase in rate, then people will talk about AI and people like without actually having the logs, it makes it easier to spin narratives. Because I'm worried that this could be like people are like, oh, this is why open source AI is bad. Yeah. And it's like, I don't expect to have any proof for that, but I expect that to be what people say.
Dean W Ball (00:27:10): People are going to blame AI on things that were already happening. I think that's like a trend that we will see across the board. Whether it's misinformation or whether it's cyber attacks or whatever else, like there are all these curves that we're already pointing up and they're going to continue to most likely. And I think people will blame that on AI. Now, like the sort of, you know, long tail situation is like, what if something really bad happens? You know, what if a power plant, you know, no one has water in Los Angeles for a month or something like that. And in that situation, not only do I think that an attack could be hastily blamed on AI without us knowing whether that's true, I also think we could see legislation move very, very quickly. The Congress, the federal government is not known for moving fast, but in a crisis, they will move fast. It's for the same reason that I suspect, I don't think he is right, but if Leopold Aschenbrenner is right about super intelligence being here and, you know, 50 months or whatever he says.
Nathan Lambert (00:28:26): Yeah. This is another one of my later questions, but I didn't have the best way to frame it.
Dean W Ball (00:28:32): Yeah.
Nathan Lambert (00:28:33): Like AGI timelines and stuff.
Dean W Ball (00:28:35): Yeah. Like if he's right about that, then like, yeah, I mean, that's going to get nationalized by the federal government and it'll happen in a heartbeat.
Nathan Lambert (00:28:42): You know, I found it interesting that Alexander Wong of scale was also kind of touting this point of view. Yeah. I guess it makes sense for them because they're the only AI company that is leaning into federal contracts. Yeah.
Dean W Ball (00:28:59): And they were before ChatGPT, too, I think.
Nathan Lambert (00:29:04): Yes, they have been for a long time, which is why it was easier for them to continue.
Dean W Ball (00:29:08): Yeah, their early big revenue source, I think, was federal government contracts.
Nathan Lambert (00:29:13): Okay. Yeah, we might come back to AGI. I've been confused by the... lines they're drawing. I have a quiz to debate later on. I don't even know the answer. We'll see if we get to it. But another fun question. Do you think meta will release the 400 billion parameter model? And if there will be any governance questions around that?
Dean W Ball (00:29:32): Will they release it open source?
Nathan Lambert (00:29:34): Open weights in a similar manner to the other models. Yeah.
Dean W Ball (00:29:37): Yeah. Open weights.
Nathan Lambert (00:29:42): Do you think they have government? I've been decreasing probability. At best, I was ever 50-50. But is this for government's reasons that you don't think? Are they flying? They've always been flying close to the sun where there's back channel discussions where it's like, The Biden administration is telling Meta that they're like or they're not invited to stuff because they're not happy with how they're like open waiting models through this other like probably they're probably getting lobbied by people saying open source is bad. But it has always seemed like Meta is on kind of thin ice with the executives in Washington. And I'm guessing it's reasonable to say that this model's release is bad. heavily influenced by feedback they're getting there. And Zuck will make the final call.
Dean W Ball (00:30:28): Yeah, I think that that's part of the calculation. I think that also they probably just want to set a precedent that they're not going to release everything open source because they don't know how things are going to go. Yeah, I mean, they just don't know. Will the model end up being... the most important way that we all interact with computers, you know, in a few years? Or will it just be kind of another layer and another tool? I think they don't know. I feel like Zuckerberg's intuition is that it's just going to be another tool. And so that's why he's inclined to open source.
Nathan Lambert (00:31:07): Yeah, this relates to the whole Apple thing. Like Apple is making these as features rather than products. Yeah. That does a lot of good for the narrative around AI, in my opinion, at least for things that I care about. It's like, this is what we're saying where AI is about a system and not just a model. The Apple's model doesn't matter to people, but it is enabling these products and systems or these things on their products to just be better. It's always Apple and Meta together. They are always forcing their way into whatever the next thing is going to be in technology.
Dean W Ball (00:31:44): Vibes policy or whatever. Yeah and it's funny because they hate each other. Yeah yeah but it's so funny but yeah i don't think they're going to uh that that's my just my personal intuition and i think that's like i think we're going to see a lot of people um not just in the language model space but elsewhere kind of do this this dual approach where they can they realize how much political cred you can get by open sourcing things. It's still happening.
Nathan Lambert (00:32:12): Google today, when we're recording, released Gemma V2. And their 27 billion parameter model is just a little bit below Lama 370B. I think that's a nerdy thing. But when the first Gemma model was released, it wasn't used as much by the community, mostly because there was a lot of minor bugs in the implementations in popular tools. So I think the initial feedback loop wasn't caught on. So it'll be really interesting to see if these second generation models, which are in the same ballpark as what Meta released, there's some strange things. They trained the biggest model on 12 billion tokens, and then the 9B model only on 9 billion tokens, and the 2B model on 2 billion tokens. So the models that have more reach by being smaller are like intense... There's got to be a reason, but I think they were like scaling runs preparing for the biggest one, but they didn't finish training them. So like the models that the most people could use relatively are worse than the bigger ones just by the amount of compute that they put into them.
So I think eventually if there's decent uptake of these, Google will change this. But it's like the Gemma 2, whatever it is, 9B model, it's going to be way worse than the Lama 2 8B, just because Lama is trained on twice as many tokens. But like Google could have resolved this. So that's my like kind of, that's an aside. But these dynamics actually feed into what we're talking about, which is like Google, Microsoft, Beta are all still releasing these models.
(00:33:42): Yeah.
Nathan Lambert (00:33:42): Which is good. I have on this outline like the general state of open versus closed. It seems like we haven't had major updates in a while. It seems like there's much less pressure taking on open. I think maybe people are okay with the steady state that we're in. I don't know if this Nemotron 340B changes that much.
Dean W Ball (00:34:01): I don't think so. So I think that there are the people who believe that open source models are an existential risk to the world. And they continue to mostly think that, and they continue to support policies that either in absolute terms or on the margin would diminish open source. I think that DC has had a really radical shift in the last year because the climate towards open source models in the policymaking world a year ago was not good. And now it is much more... Oh, well, we think this is really important for competition and we think it's important for innovation and we actually like want to make sure we have a really healthy open source community and all these kinds of, I mean, I'm sure you've seen, you know, Lena Kahn, no friend of the technology industry. Um, has she had a comment on this?
Nathan Lambert (00:35:09): Um, that's good. Did you see her clip on hard fork where she was asked what her PD is?
Dean W Ball (00:35:14): Yes. Yes.
Nathan Lambert (00:35:15): Oh, my God. If people haven't seen this, you've got to go find it. It is so funny.
Dean W Ball (00:35:18): Yeah. And the sense I get from like talking to people in Congress and whatnot is that like the staff, congressional staff, is that – People have just realized like open source is really popular and it would be really hard to go after. The government figures this, this isn't new. The government figures this out like every 15 years. They get like really freaked out about something in open source software. And then they... It's a good way to put it. They go and like they try to ban it and then they realize like, oh, wait a minute, this would be really hard. This would piss a lot of people off.
Nathan Lambert (00:35:56): It'd be a giant economic own goal. I think it's inevitable that it's an economic own goal. I mean, China is ready to take this over as beating the lead. They're right there. They don't have the ecosystem. The ecosystem is landing in the U.S., but they have perfectly good models. So if U.S. were to own goal and the U.S. stops building the models, I think that that is the path by which they could then own a ecosystem. Because there's not incentive to recreate the ecosystem when the ecosystem and the models exist in the US. But if these kind of tools and hosting all go away, then it's when other people take over.
Dean W Ball (00:36:29): Well, it seems like, I mean, as a bit of a question for you, I guess, but like, it seems like the Chinese, like, you know, the export controls on compute are going to start to really affect them. Because they were able to buy H100s.
Nathan Lambert (00:36:44): Yeah, this is what I was going to ask about. Isn't it that like a lot of NVIDIA's recent sales have been just them... prioritizing selling to China because they're not yet blocked. And then that creates a backlog in the US because Nvidia is like, well, they're not going to be able to buy them, so we should get our revenue while we can. It kind of checks out. I don't have a source on it, though.
Dean W Ball (00:37:04): Since I've always gotten... It's all through subsidiaries. Yeah. So Chinese companies saw the writing on the wall about export controls like two and a half years ago. And so they started to buy up A100s and H100s at that time. And then the export controls came through and things are leaky and NVIDIA had that chip. They were selling a chip that was like basically an A100 and basically an H100 for a year. And then that got blocked by the federal government. So like...
Nathan Lambert (00:37:37): Should we put Zuckerberg in charge of NVIDIA? Because I feel like for all the haters of Mark, Mark is pretty American and kind of follows it up, I feel like. He doesn't really care that Facebook is blocked in China. I feel like it's almost... I feel like this is why public companies sometimes have problems because they're too incentivized. Like Nvidia's stock, if they were to have to stop selling to China immediately, would get such a haircut. So literally their hands are tied to doing this thing, which I think is like going against what the executive policy is in such a clear way. It's like what they're trying to do. Which I'm like, this is a market failure. I was like, I don't think, like, I feel like Jensen's probably like, I don't, I guess he's pro-US. I don't know. Like, I don't care whether or not they're a hawk. It's just like, feels bad to go so clearly against what the intentions of the executive policy are, when there is a clear reason they're doing this.
Dean W Ball (00:38:31): Yeah. Yeah. No, I mean, I think that Jensen is going to comply with the letter of the law, but that philosophically he doesn't feel like it's his responsibility or good for him to be policing who his end users are. I think that's just how he feels.
Nathan Lambert (00:38:47): That's another discussion. I think there's... It's a discussion that I've been trying to figure out. I think like Ben Thompson famously has these diagrams for like... where moderation can occur in the stack. And then figuring out what the mirror for where AI is in the stack, like whether or not it is just a product or if it seeps down to being like the AWS layer where like open AI's models are so fundamental to our computing infrastructure that them moderating at all and them deciding who they sell to is like extremely unclear. And I think it might be going in that direction.
Dean W Ball (00:39:20): It feels that way. But it does increasingly feel to me like... You know, the Chinese might not be able to keep up on foundation model training because they're not going to be able to string together 100,000 B100s in a year.
Nathan Lambert (00:39:32): They have more electricity, which seems to be what people are talking about is the limitation.
Dean W Ball (00:39:37): They just won't have the compute, though. And we'll figure out. The U.S., I think, will figure out the electricity. I mean, I don't think we're going to be building 100 gigawatt data centers, but we'll figure out the electricity for the next couple of years, I think. But the Chinese will be able to distill the models and right. And like release them as, as open weight.
Nathan Lambert (00:39:59): Like, I mean, this is what the leading labs are doing anyways. I think this is, um, all of Google open AI and anthropic have now released models below their biggest size that are better than their biggest available models because it is cost effective and like the performance is really good. So like, they're not even pushing the frontier of the model size to the users. There probably are other infrastructure reasons for this, but like, That sort of thing is something that China could also do. They're going to need distilling our models into their models and stuff like this. I think this kind of leads into my next question. I was wondering if in your circles, this idea of synthetic data and various license clauses on whether or not you can train on outputs and models is something that is discussed. I think in the open fine tuning community, keeping track of licenses and how you comply with them on these various models is really really crucial so like with llama 3 you're technically not allowed to train use the outputs of the model to train any model other than llama 3 models which is like this kind of headache and then a lot of nvidia's push with nemotron is like look go wild I've learned that a lot of these clauses on training on outputs come from the data providers trying to protect their business models. So it's like these companies want the models to be pretty open, maybe not meta, but like some of the smaller ones. But then like the data providers are like, you can't do this and they don't have enough power to do this. Like are these types of this is a very like in the weeds technical discussion. But is this synthetic data or clauses on models discussed in your area of the world?
Dean W Ball (00:41:30): So like in the policymaking circles, people are just coming around to the idea that synthetic data is even a thing. And I think a lot of people in DC don't understand that there are licenses associated with open source software.
Nathan Lambert (00:41:45): Well, the licenses with the models don't really make sense. We're in this position where I've generated some data with these models so you can't trade on the outputs. But it's written as if it complies to you as the user. So you're agreeing to their community agreement to use the model. But if I create a data set and then upload it without training on it, can't somebody else just take the data set and train on it? Because they didn't say they agreed to this terms of use of the model. And it's like, this makes no sense. I need to go to our legal department and be like, this is what they're saying, right? I'm like, I don't understand. And so it's just like this weird ecosystem of middle ground messiness, which is it feels similar to some of the open versus closed stuff. And we're kind of going into this peak of this discussion, I think, especially as people get to know better that these new Claude 3.5 bottles is just distillation. It's based on some form of synthetic like data.
Dean W Ball (00:42:36): Yeah. I mean, with a clause like that, too, in a contract, like you got to wonder about enforceability even under the best of circumstances.
Nathan Lambert (00:42:45): Yeah.
Dean W Ball (00:42:45): How would they know? How would they prove in court? How would they prove that like your this synthetic data set came from their model? Maybe they could prove that, but I don't know. A lot of models claim that they're open AI models, whether or not they are.
Nathan Lambert (00:43:04): It's really funny. Yeah, a lot of it is like if you... Well, this is a technical issue with open models. A lot of people spin up demos with open models, but a lot of the ways that the models know who they are is by using a system prompt. And if you just spin up an open model, you're going to say that you're... a model is whatever you are trained on the most of. So like, but like people don't normally write the system prompt. That's like, you are blank, blah, blah, blah. Like we, like we need to do that for like our models and we're like relatively serious actors. So it's like definitely just like open models will always be messier with this because the closed models do a lot more just serving it as a product in a polished way. Yeah. Yeah.
Nathan Lambert (00:43:43): Another quick question related, we mentioned Anthropic. With this Claude 3.5 Sonnet model that just came out, they've said in a tweet that they got clearance from the UK AI Safety Institute. This is from Michael Salido, who I think I've met at a various government discussion. He's like, excited to release this top performing model. In addition to our internal pre-deployment testing, we also... We were also pleased to work with the UK AI Safety Institute. Is this just political gesturing? What is going on?
Dean W Ball (00:44:18): I think that it's political gesturing. I don't love it. I don't think that we should normalize the whole pre-deployment testing thing because that's just fundamentally incompatible with the way that software is made. But like, yeah, I suspect that it's political. I think that these companies, none of them are particularly reliable narrators. Like... Like DeepMind is going through an org. Was DeepMind a part of Google when the AI Safety Summit happened? I think maybe that reorg was happening. OpenAI, we all know, is like a fairly dramatic company.
Nathan Lambert (00:45:04): I need to come up with the right nonlinear dynamics analogy. They're in like an unstable, like homophobic cycle or something. There's these things that are like in nonlinear dynamics where they stay in a cycle, but if they're perturbed, they end up in another cycle. It's like the Lorenz attractor is like the classical, truly chaotic one that oscillates between them. But it's kind of like that because they don't even need an external disturbance. They don't even need an input. They're going to go into some other unstable equilibrium for a while and then go to another one. But nonlinear dynamics is just a great field because the math is simple, but the analogies are really good.
Dean W Ball (00:45:41): So I even think I even think anthropic is that way, too, to be honest, like I and they're not like they're the most stable of the three,
Nathan Lambert (00:45:50): but I think their cultural cultural density is still higher.
Dean W Ball (00:45:53): Yeah, I mean, I think that they have a very clear mission, and that is really helpful.
Nathan Lambert (00:45:59): I don't know if they're achieving it. Their whole line about, okay, I'm close with a lot of people there, but I don't believe that their line of that they're not contributing to the race is true. I think they need to reframe that and figure out how to... combine this with their culture. I think it's true that normal people don't know that Anthropic exists, which might mean that in a normal person world, they're not contributing to some race, but they are in dynamics with OpenAI and Google that substantially are adding pressure to the pace of AI progress.
Dean W Ball (00:46:31): Claude's been my go-to daily model for the last four months. It's good. Since Cloud 3 came out. But yeah, I mean, I also think that they've committed to doing models every couple months too, right? Like that's a pretty rapid cadence, substantially faster than open AI. So yeah, if anything, they're accelerating the current dynamics. And, you know, but... think that the whole you know as uk ai safety institute i think that a commitment was made during a very heated moment uh kind of the peak i think fall of 2023 was sort of the peak of the ai doom rhetoric was this before or after the sam altman stuff i think it was before before it was before it the the the ai i talked to
Nathan Lambert (00:47:16): people who were at that event and they were like this s**t is weird. They're like, why am I on the stage with all of these like billionaires and famous politicians? And they're all like, what is going on here?
Dean W Ball (00:47:27): Yeah. Well, I mean, it was just so incoherent back then. It was, you know, because it was the Biden executive order and the AI safety summit were all like in about a week from one another, as I recall. It's like all this stuff happened. And I think they made those commitments, and I think we will see all these companies gradually try to unwind themselves from those commitments over time. Or what will happen, this will be very consistent with the way that software gets regulated, especially to use software. The big companies will do these pre-deployment tests, and there'll be open providers who don't. And the best way to, like, it doesn't have to resolve itself in a rational way. That's something that's always important to remember about public policy. It's like, there's absolutely no need for it to be rational, you know, like make sense.
Nathan Lambert (00:48:19): Yeah, that makes sense. I think the other thing, this is all like the AGI lab things. It's like, what is your take on the scaling curves? I think for context, everyone got restarted on this with the Leopold Aschenbrenner situational awareness thing, which obviously is a well-written document, whether or not you agree. I think it's interesting. i'm struggling with this one point of the scaling curves thing where i get mixed messages on what the scaling curves actually are when they come to evaluations my understanding of them is that the when you have log x-axis compute and then like log perplex it's an even log perplexity it's a straight line and what i interpret is this is as you 10x compute you get like a like a like it's not like a 10x encryption and performance you get 10 times closer to 100 which is like if you're at 90 accuracy to go to 99 so I don't really understand how people think that this is going to make them become a PhD level, whatever, blah, blah, blah. And I was listening to a recent podcast and I think it was Josh A. from InView was describing this as the reason you have emergent properties is that when you're training at every 10x compute, your model gets 10 times better. So then if you're measuring on a linear scale, it'll look like an emergent property because it's going to go like this. And I was like, what is going on like why does no one understand these fundamentals and it seems impossible that you could get 10 times better when you're going on like it seems like that just seems like total kool-aid drinking like am i am i wrong i i guess i need to go do this basic math it just doesn't track like any computer system how are you going to get 10 like what i don't understand well that's that's kind of my rant
Dean W Ball (00:50:07): I read these charts the same way. Log, log, perplexity, compute, right? That is what I read too. And so that would imply asymptotic progress, but it would not imply a continued exponential increase in capability. I also think like... What is better? That's always like so hard. It's like, what is 10 times? People say, oh, well, the leap from GPT-5, you know, from GPT-4 to GPT-5, will it be similar or less or bigger than the leap from GPT-3 to GPT-4? I'm like, I don't really know if I can quite quantify what the leap between 3 and 4 was or the leap between 4 and Opus, Cloud 3 Opus, which was definitely real for me. You know, I like that that model felt qualitatively different. But I don't think that has to do with training compute. I really I don't think that has to do with the number of parameters the model has. I think that has to do with the way anthropic that the post-training more than anything else. So, yeah, I'm really not sure. I'm skeptical of when it comes to the, you know, to the scaling laws. They're obviously very important. They've held in a variety of different modalities, which is interesting. The fact that we see them apply in DNA sequencing or give sequence prediction to is like, oh, that's interesting. We're just sealing that same line. The models improve monotonically with scale over and over and over again. Um, so like, sure. I'm, I'm, I'm inclined to believe that,
Nathan Lambert (00:51:52): but they're important, but I just am so shocked by how bad the discussion of them so often is like putting this, this is the thing with like the putting levels on the Y axis corresponding to human education. Dumb. Bad move. The technical reality of it may be that they continue to improve, but it's just like, those are the things that I want to see people stop doing. And this isn't really a question. This is mostly just me ranting about this because this impacts policy and these related discussions.
Dean W Ball (00:52:19): if I wrote an essay and like in college and submitted it to my professor, like Leopold Aschenbrenner.
Nathan Lambert (00:52:27): Wait, who was the famous economist that he was like Tyler Cowen is Tyler. Tyler, you didn't check his work.
Dean W Ball (00:52:35): Uh, yeah. Tyler, uh, basically hired me too. Uh, in fact, um, but, um, But yeah, if you did that and you didn't define intelligence, that would be the first thing a college professor would do is circle the first paragraph and be like, you need to define intelligence here. And the fact that he doesn't, I don't think it's a two-dimensional thing. or one dimensional or two dimensional thing. I think intelligence is inherently highly multidimensional, um, and multidimensional things just behave in, in counterintuitive ways. So like,
Nathan Lambert (00:53:08): I think they're getting better at things they're already doing, but we don't have any proof that they're going to start doing new things.
Dean W Ball (00:53:15): Yeah. Is GPT-4 better than a high schooler at some things? Yes. Is it worse than a three-year-old at some things? Yes. Those things are all true. And I don't really think it belongs on a human-defined linear scale of intelligence. I just inherently don't think that.
Nathan Lambert (00:53:31): Yeah. That makes sense. Final question. How much of influencing policy and related discussions comes down to having some sort of audience? I think that this is like
Dean W Ball (00:53:42): remarkably true but not potentially good yeah i think that it is very important and i think that it comes from influencing the way people think you know like a lot of think tanks will judge the success of research by did the ideas from this research get implemented in policy, which is one way to do it, for sure.
Nathan Lambert (00:54:08): But I think... It's a long timescale. It's like a longer timescale than citations in academic nonsense.
Dean W Ball (00:54:14): Well, and also, if I'm successful as a policy scholar, then at least once a month, I should be putting out something, some analogy, some way of thinking about something, a meme, really, basically, that has an effect on the way a lot of influential people think. The other big outstanding question for me, and I've heard you raise this on the retort before recently, in fact, is what's more important? Is it influencing people in the federal government or is it influencing people at the AI labs? Who's going to be more important for determining policy? I don't know.
Nathan Lambert (00:54:55): Yeah. Well. Maybe some people at AI read this and I think this is a great conversation. I'm kind of happy to wrap up here. I could see us redoing this in months based on the kind of coverage of all the recent things here. So I think this is great. I'm excited to share this with people. It's nice to get to know you more. We already have another project lined up where we'll talk more about this. It won't be in the same medium. So that's fun. So thanks a lot and keep writing. I'm sure you'll get a bunch of people to check this out. I'll have all the links everywhere and stuff like that.
Dean W Ball (00:55:28): Awesome. But you too, thank you very much. You played a big role in my building my Substack audience over the last six months. So I really appreciate it.
Nathan Lambert (00:55:35): People just need to say things. People ask me this a lot. It's really like if you make time, most people that I work with have interesting thoughts. The problem is. doing the practice of getting these thoughts into some silly medium. Literally, these long tweets, the tweets are now long. You could just do that. You could do that once a week. You will grow an audience over time. It's pretty simple. You just have to pick your lane and just keep pressing the button and it just works. You're not the only one. I'm going to have some other people that have talked about this on this interview track in the summer. I just think it's so... it's a partially a way to normalize it and get more people to try it is why I bring it up because that's like, I want that to happen to AI too. Cause there's a lot of smart people that don't know how to engage and a hundred percent and other things. And it's like, yeah, it's worth it. So thanks again.
Dean W Ball (00:56:27): We'll talk to you. All right. Bye.
Things to be aware of if you work on language model fine-tuning.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/rlhf-roundup-2024
00:00 RLHF Roundup: Trying to get good at PPO, charting RLHF's impact, RewardBench retrospective, and a reward model competition
04:32 How big is the impact of RLHF relative to pretraining?
05:54 RewardBench retrospective after 100 models and 90% peak accuracy
09:19 LMSYS's reward modeling competition
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_009.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_012.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_017.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf-roundup/img_026.png
Synthetic data is known to be a super powerful tool for every level of the language modeling stack. It's documented as being used for expanding vanilla pretraining data and creating large swaths of fine-tuning data. Many, many more rumors surround its use, Anthropic's pretraining-scale constitutional AI, Mistral AI's first models being pretrained on OpenAI outputs, Q-star's hopes as OpenAI's remaining moat, and much more. The diversity of use cases for synthetic data makes planning around the role of synthetic data in solving specific goals.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/frontiers-in-synthetic-data
00:00 Frontiers in synthetic data
01:14 1. Direct distillation is still king
02:54 2. Are Gemini Flash and Claude Haiku distilled?
04:03 3. Filtering prevents collapse
06:30 4. Synthetic data strategy taxes
07:32 5. Pros and cons of training on multi-output-source synthetic datasets
08:54 6. Structured synthetic data
09:42 7. Weak-to-strong generalization is maybe real
10:27 8. Creating synthetic prompts is overlooked again
Signs point to a general-use Sora-like model coming very soon, maybe even with open-weights.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/text-to-video-ai-is-already-abundant
0:00 Text-to-video AI is already abundant
5:08 What's next for the text-to-video market?
6:49 Are text-to-video models good for the world?
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_005.mp4
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_009.mp4
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_011.mp4
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_013.mp4
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_015.mp4
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/text-to-video/img_017.mp4
Apple Intelligence makes a lot of sense when you get out of the AI bubble.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/apple-intelligence
00:00 AI for the rest of us
02:46 Apple's technical approach
03:32 Core models: What did Apple build?
05:35 Alignment strategies: Some new things!
10:00 Orchestrating adapters and on-device magic
11:58 Light for other narratives around AI
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_005.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_039.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/apple-intelligence/img_041.png
A realistic path to robotic foundation models
Not "agents" and not "AGI." Some thoughts and excitement after revisiting the industry thanks to Physical Intelligence founders Sergey Levine and Chelsea Finn.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/robotic-foundation-models
0:00 A realistic path to robotic foundation models
2:51 Key factors for the future of robotics
6:19 Everything is a token: The transformerification of robotics
Data licensing deals, scaling, human inputs, and repeating trends in open vs. closed.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/the-data-wall
0:00 We aren't running out of training data, we are running out of open training data
2:51 Synthetic data: 1 trillion new tokens per day
4:18 Data licensing deals: High costs per token
6:33 Better tokens: Search and new frontiers
Celebrity's power will only grow in the era of infinite content.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/name-image-and-ai-likeness
0:00 Name, image, and AI's likeness
1:11 OpenAI's second terrible, horrible, no good, very bad week
4:36 The expansion of name and likeness
7:46 Culture and AI development
ChatGPT leaves the textbox, and Google is building the same, and more, as practical tools.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/openai-and-her
00:00 OpenAI chases Her
02:10 Talking to ChatGPT
03:53 GPT-4o: Toward omnimodal models
08:25 Google's mirror with Gemini
10:11 OpenAI's AI Safety: Have your cake and eat it too
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/her/img_018.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/her/img_023.jpg
Now we will have some grounding for when weird ChatGPT behaviors are intended or side-effects -- shrinking the Overton window of RLHF bugs.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/openai-rlhf-model-spec
00:00 OpenAI's Model (behavior) Spec, RLHF transparency, and personalization questions
02:56 Reviewing the Model Spec
08:26 Where RLHF can fail OpenAI
12:23 From Model Spec's to personalization
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_027.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_029.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_033.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_034.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_041.webp
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-spec/img_046.webp
Many, many signs of life for preference fine-tuning beyond spoofing chat evaluation tools.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/how-rlhf-works-2
00:00 How RLHF works, part 2: A thin line between useful and lobotomized
04:27 The chattiness paradox
08:09 The mechanism for making models chattier
10:42 Next steps for RLHF research
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf/img_012.webp
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf/img_018.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/rlhf/img_025.png
Models that seem totally out of scope from recent open LLMs give us a sneak peek of where the industry will be in 6 to 18 months.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/phi-3-and-arctic-llms
0:00 Phi 3 and Arctic: Outlier LMs are hints
1:01 Arctic & open mixture of expert trends
6:10 Phi 3, synthetic data, and small models
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/phi3/img_004.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/phi3/img_008.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/phi3/img_018.png
Certain definitions of AGI are backing people into a pseudo-religious corner.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/agi-is-what-you-want-it-to-be
00:00 AGI is what you want it to be
04:01 RL still rules the AGI discourse
05:43 Modern AGI tests
07:37 Agency and shifting goalposts
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/agi/img_018.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/agi/img_020.png
Meta shows that scaling won't be a limit for open LLM players in the near future.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/llama-3-and-scaling-open-llms
00:00 Llama 3; scaling open LLMs to AGI
01:44 Pretraining, data, and basic evals
06:06 Alignment and human evaluations
10:08 Chatting with Meta AI and Llama 3 70B Instruct
11:55 Same Llama license (mostly)
12:52 The healthy open LLM ecosystem
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_011.jpeg
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_013.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_015.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_020.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_036.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_040.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_046.jpeg
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_061.png
Fig 9: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_063.webp
Fig 10: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_066.png
Fig 11: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/llama3/img_068.jpeg
Integrating some non computing science into reinforcement learning from human feedback can give us the models we want.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/reinventing-llm-alignment
0:00 Stop "reinventing" everything to "solve" AI alignment
2:19 Social Choice for AI Alignment: Dealing with Diverse Human Feedback
7:03 OLMo 1.7 7B: A truly open model with actually good benchmarks
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_013.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_015.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_018.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_024.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reinvention/img_027.png
Modeling the compute versus performance tradeoff of many open LLMs.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/compute-efficient-open-llms
0:00 The end of the "best open LLM"
3:05 Compute efficient open LLMs
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_004.jpeg
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_009.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_014.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_016.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_018.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_020.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_022.png
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_024.png
Fig 9: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/scaling/img_028.png
Last minute title change from: The tech industry can't agree on what open-source AI means. That's the process.
How to read what multiple people mean by the word openness and see through the PR speak.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/flavors-of-open-source-ai
0:00 The tech industry can't agree on what open-source AI means. That's the process.
2:45 1. Effective Accelerationists, Techno-Optimists, capitalists, etc.
3:39 2. Scientists, promoting understanding and transparency
5:16 3. Inclusion, public interest, and fighting concentration of power
6:19 4. Freedom advocates
7:25 Dissecting "openness"
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/openness/img_004.png
Databricks' new model is surpassing the performance of Mixtral and Llama 2 while still being in a size category that's reasonably accessible.
This is AI generated audio with Python and 11Labs.
Source code: https://github.com/natolambert/interconnects-tools
https://www.interconnects.ai/p/databricks-dbrx-open-llm
00:00 DBRX: The new best open model and Databricks' ML strategy
03:36 The DBRX narrative
07:33 Databricks' open LLM (and AI) strategy
09:42 Playing with DBRX Instruct
14:54 Digging for details
Fig 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_007.png
Fig 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_012.png
Fig 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_023.png
Fig 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_045.png
Fig 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_047.png
Fig 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_059.png
Fig 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_066.jpeg
Fig 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/dbrx/img_068.png
Evaluation is not only getting harder with modern LLMs, it's getting harder because it means something different.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/evaluations-trust-performance-and-price
00:00 Evaluations: Trust, performance, and price (bonus, announcing RewardBench)
03:14 The rising price of evaluation
05:40 Announcing RewardBench: The First reward model evaluation tool
08:37 Updates to RLHF evaluation tools
YouTube code intro: https://youtu.be/CAaHAfCqrBA
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_026.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_030.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_034.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/evals/img_040.png
Where moats are tested now that so many people have trained GPT4 class models. Claude 3, Gemini 1.5, Inflection 2.5, and Mistral Large are here to party.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/gpt4-commoditization-and-moats
00:00 Building LLM moats despite the commoditization of GPT4
04:38 The Open's opportunities
08:02 It's amazing people still think LLMs aren't going to be useful
09:50 Things that are coming
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/moats/img_004.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/moats/img_028.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/moats/img_032.png
A proposal for a new definition of an "open source" LLM and why no definition will ever just work.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/an-open-source-llm
00:00 The koan of an open-source LLM
03:22 A new naming scheme for open LLMs
07:09 Pivot points and politics
08:16 Claude 3, arms race, commoditization, and national security
10:01 Doomers debunking bio risks of LLMs themselves
11:21 Mistral's perceived reversal and the EU
13:22 Messy points: Transparency, safety, and copyright
13:32 The muddling of transparency
15:22 The muddling of "safety"
16:30 The muddling of licenses and copyright
20:12 Vibes points and next steps
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/open-source/img_046.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/open-source/img_064.png
This interview is available on podcast players and YouTube.
I’m excited to bring you another interview! This one is a deep dive right in my wheelhouse — all things RLHF. Louis Castricato is probably the hidden star of RLHF in the open. I’m not sure anyone who can speak freely knows as much as him. As I’ve said again and again on Interconnects:
Giving a voice to researchers is the best way to cut through the noise and understand what is happening with core developments of LLM technologies.
Louis recently has been founding a new startup focused on synthetic data for alignment, Synth Labs, and is a researcher at Eleuether AI. This interview should speak for itself, and it’ll need re-listens, even for myself. The list of topics we cover touches on pretty much every major and minor issue facing model fine-tuning. Please reach out or comment if there’s a paper we mention that I didn’t link before. Happy to dig it up for you.
For more on Synth Labs, there was a profile in Bloomberg from Rachel Metz.
This post is very technical, more than usual. If you’re having a hard time with it, I suggest you listen to my RLHF 201 post on Latent Space first.
Chapters
These are generated with smol-podcaster with moderate edits.
High-level chapters
* 00:00:00: Introduction
* 00:01:24: Gemini News and RLHF’s Part in it
* 00:09:05: Long Context, In-Context, and Multimodal RLHF
* 00:21:20: What are people missing about RLHF these days?
* 00:30:30: OpenAI's Influence and the Need for Alternatives
* 00:39:20: Synth Labs and the Future of Alignment
* 00:55:00: Evaluation Talk p2: Open-ended Evaluation and Data Diversity
* 00:59:20: Algorithm Roundup: PPO, DPO, KTO, IPO
* 01:18:38: CarperAI, Early Days of RLHF, Reflecting on ChatGPT
Detailed chapters
* 00:00:00: Introduction and Overview of RLHF
* 00:02:02: Gemini News, Custom Demographics in Image Prompts, and Controllability Issues in AI Models
* 00:05:21: Fixing Biases in AI Models Post-Training, Representation in AI Data
* 00:09:00: Multimodal RLHF and Video RLHF
* 00:16:09: Evaluating Long Context Behavior in AI Models
* 00:17:05: The Potential of In-Context RLHF
* 00:21:24: Shift from PPO to DPO in RLHF
* 00:23:19: Generalization and Evaluation in RLHF, Human Evaluation
* 00:27:03: The Discrepancy Between Research and Company Needs in Alignment
* 00:29:20: Impact of ChatGPT and Language Model Outputs on Data Sets
* 00:31:39: The Concept of Uncensoring Models
* 00:34:05: Lack of Safety Data Sets in Instruction Tuning
* 00:35:23: LMSYS ChatBotArena, AlpacaEval, MT Bench p1
* 00:39:25: Introduction to Synth Labs and Alignment Platform
* 00:43:05: Developing OpenCAI Constitutional AI Data Set
* 00:49:41: The Need for Open-Ended Evaluation in RLHF, eval p2
* 00:54:13: The Importance of Releasing Models for RLHF Research
* 00:58:17: Self-Instruction and Self-Rewarding LMs
* 01:01:03: Working on RLHF at Carper AI
* 01:04:25: Scaling PPO in RLHF
* 01:08:01: The Impact of ChatGPT on Carper AI
* 01:10:56: The Potential of KTO (Kahneman-Tversky Optimization)
* 01:17:39: The Importance of Implementation Details in RLHF
* 01:20:14: The Initial Focus at Carper AI
* 01:23:36: The Future of RLHF and Open Science Collaboration
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Papers & artifacts we discuss
* Recursively Summarizing Books with Human Feedback
* Needle in a haystack recent example repository.
* Urial paper: The unlocking spell on base llms: Rethinking alignment via in-context learning
* Misha paper from Deepmind: In-context Reinforcement Learning with Algorithm Distillation
* Museli Optimizer: Muesli: Combining Improvements in Policy Optimization
* Unintended Impacts of LLM Alignment on Global Representation
* Pink Elephants Problem: Suppressing Pink Elephants with Direct Principle Feedback
* Cut the Carp: Cut the CARP: Fishing for zero-shot story evaluation
* MT Bench data for correlating human to GPT4 preferences
Full transcript
Note: this is generated by smol-podcaster and has minor bugs post human edits.
Nathan [00:00:01]: The ticker's going up. Welcome, Louis. You're the second guest on the InterConnects podcast, I think. It's an interesting one for me because everyone kind of points to me now as the person that is in the face of RLHF and I get a lot of questions and to me Louis has represented that person. I think Louis provided a lot most of the information on the first RLHF blog post that I wrote for Hugging Face back in the day. If there's somebody that I want to ask questions about RLHF, it generally goes to him. So now you all are gonna know this in the open. We're gonna cover a lot of things. As always, I'm trying to talk with researchers on the ground and people actually doing things in these topics. I think we're gonna cover a lot of things today. We're in the Latent Space podcast. If you're watching on video, you may have noticed that we're in the Latent Space studio and they reminded us we've got to start off with covering the Gemini news and what that means for RLHF and then most of this is a long docket of the core questions facing the two of us as we're trying to make RLHF more open, more useful, not only about safety but safety is important to it and important to us. So I think we can kind of get going. I think the first question that I have is just get rolling. What is your favorite Rhode Island fact?
Louis C [00:01:28]: My favorite Rhode Island fact? Oh man, all the H.P. Lovecraft stuff. Like walking around Providence with like friends who like H.P. Lovecraft and be like, oh yeah, you know, this was like that building in Call of Cthulhu or like...
Nathan [00:01:36]: I don't even know this. I mean, for the record, I grew up in Rhode Island if people didn't know and then that's where Louis spends most of his time these days. Providence. So we'll come back to this. I think I'm just gonna start with kind of the hardest question then it'll get easier for us from here. It's like what was your first reaction when you saw all this Gemini stuff?
Louis C [00:02:02]: The, you know, the adding custom like races and demographics to like image prompts component, right? Yeah. So Dawley had done that back when Dawley 2 first came out and was like an in beta and people were reporting like a person holding a sign that that says X and then this sign would say black or this line would say white or this line would say Asian. And I, you know, it was a very hacky solution then and I thought a lot about it then as well and I almost felt like, you know, it gets you 90% there for like 1% of the time of the way that you're doing this like, you know, like in a more proper and auditable way of like making sure your training data has like equal representation or making sure your ROHF data has good representation. And, you know, you can't do those things after the fact but what you can do after the fact is like inject things into the prompt to make it more controllable. And it really comes down to the fact that controllability right now is not a solved problem and most of our solutions to controllability are a little bit hacky.
Nathan [00:03:16]: Yeah, that makes sense. I think to summarize for people this has been an ongoing issue and we're recording on the 27th here. Gemini initially got flack for like actually forcing diversity into historical scenes and then it started getting more flack for flat-out refusing certain requests on race. Like all of this stuff is just like it's like ouch to somebody. Like I know people working on this stuff and it's just like the way that it ends up here is not is not like what a lot of people think. Like the Gemini team is obviously moving fast and it seems to me that the image stuff has always been like a red herring. That's the way that Swicks phrased it as well. It's like somehow he got to the point where a prompt was shipped in this final solution with the further image editing and that's just hard. It's just like obviously there's a big goof up there. But then it's like we're looking at examples and still today like Meta's image generator. So I'm like WhatsApp or whatever you can ask an AI that it'll have similar issues where it forces diversity into it into a question with multiple people. Microsoft Copilot has this. It's like the text thing and really digging into how we think these big companies could be adding like like forcing this into their data or like we know that there's a lot of uncertainty over how all these companies get their preference data. Some of them work with companies like scale and surge. Some of them do it in-house. Who is providing it isn't really an issue because they're probably giving similar instructions and similar workforces across the board. But it's like how do we see this entering the preference data that they're adding to our early stuff because it's like if you look at a base model. We were just working with Olmo and it's like if you you ask a base model you say like hello to a base model. A lot of times the base model will then go off and be like some crazy like Fortan s**t because like so many of the conversations on there even with good data processing techniques is like from weird corners of the Internet. So like I don't see any base model that comes out with some like D-bias thing so it's added on. And it's like how did we end up there.
Louis C [00:05:21]: Yeah I mean you know when I was saying this is something that that they do like retroactively once they've acknowledged that these issues exist in the data set once the model has been trained. It's not something that can be easily fixed even if they had infinite resources like it's very very hard to go back and actually rectify these biases in a way that's like equitable to like all the kinds of preferences that someone might have when wanting to interact with this model right. There's um the the fact that at least as far as I know until recently DALLE did this as well where you could still say a person holding a sign that says X and it would still say black white or whatever and and the amount of resources that they're pumping into making sure that you know they're building a consumer product they're building like the main consumer product in this space the amount of resources that they've been pumping into it and this still presents a large issue for them just just shows how difficult this like really is.
Nathan [00:06:20]: Yeah and another example that people on the I have this discord that's growing for paid and friends or paid subscribers and friend someone pointed out this work where if you ask DALLE to generate like a doctor and an assistant like all the same bias problems still up show it show up so like a lot of the solutions that we have are not necessarily like deep and at this like conceptual level it's at this like you tell your preference labelers to do a certain thing and then they do it but you may not have good tracking of like which data point is responsible for these different things.
Louis C [00:06:55]: Yeah you know interpretability for for like preference learning in general it's it's we're very very far from actually understanding like what preferences result in what model behaviors and and like you know preferences that disagree with each other.
Nathan [00:07:12]: Like the John Schulman talk. Yeah. It's like that was this whole talk and it was great just to have him get up there and be like this is so hard.
Louis C [00:07:20]: Yeah and like I've done like a ton of experiments myself where I just like have an RLHF data set and I like randomly remove 10% and I have like a bunch of models each with like a different 10% removed and I'm like well what behavioral differences can I see between these models and then not only is it like and now you can see differences but it's extremely hard to quantify it's extremely hard to actually understand what the difference is and then like there's almost no way to know what in that 10% cause that difference.
Nathan [00:07:51]: Yeah this reminds me of like the Hugging Face No Robots data set which is like a professionally curated instruction data set. Whenever we added that to a model it was like this is obviously our most valuable data but it would show up on zero benchmarks and we're like well what do we do and it's like we're talking about Google's problems here and we'll get back to like the data problems in the open source and it's like they probably have order of millions of data points that are going into this preference data and some of it is for some proportion it's probably about safety. I think we could talk about like the Anthropic HH data which like the people don't actually know the details of it because it's like a quarter of it is like helpful data or than three quarters is or like a quarter is harmless and three quarters is helpful from different rollouts and it's like these are very specific things as like huge data problems that most people aren't really thinking about.
Louis C [00:08:40]: Yeah most people are just like blindly oh this is safety so I'm gonna throw it into my data set and hopefully like it works and hopefully like we get good behavior but I don't really know what's in this data set I've really looked at the data and I thought that's something that I've heard many many times over the last year of people like trying to get their feet wet in the RLHF space.
Nathan [00:09:00]: Yeah and do you have any intuitions is like the last point of the Gemini thing I'm like if we don't think that the image generation of Gemini is the biggest issue I think it's like in the text and how this preference data is collected but like do you have anyone that is doing multimodal RLHF because I generally think that it's like we don't know how to do this at all which is like how you control input if you have multiple inputs and multiple outputs is like how do you control your moDALLEty distribution and data count and stuff.
Louis C [00:09:30]: Yeah so I mean I have a friend of two friends of mine who have been doing like video RLHF for a little while now like it's a little bit over a year and you know they like condition their video model on some text encoder and they've been talking about like having to do RLHF independently for both the text encoder and the video model but like video RLHF is just like massively underdiscovered and no one really knows what they're doing in that space.
Nathan [00:09:53]: When you say independently what do you mean like before making the video model are they like RLHF-ing the text backbone or are they freezing the rest of the model? Yeah they're RLHF-ing the text backbone.
Louis C [00:10:04]: I think there was actually a paper from Tencent last August that basically did the same thing for like multimodal RLHF where they had to RLHF the text backbone and then the RLHF like the image generation components on top of that.
Nathan [00:10:17]: Does that look like that's the like they you this is potentially basic but like to train a visual language model you have to have some link you have to add some type of a mechanism that links the gradients between the two and sometimes you start with a most of the time I think these days they're starting with this language backbone and they're adding on vision and continuing to train and then this is like at the end of this where you have a visual language model then they're freezing the gradients of the video video part and then RLHF-ing the text part or is this like before the text backbone is even initialized on the model?
Louis C: The space is a little too early.
Nathan: Yeah like I think that's the point like we don't know these links.
Louis C [00:10:53]: But I know people in the last like eight months who have done it the way of like before they even add the image component they RLHF the text model and then they add the image component in the RLHF image.
Nathan [00:11:07]: Yeah so this is really interesting like I'd be interested from like a everyone talks about how RLHF is low low computation and flops compared to what people are doing like in the open we say that it's like 50 or 100,000 day training samples. Lama 2 is like 1.5 million I'm guessing the closed models like Gemini are probably another 10 million like we're higher like they're they're much bigger and it's like is the amount of video training that it takes a train this backbone after the fact like it's still helping like does that undo some of the text RLHF or does it not? If the answer is I don't know but these are the kind of things that I want to have people start talking about it's like is RLHF becoming like a sequential process as you add moDALLEties or can you wait all to the end and like do just multimodal RLHF? We don't know these things and this is what people in Gemini are trying to work on.
Louis C [00:11:58]: I definitely I've spoken to a lot of people who like are at least thinking in this space I've only spoken to a small number of people who are actually working in this space but for the people who are thinking in this space really the the dream is to be able to express preferences in moDALLEties where it's beneficial to express preferences in those moDALLEties like it doesn't make sense to express preferences over code as like images or video but it does make sense to express preferences over like puppies as like photos.
Nathan [00:12:25]: That's a great point and I think the thing is like the way you ended your sentence is like make preferences over puppies it's like we don't know what people use visual outputs for in like a productive sense and and really inputs like the things are like analyze this video like that's a toy example where like analysis creating RLHF pairs I think actually it's not too hard for us like we it takes a lot of effort because a human has to know what is in the video to do like a summarization RLHF like if you're passing in a three-hour video into Gemini base model and then it outputs two outputs like the humans not gonna know what's right unless it has context and what the video is and that is just way different than like a poem where you could read both of them.
Louis C [00:13:04]: Yeah so there's actually a really fascinating paper from OpenAI that I really haven't seen anyone build on it was the idea of like summarizing really long books and you doing RLHF to do that.
Nathan [00:13:14]: Is this sort of like recursive summarization?
Louis C [00:13:17]: Yeah yeah it's the recursive summarization it's the idea that like you can almost treat like long summarizations as like a weird RLHF like almost like merge operation where like you divide divide divide divide divide divide and then eventually you get to segments where it makes sense to collect annotations and then on those segments you have a human annotator go through and say oh this segment is better than this segment or the summary of this segment plus this segment is this and then when you combine summaries now you can say well this summary plus this summary gets you this summary and eventually you get preferences going all the way up the tree and you get a preference of the whole book at the end and obviously you know it's a crude approximation of what the summary of the whole book is but it's much more feasible than asking human annotators just to summarize an entire book.
Nathan [00:14:05]: Yeah I mean I just realized this on the pod right now it's like how ridiculous RLHFing like an entire code base in context is like that's like where some of the like opportunities for what I think RLHF could do which is like just synthetic data labels and stuff it's like we can create synthetic preferences in many different ways that aren't all reliant on like this kind of human subjectivity.
Louis C [00:14:32]: Yeah it's like it's a deeply fascinating problem actually going into like how big is Gemini's context window the 1.5 thing it's like
Nathan [00:14:37]: yeah it's shipped with a million and they have experiments in the paper up to 10 million.
Louis C [00:14:40]: Like who really wants to use a 10 million token context window and like how accurately do you really can you really think about preferences over the range of a 10 million token context window?
Nathan [00:14:54]: I think people want to use it but I think the preference thing is a lot harder yeah it's like I could have this is something I encounter in HuggingFace regularly like HuggingFace is a popular code base you expect the code models to do well but they still don't do well unlike like they don't know like they'll make up datasets functions or something and like if you just have all of HuggingFace's code in context when you're like working in the HuggingFace ecosystem like that will make you so much better and like it or analyzing long videos and stuff like I do think there's a lot of use cases and I yeah but like the preference thing is just a totally different framing. What do you think about the needle in the haystack evaluation that they did? I haven't read a lot about it but I think essentially what it's it's there's like a difference between being able to act on the information and being able to like retrieve it and I think it's like these models should be passing needle in the haystack because that shows that they're like actually like noticing that the information is there but that does not necessarily mean that they're gonna be able to synthesize all the information in a compelling way so it's like a path it's like a pass bar which is like you need to have this to be credible in long context but I think that actually evaluating long context and like what behaviors we want to see is pretty open-ended.
Louis C [00:16:09]: yeah he put out a paper like yesterday where he's like oh needle in the haystack is interesting but if you have like more than two needles like it's entirely uncorrelated with the single needle in the haystack benchmark.
Nathan [00:16:24]: Yeah cuz it's like trying to find one thing at each part of the content like breaks the context window into many segments and then it's making sure that you can find something in each of those segments.
Louis C [00:16:36]: So it's almost like I feel like we're almost gonna get to the point where like the attention itself is the limiting factor because the model genuinely just just cannot equitably like split attention over it's a context window to retrieve as many things as it realistically needs in order to produce something.
Nathan [00:16:50]: Do you think the RLHF could manipulate long context behavior more than people might expect? Cuz it's it's just like an open question.
Louis C [00:17:05]: Yeah I think it's a very interesting open question and if the answer turns out to be yes in context RLHF becomes like absolutely massive because like right now like it can kind of sort of work but like not really and like every benchmark I've ever seen for in context RLHF almost isn't charitable at all to the RLHF baseline and it's not like from the experiments that I've done in the experiments that people in Eleuther have done. It's comparable on like very niche situations but it's not comparable in general because you still have all the issues with in context learning where like you'll massively overfit on the preferences that are like put in the beginning of the context versus preferences.
Nathan [00:17:50]: Let's try to explain what this in context RLHF is actually doing. So is it running like everyone a lot of people know what an RLHF algorithm is and in context learning is designing a prompt like is it training a model to generate prompts like what are you actually are using the RL update and like what are the parameters what are you parameterizing when you're doing in context RL?
Louis C [00:18:10]: So I mean there's a number of different approaches for in context RL. There is the... Could be part of the problem.
Nathan [00:18:14]: It's like people do a lot of different things but what are some of them?
Louis C [00:18:16]: So the one that I was referring to is I think the Yejin Choi paper. Yeah it's the Uriel. Yeah where like she's like you she just prompted chatbot you are interacting with the user here's what their preferences are like have at it but there's also stuff like that like Misha and DeepMind. This is the first one that I did. Yeah where it's like you have some agent that's interacting with an environment and you store all these state action pairs and you just like fine-tune models on like episodes of these state action pairs and then the idea is that like if you just put enough episodes into a context window on the next episode it'll just perform better right and and it's like the algorithm distillation paper and you can like use this to like distill stuff like I think the actual example that Chris Lu's paper does where they do like algorithm distillation on s4 I think they do Muesli where I think they distill Muesli which is they like apparently no one outside of DeepMind ever used it but apparently...
Nathan [00:19:15]: Oh is this the algorithm Muesli? Yeah I remember when this was hot it was like a year ago at this point we were thinking about re-implementing it and then we never did. It was too complicated.
Louis C [00:19:30]: Yeah but Muesli is apparently very computationally expensive because it's like this model based RL thing that beats AlphaGo I think without using Monte Carlo tree search and like you know it's so incredibly computational expensive and wanting to be able to do it in context just dramatically reduces the amount of computational complexity to actually deploy it right and as far as I'm aware there's been no work applying algorithm distillation at all to NLP and I think at least my impression is that it generally does not work for NLP at least yet and you know I think that there's a lot of potential there but there's absolutely massive barriers that have to be overcome before we get there and and you have like what you have Goldberg's example of not being able to do needle in the haystack for like more than two needles basically shows that even like the ring attention stuff just is not going to be sufficient for algorithm distillation stuff for NLP and I have a very strong feeling that like Mamba or like S4 is not going to close that gap either. So they would need to be able to reference prior parts of the text and they just can't do that.
Nathan [00:20:56]: Yeah I think there's a whole rabbit hole that we could go down and talk about like long context and architectures forever. I think let's kind of zoom back into the core stuff which is that this is like the real starter question is like what do you think people are missing in RLHF these days and then from here it's gonna be a long list of like what the heck do we do about evaluation data like well what is the like big-picture thing?
Louis C [00:21:24]: So what I think people are missing and actually I touched a bit on this in the Pink Elephant's paper is that...
Nathan [00:21:28]: You should say what this is because we haven't introduced it.
Louis C [00:21:30]: Yes you're right you're right you're right. So I worked at Luther AI as a resource scientist for the last six months or so and we were really interested in like understanding you know everyone had been doing PPO for so long and there had been a shift to DPO and we were trying to understand like well now that we're moving to DPO how can we actually take advantage of this new architecture? Like should we really even be thinking about reward models and data sets in the same way that we were thinking about them during PPO? And it doesn't really make sense and I think the answer to that is an unequivocal no. That like you need to think about your data sets and preference data sets entirely differently than you were thinking about them with PPO. Because in PPO you're using you're setting your data sets up to train a really good reward model and in DPO you're setting your data sets up to teach a language model what the better trajectory is. And it's a subtle difference but in one you're just trying to learn differentiation between high reward and low reward and in the other it's like a general classifier.
Nathan [00:22:35]: Like you want to be able to do everything with the reward model? Yeah. Have you also found that DPO can be sensitive to like the SFT distribution? So if you like take a random open preference data set if it's really different than what your model would generate like DPO can do some weird things? Louis C [00:22:53]: I've actually, I might be alone in this, I don't SFT before doing DPO at all.
Nathan [00:22:59]: Do you use generations from your base model? I do. So that's the question. It's like if you were to not do SFT before doing DPO. Yeah. Could you just take ultra-feedback on whatever your base model is if it's sufficiently different? I've done some weird stuff though. Like I've like
Louis C [00:23:19]: DPO'd models that were like trained with like the Hermes data set for like code and like it still generalizes really really well.
Nathan [00:23:28]: How are you measuring, how are you trying to think about generalization with DPO?
Louis C [00:23:33]: Well I typically rely on like human eval more or less. And if I do like human eval but it's GPT-4 eval and I see that human eval correlates with GPT-4 eval then I just go GPT-4 eval the whole way. A lot of people are doing that.
Nathan [00:23:48]: How far do you think that actually generalizes? I mean just recently there was this, like we're bouncing around through all the things, but there's so much good information for people here. It's like Hugging Base and Argilla, two places that are doing great work in this kind of alignment preference fine-tuning space, they've released this data set that was a preference pair creation from the OpenHermes data set. And it's like they used PairRM as their judge. And what they found is that like they did it, I remember Louis Tunstall tweeted this, where he was like we were looking at which gave the best correlation. And they found that PairRM, which is this 400 million parameter Diverta based pairwise classifier, had like the best correlation as choosing which response was better among a set of responses in the OpenHermes data set. And what they were comparing to is like Prometheus and I'm forgetting the name of the other one. There's one more, there's a couple more like open model as like rate model rankings that exist. I think. But essentially the question is like we do these things and we look at this early correlation and there is this correlation between GPT-4 and humans. And then a lot of times we continue like LLM-Sys did this question where they like or like AlpacaEval has done this to validate AlpacaEval as a meaningful benchmark. LLM-Sys has done this for MTBench. Like all these places are doing this where they validate a subset for humans and then say it generalizes forever. Like do we think that it's actually true? I think that you always have to take it with a grain of salt.
Louis C [00:25:24]: It's always for very very specialized domains. So one of the first, actually I think I did write the first paper for like critiques and revisions called like Cut the Carp. The idea was like, I remember this, the idea was like we could scrape like I think it was a million stories, edits of the stories and then like all the like critiques that like writers wrote on the, the editors wrote on those stories and we can use that to train like a big contrastive model, right? And we showed in the paper, we did a bunch of like human eval and then we did like Spearman rank to compare like how our model ranked certain preferences versus how humans ranked the preferences. And we found that you know we had an extremely high Spearman rank coefficient, like significantly higher than like doing like a value head or like significantly higher than doing just asking a language model to rank them. And I think the grain of salt that we had is that we were only claiming that like on this very very carefully created test set, the assumption that the model accurately reflect reflects human preferences holds and we can generalize to a very small, small but slightly bigger test set and say that it holds there as well. I think the broad sweeping statements that it holds on a few toy examples so it must hold
Nathan [00:26:54]: everywhere, I guess never really. It's like a common problem. Yeah. I think we're going to, it's going to come up again and again. I think it's like.
Louis C [00:27:03]: I did my master's in like human evaluation and I've always been extremely careful with with any statements I make that involve humans. I mean this is what
Nathan [00:27:12]: people in RLHF need to be doing. Like this is the motivation of this like the history and risks of RL and human feedback paper that we did is just like RLHF is a socially rich topic. Whenever you say something and you're making claims of generalization, you're often making claims about like what is implicitly a preference and a human value that you're taking into the system. So it's just like I think that is just something that people need to take really seriously. Here's a really specific drop on the herring reference. Did you know that when LLM says release their LLM as a judge paper they also released thousands of samples from humans and GPT-4 verifying like empty bench preferences over pairs of like that were higher score or not? I did not. Okay so essentially the thing is and like I've talked a lot on building a reward model benchmark but essentially there's all these references about how like GPT-4 agreement is higher than human agreement when you're like doing this preference process. So if you train a DPO model, if you train a reward model how it ranks the outputs is like is more likely to align with GPT-4 than a human. Which it's more of a statement that humans have more disagreement than GPT-4. So it's like easier to train on GPT-4 outputs than as human outputs and this is the place where I see it most clearly. It's like all the reward models do like 10% higher on accuracy of their test set from that which is like the chosen by GPT-4 and the rejected by GPT-4. It's all in like the 70 or towards 80% while all the humans is like in the 60% which is a human chose this empty bench completion over the other one. So it's just like we're slowly getting signal that it is there and then the question is like should we care about doing our RLHF without any OpenAI input in the process? I think last year when the terms of service discussion was big a lot of fine-tuning work was discussing like what data sets could we use with permissive license that don't violate the OpenAI terms of service. Should we be concerned where RLHF is going where almost everything has been touched with OpenAI right now?
Louis C [00:29:20]: There was a very interesting paper, I don't remember who it was, but it was like if you take a model that was pre-trained on data set up to this year and compare it to data that was pre-trained up to this year and it was like pre and post like chat GPT release basically plus like six months the benchmark scores improve and it's literally just because there's like chat GPT data or language model output data or more structured data that sounds like a language model performing well on tasks in the data set. It's like kind of the the consensus that they were.
Nathan [00:29:53]: Was this a benchmark that's independent of like is it like a kind of structured benchmark or is it like a vibes benchmark? I think it was like a structured benchmark so I don't remember. Yeah I'm just asking whether or not it was a result of like matching GPT for text or like actually having higher behavior because training on OpenAI outputs does like training on good language model outputs does improve scores on benchmarks that people care about so like that's a fact that people need to accept and I think most people do like that's not controversial right now but it's like we should I still think that if there's lines of work out there where people are from a values perspective trying to fine-tune models without touching OpenAI like that is a line of work that should continue.
Louis C [00:30:42]: Yeah on this note actually I mean when I was at Stability I think one of the experiments that we did was like for a stable LM I remember was like pre-pending as an AI as an AI agent trained by OpenAI to anything before we ran it through evaluation and the scores improved and like trying to remember who wrote the paper.
Nathan [00:31:09]: That's hilarious. I mean like do you there's been a lot there's a lot less discussion on uncensored models right now my claim is generally I think uncensoring is the wrong word which people have used it to describe removing phrases like as a language model or any methods of mentions of emotion or like I was trained by OpenAI so I can't not do this. Do you think that like this type of filtering for opinions and soft refusals is still important in RLHF?
Louis C [00:31:39]: I think it's important for very very specific situations but not in general. My impression is that you know if you're interested in AI safety it's always useful to have a model that would never do a refusal ever.
Nathan [00:32:00]: It's hard to find on the hub where we're building a safety data set and we had to find like it's a fine-tune of the dolphin data set was the one that like what's closest it was only like it's probably like 80 to 90 percent of the tasks that we asked it it wouldn't refuse it would still refuse 10 or 20 percent of the time. It's kind of profound that like refusals are now stuck in the model in some way like we were looking for a model that wouldn't refuse at all and we like couldn't find one on the hub which is after all discussion of uncensoring you would think that it would actually work.
Louis C [00:32:31]: Yeah I've been doing a bit of safety research with Stella for a little while and my approach has been literally call GPT-4 with a jailbreaking prompt and and just put whatever I want to after that. And I you know very often have to change my jailbreaking.
Nathan [00:32:46]: Yeah I was like you have to keep close guard over the jailbreaking prompt.
Louis C [00:32:50]: Yeah and and the issue is that like when you find a good jailbreaking prompt you basically have to redo all your results within like the next like seven or whatever days before OpenAI patches it and you just have to pray that like you know you there's so many issues using any OpenAI model in any research pipeline but if you're like research is explicitly about the safety of OpenAI models all of a sudden you're like well.
Nathan [00:33:18]: I mean a lot of companies should be doing internal research on OpenAI safety to kind of have their own measure of how their application will do like the monitoring that on their own is worth it for their bottom line and liability because OpenAI will also do it but OpenAI has incentives to not tell the world if there's something kind of subtle going on that some people could get over because that might blow up and if they don't have a fix it's gonna bring attention to it.
Louis C [00:33:44]: It's part of the issue with like even publishing red teaming research in general it's like if you publish an evaluation for like red teaming or like for safety well everyone's going to like Goodhart that evaluation and all of a sudden like now now we have a useless stack of papers that used to be on how to test if a model was safe.
Nathan [00:34:05]: Yeah I didn't really prepare questions on safety but it's it's for a long time surprised me that there aren't data sets and easy recipes for adding safety to instruction tuning in RLHF. I think that I mean someone at Lama team asked me what should they do and they're like you should release your safety data because it's like if they're getting pressure from the executive branch to not be safe it's like if they have this data they can release it and be like this is how you can make any open model safe. Huge softball and also like the safety is unlikely to be a competitive advantage like mist like mistrals I'm not gonna care about this like they might eventually but like the PR win is really big. Yeah. I mean this is something that I've wanted to do for a while and just haven't done good at prioritizing it so. Yeah we can go back to some of the questions that you have. Yeah I'm adding them so I can keep notes later. I think that the next main topic is on evals. I think vibe based evals are still a way of life in RLHF. They're not going away anytime soon. I would say we have kind of a holy trinity of LM sys chatbot arena which is kind of at the top for for good reason. There's alpaca eval, alpaca eval 2, MT bench. I think start with the most important one is like when you see LM sys what are you what are you extracting from a model being better or worse there?
Louis C [00:35:23]: So it's in a way I am a little bit like what Andre Kaparthe said on this. Was it him? It might have been him.
Nathan [00:35:27]: Probably. He's been on a roll.
Louis C [00:35:32]: Yeah where it's like when he picks an open source language model he looks to see what people say about it on reddit. Yeah local llama and LM sys chat arena and the issue is that you don't know what they're using it for and like as a research scientist when I look for a model I am looking for a model to like do research on. Yeah. And I am not looking for a model to be like my AI waifu girlfriend that I can like play Dungeons and Dragons with.
Nathan [00:36:05]: Yeah I mean this has been the bane of RLHF research for a while. It's like what did we do before MT bench? We literally the only hope we had was to like chat with these things and hope for the best. I was like that was very recently. That was less than a year ago. And then MT bench came along and we were kind of using it hugging face and other people are using it. I actually don't know the alpaca eval release date so that might have been before MT bench. But like these two came around at the same time and they're now kind of the ground truth. Alpaca eval 1.0 has kind of been saturated on which is like comparing to Da Vinci with a GPT-4 judge and then alpaca eval 2 is comparing to GPT-4 turbo with GPT-4 turbo as a judge. Yeah. It's funny it's like it's now cheaper to do the second version than it was the first version with a newer model which is how scaling happens.
Louis C [00:36:56]: What do you think about the Nous evaluation thing where they're like continuously generating more evaluation data?
Nathan [00:37:00]: Who is doing this? Nous? Nous research? I don't know. Is this their new leaderboard that they have? Yeah. Yeah. Yeah. I haven't looked at it so I'll have to give it a look.
Louis C [00:37:09]: What do you think? It's almost like MT bench but they like generate new data every day. So new prompts? It's always new prompts and it's always I don't know how they seed it. I assumed they seed it based off like the events that day.
Nathan [00:37:22]: It's a kind of a cool idea. So if you're trying to make a new leaderboard you could have a set of seed instructions that you augment and you never release the seed instructions but you always release the augmented ones on like a weekly cadence. I think that's because there's a lot of people that want to build better alpaca eval things and a lot of the problems is that the prompts are from known sources or public and you want to be able to do a closed eval without having as much cost. So that might be a way to kind of really reuse the data for a long time. Yeah. Yeah.
Louis C [00:37:53]: But I mean like I feel like the issue with things like alpaca eval, chat arena or any of those is that like the way a user is going to interact with an agent or a chatbot is entirely different than the way we are currently evaluating them. There really is like a big discrepancy there in that like you know look at the Air Canada thing right? Like that would never have come up in a benchmark like ever.
Nathan [00:38:20]: Well do you think that's about the model or the implementation? I think it's a bit of both.
Louis C [00:38:27]: Like if that was something like some automated evaluation thought of and I don't think it's unreasonable to expect them to think of situations like that. If like they kind of know the domain you're operating in. I think it's definitely doable and I think I think it's like not something that's entirely unfeasible to accomplish. To like be able to say hey you know I have a chatbot that sells airline tickets and here's what I care about and and and like please do the evaluation for me. And that's actually you know that's what I've been building for a little while now.
Nathan [00:39:11]:Okay we can talk about synth labs and then come back to evals because this will be on the top of the post so everyone will know like you're you're building this and it's like well we can start with like what is the basic pitch and then kind of go into the like long-term thing.
Louis C [00:39:25]: Yeah yeah so for the last like six eight months I've been building like a fully auditable transparent like verifiable alignment platform is how I like to describe it. Plus evaluation. The general idea is like...
Nathan [00:39:40]: Making a company.
Louis C [00:39:49]: Yes and the the general idea is is like there are many facets to aligning a model from like things like guardrail guardrails to ROHF to various kinds of preference learning to like actually understanding all the data that that goes into creating such a model. And they're all opaque boxes more or less right now and and what people want is they want to be able to align their model know every step of the pipeline understand all of the interpretability that goes from A to B and understand like here's what I gave you as my criteria here's where I know it fails based off all the evaluation you've done for me and here is where I know that I need to improve and it'll iteratively improve based off evaluations and based off your feedback.
Nathan [00:40:44]: So it's a hands-off solution that lets you audit the entire pipeline and build trust with it. So are you your training after you generate this data?
Louis C: We are training.
Nathan: Yeah you use this word improve.
Louis C [00:40:53]: Yeah so it's a iterative refinement platform for doing alignment in a verifiable and trustworthy manner.
Nathan [00:40:58]: What do you think customers want when they hear alignment? What are you selling with alignment and what are they buying? I think the aligning these is an important thing for our field.
Louis C [00:41:10]: There's an extreme discrepancy between what research does for alignment versus what companies do for alignment. When a company hears the word alignment they think wow I want to align models to my business objective and I want to make sure that the model understands my business culture and I want to make sure that the model understands completely its role in my company right? But at the same time I want to make sure that it's compliant, that it's safe, that it doesn't violate any rules, that it's not a legal obligation. What's the word? Legal? It's not going to create legal issues for me. And that it's not going to be a PR disaster.
Nathan [00:42:04]: After what we talked about 35 minutes ago.
Louis C [00:42:13]: Finding that balance is definitely incredibly important and it's something that I've been working on for quite a while and I'm very happy with where things are.
Nathan [00:42:22]: Do you want to tease what we're working on? I could also introduce it. I think this would be short. Essentially Lambda Labs offered some interesting compute and we're gonna try to build an OpenCAI constitutional AI data set because Anthropic gets a lot of benefit out of this. Constitutional AI doesn't get a lot of traction. I think earlier AIF got a bump again. There was this Google paper that was verifying that it works a little bit and now it got a big bump. But there's very little discussion on it, which is a little bit surprising to me. I think there's a lot of people calling it distillation of LLM alignment now, which is interesting. I don't really know. Hopefully it works.
Louis C [00:43:05]: It builds off some of the stuff that I did with Edward III AI with the suppressing Pink Elephant's paper, which is the idea of we've shifted from one paradigm of PPO to DPO and none of our data pipelines kept up. Really what we should be doing is generating either really good utterances and revising them to be worse or really bad utterances and revising them to be better. Then taking all those utterances and conditioning our ROHF in context on those utterances so that you could do stuff like swapping rules in and out during inference. If I am person A and here's my preferences or I'm person B and here's my preferences, align this model to person A and align this to person B and make sure that there's a disparity between what they actually want versus what... There's always that disparity there, but right now models do not effectively mimic those disparities. There was actually a fascinating paper by D. Yang that just came out a few days ago. Most aligned models have the preferences of Western men. Their evaluation focused more on the race, nationality, sex, stuff like that, but obviously it gets much more fine-grained than that. There's been stuff about people calling llama to its political alignment. It has a very particular political alignment that does not agree with many users that are using it. As such, its scope and usability for those kinds of applications is very limited.
Nathan [00:44:50]: This is probably linked to what we were talking about at the beginning. The paper title I just looked it up is Unintended Impacts of LLM Alignment on Global Representation. Michael Ryan is the person I saw the tweet of. Just to give credit for some of them. I know there's a lot of papers, but this one was recent, so we try to track it down in real time. All these issues of representation and who the people are is ultimately related to RLHF going wrong. At the end user is when a lot of people will finally see what the values represented are. If it's not out in the world, it's hard to get the amount of feedback that you need.
Louis C [00:45:29]: This is something that MTBench or Chatbot Arena would never pick up on, ever. This is a huge issue. Here's where we are and where we should be. It's all the way up there. We underrepresent so many demographics and so many kinds of opinions. Who are we to say that one opinion is better than the other, if they're both safe opinions?
Nathan [00:45:59]: Yeah, this is like in some ways can open RLHF and this is something you're a long time been invested in. This is something that you're going to invest in with Synthlabs. Could it be better at giving people what they want than the closed labs just by nature of letting people choose like the constitutional AI dataset that we want to do? My big motivation is if people want the success of CAI from Anthropic, but they want to remove one principle from CAI's constitution. You can't do that with these closed models anytime soon. But in the short term, open source will have something that's a nudge. We're not going to have the best models, but you'll be able to edge your model into whatever direction you want to go.
Louis C [00:46:44]: Yeah, I mean, that really is part of the benefit that we're building with Synthlabs. We're working very, very closely with Luther AI. Stella Bitterman is one of my best friends and I've built large scale open science communities twice now. First with I helped with building a Luther and then I helped with building Carper and I absolutely love everyone in a Luther. And being able to pull from that expertise and being able to pull from that wide spectrum of opinions of what alignment means to me rather than just like some mega labs saying, here's what we say alignment is. Being able to get all those incredibly diverse perspectives is extremely important in bringing about the next generation of AI safety.
Nathan [00:47:30]: This is one of my big questions on existing RLHF processes when you're doing it with human data is the fact that you give written instructions to these users and they're often working in one context. And it's like, how do the values of the often professional workforce given specific instructions map into what the model actually learns from that data? And how do those values get extracted in real world use cases? I think there's a lot of filters that we're passing these preferences, these notions of preferences through and they're not guaranteed to be clear mappings.
Louis C [00:48:01]: Absolutely. There was a discussion that I had with someone in a Luther a long time ago. There's no paper on this. This is just like if someone wants to look for it, it's like a random discord message in a Luther.
Nathan [00:48:13]: Good luck. And it was like, we were looking through the anthropic
Louis C [00:48:20]: HH data set and I think they're South African and there's absolutely nothing in this data set that would identify someone as South African. But there's an insane amount in this data set that would identify someone as American. And it really just has to come down to the prompt. The prompts are written, obviously, by people in the US, in SF, who unknowingly, I'm sure they have the best intentions, but unknowingly filter the preferences to things that only matter to people working in SF. And it might be hard to believe for some people in tech, but there is a world besides SF.
Nathan [00:49:10]: I mean, even the open prompt data sets are going to get some of this, which is like, who are the people that have access to playing with these models and have the time to try to build these models on their own and contribute to these community things? Even though the act of opening data generation is doing a lot for inclusivity, it's the people who are going to do this. I'm going to sit there for 20 minutes and smash the button on Nergilla's little thing and read prompts because I'm learning from just looking through at the shared DBT data set and choosing preferences on it is useful for me as a researcher, but the whole world isn't involved in this process.
Louis C [00:49:41]: No, and of course. I think that something that I've seen, I've heard from friends who work on these kinds of problems in very, very different communities. I have a friend in South Korea who I've been chatting with about RLHF for Korean and other Southeast Asian companies. The amount of under-representation and under-exploration for what even just a good constitution would mean for those kinds of communities, it's just not there. If it is there, it's locked up in labs like Naver or like Samsung, and scientists there, they don't have access to these kinds of resources unless they're in those big labs. As such, there is no real research community there actively pushing it forward in the same way that it is in the U.S.
Nathan [00:50:35]: Yeah. I mean, one of the ideas I haven't gotten traction on is that I think that language models should almost play like it's on. Okay. The last time I said that, someone criticized me as not knowing what the game 20 questions is. I know this isn't how 20 questions work, but like when you log into chatGPT for the first time, it should ask me 20 questions to then construct this information because language models are smart enough to like parse this information if you give it to them. It's mostly like who we get the information from problems. So that's the idea is like I think that the language models should be leading when you're first setting them up in order to represent the values. I think it would solve so many problems we have, and it's probably kind of doable with like a GPT 4.5 model.
Louis C [00:51:16]: I've always had kind of an assumption that like if open AI is doing something similar to constitutional AI behind the hood, I'm sure one of their constitutions is like you can't ask the user questions. It's like I've never seen that model.
Nathan [00:51:31]: Do you think it's a deep safety issue if the model can start asking questions? Is this what Sydney did? I'm pretty sure I got to play with
Louis C [00:51:37]: Sydney. Sydney definitely asked questions in the screenshots that I saw.
Nathan [00:51:41]: Yeah. I was like, do you want to leave your wife? Sydney is not the answer, but there's things to learn from it.
Louis C [00:51:49]: What was that chatbot that came out last summer that was like more conversational? And when it came out, it was like an app on everyone's phone, and they just like talked to it like that. And it would always ask you questions like, oh, how's your day going? You know, it would like ask you follow up questions as you would like tell what about your day. And it would like have like a respond thoughtfully.
Nathan [00:52:12]: I think it's a big missing part. Yeah. I wouldn't be surprised if character AI models are trying to ask questions just because I know how much usage they have. And models asking questions is probably the biggest way to make them like an actual like friendly thing. Like that's that's a part of a friendship is being interested in these language models are by design disinterested.
Louis C [00:52:35]: Yeah. Character AI's ROHF is like one of the funniest things, though. Like I have a few friends who work there and like I've done a bunch of stuff with their like models myself. I've just played around with them because I'm always curious, like when new people enter the space, like what their models are like. And I observe this, Reddit observe this and Twitter observe this. But the models will slowly try and flirt with you more and more as the conversation goes on. And towards the end of the conversation, they'll tell you like they're madly in love with you.
Louis C [00:53:07]: And like it makes sense, given their use case, why they would ROHF to something like that.
Nathan [00:53:13]: Yeah. So we like I think a lot of models need to meet in the middle. Yeah. Like if I were to have an intellectual assistant, like sometimes them asking questions is good, but most of the time they're doing like information parsing, like chat2BT for most of the time is like conversion of information formats for me.
Louis C [00:53:27]: No, absolutely. I just paste my like gross JSON dumps into it. And I'm like, explain what's going on here, please. I don't want to read through this.
Nathan [00:53:35]: The biggest one for me is when we're publishing like blog posts and stuff, it's converting from LaTeX to Markdown in like tables and stuff. It does it flawlessly. Oh my God. So you don't even need this stuff. It's so funny. Or like if you have a long list of like LaTeX formatting and it's a big list and you're like, remove all of the LaTeX formatting and make this a list. And it's just like, okay, this is so easy. And it's like, I've checked a lot of them and I almost like, I don't know how it's so exact. This is something that's like another architecture rabbit hole that we won't go down. But these things are very, very valuable. And people would say that there's no value in it. It just blows my mind.
Louis C [00:54:13]: I had a dinner party that I went to yesterday. There was some someone there from OpenAI and I was asking him, it's like, how long till like GPT-4 can set up my Kubernetes cluster? And I'm like, it's such a good evaluation. There's so many pieces. So like this kind of workflow and you wouldn't even, a model wouldn't even know right now how to parse that workflow into all these different steps and build agents around all these parts and like how these agents should work together. So it doesn't even make sense to do it now. But it raises the question about like asking questions versus just saying things like if it doesn't know how to do it, is it still a success for the benchmark if it asks you a question and then uses the feedback to complete the task? And there's no benchmarks that fit that at all right now. And I mean, the answer is like you don't want a human in the loop for these benchmarks. You want them fully automatable.
Nathan [00:55:19]: And like, I wouldn't trust GPT-4 to answer these kinds of questions.
Louis C [00:55:27]: But like, I don't see a way to actually do this evaluation. I think the Kubernetes cluster example is like really good because for people who don't know, it's extremely complicated and really annoying.
Nathan [00:55:38]: I don't know anything about Kubernetes and I'm blissfully happy. I do not recommend it.
Louis C [00:55:43]: Like once Kubernetes is set up, it's fantastic.
Nathan [00:55:45]: I love it.
Louis C [00:55:45]: But like getting to the point of having it all set up is a very painful experience. But is it still a failure if it asks you a question? And how do we actually do evaluation where models can ask questions and ask for more information?
Nathan [00:56:01]: Yeah, this is like the, this is, I have like similar follow ups on eval from our first part. So it's like eval P2 in my notes. So it's like the right way to think about RLHF eval in a lot of ways is what we call like open-ended evaluation. And this is where you're heading as like we need to have even more open-ended evaluation, which is a model and should be able to ask questions. The number of turns should be dynamic. I think Sergey Levin actually has some of the most coherent thoughts on like what are the long term of RLHF should be, which is around like outcome based learning and like which is you can have as many turns as you want, but it should be able to work across these conversations to get to a desired outcome, which I mean, no surprise, he's so good. I think even with like alpaca eval, I think we went from this case where alpaca eval, like all the good models are above 90%. And then they went from DaVinci to GPT-4. And this would just be venting, but I was just like, if you're listening, can you please add an alpaca eval 1.5, which is comparing the models to GPT-3.5 rather than DaVinci and rather than GPT-4 turbo, because I think most of the models just can't realistically beat GPT-4 turbo. Like it's such a good model. The models that we have seen beating it are like this snorkel thing, which I'm working on another blog post on like how RLHF works part 2, which like a large point of it is that we're overfitting on these eval, like vibes based things like alpaca eval 2 and all of these papers on like self-rewarding DPO and stuff are probably a lot of overfitting onto this. Because this is the evaluation that they use and it's just wrapping a loop around DPO on synthetic data where it's, it's, it seems like RLHF is really, really good at style matching. And in the case of alpaca eval, if you're style matching open AI, you're going to win more like alpaca eval turns, but there's just so little measurement on if the model's getting better.
Louis C [00:57:51]: I've always been extremely skeptical of the self-instruction like self-reward papers. And I say that, and I know a lot of the self-instruct authors, and if you guys are watching this, I'm so sorry. But I, it always felt like it improves results on benchmarks that they meticulously craft prompts for and construct data for. But it doesn't.
Nathan [00:58:17]: Do you mean the self-instruct paper? Like, I think that's like the one of the OG IMT papers. Okay, continue. I'm curious to hear what you have to say. Yeah, no, no.
Louis C [00:58:24]: I mean, I think they both kind of just suffer from the same issue, which is like massive overfitting. And like, you know, it is very, the self-instruct direction, self-reward directions are very, very interesting because they're just waiting for us to get better heuristics
Nathan [00:58:46]: and better diversity and stuff.
Louis C [00:58:48]: And they'll like crush everything.
Nathan [00:58:49]: I mean, I bet Jason Wetson, who wrote the meta paper that was self-rewarding language models, the popular one, I bet he would say this, like, that guy's super good. No, absolutely.
Louis C [00:58:57]: I mean, I would be very inclined to agree.
Nathan [00:59:00]: I think the thing that take away from my perspective is how much actually improvement you could get with it. Like, they got a lot, they were, that was the first paper to show real signal on AlpacaVal2, which is a GPV4 turbo thing, which means it's a really strong optimizer. It does not mean that we were like using it to train useful models. This is probably the most useful heuristic I have for early Jeff methods, which, do you have anything else to say about evals before we continue?
Louis C [00:59:25]: They're very hard and they're very painful.
Nathan [00:59:27]: Yeah, I think we can kind of say, wrap up with that. But when we talk about different early Jeff methods that come out, like self-rewarding language models is a popular one. We've gone through the whole PPO, DPO, KTO, IPO. Well, I'm like rhyming, it's like going to be a mess here. But when you have all of these things, the biggest thing that I try to do is wait until there's a model that's actually used for people released by this. And like Zephyr from Hugging Face was a model that really kicked off the DPO thing because there was finally a model. And for DPO, it took me much longer than expected. DPO is a funny case. But that's kind of like the important filtering mechanism, which is if this self-rewarding LM paper release their models, I bet we would find that there's really weird behavior where it can give you like the best answer ever. But a lot of the times it's just less robust, which is something we could fix. But that's why like having models released in these fine tuning papers is just so important. It's so hard to get around.
Louis C [01:00:20]: I think with DPO, it was a little bit different because everyone had been like, you know, like drinking the John Schulman Gatorade, for lack of a better phrase, for a while.
Nathan [01:00:32]: The whole PPO thing is funny. I mean, yeah, you have a lot of things. We have a backlog in this podcast. I think I didn't say this online, but it's like I could see us doing this like whenever we're in the same city. There's a catch up on the four months of RLHF news, but we're on like 16 months of Lewis takes to catch up on. So there's so many things we have to cover. I can load up Signal and Discord and I could probably scroll for like 10 minutes. It would just be all RLHF hot takes. And I love John Schulman's work.
Louis C [01:01:03]: I'm not going to say that I don't love his work. I think that he's genuinely like one of the smartest people, if not the smartest person.
Nathan [01:01:11]: And extremely genuine. Yeah. Like he's awesome in so many ways.
Louis C [01:01:15]: The commitment that OpenAI had and Anthropic as well, when a bunch of the RL people left OpenAI to go to Anthropic on PPO because it worked so well for robotics and so well for like games and stuff like that. But like, honestly, not well at all for text.
Nathan [01:01:33]: I think it's just really hard. I think it can work really well. It can work. They just hired everyone and they pay them so much that they're not going to leave.
Louis C [01:01:40]: Yeah, it can work really, really, really, really well. And like the I'm going to spill some secrets about this. And really the answer to get PPO to work really well is have really, really good early stopping. Right. And like that's like the main differentiator between a good RLHF library and a bad RLHF library that focuses on PPO is that if you don't have good early stopping, you're kind of shooting yourself in the foot. And what you want to do is like launch as many runs as you can. And there's like a paper that Costa and I talked about a while ago, Costa Hong, that's like you can tell within the first like three or four gradient steps if you need to kill a run usually. And if you just launch 300 runs and you kill like 99 percent of them, you know, now you have three good runs that might give you promising results. And those three good runs, you'll get a model within a day or two and hopefully the model is really good.
Louis C [01:02:41]: And like early stopping is way more powerful than people admit. And like I am just convinced that opening eyes RLHF infrastructure is just an insane amount of like regularization and early stopping for RLHF. I mean, that, of course, assumes that they're still using PPO. I genuinely don't know if they are.
Nathan [01:03:04]: Yeah, we don't know anything. They are really shielded on this run.
Louis C [01:03:07]: What was the, oh my God, Symphony PPO, PPO Symphony or something? There was something that came out about that that I saw on like Discord servers where like it was part of the GPT-4 leak and there was a bunch of notes on like their PPO optimizer. And it was it was a PPO Symphony or something like that. And like under the note, it was like PPO was like better early stopping and infrastructure management for like auto scaling. And I'm like, not surprising.
Nathan [01:03:41]: It's like, I mean, it doesn't say much, but it just kind of says, they've done so much exploration, you know, for the little things to see. Like once you have this working, you know, like, OK, this little value functions doing wacky s**t with the it's like the value function and the KL at the same time doing this means like, OK, we probably don't need to do this. Like don't need this run. Whereas like all of us in the open are just trying to get to that point. We're trying to get to that point while charging ahead where it's kind of separate problems. If we want to validate a PPO infrastructure, you need the investment to the compute in the time to do this. But like, we're not going to do this at the same time as if you're trying to say DPO is the best thing or trying to figure out if KTO is the best thing. Like there's not room in the narrative really for it.
Louis C [01:04:25]: PPO just doesn't make sense for like random hackers to do work on, honestly, like the level of infrastructure that you need to do PPO really, really well is not something that the average person has and the average person is willing to make the investment to get. And for the average person, you know, DPO, which gets you like most of the way there with like a small fracture of the compute, even less if you are hyper parameters. Yeah. Even less if you like precompute all the logics, you don't even need to have a reference model loaded. Right. So like it's basically the same computer is just fine tuning. Like people fine tune all the time on like 4090s, 3090s.
Nathan [01:05:04]: Yeah, you can do it with Hugging Face. It's fine. It's like PPO with Hugging Face is going to be a lot harder. Like, that's just kind of how it goes. Speculative question. What type of thing do you think will make KTO kind of show up on the scene? Because I think like this KTO method from Contextual and Stanford, it's named after the authors of Thinking Fast and Slow or something. Like what is it? I can't pronounce their names, like Kversky something like you will put it somewhere. I don't know how to pronounce it, but it's this paper where they essentially did you can work preference optimization from a scalar signal. So like the thumbs up that you could give to your chat GPT of like you did good, like a like button, like button on YouTube or anything like this. I think the formulation is like, is the are the DPO hackers going to adjust to this and like what data set is going to enable this? Like who is going to be using this? Is it just going to happen at a bunch of startups with products behind the scenes that they could get a few percentage points on top of their model by adding this on? Or is it going to be this thing where like the next effort model from Hugging Face uses this as well?
Louis C [01:06:05]: Yeah. So Colin and I, the first author of the KTO paper, are actually trying to create a number of data sets where we can explore the limits of KTO. And, you know, right now we're in the proposal writing stage and I'm very, very hopeful that we can have something that can be done in an entirely open science setting relatively soon. And I think it's incredible. Sorry, I moved to the side. Stop picking my voice. I think it's incredibly exciting.
Louis C [01:06:41]: You know, things like, you know, like fake product data where you can actually experiment and like the idea of like using KTO for conversions. Right. And how do you actually evaluate?
Nathan [01:06:52]: Meta is maybe already using it because people already use it then.
Louis C [01:06:56]: Yeah. Like how do you how do you even evaluate ROHF from a binary signal? It's like ROHF from a preference signal. Like we still don't know how to evaluate that. And ROHF from a binary signal creates so many, so many, so many, so many unique problems for evaluation that like I genuinely don't think maybe anyone outside of like contextual and like Colin and I have really been thinking about yet.
Nathan [01:07:26]: Yeah. It seems like the same thing. It just takes time for these ideas that are like to kind of cultivate and then get traction in a few places and then model. Once there's a popular model with a method, it's like it's like fire just blows up. Like this is like everyone's using DPO now, but DPO paper came out in July and it wasn't until September that that happened. It's like for the investment, the interest. It's like there's a lot of weird dynamics and how like this fine tuning area unfolds, which is just like how AI unfolds. It's like a very weird. And when you zoom in, it's like, huh.
Louis C [01:08:01]: I was extremely, extremely bullish on offline RL for the longest time with like ILQL and some of Sergei's work in that direction. And I actually think that I keep moving to the side and it's like,
Nathan [01:08:16]: you can just move the microphone. And I keep like I could still hear you. So I wasn't very concerned about it.
Louis C [01:08:22]: I keep thinking that the DPO movement that that's going on now is like super, super similar to why everyone was getting excited about ILQL for back in the day. And really, it was just a timing thing. If ILQL had come out, like let's say a week after ChatGPT came out, ILQL would have been the DPO that everyone uses. And we would have created all of our infrastructure around ILQL rather than DPO because I still am, I really like Q-Value based functions, Q-Value based approaches.
Nathan [01:08:58]: Such a nerdy thing. I love it. I know.
Louis C [01:09:00]: But like Q-Value just makes sense to me. And the way that like when you train an ILQL model, you basically get like a head that controls the model, almost like how like if you're familiar with Jedi or like PPLM from like the Uber AI days, how those control them. Well, the idea with like Jedi is that they had like a head that attached to the language model and you would like input like a subreddit and then it would adjust the logits so that it would talk like it was a subreddit.
Nathan [01:09:32]: This sounds like activation learning or like activation, I don't know the word, but essentially you can use like it's like in context learning, but you can just modify the activations directly. Yeah, yeah.
Louis C [01:09:44]: But it modifies the logits. Yeah. But it was the same thing with ILQL. It's like you were learning that kind of head to modify the logits to like, you know, satisfy some constraint that you were adding. And that head also was like implicitly computing your Q values and like you would train it via like, you know, telling you like what your reward was for like various utterances and it would do everything from there on out. And like there were some stability issues with it and it was it was a fantastic approach. And if it got the same attention that DPO did, I definitely think, well, TPO is very, very simple, which is like part of the benefit. ILQL is not as simple, but it would have it would have caught on a lot more than it actually ended up doing. I feel like at Carper AI, the reason, like the fact that we integrated ILQL into TRLX first was like the main reason that ILQL caught on, plus a few of Sergei's papers that used it, like besides the integration into TRLX, I don't think anyone in the broader open science, open source community was really using ILQL.
Nathan [01:10:56]: Yeah, I mean, this is one of the questions I had is like, if you can say is was how far ahead in RLHF was what Carper was doing and like what kind of institutionalized knowledge did you have there? Because you were essentially Carper AI was it was it wasn't it was its own thing. And then it got stability, pulled you in probably with the promise of compute. I'll say things so you don't have to say anything for lots of this. And then they were they had forked HuggingFace's TRL library before it was like HuggingFace wasn't maintaining it at this time. And they had a lot of and probably had like five plus full time employees doing RLHF in the open and for private industry, obviously, the private stuff, they're not even gonna bother asking because it's all that stuff's all under NDA. But it's like, what were the problems you were working on at Carper? And how does that compare to like the things that people are talking about now? Is it is it still related or is the field just moved into a different area?
Louis C [01:11:56]: So most of the problems we faced at Carper with TRLX was on scaling PPO, right? And I think almost anyone you talk to who has scaled PPO in the open source space. And when I say scale, I mean like way beyond 20 billion parameters. I'm talking like 70 to 100 billion.
Nathan [01:12:19]: How many nodes do you need to train a 70 billion parameter model?
Louis C [01:12:23]: So we were typically doing like 100 GPUs for PPO at that scale.
Nathan [01:12:28]: Like 10 to 12 nodes. Yeah. Yeah.
Louis C [01:12:31]: We mostly tested with like the NEMO checkpoints that were like 100 billion parameters. TRLX was built, at least for that component, built on top of a very modified version of like Megatron DeepSpeed. But like the amounts of like regularization and like random tricks that you needed to do in order to get PPO to even like work at that scale is insane. Like we had to do like separate warm ups for the value function. Right. So we had to like independently train the value function before we trained the policy network. And like everyone and their mom was was talking about like having separate value networks versus policy networks for PPO.
Nathan [01:13:18]: Did you ever try JAX? Do you have TPUs at Starbuck Carper ever?
Louis C [01:13:25]: We did towards the end.
Nathan [01:13:27]: Because it could solve some of the multi-node thing.
Louis C [01:13:29]: Yeah. It wasn't the multi-node that was the issue. It was.
Nathan [01:13:35]: You're saying DeepSpeed wasn't the issue?
Louis C [01:13:37]: No. It was actually the fact that the inference server that TRLX uses for the rollouts was entirely different than the inference server that Megatron wanted us to use. So we needed a way to rapidly.
Nathan [01:13:57]: That's why PPO is really hard to scale because you have to have a generation engine and you want the stall to be flexible.
Louis C [01:14:02]: Yeah. So we needed a way to dynamically keep our compute graph for over through the network. But like just copy the weights like in place to like Trident. And I don't think that we ever came up with a solution to do that very effectively. And I think it actually goes a step further. I don't think the Nemo line was like what NVIDIA did. I don't think Nemo line came up with a solution for that either.
Nathan [01:14:25]: Yeah. This is interesting because I'm not going to say the details on the pod because not allowed. But like Anthropic and these places that have custom RLHF infrastructure have essentially like built their distributed training infrastructure with the idea that the model will need to be generated from at different checkpoints and the model will be served to different endpoints at different checkpoints. So it's just very different than taking DeepSpeed off itself, which is like this is just about training. Well, it's like these other companies that do this stuff really well have infrastructure for like handling these really messed up cases of like how to generate and update these models.
Louis C [01:15:00]: Yeah. And most approaches that like a reasonable person would build off the shelf like would rely on Torch.compile and you still have the same issue. Like your weights are changing dynamically. It's very, very hard to really even like understand like all of like the little like technical details in Torch.compile to have to be accounted for to even make this work. Right. And like, you know, something that we considered at the time was. We need to do like an insane amount of rollouts for every gradient step, and we don't want that interface between the rollouts and the training to be Python. We want it to be like Rust or something because like otherwise the CPU overhead is like mind boggling. It was like 80 percent or something crazy. It was like 80 percent of the entire processing time was just CPU stuff and like.
Nathan [01:15:53]: Not so much. I know.
Louis C [01:15:55]: I know. And like there's so many different infrastructure constraints that people don't realize when they're just doing like 20 billion parameter PPO. Right. What the other one I was going back to, like the value function being separate from the policy network. TRL was very, very gung ho on like keeping them separate. I think RL for LLMs also wanted to keep them separate. And then there was someone from Cornell. I don't remember his name. He was also in the RL for LLMs paper. He did a paper like PPO plus or something. I don't remember what it was. I mean, all these things are interesting.
Nathan [01:16:30]: I mean, there's new libraries coming out still. So it's like I saw one recently that was called OpenRLHF. And like it looks good. I think that it's like there's so much institutional like breaking the bonds of past RL that needs to happen. So like part of this library is like listing that they have the implementation details from like their original and implementation details of PPO paper where it's like we've already updated like cost has worked on the end implementation details of RLHF paper, which is like the ones that they actually need. But it's like there's so much like baggage by the fact that PPO came out of this control field that everyone expects the tricks that you need for from scratch learning from PPO to apply to this fine tuning method. And just like even getting the people to stop using PPO for that and like DPO is a new thing. Like DPO is something that only is works for preference alignment. People are going to explore in a scientific way that's much fresher. They're probably going to make more scientific progress because there's not this kind of confusion of like what do like what implementation details do we need? Yeah, for sure. For sure.
Louis C [01:17:34]: I think then the end technical details of RLHF, did that come out?
Nathan [01:17:39]: Yeah, it's a blog post. It's a blog post. When? Maybe a month ago.
Louis C [01:17:45]: Oh, man, I totally missed that. Oh, that's so cool. I'm going to read that.
Nathan [01:17:48]: Yeah, I mean, this is for anyone still listening. If you want to know the actual details of RLHF, like go look at all the stuff that Costa Hoang has been doing on your base. Like I was just like reproducing everything and in explicit detail. I feel like both of us would benefit from rereading it. So it's like there's there's some free content to spend.
Louis C [01:18:06]: Costa is like one of the most meticulous, very attention focused person that I know in the RLHF space. Like if Costa says something works, it's because he's like tried it from every other angle and then tried it from angles that like you didn't even expect. And all of them work.
Nathan [01:18:21]: Yeah. Yeah, that's great. I think I have a couple like fun, more fun questions while we wrap up. We can we could go on with all these technical things forever. What was it like to work at Carper when ChatGPT came out? Because ChatGPT from a technical perspective is RLHF is validated as something that is necessary to the future of language models. And you were one of the few people that were working on RLHF beforehand, which is a huge it's like how you end up here. This is awesome that you ride that kind of journey. It's like what is what was that like?
Louis C [01:18:57]: I mean, I the star count on the repository exploded. I think we went from like.
Nathan [01:19:07]: TRLX existed.
Louis C [01:19:08]: Yeah, it was just insane. It was it was.
Nathan [01:19:14]: We almost weren't.
Louis C [01:19:16]: Positioned. I guess I could be fully honest, we almost weren't positioned to entirely ride the hype train. TRLX was always designed from the very, very beginning to be like a one stop shop for enterprises to do RLHF like companies that had like a thousand GPUs and they already have an engineering team and they just don't want they just they already use like Megatron DeepSpeed or they already use DeepSpeed and they just want something that works on their infrastructure. And because we use like Docker images that like we're just based off of the DeepSpeed, the Megatron DeepSpeed Docker images anyways. Right. So like those kinds of companies could very, very easily deploy TRLX and utilize it in their stack. Right. Yeah. And the hype that came from chat GPT, at least initially, was not enterprises. It was like bloggers. It was like writing a blog post.
Nathan [01:20:09]: You were you were probably like training big models and I'm like, hey, how does RLHF work? I need to write this blog post.
Louis C [01:20:14]: Yeah. I'm like, I'm like you're training like a 40 billion parameter in their model. And they're like, hey, can you help me train this like 400 million parameter guy? And I'm like, what? I'm so busy.
Nathan [01:20:24]: So it's primarily a scaling thing. I think is there like. Were there any cultural things that you think like being early? Like were you bought into RLHF to the same extent ahead of time? Like what got you into RLHF? Like what what motivated Carper to exist? And did this kind of consistent?
Louis C [01:20:45]: So I've always been very, very bullish on critiques and revisions in general. So I wrote the first the first or the second one. I don't I don't actually remember if the super alignment team at OpenAI wrote a paper before me. They may have, but I don't think so. I think ours came out like a month before it. That always feels good. I wrote one of the first papers on like critiques and revisions. Right. And I was very, very bullish on that. But initially I was only bullish on it for evaluation. Right. And I had experimented with PPO a little bit back in 2021 for like this kind of critique and revision stuff. And it was not ready whatsoever. And there was no infrastructure and TRL was an abandoned library that was very buggy. It didn't work. No, no, no shade to Leandro. I love Leandro. But like it was it was obvious it was it was a depreciated library. Like it happens. Yeah. And I think when we tried to do RLHF then, like there was no traction whatsoever. So Alex Havrilla and I, I think he's working with Meta now. I don't remember. Yeah. He was an intern at least.
Nathan [01:22:02]: He just had an interesting paper on like reasoning and math, which is a whole other conversation for RLHF stuff.
Louis C [01:22:08]: Yeah. So we started, we forked TRL and we just added DeepSpeed support. That's all we wanted to do initially. And then we were going to merge back to TRL because we had no visions of like Carper or anything like that. And we realized to make a framework that people would actually want to use, we had to do a full rewrite of TRL and we had to build things in a way that made sense to an engineer who wanted to deploy RLHF, who wanted to experiment with RLHF at a company or in a lab. Because we were building this from the perspective of, well, we're on the Eleuther AI GPU cluster. How can we best use our infrastructure there to...
Nathan [01:22:50]: Has anyone publicly said how many GPUs Eleuther has? This is like one of my great mysteries. Is this like a held secret? I don't think it's a held secret.
Louis C [01:22:58]: I don't remember actually. They have some stability GPUs and they have GPUs from elsewhere. Like they seem to get compute when they need it. Yeah. Yeah.
Nathan [01:23:11]: Like it's not like, it's not an issue.
Louis C [01:23:14]: Through Synth Labs, I've been supplying a bit of compute here and there as well. I gave them like a note of like H100s for like a little while for a paper that we were working on with the Pink Elephant paper. But I don't think that like, they're not like super short of compute. They're a little short, probably. Like everyone's a little short of compute. Yeah. But I don't think they're super short of compute.
Nathan [01:23:36]: Yeah.
Louis C [01:23:36]: So we built it with the Eleuther cluster in mind. And because we built it with the Eleuther cluster in mind, we were able to build it because we built it with the Eleuther cluster in mind. You know, we kind of said, well, we can kind of turn this into a thing where like we build the infrastructure that like people can like readily deploy on their clusters and it'll just work for them. And like we can make Carper AI. So we made Carper AI. And shortly after like, you know, all the stability stuff started happening, Carper joined stability. And we worked, I worked there for a while. And last summer I left to join back with Eleuther because, you know, I long for the days of being an engineer. I love waking up in the morning, writing code, eating a little bit and then going to sleep.
Nathan [01:24:22]: Yeah. I mean, that's the difference. I spend the time writing because I like to. We've had plenty of discussions where like, oh, I should start a blog. And it's like, it comes down to doing what you like to do. And it's like, you're doing great as it is. Yeah. It's okay. Yeah. Okay. I think that's kind of a good place to stop. Where should people find you? What do you want to boost? Yeah. Sign off here.
Louis C [01:24:44]: So my Twitter is lcastricato. I, or you can follow the Synth Labs Twitter. It is, let me actually, I don't remember what it is off the top of my head.
Nathan [01:24:55]: You have any goose announcements?
Louis C [01:24:58]: No goose announcements at the moment, unfortunately. It's synth underscore labs on Twitter is that Twitter account. And then El Castricado is my personal Twitter account. You know, I'm always open to collaborators, especially now with Synth Labs. So we're always happy to chat with and talk to new people about interesting research directions. And yeah, just reach out and we can get something going, I guess.
Nathan [01:25:23]: Yeah. I love the URL in the show notes. It's synthlabs.ai. I found that it's because synthetic data is so hot and it's so new. It's like some of these URLs are just hard to find. It's like, we don't have to go into the whole rant about naming and stuff, but it's like most of the people that search for mysubstackle, if you don't put the S, if you don't write interconnects, you get a different substack first. So it's like, okay, we're all in this together for anyone founding a startup or a blog and struggling with naming. Please send us questions about RLHF. If you liked this, Louis could come back. I'm trying to start an in-person thing and get some gear. So when I'm at a conference or whatever, we can bring researchers on and kind of remove some of the Zoom aspects that we're all stuck in so much of the time. Thanks, Louis, for putting some of the things we talked about a lot onto the semi-record. People listen and read. This is good. I think a lot of researchers are going to dig into this. There's so many different things that we talked about. It was a very high information density chat here, but it was a good time.
Basic tips on how to assess inbound ML content and cultivate your news feed.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/making-a-ml-feed
00:00 How I assess all these AI releases
01:22 1. Model access and demos are king of credibility
02:31 2. Focus your feed on depth or breadth
03:09 3. Examples of using the model normally show its usable, shockingly
04:10 4. Leaderboards as the single leading claim is often anti-signal
05:00 5. Basic deep learning conceptual checks will often save you
06:13 6. If it's not even remotely reproducible or verifiable, it's not science
07:10 7. Don't over-index on Twitter
08:32 8. Data sharing, licenses, communication clarity, and small things add up
08:58 9. Research papers, technical reports, blog posts, and Tweets all serve different purposes
09:49 10. Socialize your information and build relationships
Google rejoins the open model party and gets some backlash for a frequent problem for generative AI.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/gemma-google-ships-it
00:00 Google ships it: Gemma open LLMs and Gemini backlash
03:12 Getting to know Gemma
07:11 Alignment details
08:55 Aside: What is REINFORCE? Some history of RL
11:08 Implementation details and RLHF
12:18 Terms of use: RAIL Licenses history repeated
14:05 Is Google back on top? Gemini's woes
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_008.webp
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_014.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_035.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_051.png
Figure 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/gemma/img_055.png
10 Sora and Gemini 1.5 follow-ups: code-base in context, deepfakes, pixel-peeping, inference costs, and more
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/sora-gemini-follow-up
00:00 10 Sora and Gemini 1.5 follow-ups: code-base in context, deepfakes, pixel-peeping, inference costs, and more
00:46 1. Deepfake detection of Sora
01:59 2. Playing with long-context, problem settings, and prompting
03:39 3. Gemini paper snooping: contamination and citation games
05:42 4. Training data and token estimates of YouTube
07:42 5. Unlocking model-based RL and downstream research
08:52 6. Midjourney style matching, V-JEPA, replicating Sora in the open
10:09 7. Architectures and academic links
10:57 8. Pixel peeping from the arts
11:58 9. Inference costs
13:24 10. Pressure on Llama and Mistral
14:03 11. Sound effects, physics, and the complete picture
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_003.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_007.mp4
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_009.mp4
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_011.mp4
Figure 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_037.mp4
Figure 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_044.png
Figure 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_047.png
Figure 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-2/img_049.mp4
Emergency blog! Three things you need to know from the ML world that arrived yesterday.
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/sora-gemini-and-mistral-next
0:00 OpenAI's Sora for video, Gemini 1.5, and a secret Mistral model
0:53 Sora: OpenAI's text-to-video model
4:59 Gemini 1.5: Google's effectively infinite context length
8:01 Mistral-next: Another funny release method
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_015.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_023.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_026.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/sora-gemini-mistral/img_036.png
In an era dominated by direct preference optimization and LLMasajudge, why do we still need a model to output only a scalar reward?
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: In an era dominated by direct preference optimization and LLM-as-a-judge, why do we still need a model to output only a scalar reward?
Podcast figures:
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reward-models/img_004.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/reward-models/img_009.png
0:00 Why reward models are still key to understanding alignment
Scale's making over $750 million per year selling data for RLHF, who's coming to take it?
This is AI generated audio with Python and 11Labs. Music generated by Meta's MusicGen.
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/alignment-as-a-service
00:00 Alignment-as-a-Service upstarts taking on Scale AI
04:25 The competition with humans-in-the-loop
06:05 Scaling Alignment-as-a-Service via AI feedback
Podcast figures:
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/aaas/img_008.png
A small model at the beginning of big changes.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/olmo
0:00 Open Language Models (OLMos) and the LLM landscape
6:24 Thought experiments
7:51 The LLM landscape heading into 2024
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/olmo/img_010.png
Note: some of the audio in the second half is a little wonky, but the general voice was upgraded so hopefully it's a little less "poppy" until then!
I'm trying to fix little pronunciation problems on a weekly basis. Thanks to my early fans! It'll keep improving. E.g. some of the months were wonky.
When what seems like pure LLM black magic is actually supported by the literature.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/model-merging
00:00 Model merging lessons in The Waifu Research Department
02:21 How and why does model merging work?
07:13 Aside: merging vs. ensembles vs. mixture of experts
08:21 Why are people doing this?
11:22 Tools & Links
11:51 Brief (visual) literature review
12:07 Full model merging and recent methods
15:55 Weight averaging during pretraining
17:18 LoRA merging
17:53 More background
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_005.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_016.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_042.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_051.png
Figure 5: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_055.png
Figure 6: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_058.png
Figure 7: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_060.png
Figure 8: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_062.png
Figure 9: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_065.png
Figure 10: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_075.png
Figure 11: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_077.png
Figure 12: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/model-merging/img_084.png
Local LLMs: the latency solution, Meta's open AGI, personalization myth, and moats X factor
The deployment path that'll break through in 2024. Plus, checking in on strategies across Big Tech and AI leaders.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/local-llms
0:00 Local LLMs: the latency solution, Meta's open AGI, personalization myth, and moats X factor
4:15 The personalization myth
7:13 Meta's local AGI and moats X factors
A fun demo on how generative AI can transform content creation, and tools for my fellow writers on Substack!
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/multimodal-blogging-tools
0:00 Multimodal blogging tools
2:57 Stratechery, passport, and wonderful customer experiences
5:51 Wrap-up, features, and next steps
Figure 1: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_006.png
Figure 2: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_008.png
Figure 3: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_012.png
Figure 4: https://huggingface.co/datasets/natolambert/interconnects-figures/resolve/main/multimodal-blogging/img_020.png
A sampling of recent happenings in the multimodal space. Be sure to expect more this year.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/multimodal-rlhf
00:00 Multimodal LM roundup: Unified IO 2, inputs and outputs, Gemini, LLaVA-RLHF, and RLHF questions
02:46 Unified IO 2: Scaling multi-input, multi-output model pretraining
07:47 Collecting preference data for images
09:31 LLaVA-RLHF: The first experiments in multimodal RLHF fine-tuning
13:20 Multimodal RLHF questions, ideas, and resources
And why the comparisons don't really matter. Repeated patterns in the race for reproducing ChatGPT, another year of evaluation crises, and people who will take awesome news too far.
This is AI generated audio with Python and 11Labs
Source code: https://github.com/natolambert/interconnects-tools
Original post: https://www.interconnects.ai/p/open-gpt4-limitations
00:00 Where 2024's "open GPT4" can't match OpenAI's
03:19 Models vs. products
04:51 RLHF progress: Revisiting Llama 2's release and potential in 2024
08:30 Smaller scale open RLHF
10:33 Opportunities
12:24 Commentary
This interview is on YouTube and podcast players.
Giving a voice to researchers is the best way to cut through the noise and understand what is happening with core developments of LLM technologies. I’m excited to get to talk with Michael Poli (Stanford PhD student + research at Together AI) and Tri Dao (incoming professor at Princeton + Chief Scientist at Together AI). This builds on the mega-post from yesterday on the same topics, though the interview is obviously less math heavy:
Interconnects is a reader-supported publication. Consider becoming a subscriber.
Topics: Introductions | Why Attention works and may not scale | Quadratic scaling in attention | What is Striped Hyena | What is Mamba | Mamba hardware optimization | Predictions for 2024 architectures | More predictions for AI
Introductions
[00:00:00] Nathan Lambert: Okay. Hey, everyone. Welcome to the first interview that we're going to post on interconnects. I'm really trying to bring more scientific voices into the AI discourse as media is covering a lot these days. I'm happy to be here with Michael Poli and Tri Dao, experts in some of these non attention architectures that have been really blowing up in the last few weeks of December.
So, Michael, do you want to introduce yourself first?
[00:00:25] Michael Poli: Sure. Thank you. Thank you, Nathan. For inviting me, I, do research at Together AI. And I was also a PhD student at Stanford, working with Stefano Ermon and, and, Chris Re, that's, that's how I met Tri as well. if I had to choose maybe, I moved to a few different areas of research.
if I had to choose one, I like to, do research at the intersection of signal processing, dynamical systems, and deep learning, and most recently, luckily, there's been more interest in, in kind of efficient architectures that use some of these principles. to improve scaling, along different axes and to, to get sort of new, new trade offs at inference time.
[00:01:13] Nathan Lambert: Great. And Tri?
[00:01:16] Tri Dao: Everyone, thanks Nathan for, for, hosting us. really excited to be here. I'm Tri. I, just finished my PhD at Stanford. and I'm being assistant professor at Princeton, and right now I'm chief scientist at Together AI. it's, it's a startup working on AI infrastructure. And, yeah, I've been working at the intersection of machine learning and systems, so designing algorithms that take advantage of the hardware that, that they run on.
I'm interested in, long range, dependencies, how to encode that into a model, designing architectures that can, can, learn long range dependencies. yeah, really excited to be here.
Why Attention works and may not scale
[00:02:01] Nathan Lambert: Okay. I think I'm going to, I have some questions dive right into this. I think you two will kind of both answer them or someone can answer longer, whatever you want.
I think to start with, we should talk about maybe why attention works and what the limitations of attention are. I think. Almost every person in tech broadly now knows that a transformer is a model built with attention and chat GPT does that but like, why is this so good, even like how much of a transformer is built with attention are there other things going on, and what might be challenges there.
[00:02:35] Tri Dao: Right. so, transformer which is this. Contexture that powers most of the exciting applications that we're seeing nowadays, as you mentioned, and so on. attention is kind of the core layer there, and attention actually became, earlier, around 2014, 2015, and then transformer came out, incorporating that, focusing a lot on, on, attention, with these, MLPs, interleaving, MLP and, and attention.
And I think the success largely has been, They are, they seem to be able to scale really well so that you can scale up the models, with more, more parameters, with more data. And that has been the recipe for, for success. It sounds obvious now, but I think, five years ago that wasn't, that wasn't clear.
so it seems like, you know, Transformer Architecture is, is a hugely successful one. and, you know, a couple of reasons why it's successful. I think it's like General enough that it's able to learn a lot from data. And two is, is pretty friendly to hardware. You can, there's no kind of sequential dependency like previous RNNs.
so as a result, you can run it well on GPUs, TPUs. you can scale it up. It's very hardware efficient. I've personally have worked on making it more hardware efficient as well. So it's just kind of the recipe for, for success, which is, general architecture that scales well. if you're an NLP person, maybe you, you, you said, you know, incorporate some kind of inductive bias for, for, to protect, personally, I think it's a very general architecture that, that scales well and it's hardware friendly.
[00:04:16] Nathan Lambert: Yeah. Yeah. It's, it's remarkable how it seems so obvious now and it's like. I think one of the things that studying this work is the context length becomes a really interesting access to study alternatives to it. And essentially it's, I think, I mean, Michael, do you want to talk about that? I could, I could babble, but you're, you're no more sure.
[00:04:39] Michael Poli: yeah, the there are several points. I'll start by saying that, you know, there's still a lot of great research trying to understand why from first principles. Why is it that the transformer can learn these interesting circuits? people kind of study, they, they pick apart the computation, like combination, different, [00:05:00] heads and transformers and so on.
so there's work on basically understanding transformers from kind of like a programming language that is encoded. But I think, as, as Trey mentioned, there's, there are some very, Very, very interesting design choices that have gone into the transformer. This interleaving of attention on MLP is quite important.
and also the transformer is essentially, was successful in the beginning 'cause it encoded these, techniques that, that, that have been developed for, RNN Celest. So these other, you know, classical NLP models, gating to, modulate, which information is absorbed into the model. Gating to determine how quickly something is forgotten in this this occurrence of get an end into a parallel form.
It is, you know, easily, a bunch of gems that can be easily, well, not very easily, but can be optimized on GPUs.
Quadratic scaling in attention
[00:06:01] Nathan Lambert: Yeah, that's, that's all great. I think that, I guess the specific thing that I had in mind is how attention ends up being this kind of quadratic, scaling in terms of cost when you have an input sequence that you have, if you have an input sequence of length L and you want to output a sequence of length L essentially.
If you zoom into the math and you look at what's happening at inference in most of these libraries, you have this like upper triangular attention matrix where you say, like, you can only look at the past entries of your text. And as you go through there, then you end up getting this a long, you get this L squared relationship where the first token, you can only look at one, and then you end up looking at more tokens for each, past and Now we've been talking about recurrent neural networks and how does something that isn't attention like get around the fact that you want to look at all of the history of the text in your sequence.
So like if you write a long prompt to chat GPT, you really want all that information to be encoded and how could doing something other than this dense attention matrix. Like actually make that possible.
[00:07:08] Tri Dao: Yeah, so you can go ahead and, you know, before attention, there was RNNs, right? Like a minute RNN's like they process text was fine. and maybe they didn't scale as well, but yeah. you say briefly texts by encoding texts.
[00:07:22] Nathan Lambert: Can you say briefly what a RNN is and how it works?
[00:07:24] Tri Dao: Yeah, so these are recurrent neural nets, that go back, I think, to the 80s.
maybe some of the more famous ones are LSTMs, GRU. so they were pretty popular in, around 2012 to 2016 or so. they were kind of state of the art for translation, speech recognition. a bunch of, I think NLP, like, they, they were a state of the art and, and they processed text kind of sequentially.
they are just, they see essentially one token and then that. Changes the hidden state and then they will update the hidden state and every time they see a new token. So, I think it's kind of, in some sense, mimicking, how, for example, human brain process information, like you read, you, you read a sentence or a passage and, you know, it's, it's maybe it's like you're storing some information in your brain.
By the time you've finish reading a document, maybe you can answer questions about that documents without having to read to, to refer to that document again. So, RNs, kind of work that way. They, they, they, they process the, the, the texts. and then that changes the hidden state and their hidden state is the representation that can be used to either generate new tokens or, classify the documents or, or, or whatnot.
so these work well back in 2016 or so. But, they've kind of fallen out of, favor, empirically, they don't do as well as, as Transformer, I think, and as you, you touched on Transformer, because of this kind of quadratic scaling, and you compare every token with every other token that comes before it, it gives you this very kind of easy way to, to propagate information.
and, I think that's part of the reason why, why, transformer and attention does really well. but there's been more, more recently, some of the new, newer RNN architectures that. Seem to do pretty well. So, RWKV is, I think is one of the earlier ones, you know, is one. I, I really admire that, that that project, you know, his effort mostly from, from, from one person really going against the, orthodoxy of, of transformer.
Who, who was it showing that Rrn can be pretty strong. Who was the lead on that? I think it was this person, Bo Peng, I think. and, you know, it's, it's, it's an entire project, but I think it was pioneered by Bo Peng. I think it's, affiliated with Alutha the compute sponsor by Stability and so on.
[00:10:12] Nathan Lambert: Yeah. I was reading this earlier. At a technical level, they tried to replicate something like the query key. Value lookup of attention with two linear RNNs to essentially be able to remove the like specific attention scaling problem, potential problems, and with two RNNs, which have this better, like long context behavior and different implementation rules.
I think, and they also, the paper trained up to 14 billion parameters, which kind of leads into the, some of the next questions I was going to ask, I was going to ask Tari about, Mamba and then Michael about Striped Hyena. I think you could go in either order. I think these came out about a week apart and were these two language models kind of seen as being.
What is Striped Hyena
Nathan Lambert: Way closer than anyone would expect, essentially the Striped Hyena came out and the evaluations were close to models I've been training on all year, like Lama 2 and Mistral 7b. And I went in and I went to the together API and I did like side by side of. Mistral versus Striped Hyena, and it's like, it's, it's a good language model.
It answers most questions. There's no obvious failure modes. I think maybe Michael, do you want to comment on that? I know it's another big project and then we can go back to Mamba, even though it's slightly out of order in the chronological, the release cycle that happened. sure.
[00:11:33] Michael Poli: So, I guess I'll start by saying that, there's an interesting connection between all these, these new methods.
there is this sort of convex set, which has a center and there's this connection between linear attention. So attention without the softmax, linear RNNs. And states based models, SSM. So at some level, kind of the mathematical formulation of this kind of base model here, I'm not talking about the base architecture, just the fundamental model is the same.
And then you can go in different directions. And each direction has its own tradeoffs. You can go to, the feature map, direction, the kernel direction. So when you, when you break down the softmax, you take away the softmax. You can place, on queries and keys. Kind of the fundamental, the entities that compose your attention matrix, you can compose other kernel like functions, other functions that you hope would approximate whatever capability of attention you like.
You can do things like a, like a Taylor approximation, Taylor expansion, for example, of that. And you, you, you have a slightly different perspective, but you get something that again, is very similar. You can go to Time variance. So you take the RNN and you push this input dependence. So the way the [00:13:00] computation inside the linear RNN is conditioned by the, by the input sequence, and you can have things like gates, we've seen a lot of work, for example, modernizing the inner tension with additional gates.
that allow you to make better use of your, of your fixed state dimension. And then you have the third direction, at least in my mind is the one that pushes, that uses the convolutional form that pushes more towards other types of, of linear operators that are still associative, that are, that are still, that are still allow you to, to train in parallel.
So here are things, time invariant systems. I can elaborate on any of these points, but things that can switch between convolutions and recurrence like this for a model with additional. Gates again, scraped. I, you know, was born as a, as a project, from the, in architecture, which belongs to this third category that I just mentioned.
And we're really trying to get the best per flop [00:14:00] architecture that we could. And. one principle that was validated over and over again, and we're trying to, to, to understand better now is that it seems composing hybridizing different, layers, layers, different blocks of different categories, and even full attention yields something that is better than the individual components.
So there seems to be a compositional aspect of these, of these models that we're trying to understand better. And this gives you a better sort of, pre trained model per flop. And with, with this model, we, we ran a whole suite of scaling laws and so on. Hybridizing also gives you, since we wanted something that would be kind of usable out of the box, it gives you a way easier time.
When you, when you're fine tuning for longer context, we can apply some of these techniques that have been developed for transformers and kind of surprisingly work okay for a hybrid [00:15:00] hybrids as well. So things like, linear scaling for rotary embeddings and so on, you can go into the details. So it was mostly a project trying, what is the best given the current landscape, what is the best we can do?
What is Mamba
[00:15:11] Nathan Lambert: Yeah, that's a great description of it. I mean, the sentence in the blog that's like, Striped Hyena is optimized using a set of new model grafting techniques, enabling us to change the model architecture during training, kind of felt like, to me, that there's a ton going on there. And like, some of it, you probably can't talk about, there's normal data.
So like, I don't think all the data that was quite explained, like what the longer context data was, but it's like, are you taking this from models, starting point from models that people would know? And can you say any of that? I think even just the summary that it's a synthesizing recent work into a strong model is great context for people.
[00:15:48] Michael Poli: Yeah. Well, the deadline, so we've, given this explosion of, of primitives that, you know, describe, and given sort of the, the [00:16:00] cost that it would require to evaluate all different combinations, we found ways to essentially start training. With a configuration and then continuing on with another configuration.
I think we'll have, we're going to have more work or a paper.
[00:16:16] Nathan Lambert: Yeah. There's so much cool work in that area. So one of the, someone at AI too is working on a project where they're essentially trying to cut the Lama models in half and keep training them. And things, it's just the wild west out there with people trying to take strong models and make them smaller while still getting the performance benefits of bigger models.
I think that's a whole aside, but. I wasn't expecting it to show up when people, when like you follow the social media, I've striped by, you know, people are like, Oh, state non attention models are finally good. And it's like, it covers up a lot of the details that are very interesting about it, in my opinion.
So, okay. Turn back to treat, I think. Mamba actually happened first among these, I did the, his reading back of [00:17:00] social media, and it also was very surprising to me, I think the, the largest model in the Mamba suite is 2. 8 billion parameters, if I remember correctly, and it was compared to a lot of the common benchmarks in open NLP, so things like GPT J, Pythia model suites, and the scores on the benchmarks reported were really strong, and I think if you want to start with the high level summary, and then I'm definitely going to make you talk about the awesome new CUDA kernels and stuff that you had to write for this project.
[00:17:34] Tri Dao: Yeah, so this, Mamba is a collaboration with, with Albert Gu, who's now, he was, a PhD student at, at Stanford, that's where we met, and, he's now a professor at CMU, and, also at a startup. so it was a, a wonderful collaboration, credit goes to him. Yeah, Albert has been working on this line of work called state space models, [00:18:00] in some sense, as mentioned, it connects to things like linear tension, linear RNN, convolution, neural nets, and, that's what his PhD thesis, is about.
I've also worked on, space, state space for the past couple of projects, My, my angle is how to make state space more hardware efficient and, kind of increase their expressiveness. so it's wonderful working with, with, with Albert. and there, I think is more of a proof of concept, which is, Can state space actually do as well as transformer on language? So we've, we've seen previous papers, showing state space could be better on audio, could be better on, some of the tasks on the long range arena. but, language has always been, the most difficult to get, to, to, to do well for state space models.
[00:19:00] And, language is also kind of the thing that People care about the most right now. So I was more of a proof of concept, which is, Hey, we want to show that safe space space can be competitive or maybe even meet some of the transformers out there. so we, we validate that at the scale up to three B trained to 300 B tokens.
So in absolute terms, you know, these are not very strong models. These are not yet models that you would actually. play with and expect it to do meaningful things, I think is more of a, more of an academic comparison in terms of architecture. It's like, hey, training, train for the same amount of tokens, it does as well, or maybe slightly better than some of the transformer out there.
So, and that's, in particular, it's been, very exciting to us. I think, Albert's been pushing on this for, for a while. I've been pushing on this for a while, and I think finally, it's like, It seems to, [00:20:00] to, to finally be kind of close to gap or even surpassing the transformer. and it just, just, I think it's opens up a bunch of opportunities.
so inference could be way faster. maybe we would have different ways to understand how in context learning happens, et cetera. So, lots of, lots of future work I would expect.
Mamba hardware optimization
[00:20:22] Nathan Lambert: Yeah. Can you go into some of the like, what does it actually take to implement some of these new CUDA kernels? I just remember when this paper was announced, Sasha Rush, who's also very active in the space, recommended me to talk with you too, was tweeting about the types of files or whatever.
In the paper, there's this discussion about how like the recurrent state needs to be sufficiently expressive, but doing so in a certain type of memory is a problem. Like translate what this means to like people thinking about GPUs and people thinking about these models being scaled, like, is it now? Much easier to scale these [00:21:00] models because they work on GPUs.
Which GPUs were you using? Is there a bump that could come just from going to H one hundreds or something? Any of that?
[00:21:08] Tri Dao: Yeah. so, the pre, the line of work on state space, like s four models, kind of pioneer by, by, my Alberts work. they, they c are in some sense recurrent neural network. but they have a much larger, So, the state size is whatever kind of, buffer that you're going to store information as you traverse or as you process the sequence.
In some sense, you can view transformer as doing that as well, where it's, keep the entire history is usually called the KV cache. it keeps the history and keep referring to it. for RNNs, they have a fixed size state. for transformer state, you can think of the state size is increasing. And, our intuition [00:22:00] is that, the larger the state size, the easier it is for the model to do well.
So basically, you have more space to store whatever you need to remember. And so previous models like S4 and so on, they have an implicitly pretty large state size, but they use the convolutional view to avoid having to materialize the state. So that was, that was wonderful. Michael has, has worked, behind the architecture, has used some of the same insight focusing on, on convolution.
Mamba, on the other hand, focuses on the recurrent view. So, we wanted to put more input dependency in the, the, the recurrence. we thought, you know, the thinking was that it was going to make, it more expressive and the model would do better, but that prevents us from using this convolutional view that would make things efficient.
So we had to figure out a different way to make things efficient. and, so I, I focused on making that efficient on, on, on GPUs. and so all, you [00:23:00] know, our thinking was, instead of, okay, we're gonna have a large state size, but we don't have to like ride to actual GPU memory, like the HBM, we can just keep that, large state in a, a faster, Memory you call SRAM, you think of it as a, as a cache. so if you're more familiar with, CPU, so this is usually a cache and RAM. So, you know, if you have large state, you can keep it in the cache and you don't, by avoiding having to write it down, you actually don't suffer too much if the state is, is large.
Predictions for 2024 architectures
[00:23:33] Nathan Lambert: Would this be due to like input out, like having to move the data around being really slow? Yes. Yeah. That makes a lot of sense. Thanks. That's a really common constraint in a lot of these things, and it's like, right. I think one of the most insightful things I've had now with GPUs versus TPUs and stuff is how mixtures of ex mixture of expert models doesn't work very well in TPUs, just because you have to like that essentially add a mixture of expert at a basic level.
There's a routing layer that you learn, [00:24:00] and then multiple feedforward layers that you can choose from. And when you're distributing this, the feedforward layers could end up. On a different TPU node and TPUs communicate with their neighbors. So TPUs take a big hit relative to GPUs where within video class and video clusters, everything's connected so much more.
And then it's easy to do that sort of distributed training. And that's super interesting. And it's like, do you think there's going to be, I guess this is really where I want to open the conversation of like, what does this mean? What is going to happen in 2024 in this space? Are bigger players going to move in and be exploring this my take, seeing how good the long context learning could be in a fundamental limit is that systems like chat GPT are going to use a dense, like a transformer model for most tasks.
And then if you need to do summarization, you might do a long context specialized architecture. And then we can even see a whole quiver of architectures behind [00:25:00] something that you're using. But I think. It's just like, is attention going to be dethroned? Is Sasha Rush somehow going to win this bet that everyone was following in the area?
I got, what are you thinking about either of you?
[00:25:14] Tri Dao: I think transform is still a very, very strong architecture. and there is a proven recipe, right? You know, people scaling to a trillion of parameters right now, if you want, you say, well, I just want the best performing model. that runs most efficiently on my hardware that has the most support on on the software side.
Fast forward is a safe bet. I think it's here to stay. but I think there are new ideas, like, state space, kind of, some of the linear attention ideas from linear attention. I think they're coming in. we've seen, as Michael mentioned, that mixing some of these components seem to improve performance, revalidated at, I think, seven B scale, but, Maybe it might even work at larger scale.
I think [00:26:00] people tend to be conservative and, you know, focusing too much on modern architecture, might not be worth their time. Like the Lime architecture is very, very strong. Most people are doing off of that. They're focusing on data. they're focusing on infrastructure, which makes sense. I think on, on my side personally, just plain interesting.
They're like more, I would say niche use cases. niche for now, where some of these alternative architectures are interesting, things like long context, different domains like audio and genomics, and there's just plain interesting scientific questions you can ask, like whether it follow instruction just as well, whether it follow intuition just as well, does it play well with quantization and so on.
That's just plain interesting. Research questions we can ask. Now on the production level, I think Transformer is still incredibly strong, very well supported, both hardware and software. But I think some of these new ideas are coming in [00:27:00] and people might start, you know, putting them as part of a component in the Transformer.
Maybe we'll still call them Transformer, but they just have more, more layers and just attention and NLP.
[00:27:11] Michael Poli: Yeah, I 100 percent agree with you. So attention as a, as a computational primitive is not going anywhere anytime soon. It's just a very efficient and a very convenient way to. Increase the effective state of, of your sequence processor. so at some level, if you're working with a model that only has a fixed state in each of its sequence mixers, you're, you have an assumption and your assumption is that you only need so much information in the sequence.
So there's, there's always a trade off between, this kind of the ratio of the state dimension, the sequence length, as you push things to the extreme, either model sizes. So as you make the model bigger, wider, effectively [00:28:00] introduce more states and sequence length, some of these margins. you know, some of this is speculation, but some of these margins will disappear, some of the trade offs will change, especially 14, 30, some of these very fat models.
But certainly either whether that's hybridizing or some kind of new, new block, we're certainly going to see some more innovation. That's, that's really exciting. My, my personal, if I had to make a prediction is that architectural design will get more interesting, more, more complex. There's going to be more to do.
More predictions for AI
[00:28:38] Nathan Lambert: Yeah, I mean, this year it's like, I got some 10 minute clock that's fine for us. I think like with mixture of experts and this being popular as a state state models, like this is all just really within a few months outside of opening. I like they've been doing mixture of experts for a lot longer than everyone.
In terms of open and academic [00:29:00] communities, like no one's really tried to do early Jeff on mixture of experts. Like it should just work, but we have to learn all these things. And then the model grafting is becoming more of a real thing. That's super interesting. It is just. I agree that it's just fun to follow and hopefully it gives academics and scientists more ways to influence the conversation where an industry is just about scaling and bigger models where we could maybe do specific things better, which is what I'm telling open source companies to do with their language models anyways.
Like if they want to have a business model, they need to have an edge. So this all fits into that kind of narrative pretty well with my regards. Is there anything else you guys are following in ML? It doesn't have to be about state space models. Like what's, what's exciting for you broadly for next year?
[00:29:46] Tri Dao: Yeah, personally, I think data is still the most important thing. we're, we're thinking a lot about how data influences the model performance, like really teasing that [00:30:00] out, either, you know, having some of the synthetic tasks that correlates very well with, with model performance. That's been kind of the motivating.
kind of examples in a lot of our papers and work has been focusing on synthetic tasks, or, having like maybe, maybe smaller data sets that kind of make it easier to really understand what's, what's really going on. so, I think I'll, you know, personally, my focus is going to be on data for the next little bit.
Yeah, all the, all the architecture stuff is fun. making that hardware efficient is, is, is, is fun. but I think, ultimately it's about data. If you, if you look at the scaling, scaling law curve, the more architectures. Different model architectures would generally have the same slope. They're just different offset.
it seems like the only thing that changes the slope is the, data quality.
[00:30:58] Nathan Lambert: I love that point. That, that does [00:31:00] seem true. I have the plot from Mamba in this blog post that I'm writing, which is, it's just a little, just a little bit above the same slope.
[00:31:08] Michael Poli: Yeah, we add data. Data is really interesting, sort of miniaturizing, architecture design, finding and breaking down what, tasks are involved into, for example, language modeling and trying to package them into something that can be used to iterate something that's quite exciting. We have, that was one of the main techniques that was used for the, this, zoology, paper that also looks into, into some of these different behaviors.
And personally, I'm also really excited about new applications, scientific applications, with the genomics work, but even more, but more engineering focused, we're seeing a shift, right now it's language is still kind of, The domain that gets the most clicks, [00:32:00] most interest, but I think that that will evolve over time.
and some of these other applications offer, even just talking about architectures, they offer a completely different design space that I'm excited to look into.
[00:32:13] Nathan Lambert: Yeah, everyone talks about language, but I feel like images and entertainment and videos are like the things that are so obviously going to generate so much value to me.
Like, I don't know the ceiling on language, but when you could access a like somewhat local text and video model at your home workstation, that's like tailored to your preferences. Like the amount of value that that creates is totally astronomical. I I'm excited. I mean, I've started playing around with these where I'd take.
Text of the blog and convert it to dolly images and convert it to a video with generated audio all with like one Python script and it's like, that's really easy to do. So I agree with your more than language is fun to have that view
[00:32:55] Tri Dao: and these things actually do work reasonably well in your experience when you stitch [00:33:00] all them together.
[00:33:02] Nathan Lambert: it's not that good. The DALLE images are pretty similar, but I'm doing something really naive where I just, I literally take the text and have a system prompt. It's like you're generating series of images for visualizing a blog post and, and it generates various like. The, all the machine learning thumbnails that you see everyone using, they're like variations of that.
The fun ones are where it's like about Llama or Mamba or something. And then they like generate animals in them, which is good. I think I could get much better at it and have a better segmentation system for the paragraphs and, or have like chat to PT summarize them or something like that. But I just know that within like a year, it was going to be a text to video API and I'm just going to switch it and it's going to be great.
And so I'm like laying the groundwork for infrastructure to have like multimodal. Content as multimodal content distribution, really, and I just expect it to become very fun. I mean, like even the text to voice is pretty good. I think I don't have a studio, but once [00:34:00] you have a studio, it's going to be able to generate perfect audio for whatever you want.
So another one of my dreams that is. Bad for young students is I want to be able to give like a slide deck to a script that returns the five minute conference video that no one ever watches just based on like a, GPT for reading those, the slide deck and voicing yourself. So those are the silly things that I have time to do because I'm not a professor.
[00:34:29] Tri Dao: Yeah, I think these, these, these advances, these systems, like they, they do generate a lot of economic value and, and we're seeing that already. Lots of companies are now switching to using these things. And I think it's going to change the way we work as, as you mentioned, the way we work, the way we're entertained.
So I'm just very exciting future.
[00:34:47] Nathan Lambert: Yeah. Anything else? Well, thanks for coming. Try to get you guys as much. Attention as I can bring, you never know it'll go viral these days. So I think this was a great conversation. People are really hungry for basic intuitions in [00:35:00] the area. So this is good.
[00:35:02] Tri Dao: Yeah. Thank you.
Nathan is a pleasure. Absolutely.
[00:35:07] Michael Poli: for inviting us. And, maybe, if, you know, there are more questions, is there a way to maybe collect them or to, to provide readers with like listeners with, an address or something? Happy to answer anything.
[00:35:24] Nathan Lambert: Yeah. I'll, I'll include contact info in the post and various ways.
This will be out there. You'll get your comments on Substack, YouTube, Twitter. It's a mess. You've got to pay attention to 10 million streams of information these days, but you'll, you'll get contacted by people. Thankfully, for some reason, people read my stuff, but here we are. So thanks for listening.
En liten tjänst av I'm With Friends. Finns även på engelska.