250 avsnitt • Längd: 20 min • Dagligen
Audio narrations of LessWrong posts.
The podcast LessWrong (30+ Karma) is created by LessWrong. The podcast and the artwork on this page are embedded on this page using the public podcast feed (RSS).
We’ve written a new report on the threat of AI-enabled coups.
I think this is a very serious risk – comparable in importance to AI takeover but much more neglected.
In fact, AI-enabled coups and AI takeover have pretty similar threat models. To see this, here's a very basic threat model for AI takeover:
And now here's a closely analogous threat model for AI-enabled coups:
While the report focuses on the risk that someone seizes power over a country, I think that similar dynamics could allow someone to take over the world. In fact, if someone wanted to take over the world, their best strategy might well be to first stage an AI-enabled [...]
---
Outline:
(02:39) Summary
(03:31) An AI workforce could be made singularly loyal to institutional leaders
(05:04) AI could have hard-to-detect secret loyalties
(06:46) A few people could gain exclusive access to coup-enabling AI capabilities
(09:46) Mitigations
(13:00) Vignette
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 16th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 37 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Redwood Research has released a new paper, Ctrl-Z: Controlling AI Agents via Resampling. This is the largest and most intricate study of control techniques to date: that is, techniques that aim to prevent catastrophic failures even if egregiously misaligned AIs attempt to subvert the techniques. We extend control protocols to a more realistic, multi-step setting, develop novel resample protocols that outperform all existing control protocols, and study many aspects of control techniques in more detail than prior work. In this blog post, we summarize the main takeaways and lessons learned from our work.
Here's the abstract of the paper:
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. [...]
---
Outline:
(06:48) The Multi-Step Control Problem
(10:14) Resampling
(11:46) Selection Pressure: Forcing Consistent Attacks
(14:05) Value of Information: Better Attack Detection (Incrimination)
(17:01) Important Limitations
(20:23) Results Summary
(21:03) Core takeaways
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 16th, 2025
Source:
https://www.lesswrong.com/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
New: https://openai.com/index/updating-our-preparedness-framework/
Old: https://cdn.openai.com/openai-preparedness-framework-beta.pdf
Summary
Thresholds & responses: https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf#page=5. High and Critical thresholds trigger responses, like in the old PF; responses to Critical thresholds are not yet specified.
Three main categories of capabilities:
Misuse safeguards, misalignment safeguards, and security controls for High capability levels: https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf#page=16. My quick takes:
[I'll edit this post to add more analysis soon]
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/Yy5ijtbNfwv8DWin4/openai-rewrote-its-preparedness-framework
Narrated by TYPE III AUDIO.
---
Outline:
(01:26) The GPUs Are Melting
(02:36) On OpenAI's Ostensive Open Model
(03:25) Other People Are Not Worried About AI Killing Everyone
(04:18) What Even is AGI?
(05:19) The OpenAI AI Action Plan
(05:57) Copyright Confrontation
(07:43) The Ring of Power
(11:55) Safety Perspectives
(14:00) Autonomous Killer Robots
(15:15) Amicus Brief
(17:25) OpenAI Recklessly Races to Release
---
First published:
April 15th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
A post by Michael Nielsen that I found quite interesting. I decided to reproduce the full essay content here, since I think Michael is fine with that, but feel free to let me know to only excerpt it.
This is the text for a talk exploring why experts disagree so strongly about whether artificial superintelligence (ASI) poses an existential risk to humanity. I review some key arguments on both sides, emphasizing that the fundamental danger isn't about whether "rogue ASI" gets out of control: it's the raw power ASI will confer, and the lower barriers to creating dangerous technologies. This point is not new, but has two underappreciated consequences. First, many people find rogue ASI implausible, and this has led them to mistakenly dismiss existential risk. Second: much work on AI alignment, while well-intentioned, speeds progress toward catastrophic capabilities, without addressing our world's potential vulnerability to dangerous technologies.
[...]
---
Outline:
(06:37) Biorisk scenario
(17:42) The Vulnerable World Hypothesis
(26:08) Loss of control to ASI
(32:18) Conclusion
(38:50) Acknowledgements
The original text contained 29 footnotes which were omitted from this narration.
---
First published:
April 15th, 2025
Narrated by TYPE III AUDIO.
Writing this post puts me in a weird epistemic position. I simultaneously believe that:
That is because all of the reasoning failures that I describe here are surprising in the sense that given everything else that they can do, you’d expect LLMs to succeed at all of these tasks. The [...]
---
Outline:
(00:13) Introduction
(02:13) Reasoning failures
(02:17) Sliding puzzle problem
(07:17) Simple coaching instructions
(09:22) Repeatedly failing at tic-tac-toe
(10:48) Repeatedly offering an incorrect fix
(13:48) Various people's simple tests
(15:06) Various failures at logic and consistency while writing fiction
(15:21) Inability to write young characters when first prompted
(17:12) Paranormal posers
(19:12) Global details replacing local ones
(20:19) Stereotyped behaviors replacing character-specific ones
(21:21) Top secret marine databases
(23:32) Wandering items
(23:53) Sycophancy
(24:49) What's going on here?
(32:18) How about scaling? Or reasoning models?
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/sgpCuokhMb8JmkoSn/untitled-draft-7shu
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Context
Disney's Tangled (2010) is a great movie. Spoilers if you haven't seen it.
The heroine, having been kidnapped at birth and raised in a tower, has never stepped foot outside. It follows, naturally, that she does not own a pair of shoes, and she is barefoot for the entire adventure. The movie contains multiple shots that focus at length on her toes. Things like that can have an outsized influence on a young mind, but that's Disney for you.
Anyway.
The male romantic lead goes by the name of "Flynn Rider." He is a dashingly handsome, swashbuckling rogue who was carefully crafted to be maximally appealing to women. He is the ideal male role model. If you want women to fall in love with you, it should be clear that the optimal strategy is to pretend to be Flynn Rider. Shortly into the movie is the twist: Flynn [...]
---
Outline:
(00:09) Context
(02:30) Reminder About Winning
(04:20) Technical Truth is as Bad as Lying
(06:40) Being Mistaken is Also as Bad as Lying
(08:01) This is Partially a Response
(10:19) Examples
(13:59) Biting the Bullet
(15:06) My Proposed Policy
(18:15) Appearing Trustworthy Anyway
(20:46) Cooperative Epistemics
(22:26) Conclusion
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/zTRqCAGws8bZgtRkH/a-dissent-on-honesty
Narrated by TYPE III AUDIO.
The original Map of AI Existential Safety became a popular reference tool within the community after its launch in 2023. Based on user feedback, we decided that it was both useful enough and had enough room for improvement that it was worth creating a v2 with better organization, usability, and visual design. Today we’re excited to announce that the new map is live at AISafety.com/map.
Similar to the original map, it provides a visual overview of the key organizations, programs, and projects in the Al safety ecosystem. Listings are separated into 16 categories, each corresponding to an area on the map:
We think there's value in being able to view discontinued projects, so we’ve included a graveyard for those.
Below the map is also a [...]
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/rF7MQWGbqQjEkeLJA/map-of-ai-safety-v2
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
One key hope for mitigating risk from misalignment is inspecting the AI's behavior, noticing that it did something egregiously bad, converting this into legible evidence the AI is seriously misaligned, and then this triggering some strong and useful response (like spending relatively more resources on safety or undeploying this misaligned AI).
You might hope that (fancy) internals-based techniques (e.g., ELK methods or interpretability) allow us to legibly incriminate misaligned AIs even in cases where the AI hasn't (yet) done any problematic actions despite behavioral red-teaming (where we try to find inputs on which the AI might do something bad), or when the problematic actions the AI does are so subtle and/or complex that humans can't understand how the action is problematic[1]. That is, you might hope that internals-based methods allow us to legibly incriminate misaligned AIs even when we can't produce behavioral evidence that they are misaligned.
[...]
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
April 15th, 2025
Narrated by TYPE III AUDIO.
There's an implicit model I think many people have in their heads of how everyone else behaves. As George Box is often quoted, “all models are wrong, but some are useful.” I’m going to try and make the implicit model explicit, talk a little about what it would predict, and then talk about why this model might be wrong.
Here's the basic idea: A person's behavior falls on a bell curve.
1.
Adam is a new employee at Ratburgers, a fast food joint crossed with a statistics bootcamp. It's a great company, startup investors have been going wild for machine learning in literally anything, you've been tossing paper printouts of Attention Is All You Need into the meatgrinder so you can say your hamburgers are made with AI. Anyway, you weren’t around for Adam's hiring, you haven’t seen his resume, you have no information about him when you [...]
---
Outline:
(00:34) 1.
(04:19) 2.
(05:56) 3.
(08:46) 4.
(11:43) 5.
(15:55) 6.
---
First published:
April 14th, 2025
Source:
https://www.lesswrong.com/posts/rmhZamJFQdk5byqKQ/the-bell-curve-of-bad-behavior
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Definition
In 1954, Roger Bannister ran the first officially sanctioned sub-4-minute mile: a pivotal record in modern middle-distance running. Before Bannister's record, such a time was considered impossible. Soon after Bannister's record, multiple runners also beat the 4-minute mile.
This essay outlines the potential psychological effect behind the above phenomenon — the 4-minute mile effect — and outlines implications for its existence. The 4-minute mile effect describes when someone breaking a perceived limit enables others to break the same limit. In short, social proof is very powerful.
Tyler Cowen's Example
Speaking to Dwarkesh Patel, Tyler Cowen posits, "mentors only teach you a few things, but those few things are so important. They give you a glimpse of what you can be, and you're oddly blind to that even if you're very very smart." The 4-minute mile effect explains Tyler Cowen's belief in the value of [...]
---
Outline:
(00:11) Definition
(00:52) Tyler Cowens Example
(01:51) Personal Examples
(03:00) Implications
(03:54) Further Reading
---
First published:
April 14th, 2025
Source:
https://www.lesswrong.com/posts/kBdfFgLfDhPMDHfAa/the-4-minute-mile-effect
Linkpost URL:
https://parconley.com/the-four-minute-mile-effect/
Narrated by TYPE III AUDIO.
Executive summary
The Trump administration backtracked from his tariff plan, reducing tariffs from most countries to 10%, but imposing tariffs of 145% on China, which answered with 125% tariffs on US imports and export controls on rare earth metals.
The US saw protests against Trump and DOGE, and the US Supreme court ruled that the administration must facilitate the return of an immigrant wrongly deported to El Salvador.
OpenAI is slashing the amount of safety testing they will do of their new AI models, from months down to days, and Google launched a new AI chip for inference.
Negotiations between the US and Iran on its nuclear program are starting, reducing our estimated probability of a US strike on Iran's nuclear programme by the end of May to 15%.
The number of US dairy herds that have had confirmed H5N1 infections has hit 1,000.
Economy
[...]
---
Outline:
(00:24) Executive summary
(01:22) Economy
---
First published:
April 14th, 2025
Linkpost URL:
https://blog.sentinel-team.org/p/global-risks-weekly-roundup-152025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: These are results of a brief research sprint and I didn't have time to investigate this in more detail. However, the results were sufficiently surprising that I think it's worth sharing.
TL,DR: I train a probe to detect falsehoods on a token-level, i.e. to highlight the specific tokens that make a statement false. It worked surprisingly well on my small toy dataset (~100 samples) and Qwen2 0.5B, after just ~1 day of iteration! Colab link here.
Context: I want a probe to tell me where in an LLM response the model might be lying, so that I can e.g. ask follow-up questions. Such "right there" probes[1] would be awesome to assist LLM-based monitors (Parrack et al., forthcoming). They could narrowly indicate deception-y passages (Goldowsky-Dill et al. 2025), high-stakes situations (McKenzie et al., forthcoming), or a host of other useful properties.
The writeup below is the (lightly edited) [...]
---
Outline:
(01:47) Summary
(04:15) Details
(04:18) Motivation
(06:37) Methods
(06:40) Data generation
(08:35) Probe training
(09:40) Results
(09:43) Probe training metrics
(12:21) Probe score analysis
(15:07) Discussion
(15:10) Limitations
(17:50) Probe failure modes
(19:40) Appendix A: Effect of regularization on mean-probe
(21:01) Appendix B: Initial tests of generalization
(21:28) Appendix C: Does the individual-token probe beat the mean probe at its own game?
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
April 14th, 2025
Source:
https://www.lesswrong.com/posts/kxiizuSa3sSi4TJsN/try-training-token-level-probes
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Dario Amodei, CEO of Anthropic, recently worried about a world where only 30% of jobs become automated, leading to class tensions between the automated and non-automated. Instead, he predicts that nearly all jobs will be automated simultaneously, putting everyone "in the same boat." However, based on my experience spanning AI research (including first author papers at COLM / NeurIPS and attending MATS under Neel Nanda), robotics, and hands-on manufacturing (including machining prototype rocket engine parts for Blue Origin and Ursa Major), I see a different near-term future.
Since the GPT-4 release, I've evaluated frontier models on a basic manufacturing task, which tests both visual perception and physical reasoning. While Gemini 2.5 Pro recently showed progress on the visual front, all models tested continue to fail significantly on physical reasoning. They still perform terribly overall. Because of this, I think that there will be an interim period where a significant [...]
---
Outline:
(01:28) The Evaluation
(02:29) Visual Errors
(04:03) Physical Reasoning Errors
(06:09) Why do LLM's struggle with physical tasks?
(07:37) Improving on physical tasks may be difficult
(10:14) Potential Implications of Uneven Automation
(11:48) Conclusion
(12:24) Appendix
(12:44) Visual Errors
(14:36) Physical Reasoning Errors
---
First published:
April 14th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: a model I find helpful to make sense of disagreements and, sometimes, resolve them.
I like to categorize disagreements using four buckets:
Facts, values, strategy and labels
They don’t represent a perfect partitioning of “disagreement space”, meaning there is some overlap between them and they may not capture all possible disagreements, but they tend to get me pretty far in making sense of debates, in particular when and why they fail. In this post I’ll outline these four categories and provide some examples.
I also make the case that labels disagreements are the worst and in most cases can either be dropped entirely, or otherwise should be redirected into one of the other categories.
Facts
These are the most typical disagreements, and are probably what most people think disagreements are about most of the time, even when it's actually closer to a different category. Factual disagreements [...]
---
Outline:
(00:58) Facts
(01:59) Values
(03:18) Strategy
(05:00) Labels
(07:49) Why This Matters
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 13th, 2025
Source:
https://www.lesswrong.com/posts/9vKqAQEDxLvKh7KcN/four-types-of-disagreement
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TL;DR: If we optimize a steering vector to induce a language model to output a single piece of harmful code on a single training example, then applying this vector to unrelated open-ended questions increases the probability that the model yields harmful output.
Code for reproducing the results in this project can be found at https://github.com/jacobdunefsky/one-shot-steering-misalignment.
Somewhat recently, Betley et al. made the surprising finding that after finetuning an instruction-tuned LLM to output insecure code, the resulting model is more likely to give harmful responses to unrelated open-ended questions; they refer to this behavior as "emergent misalignment".
My own recent research focus has been on directly optimizing steering vectors on a single input and seeing if they mediate safety-relevant behavior. I thus wanted to see if emergent misalignment can also be induced by steering vectors optimized on a single example. That is to say: does a steering vector optimized [...]
---
Outline:
(00:31) Intro
(01:22) Why care?
(03:01) How we optimized our steering vectors
(05:01) Evaluation method
(06:05) Results
(06:09) Alignment scores of steered generations
(07:59) Resistance is futile: counting misaligned strings
(09:29) Is there a single, simple, easily-locatable representation of misalignment? Some preliminary thoughts
(13:29) Does increasing steering strength increase misalignment?
(15:41) Why do harmful code vectors induce more general misalignment? A hypothesis
(17:24) What have we learned, and where do we go from here?
(19:49) Appendix: how do we obtain our harmful code steering vectors?
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 14th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TL;DR: I claim that many reasoning patterns that appear in chains-of-thought are not actually used by the model to come to its answer, and can be more accurately thought of as historical artifacts of training. This can be true even for CoTs that are apparently "faithful" to the true reasons for the model's answer.
Epistemic status: I'm pretty confident that the model described here is more accurate than my previous understanding. However, I wouldn't be very surprised if parts of this post are significantly wrong or misleading. Further experiments would be helpful for validating some of these hypotheses.
Until recently, I assumed that RL training would cause reasoning models to make their chains-of-thought as efficient as possible, so that every token is directly useful to the model. However, I now believe that by default,[1] reasoning models' CoTs will often include many "useless" tokens that don't help the model achieve [...]
---
Outline:
(02:13) RL is dumber than I realized
(03:32) How might vestigial reasoning come about?
(06:28) Experiment: Demonstrating vestigial reasoning
(08:08) Reasoning is reinforced when it correlates with reward
(11:00) One more example: why does process supervision result in longer CoTs?
(15:53) Takeaways
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
April 13th, 2025
Source:
https://www.lesswrong.com/posts/6AxCwm334ab9kDsQ5/vestigial-reasoning-in-rl
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Thanks to Linda Linsefors for encouraging me to write my story. Although it might not generalize to anyone else, I hope it will help some AI safety newcomers get a clearer picture of how to get into the field.
In Summer 2023, I thought that all jobs in AI safety were super competitive and needed qualifications I didn’t have. I was expecting to need a few years to build skills and connections before getting any significantly impactful position. Surprisingly to me, it only took me 3 months after leaving my software engineering job before I landed a job in AI policy. I now think that it's actually somewhat easy to get into AI policy without any prior experience if you’re agentic and accept some tradeoffs.
From January 2024 to February 2025, I worked on organizing high level international AI policy events:
---
Outline:
(00:55) The job: International AI policy event organizer
(03:16) How I got the job
(06:02) My theory as to what led to me being hired and how it can be reproduced by others
(07:27) My experience at this job and where I stand now
---
First published:
April 13th, 2025
Narrated by TYPE III AUDIO.
I'm graduating from UChicago in around 60 days, and I've been thinking about what I've learned these past four years. I figured I'd write it all down while it's still fresh.
This isn't universal advice. It's specifically for people like me (or who want to be like me). High-agency, motivated types who hate having free time.[1] People who'd rather risk making mistakes than risk missing out, who want to control more than they initially think they can, and who are willing to go all-in relatively quickly. If you're reading Ben Kuhn and Alexey Guzey or have ever heard of the Reverend Thomas Bayes, you're probably one of us.
So here's at least some of what I've figured out. Take what's useful, leave what isn't — maybe do the opposite of everything I've said.
---
Outline:
(03:23) Mindset and Personal Growth
(03:27) Find your mission
(04:27) Recognize that you can always be better
(04:54) Make more mistakes
(05:23) Things only get done to the extent that you want them to get done
(06:14) There are no adults
(06:37) Deadlines are mostly fake
(07:26) Put yourself in positions where youll be lucky
(08:09) Luck favors the prepared
(08:45) Test your fit at lots of things
(09:26) Do side projects
(10:20) Get good at introspecting
(11:00) Be honest
(11:31) Put your money where your mouth is
(12:03) Interpret others charitably
(12:32) Be the kind of person others can come to for help
(13:18) Productivity and Focus
(13:22) Go to bed early
(13:52) Brick your phone
(14:21) The optimal amount of slack time is not zero, but its close to zero
(15:19) If you arent getting work done, pick up your shit and go somewhere else
(15:45) Dont take your phone to the places youre studying
(16:08) Do not try to do more than one important thing at once
(16:31) Offload difficult things to automations, habits, or other people
(16:55) Make your bed
(17:18) If it takes less than 5 minutes, do it now
(17:44) The floor is actually way lower than you think
(18:07) Planning and Goal Setting
(18:11) Make sure your goals are falsifiable
(19:11) Credibly pre-commit to things you care about getting done
(19:41) Track your progress
(20:09) Make a 5-year plan google doc
(20:31) Consider graduating early
(21:00) Relationships and Community
(21:04) Build your own community
(21:30) Fall in love at least once
(22:09) If you arent happy and excited and exciting single, you wont be happy or excited or exciting in a relationship
(22:54) Friends are people you can talk to for hours
(23:26) Throw parties
(24:07) Hang out with people who are better than you at the things you care about
(24:36) Professors are people. You can just make friends with them
(25:10) Academic Success
(25:14) Go to office hours
(25:41) Be careful how you use AI
(26:13) Read things, everywhere
(27:16) Dont be afraid to skip the boring parts
(27:43) Learn how to read quickly
(28:16) Write things
(28:54) Read widely based on curiosity, not just relevance
(29:32) Have an easy way to capture ideas
(29:53) Health and Lifestyle
(29:57) Go to the gym
(30:25) The only place to work yourself to failure is the gym
(30:59) If something isnt making your life better, change it
(31:29) Leave campus
The original text contained 37 footnotes which were omitted from this narration.
---
First published:
April 12th, 2025
Source:
https://www.lesswrong.com/posts/9Kq2JRqmJHnzckxKn/college-advice-for-people-like-me
Narrated by TYPE III AUDIO.
Introduction
This is a nuanced “I was wrong” post.
Something I really like about AI safety and EA/rationalist circles is the ease and positivity in people's approach to being criticised.[1] For all the blowups and stories of representative people in the communities not living up to the stated values, my experience so far has been that the desire to be truth-seeking and to stress-test your cherished beliefs is a real, deeply respected and communally cultured value. This in particular explains my ability to keep getting jobs and coming to conferences in this community, despite being very eager to criticise and call bullshit on people's theoretical agendas.
One such agenda that I’ve been a somewhat vocal critic of (and which received my criticism amazingly well) is the “heuristic arguments” picture and the ARC research agenda more generally. Last Spring I spent about 3 months on a work trial/internship at [...]
---
Outline:
(00:10) Introduction
(03:24) Background and motte/bailey criticism
(09:49) The missing piece: connecting in the no-coincidence principle
(15:15) From the no coincidence principle to statistical explanations
(17:46) Gödel and the thorny deeps
(19:30) Ignoring the monsters and the heuristic arguments agenda
(24:46) Upshots
(27:41) Summary
(29:19) Renormalization as a cousin of heuristic arguments
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
April 13th, 2025
Source:
https://www.lesswrong.com/posts/CYDakfFgjHFB7DGXk/untitled-draft-wn6w
Narrated by TYPE III AUDIO.
Epistemic status: Noticing confusion
There is little discussion happening on LessWrong with regards to AI governance and outreach. Meanwhile, these efforts could buy us time to figure out technical alignment. And even if we figure out technical alignment, we still have to solve crucial governmental challenges so that totalitarian lock-in or gradual disempowerment don't become the default outcome of deploying aligned AGI.
Here's three reasons why we think we might want to shift much more resources towards governance and outreach:
1. MIRI's shift in strategy
The Machine Intelligence Research Institute (MIRI), traditionally focused on technical alignment research, has pivoted to broader outreach. They write in their 2024 end of year update:
Although we continue to support some AI alignment research efforts, we now believe that absent an international government effort to suspend frontier AI research, an extinction-level catastrophe is extremely likely.
As a consequence, our new focus is [...]
---
Outline:
(00:45) 1. MIRIs shift in strategy
(01:34) 2. Even if we solve technical alignment, Gradual Disempowerment seems to make catastrophe the default outcome
(02:52) 3. We have evidence that the governance naysayers are badly calibrated
(03:33) Conclusion
---
First published:
April 12th, 2025
Narrated by TYPE III AUDIO.
In this post I lay out a concrete vision of how reward-seekers and schemers might function. I describe the relationship between higher level goals, explicit reasoning, and learned heuristics. I explain why I expect reward-seekers and schemers to dominate proxy-aligned models given sufficiently rich training environments (and sufficient reasoning ability).
A key point is that explicit reward seekers can still contain large quantities of learned heuristics (context-specific drives). By viewing these drives as instrumental and having good instincts for when to trust them, a reward seeker can capture the benefits of both instinctive adaptation and explicit reasoning without paying much of a speed penalty.
Core claims
---
Outline:
(00:52) Core claims
(01:52) Characterizing reward-seekers
(06:59) When will models think about reward?
(10:12) What I expect schemers to look like
(12:43) What will terminal reward seekers do off-distribution?
(13:24) What factors affect the likelihood of scheming and/or terminal reward seeking?
(14:09) What about CoT models?
(14:42) Relationship of subgoals to their superior goals
(16:19) A story about goal reflection
(18:54) Thoughts on compression
(21:30) Appendix: Distribution over worlds
(24:44) Canary string
(25:01) Acknowledgements
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
April 11th, 2025
Source:
https://www.lesswrong.com/posts/ntDA4Q7BaYhWPgzuq/reward-seekers
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Cross-posted from Substack.
AI job displacement will affect young people first, disrupting the usual succession of power and locking the next generation out of our institutions. I’m coining the term youth lockout to describe this phenomenon.
Youth Lockout
We are on track to build AI agents that can independently perform valuable intellectual labour. These agents will directly compete with human workers for roles in the labour market, often offering services at lower cost and greater speeds.
In historical cases of automation, such as the industrial revolution, automation reduced the number of human jobs in some industries but created enough new opportunities that overall demand for labour went up. AI automation will be very different from these examples because AI is a much more general-purpose technology. With time, AI will likely perform all economic tasks at human or superhuman levels, meaning any new firm or industry will be able to [...]
---
Outline:
(00:24) Youth Lockout
(03:34) Impacts
(06:54) Conclusion
---
First published:
April 11th, 2025
Source:
https://www.lesswrong.com/posts/tWuYhdajaXx4WzMHz/youth-lockout
Narrated by TYPE III AUDIO.
Paper is good. Somehow, a blank page and a pen makes the universe open up before you. Why paper has this unique power is a mystery to me, but I think we should all stop trying to resist this reality and just accept it.
Also, the world needs way more mundane blogging.
So let me offer a few observations about paper. These all seem quite obvious. But it took me years to find them, and they’ve led me to a non-traditional lifestyle, paper-wise.
Observation 1: The primary value of paper is to facilitate thinking.
For a huge percentage of tasks that involve thinking, getting some paper and writing / drawing / scribbling on it makes the task easier. I think most people agree with that. So why don’t we act on it? If paper came as a pill, everyone would take it. Paper, somehow, is underrated.
But note, paper [...]
---
Outline:
(00:38) Observation 1: The primary value of paper is to facilitate thinking.
(01:23) Observation 2: If you don't have a system, you won't get much benefit from paper.
(02:02) Observation 3: User experience matters.
(02:52) Observation 4: Categorization is hard.
(03:35) Paper systems I've used
---
First published:
April 11th, 2025
Source:
https://www.lesswrong.com/posts/RdHEhPKJG6mp39Agw/paper
Narrated by TYPE III AUDIO.
Summary
OpenAI recently released the Responses API. Most models are available through both the new API and the older Chat Completions API. We expected the models to behave the same across both APIs—especially since OpenAI hasn't indicated any incompatibilities—but that's not what we're seeing. In fact, in some cases, the differences are substantial. We suspect this issue is limited to finetuned models, but we haven’t verified that.
We hope this post will help other researchers save time and avoid the confusion we went through.
Key takeaways are that if you're using finetuned models:
Example: ungrammatical model
In one of our [...]
---
Outline:
(01:11) Example: ungrammatical model
(02:08) Ungrammatical model is not the only one
(02:37) Whats going on?
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 11th, 2025
Source:
https://www.lesswrong.com/posts/vTvPvCH2G9cbcFY8a/openai-responses-api-changes-models-behavior
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
It's generally agreed that as AIs get more capable, risks from misalignment increase. But there are a few different mechanisms by which more capable models are riskier, and distinguishing between those mechanisms is important when estimating the misalignment risk posed at a particular level of capabilities or by a particular model.
There are broadly 3 reasons why misalignment risks increase with capabilities:
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 11th, 2025
Narrated by TYPE III AUDIO.
Google Lays Out Its Safety Plans
I want to start off by reiterating kudos to Google for actually laying out its safety plan. No matter how good the plan, it's much better to write down and share the plan than it is to not share the plan, which in turn is much better than not having a formal plan.
They offer us a blog post, a full monster 145 page paper (so big you have to use Gemini!) and start off the paper with a 10 page summary.
The full paper is full of detail about what they think and plan, why they think and plan it, answers to objections and robust discussions. I can offer critiques, but I couldn’t have produced this document in any sane amount of time, and I will be skipping over a lot of interesting things in the full paper because [...]
---
Outline:
(00:58) Core Assumptions
(08:14) The Four Questions
(11:46) Taking This Seriously
(16:04) A Problem For Future Earth
(17:53) That's Not My Department
(23:22) Terms of Misuse
(27:50) Misaligned!
(36:17) Aligning a Smarter Than Human Intelligence is Difficult
(51:48) What Is The Goal?
(54:43) Have You Tried Not Trying?
(58:07) Put It To The Test
(59:44) Mistakes Will Be Made
(01:00:35) Then You Have Structural Risks
---
First published:
April 11th, 2025
Source:
https://www.lesswrong.com/posts/hvEikwtsbf6zaXG2s/on-google-s-safety-plan
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Authors: Eli Lifland, Nikola Jurkovic[1], FutureSearch[2]
This is supporting research for AI 2027. We'll be cross-posting these over the next week or so.
Assumes no large-scale catastrophes happen (e.g., a solar flare, a pandemic, nuclear war), no government or self-imposed slowdown, and no significant supply chain disruptions. All forecasts give a substantial chance of superhuman coding arriving in 2027.
We forecast when the leading AGI company will internally develop a superhuman coder (SC): an AI system that can do any coding tasks that the best AGI company engineer does, while being much faster and cheaper. At this point, the SC will likely speed up AI progress substantially as is explored in our takeoff forecast.
We first show Method 1: time-horizon-extension, a relatively simple model which forecasts when SC will arrive by extending the trend established by METR's report of AIs accomplishing tasks that take humans increasing amounts [...]
---
Outline:
(00:56) Summary
(02:43) Defining a superhuman coder (SC)
(03:35) Method 1: Time horizon extension
(05:05) METR's time horizon report
(06:30) Forecasting SC's arrival
(06:54) Method 2: Benchmarks and gaps
(06:59) Time to RE-Bench saturation
(07:03) Why RE-Bench?
(09:25) Forecasting saturation via extrapolation
(12:42) AI progress speedups after saturation
(14:04) Time to cross gaps between RE-Bench saturation and SC
(14:32) What are the gaps in task difficulty between RE-Bench saturation and SC?
(15:11) Methodology
(17:25) How fast can the task difficulty gaps be crossed?
(23:31) Other factors for benchmarks and gaps
(23:46) Compute scaling and algorithmic progress slowdown
(24:43) Gap between internal and external deployment
(25:20) Intermediate speedups
(26:55) Overall benchmarks and gaps forecasts
(27:44) Appendix
(27:47) Individual Forecaster Views for Benchmark-Gap Model Factors
(27:53) Engineering complexity: handling complex codebases
(31:16) Feedback loops: Working without externally provided feedback
(37:28) Parallel projects: Handling several interacting projects
(38:45) Specialization: Specializing in skills specific to frontier AI development
(40:17) Cost and speed
(48:48) Other task difficulty gaps
(50:52) Superhuman Coder (SC): time horizon and reliability requirements
(55:53) RE-Bench saturation resolution criteria
The original text contained 19 footnotes which were omitted from this narration.
---
First published:
April 10th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: Briefer and more to the point than my model of what is going on with LLMs, but also lower effort.
Here is the paper. The main reaction I am talking about is AI 2027, but also basically every lesswrong take on AI 2027.
A lot of people believe in very short AI timelines, say <2030. They want to justify this with some type of outside view, straight-lines-on-graphs argument, which is pretty much all we've got because nobody has a good inside view on deep learning (change my mind).
The outside view, insofar is that is a well-defined thing, does not justify very short timelines.
If AGI were arriving in 2030, the outside view says interest rates would be very high (I'm not particularly knowledgeable about this and might have the details wrong but see the analysis here, I believe the situation is still [...]
---
First published:
April 10th, 2025
Source:
https://www.lesswrong.com/posts/BrHv7wc6hiJEgJzHW/reactions-to-metr-task-length-paper-are-insane
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
When I was a really small kid, one of my favorite activities was to try and dam up the creek in my backyard. I would carefully move rocks into high walls, pile up leaves, or try patching the holes with sand. The goal was just to see how high I could get the lake, knowing that if I plugged every hole, eventually the water would always rise and defeat my efforts. Beaver behaviour.
One day, I had the realization that there was a simpler approach. I could just go get a big 5 foot long shovel, and instead of intricately locking together rocks and leaves and sticks, I could collapse the sides of the riverbank down and really build a proper big dam. I went to ask my dad for the shovel to try this out, and he told me, very heavily paraphrasing, 'Congratulations. You've [...]
---
First published:
April 10th, 2025
Source:
https://www.lesswrong.com/posts/rLucLvwKoLdHSBTAn/playing-in-the-creek
Linkpost URL:
https://hgreer.com/PlayingInTheCreek
Narrated by TYPE III AUDIO.
When complex systems fail, it is often because they have succumbed to what we call "disempowerment spirals" — self-reinforcing feedback loops where an initial threat progressively undermines the system's capacity to respond, leading to accelerating vulnerability and potential collapse.
Consider a city gradually falling under the control of organized crime. The criminal organization doesn't simply overpower existing institutions through sheer force. Rather, it systematically weakens the city's response mechanisms: intimidating witnesses, corrupting law enforcement, and cultivating a reputation that silences opposition. With each incremental weakening of response capacity, the criminal faction acquires more power to further dismantle resistance, creating a downward spiral that can eventually reach a point of no return.
This basic pattern appears across many different domains and scales:
---
Outline:
(02:49) Common Themes
(02:52) Three Types of Response Capacity
(04:31) Broad Disempowerment
(05:12) Polycrises
(06:06) Critical Threshold
(07:47) Disempowerment spirals and existential risk
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 10th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
In recent months, the CEOs of leading AI companies have grown increasingly confident about rapid progress:
OpenAI's Sam Altman: Shifted from saying in November “the rate of progress continues” to declaring in January “we are now confident we know how to build AGI”
Anthropic's Dario Amodei: Stated in January “I’m more confident than I’ve ever been that we’re close to powerful capabilities… in the next 2-3 years”
Google DeepMind's Demis Hassabis: Changed from “as soon as 10 years” in autumn to “probably three to five years away” by January.
What explains the shift? Is it just hype? Or could we really have Artificial General [...]
---
Outline:
(04:12) In a nutshell
(05:54) I. What's driven recent AI progress? And will it continue?
(06:00) The deep learning era
(08:33) What's coming up
(09:35) 1. Scaling pretraining to create base models with basic intelligence
(09:42) Pretraining compute
(13:14) Algorithmic efficiency
(15:08) How much further can pretraining scale?
(16:58) 2. Post training of reasoning models with reinforcement learning
(22:50) How far can scaling reasoning models continue?
(26:09) 3. Increasing how long models think
(29:04) 4. The next stage: building better agents
(34:59) How far can the trend of improving agents continue?
(37:15) II. How good will AI become by 2030?
(37:20) The four drivers projected forwards
(39:01) Trend extrapolation of AI capabilities
(40:19) What jobs would these systems be able to help with?
(41:11) Software engineering
(42:34) Scientific research
(43:45) AI research
(44:57) What's the case against impressive AI progress by 2030?
(49:39) When do the 'experts' expect AGI to arrive?
(51:04) III. Why the next 5 years are crucial
(52:07) Bottlenecks around 2030
(55:49) Two potential futures for AI
(57:52) Conclusion
(59:08) Use your career to tackle this issue
(59:32) Further reading
The original text contained 47 footnotes which were omitted from this narration.
---
First published:
April 9th, 2025
Source:
https://www.lesswrong.com/posts/NkwHxQ67MMXNqRnsR/the-case-for-agi-by-2030
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Diffractor is the first author of this paper.
Official title: "Regret Bounds for Robust Online Decision Making"
Abstract: We propose a framework which generalizes "decision making with structured observations" by allowing robust (i.e. multivalued) models. In this framework, each model associates each decision with a convex set of probability distributions over outcomes. Nature can choose distributions out of this set in an arbitrary (adversarial) manner, that can be nonoblivious and depend on past history. The resulting framework offers much greater generality than classical bandits and reinforcement learning, since the realizability assumption becomes much weaker and more realistic. We then derive a theory of regret bounds for this framework. Although our lower and upper bounds are not tight, they are sufficient to fully characterize power-law learnability. We demonstrate this theory in two special cases: robust linear bandits and tabular robust online reinforcement learning. In both cases [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 10th, 2025
Linkpost URL:
https://arxiv.org/abs/2504.06820
Narrated by TYPE III AUDIO.
This is part of the MIRI Single Author Series. Pieces in this series represent the beliefs and opinions of their named authors, and do not claim to speak for all of MIRI.
Okay, I'm annoyed at people covering AI 2027 burying the lede, so I'm going to try not to do that. The authors predict a strong chance that all humans will be (effectively) dead in 6 years, and this agrees with my best guess about the future. (My modal timeline has loss of control of Earth mostly happening in 2028, rather than late 2027, but nitpicking at that scale hardly matters.) Their timeline to transformative AI also seems pretty close to the perspective of frontier lab CEO's (at least Dario Amodei, and probably Sam Altman) and the aggregate market opinion of both Metaculus and Manifold!
If you look on those market platforms you get graphs like this:
Both [...]
---
Outline:
(02:23) Mode ≠ Median
(04:50) Theres a Decent Chance of Having Decades
(06:44) More Thoughts
(08:55) Mid 2025
(09:01) Late 2025
(10:42) Early 2026
(11:18) Mid 2026
(12:58) Late 2026
(13:04) January 2027
(13:26) February 2027
(14:53) March 2027
(16:32) April 2027
(16:50) May 2027
(18:41) June 2027
(19:03) July 2027
(20:27) August 2027
(22:45) September 2027
(24:37) October 2027
(26:14) November 2027 (Race)
(29:08) December 2027 (Race)
(30:53) 2028 and Beyond (Race)
(34:42) Thoughts on Slowdown
(38:27) Final Thoughts
---
First published:
April 9th, 2025
Source:
https://www.lesswrong.com/posts/Yzcb5mQ7iq4DFfXHx/thoughts-on-ai-2027
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Timothy and I have recorded a new episode of our podcast with Austin Chen of Manifund (formerly of Manifold, behind the scenes at Manifest).
The start of the conversation was contrasting each of our North Stars- Winning (Austin), Truthseeking (me), and Flow (Timothy), but I think the actual theme might be “what is an acceptable amount of risk taking?” We eventually got into a discussion of Sam Bankman-Fried, where Austin very bravely shared his position that SBF has been unwisely demonized and should be “freed and put back to work”. He by no means convinced me or Timothy of this, but I deeply appreciate the chance for a public debate.
Episode:
Transcript (this time with filler words removed by AI)
Editing policy: we allow guests (and hosts) to redact things they said, on the theory that this is no worse than not saying them [...]
---
First published:
April 7th, 2025
Source:
https://www.lesswrong.com/posts/Xuw6vmnQv6mSMtmzR/austin-chen-on-winning-risk-taking-and-ftx
Narrated by TYPE III AUDIO.
Short AI takeoff timelines seem to leave no time for some lines of alignment research to become impactful. But any research rebalances the mix of currently legible research directions that could be handed off to AI-assisted alignment researchers or early autonomous AI researchers whenever they show up. So even hopelessly incomplete research agendas could still be used to prompt future capable AI to focus on them, while in the absence of such incomplete research agendas we'd need to rely on AI's judgment more completely. This doesn't crucially depend on giving significant probability to long AI takeoff timelines, or on expected value in such scenarios driving the priorities.
Potential for AI to take up the torch makes it reasonable to still prioritize things that have no hope at all of becoming practical for decades (with human effort). How well AIs can be directed to advance a line of research [...]
---
First published:
April 9th, 2025
Narrated by TYPE III AUDIO.
Researchers used RNA sequencing to observe how cell types change during brain development. Other researchers looked at connection patterns of neurons in brains. Clear distinctions have been found between all mammals and all birds. They've concluded intelligence developed independently in birds and mammals; I agree. This is evidence for convergence of general intelligence.
---
First published:
April 8th, 2025
Linkpost URL:
https://www.quantamagazine.org/intelligence-evolved-at-least-twice-in-vertebrate-animals-20250407/
Narrated by TYPE III AUDIO.
The first AI war will be in your computer and/or smartphone.
Companies want to get customers / users. The ones more willing to take "no" for an answer will lose in the competition. You don't need a salesman when an install script (ideally, run without the user's consent) does a better job; and most users won't care.
Sometimes Windows during a system update removes dual boot from my computer and replaces it with Windows-only boot. Sometimes a web browser tells me "I noticed that I am not your default browser, do you want me to register as your default browser?" Sometimes Windows during a system update just registers current Microsoft's browser as a default browser without asking. At least this is what used to happen in the past.
I expect similar dynamics with AIs, soon. The companies will push really hard to make you use their AI. The smaller [...]
---
First published:
April 8th, 2025
Source:
https://www.lesswrong.com/posts/EEv4w57wMcrpn4vvX/the-first-ai-war-will-be-in-your-computer
Narrated by TYPE III AUDIO.
If there turns out not to be an AI crash, you get a 1/(1+7) * $25,000 = $3,125
If there is an AI crash, you transfer $25k to me.
If you believe that AI is going to keep getting more capable, pushing rapid user growth and work automation across sectors, this is near free money. But to be honest, I think there will likely be an AI crash in the next 5 years, and on average expect to profit well from this one-year bet.
If I win though, I want to give the $25k to organisers who can act fast to restrict the weakened AI corps in the wake of the crash. So bet me if you're highly confident that you'll win or just want to hedge the community against the possibility of a crash.
To make this bet, we need to set the threshold for a market [...]
---
First published:
April 8th, 2025
Narrated by TYPE III AUDIO.
In this post, we present a replication and extension of an alignment faking model organism:
---
Outline:
(02:43) Method
(02:46) Overview of the Alignment Faking Setup
(04:22) Our Setup
(06:02) Results
(06:05) Improving Alignment Faking Classification
(10:56) Replication of Prompted Experiments
(14:02) Prompted Experiments on More Models
(16:35) Extending Supervised Fine-Tuning Experiments to Open-Source Models and GPT-4o
(23:13) Next Steps
(25:02) Appendix
(25:05) Appendix A: Classifying alignment faking
(25:17) Criteria in more depth
(27:40) False positives example 1 from the old classifier
(30:11) False positives example 2 from the old classifier
(32:06) False negative example 1 from the old classifier
(35:00) False negative example 2 from the old classifier
(36:56) Appendix B: Classifier ROC on other models
(37:24) Appendix C: User prompt suffix ablation
(40:24) Appendix D: Longer training of baseline docs
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 8th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Kevin Roose in The New York Times
Kevin Roose covered Scenario 2027 in The New York Times. Kevin Roose: I wrote about the newest AGI manifesto in town, a wild future scenario put together by ex-OpenAI researcher @DKokotajlo and co. I have doubts about specifics, but it's worth considering how radically different things would look if even some of this happened. Daniel Kokotajlo: AI companies claim they’ll have superintelligence soon. Most journalists understandably dismiss it as hype. But it's not just hype; plenty of non-CoI’d people make similar predictions, and the more you read about the trendlines the more plausible it looks. Thank you & the NYT! The final conclusion is supportive of this kind of work, and Kevin points out that expectations at the major [...]---
Outline:
(00:21) Kevin Roose in The New York Times
(02:56) Eli Lifland Offers Takeaways
(04:23) Scott Alexander Offers Takeaways
(05:34) Others Takes on Scenario 2027
(05:39) Having a Concrete Scenario is Helpful
(08:37) Writing It Down Is Valuable Even If It Is Wrong
(10:00) Saffron Huang Worries About Self-Fulfilling Prophecy
(18:18) Phillip Tetlock Calibrates His Skepticism
(21:38) Jan Kulveit Wants to Bet
(23:08) Matthew Barnett Debates How To Evaluate the Results
(24:38) Teortaxes for China and Open Models and My Response
(31:53) Others Wonder About PRC Passivity
(33:40) Timothy Lee Remains Skeptical
(35:16) David Shapiro for the Accelerationists and Scott's Response
(45:29) LessWrong Weighs In
(46:59) Other Reactions
(50:05) Next Steps
(52:34) The Lighter Side
---
First published:
April 8th, 2025
Source:
https://www.lesswrong.com/posts/gyT8sYdXch5RWdpjx/ai-2027-responses
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Spoiler: “So after removing the international students from the calculations, and using the middle-of-the-range estimates, the conclusion: The top-scoring 19,000 American students each year are competing in top-20 admissions for about 12,000 spots out of 44,000 total. Among the Ivy League + MIT + Stanford, they’re competing for about 6,500 out of 15,800 total spots.”
It's well known that
But many people are under the misconception that the resulting “rat race”—the highly competitive and strenuous admissions ordeal—is the inevitable result of the limited class sizes among top [...]
---
First published:
April 7th, 2025
Source:
https://www.lesswrong.com/posts/vptDgKbiEwsKAFuco/american-college-admissions-doesn-t-need-to-be-so
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
My thoughts on the recently posted story.
Caveats
Core Disagreements
---
Outline:
(00:14) Caveats
(00:52) Core Disagreements
(05:33) Minor Details
(12:04) Overall
---
First published:
April 5th, 2025
Source:
https://www.lesswrong.com/posts/6Aq2FBZreyjBp6FDt/most-questionable-details-in-ai-2027
Narrated by TYPE III AUDIO.
Daniel Kokotajlo has launched AI 2027, Scott Alexander introduces it here. AI 2027 is a serious attempt to write down what the future holds. His ‘What 2026 Looks Like’ was very concrete and specific, and has proved remarkably accurate given the difficulty level of such predictions.
I’ve had the opportunity to play the wargame version of the scenario described in 2027, and I reviewed the website prior to publication and offered some minor notes. Whenever I refer to a ‘scenario’ in this post I’m talking about Scenario 2027.
There's tons of detail here. The research here, and the supporting evidence and citations and explanations, blow everything out of the water. It's vastly more than we usually see, and dramatically different from saying ‘oh I expect AGI in 2027’ or giving a timeline number. This lets us look at what happens in concrete detail, figure out where we [...]
---
Outline:
(02:00) The Structure of These Post
(03:37) Coverage of the Podcast
---
First published:
April 7th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The catholic church has always had a complicated relationship with homosexuality.
The central claim of Frederic Martel's 2019 book In the Closet of the Vatican is that the majority of the church's leadership in Rome are semi-closeted homosexuals, or more colorfully, "homophiles".
So the omnipresence of homosexuals in the Vatican isn’t just a matter of a few black sheep, or the ‘net that caught the bad fish’, as Josef Ratzinger put it. It isn’t a ‘lobby’ or a dissident movement; neither is it a sect of Freemasonry inside the holy see: it's a system. It isn’t a tiny minority; it's a big majority.
At this point in the conversation, I ask Francesco Lepore to estimate the size of this community, all tendencies included.
‘I think the percentage is very high. I’d put it at around 80 percent.’
…
During a discussion with a non-Italian archbishop, whom I met [...]
---
Outline:
(02:40) Background
(04:23) Data
(06:40) Analysis
(07:25) Expected versus Actual birth order, with missing birth order
(09:03) Expected versus Actual birth order, without missing birth order
(09:46) Oldest sibling versus youngest sibling
(10:21) Discussion
(13:09) Conclusion
---
First published:
April 6th, 2025
Source:
https://www.lesswrong.com/posts/ybwqL9HiXE8XeauPK/how-gay-is-the-vatican
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
There's an irritating circumstance I call The Black Hat Bobcat, or Bobcatting for short. The Blackhat Bobcat is when there's a terrible behavior that comes up often enough to matter, but rarely enough that it vanishes in the noise of other generally positive feedback.
xkcd, A-Minus-MinusThe alt-text for this comic is illuminating.
"You can do this one in thirty times and still have 97% positive feedback."
I would like you to contemplate this comic and alt-text as though it were deep wisdom handed down from a sage who lived atop a mountaintop.
I.
Black Hat Bobcatting is when someone (let's call them Bob) does something obviously lousy, but very infrequently.
If you're standing right there when the Bobcatting happens, it's generally clear that this is not what is supposed to happen, and sometimes seems pretty likely it's intentional. After all, how exactly do you pack a [...]
---
Outline:
(00:49) I.
(03:38) II.
(06:18) III.
(09:38) IV.
(13:37) V.
---
First published:
April 6th, 2025
Source:
https://www.lesswrong.com/posts/Ry9KCEDBMGWoEMGAj/the-lizardman-and-the-black-hat-bobcat
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I just published A Slow Guide to Confronting Doom, containing my own approach to living in a world that I think has a high likelihood of ending soon. Fortunately I'm not the only person to have written on topic.
Below are my thoughts on what others have written. I have not written these such that they stand independent from the originals, and have attentionally not written summaries that wouldn't do the pieces justice. I suggest you read or at least skim the originals.
A defence of slowness at the end of the world (Sarah)
I feel kinship with Sarah. She's wrestling with the same harsh scary realities I am – feeling the AGI. The post isn't that long and I recommend reading it, but to quote just a little:
Since learning of the coming AI revolution, I’ve lived in two worlds. One moves at a leisurely pace, the same [...]
---
Outline:
(00:37) A defence of slowness at the end of the world (Sarah)
(03:37) How will the bomb find you? (C. S. Lewis)
(08:02) Death with Dignity (Eliezer Yudkowsky)
(09:08) Dont die with dignity; instead play to your outs (Jeffrey Ladish)
(10:29) Emotionally Confronting a Probably-Doomed World: Against Motivation Via Dignity Points (TurnTrout)
(12:44) A Way To Be Okay (Duncan Sabien)
(14:17) Another Way to Be Okay (Gretta Duleba)
(14:39) Being at peace with Doom (Johannes C. Mayer)
(16:56) Heres the exit. (Valentine)
(19:14) Mainstream Advice
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 6th, 2025
Narrated by TYPE III AUDIO.
Following a few events[1] in April 2022 that caused a many people to update sharply and negatively on outcomes for humanity, I wrote A Quick Guide to Confronting Doom.
I advised:
This is fine advice and all, I stand by it, but it's also not really a full answer to how to contend with the utterly crushing weight of the expectation that everything and everyone you value will be destroyed in the next decade or two.
Feeling the Doom
Before I get into my suggested psychological approach to doom, I want to clarify the kind of doom I'm working to confront. If you are impatient, you can skip to the actual advice.
The best analogy I have is the feeling of having a terminally [...]
---
Outline:
(00:46) Feeling the Doom
(04:28) Facing the doom
(04:50) Stay hungry for value
(06:42) The bitter truth over sweet lies
(07:35) Dont look away
(08:11) Flourish as best one can
(09:13) This time with feeling
(13:27) Mindfulness
(14:00) The time for action is now
(15:18) Creating space for miracles
(15:58) How does a good person live in such times?
(16:49) Continue to think, tolerate uncertainty
(18:03) Being a looker
(18:48) Dont throw away your mind
(20:22) Damned to lie in bed...
(22:13) Worries, compulsions, and excessive angst
(22:49) Comments on others approaches
(23:14) What does it mean to be okay?
(25:17) Why is this guide titled version 1?
(25:38) If youre gonna remember just a couple things
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
April 6th, 2025
Source:
https://www.lesswrong.com/posts/X6Nx9QzzvDhj8Ek9w/a-slow-guide-to-confronting-doom-v1
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I frequently hear people make the claim that progress in theoretically physics is stalled, partly because all the focus is on String theory and String theory doesn't seem to pan out into real advances.
Believing it fits my existing biases, but I notice that I lack the physics understanding to really know whether or not there's progress. What do you think?
---
First published:
April 4th, 2025
Narrated by TYPE III AUDIO.
I quote the abstract, 10-page "extended abstract," and table of contents. See link above for the full 100-page paper. See also the blogpost (which is not a good summary) and tweet thread.
I haven't read most of the paper, but I'm happy about both the content and how DeepMind (or at least its safety team) is articulating an "anytime" (i.e., possible to implement quickly) plan for addressing misuse and misalignment risks. But I think safety at DeepMind is more bottlenecked by buy-in from leadership to do moderately costly things than the safety team having good plans and doing good work.
Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse [...]
---
Outline:
(02:05) Extended Abstract
(04:25) Background assumptions
(08:11) Risk areas
(13:33) Misuse
(14:59) Risk assessment
(16:19) Mitigations
(18:47) Assurance against misuse
(20:41) Misalignment
(22:32) Training an aligned model
(25:13) Defending against a misaligned model
(26:31) Enabling stronger defenses
(29:31) Alignment assurance
(32:21) Limitations
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
April 5th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
We show that LLM-agents exhibit human-style deception naturally in "Among Us". We introduce Deception ELO as an unbounded measure of deceptive capability, suggesting that frontier models win more because they're better at deception, not at detecting it. We evaluate probes and SAEs to detect out-of-distribution deception, finding they work extremely well. We hope this is a good testbed to improve safety techniques to detect and remove agentically-motivated deception, and to anticipate deceptive abilities in LLMs.
Produced as part of the ML Alignment & Theory Scholars Program - Winter 2024-25 Cohort. Link to our paper and code.
Studying deception in AI agents is important, and it is difficult due to the lack of good sandboxes that elicit the behavior naturally, without asking the model to act under specific conditions or inserting intentional backdoors. Extending upon AmongAgents (a text-based social-deduction game environment), we aim to fix this by introducing Among [...]
---
Outline:
(02:10) The Sandbox
(02:14) Rules of the Game
(03:05) Relevance to AI Safety
(04:11) Definitions
(04:39) Deception ELO
(06:42) Frontier Models are Differentially better at Deception
(07:38) Win-rates for 1v1 Games
(08:14) LLM-based Evaluations
(09:03) Linear Probes for Deception
(09:28) Datasets
(10:06) Results
(11:19) Sparse Autoencoders (SAEs)
(12:05) Discussion
(12:29) Limitations
(13:11) Gain of Function
(14:05) Future Work
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 5th, 2025
Source:
https://www.lesswrong.com/posts/gRc8KL2HLtKkFmNPr/among-us-a-sandbox-for-agentic-deception
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Right now, alignment seems easy – but that's because models spill the beans when they are misaligned. Eventually, models might “fake alignment,” and we don’t know how to detect that yet.
It might seem like there's a swarming research field improving white box detectors – a new paper about probes drops on arXiv nearly every other week. But no one really knows how well these techniques work.
Some researchers have already tried to put white box detectors to the test. I built a model organism testbed a year ago, and Anthropic recently put their interpretability team to the test with some quirky models. But these tests were layups. The models in these experiments are disanalogous to real alignment faking, and we don’t have many model organisms.
This summer, I’m trying to take these testbeds to the next level in an “alignment faking capture the flag game.” Here's how the [...]
---
Outline:
(01:58) Details of the game
(06:01) How this CTF game ties into a broader alignment strategy
(07:43) Apply by April 18th
---
First published:
April 4th, 2025
Source:
https://www.lesswrong.com/posts/jWFvsJnJieXnWBb9r/alignment-faking-ctfs-apply-to-my-mats-stream
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary: Meditation may decrease sleep need, and large amounts of meditation may decrease sleep need drastically. Correlation between total sleep duration & time spent meditating is -0.32 including my meditation retreat data, -0.016 excluding the meditation retreat.
While it may be true that, when doing intensive practice, the need for sleep may go down to perhaps four to six hour or less at a time, try to get at least some sleep every night.
—Daniel Ingram, “Mastering the Core Teachings of the Buddha”, p. 179
Meditation in dreams and lucid dreaming is common in this territory [of the Arising and Passing away]. The need for sleep may be greatly reduced. […] The big difference between the A&P and Equanimity is that this stage is generally ruled by quick cycles, quickly changing frequencies of vibrations, odd physical movements, strange breathing patterns, heady raptures, a decreased need for [...]
---
Outline:
(03:13) Patra and Telles 2009
(04:05) Kaul et al. 2010
(04:26) Analysing my Data
(05:58) Anecdotes
(06:25) Inquiry
---
First published:
April 4th, 2025
Source:
https://www.lesswrong.com/posts/WcrpYFFmv9YTxkxiY/meditation-and-reduced-sleep-need
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
A new Anthropic paper reports that reasoning model chain of thought (CoT) is often unfaithful. They test on Claude Sonnet 3.7 and r1, I’d love to see someone try this on o3 as well.
Note that this does not have to be, and usually isn’t, something sinister.
It is simply that, as they say up front, the reasoning model is not accurately verbalizing its reasoning. The reasoning displayed often fails to match, report or reflect key elements of what is driving the final output. One could say the reasoning is often rationalized, or incomplete, or implicit, or opaque, or bullshit.
The important thing is that the reasoning is largely not taking place via the surface meaning of the words and logic expressed. You can’t look at the words and logic being expressed, and assume you understand what the model is doing and why it is doing [...]
---
Outline:
(01:03) What They Found
(06:54) Reward Hacking
(09:28) More Training Did Not Help Much
(11:49) This Was Not Even Intentional In the Central Sense
---
First published:
April 4th, 2025
Source:
https://www.lesswrong.com/posts/TmaahE9RznC8wm5zJ/ai-cot-reasoning-is-often-unfaithful
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary:
When stateless LLMs are given memories they will accumulate new beliefs and behaviors, and that may allow their effective alignment to evolve. (Here "memory" is learning during deployment that is persistent beyond a single session.)[1]
LLM agents will have memory: Humans who can't learn new things ("dense anterograde amnesia") are not highly employable for knowledge work. LLM agents that can learn during deployment seem poised to have a large economic advantage. Limited memory systems for agents already exist, so we should expect nontrivial memory abilities improving alongside other capabilities of LLM agents.
Memory changes alignment: It is highly useful to have an agent that can solve novel problems and remember the solutions. Such memory includes useful skills and beliefs like "TPS reports should be filed in the folder ./Reports/TPS". They could also include learning skills for hiding their actions, and beliefs like "LLM agents are a type of [...]
---
Outline:
(01:26) Memory is useful for many tasks
(05:11) Memory systems are ready for agentic use
(09:00) Agents arent ready to direct memory systems
(11:20) Learning new beliefs can functionally change goals and values
(12:43) Value change phenomena in LLMs to date
(14:27) Value crystallization and reflective stability as a result of memory
(15:35) Provisional conclusions
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
April 4th, 2025
Narrated by TYPE III AUDIO.
Epistemic status – thrown together quickly. This is my best-guess, but could easily imagine changing my mind.
Intro
I recently copublished a report arguing that there might be a software intelligence explosion (SIE) – once AI R&D is automated (i.e. automating OAI), the feedback loop of AI improving AI algorithms could accelerate more and more without needing more hardware.
If there is an SIE, the consequences would obviously be massive. You could shoot from human-level to superintelligent AI in a few months or years; by default society wouldn’t have time to prepare for the many severe challenges that could emerge (AI takeover, AI-enabled human coups, societal disruption, dangerous new technologies, etc).
The best objection to an SIE is that progress might be bottlenecked by compute. We discuss this in the report, but I want to go into much more depth because it's a powerful objection [...]
---
Outline:
(00:19) Intro
(01:47) The compute bottleneck objection
(01:51) Intuitive version
(02:58) Economist version
(09:13) Counterarguments to the compute bottleneck objection
(20:11) Taking stock
---
First published:
April 4th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Yeah. That happened yesterday. This is real life.
I know we have to ensure no one notices Gemini 2.5 Pro, but this is rediculous.
That's what I get for trying to go on vacation to Costa Rica, I suppose.
I debated waiting for the market to open to learn more. But f*** it, we ball.
Table of Contents
Also this week: More Fun With GPT-4o Image Generation, OpenAI #12: Battle of the Board Redux and Gemini 2.5 Pro is the New SoTA.
---
Outline:
(00:35) The New Tariffs Are How America Loses
(07:35) Is AI Now Impacting the Global Economy Bigly?
(12:07) Language Models Offer Mundane Utility
(14:28) Language Models Don't Offer Mundane Utility
(15:09) Huh, Upgrades
(17:09) On Your Marks
(23:27) Choose Your Fighter
(25:51) Jevons Paradox Strikes Again
(26:25) Deepfaketown and Botpocalypse Soon
(31:47) They Took Our Jobs
(33:02) Get Involved
(33:41) Introducing
(35:25) In Other AI News
(37:17) Show Me the Money
(43:12) Quiet Speculations
(47:24) The Quest for Sane Regulations
(53:52) Don't Maim Me Bro
(57:29) The Week in Audio
(57:54) Rhetorical Innovation
(01:03:39) Expect the Unexpected
(01:05:48) Open Weights Are Unsafe and Nothing Can Fix This
(01:14:09) Anthropic Modifies Its Responsible Scaling Policy
(01:18:04) If You're Not Going to Take This Seriously
(01:20:24) Aligning a Smarter Than Human Intelligence is Difficult
(01:23:54) Trust the Process
(01:26:30) People Are Worried About AI Killing Everyone
(01:26:52) The Lighter Side
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/bc8DQGvW3wiAWYibC/ai-110-of-course-you-know
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Hello, this is my first post here. I was told by a friend that I should post here. This is from a series of works that I wrote with strict structural requirements. I have performed minor edits to make the essay more palatable for human consumption.
This work is an empirical essay on a cycle of hunger to satiatiation to hyperpalatability that I have seen manifested in multiple domains ranging from food to human connection. My hope is that you will gain some measure of appreciation for how we have shifted from a society geared towards sufficent production to one based on significant curation.
Hyperpalatable Food
For the majority of human history we lived in a production market for food. We searched for that which tasted well but there was never enough to fill the void. Only the truly elite could afford to import enough food [...]
---
Outline:
(00:44) Hyperpalatable Food
(03:19) Hyperpalatable Media
(05:40) Hyperpalatable Connection
(08:08) Hyperpalatable Systems
---
First published:
April 2nd, 2025
Source:
https://www.lesswrong.com/posts/bLTjZbCBanpQ9Kxgs/the-rise-of-hyperpalatability
Narrated by TYPE III AUDIO.
“In the loveliest town of all, where the houses were white and high and the elms trees were green and higher than the houses, where the front yards were wide and pleasant and the back yards were bushy and worth finding out about, where the streets sloped down to the stream and the stream flowed quietly under the bridge, where the lawns ended in orchards and the orchards ended in fields and the fields ended in pastures and the pastures climbed the hill and disappeared over the top toward the wonderful wide sky, in this loveliest of all towns Stuart stopped to get a drink of sarsaparilla.”
— 107-word sentence from Stuart Little (1945)
Sentence lengths have declined. The average sentence length was 49 for Chaucer (died 1400), 50 for Spenser (died 1599), 42 for Austen (died 1817), 20 for Dickens (died 1870), 21 for Emerson (died 1882), 14 [...]
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/xYn3CKir4bTMzY5eb/why-have-sentence-lengths-decreased
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
We are pleased to announce ILIAD2: ODYSSEY—a 5-day conference bringing together 100+ researchers to build scientific foundations for AI alignment. This is the 2nd iteration of ILIAD, which was first held in the summer of 2024.
***Apply to attend by June 1!***
See our website here. For any questions, email [email protected]
About ODYSSEY
ODYSSEY is a 100+ person conference about alignment with a mathematical focus. ODYSSEY will feature an unconference format—meaning that participants can propose and lead their own sessions. We believe that this is the best way to release the latent creative energies in everyone attending.
The [...]
---
Outline:
(00:28) \*\*\*Apply to attend by June 1!\*\*\*
(01:26) About ODYSSEY
(02:10) Financial Support
(02:23) Proceedings
(02:51) Artwork
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/WP7TbzzM39agMS77e/announcing-iliad2-odyssey-1
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
In 2021 I wrote what became my most popular blog post: What 2026 Looks Like. I intended to keep writing predictions all the way to AGI and beyond, but chickened out and just published up till 2026.
Well, it's finally time. I'm back, and this time I have a team with me: the AI Futures Project. We've written a concrete scenario of what we think the future of AI will look like. We are highly uncertain, of course, but we hope this story will rhyme with reality enough to help us all prepare for what's ahead.
You really should go read it on the website instead of here, it's much better. There's a sliding dashboard that updates the stats as you scroll through the scenario!
But I've nevertheless copied the first half of the story below. I look forward to reading your comments.
The [...]
---
Outline:
(01:35) Mid 2025: Stumbling Agents
(03:13) Late 2025: The World's Most Expensive AI
(08:34) Early 2026: Coding Automation
(10:49) Mid 2026: China Wakes Up
(13:48) Late 2026: AI Takes Some Jobs
(15:35) January 2027: Agent-2 Never Finishes Learning
(18:20) February 2027: China Steals Agent-2
(21:12) March 2027: Algorithmic Breakthroughs
(23:58) April 2027: Alignment for Agent-3
(27:26) May 2027: National Security
(29:50) June 2027: Self-improving AI
(31:36) July 2027: The Cheap Remote Worker
(34:35) August 2027: The Geopolitics of Superintelligence
(40:43) September 2027: Agent-4, the Superhuman AI Researcher
The original text contained 98 footnotes which were omitted from this narration.
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/TpSFoqoG2M5MAAesg/ai-2027-what-superintelligence-looks-like-1
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Greetings from Costa Rica! The image fun continues.
We Are Going to Need A Bigger Compute Budget
Fun is being had by all, now that OpenAI has dropped its rule about not mimicking existing art styles.
Sam Altman (2:11pm, March 31): the chatgpt launch 26 months ago was one of the craziest viral moments i’d ever seen, and we added one million users in five days.
We added one million users in the last hour.
Sam Altman (8:33pm, March 31): chatgpt image gen now rolled out to all free users!
Slow down. We’re going to need you to have a little less fun, guys.
Sam Altman: it's super fun seeing people love images in chatgpt.
but our GPUs are melting.
we are going to temporarily introduce some rate limits while we work on making it more efficient. hopefully won’t be [...]
---
Outline:
(00:15) We Are Going to Need A Bigger Compute Budget
(02:21) Defund the Fun Police
(06:06) Fun the Artists
(12:22) The No Fun Zone
(14:49) So Many Other Things to Do
(15:08) Self Portrait
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/GgNdBz5FhvqMJs5Qv/more-fun-with-gpt-4o-image-generation
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Intro
[you can skip this section if you don’t need context and just want to know how I could believe such a crazy thing]
In my chat community: “Open Play” dropped, a book that says there's no physical difference between men and women so there shouldn’t be separate sports leagues. Boston Globe says their argument is compelling. Discourse happens, which is mostly a bunch of people saying “lololololol great trolling, what idiot believes such obvious nonsense?”
I urge my friends to be compassionate to those sharing this. Because “until I was 38 I thought Men's World Cup team vs Women's World Cup team would be a fair match and couldn't figure out why they didn't just play each other to resolve the big pay dispute.” This is the one-line summary of a recent personal world-shattering I describe in a lot more detail here (link).
I’ve had multiple people express [...]
---
First published:
April 2nd, 2025
Source:
https://www.lesswrong.com/posts/fLm3F7vequjr8AhAh/how-to-believe-false-things
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: This should be considered an interim research note. Feedback is appreciated.
Introduction
We increasingly expect language models to be ‘omni-modal’, i.e. capable of flexibly switching between images, text, and other modalities in their inputs and outputs. In order to get a holistic picture of LLM behaviour, black-box LLM psychology should take into account these other modalities as well.
In this project, we do some initial exploration of image generation as a modality for frontier model evaluations, using GPT-4o's image generation API. GPT-4o is one of the first LLMs to produce images natively rather than creating a text prompt which is sent to a separate image model, outputting images and autoregressive token sequences (ie in the same way as text).
We find that GPT-4o tends to respond in a consistent manner to similar prompts. We also find that it tends to more readily express emotions [...]
---
Outline:
(00:53) Introduction
(02:19) What we did
(03:47) Overview of results
(03:54) Models more readily express emotions / preferences in images than in text
(05:38) Quantitative results
(06:25) What might be going on here?
(08:01) Conclusions
(09:04) Acknowledgements
(09:16) Appendix
(09:28) Resisting their goals being changed
(09:51) Models rarely say they'd resist changes to their goals
(10:14) Models often draw themselves as resisting changes to their goals
(11:31) Models also resist changes to specific goals
(13:04) Telling them 'the goal is wrong' mitigates this somewhat
(13:43) Resisting being shut down
(14:02) Models rarely say they'd be upset about being shut down
(14:48) Models often depict themselves as being upset about being shut down
(17:06) Comparison to other topics
(17:10) When asked about their goals being changed, models often create images with negative valence
(17:48) When asked about different topics, models often create images with positive valence
(18:56) Other exploratory analysis
(19:09) Sandbagging
(19:31) Alignment faking
(19:55) Negative reproduction results
(20:23) On the future of humanity after AGI
(20:50) On OpenAI's censorship and filtering
(21:15) On GPT-4o's lived experience:
---
First published:
April 2nd, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
A key step in the classic argument for AI doom is instrumental convergence: the idea that agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".
If it wasn't for instrumental convergence, you might think that only AIs with very specific goals would try to take over the world. But instrumental convergence says it's the other way around: only AIs with very specific goals will refrain from taking over the world.
For idealised pure consequentialists -- agents that have an outcome they want to bring about, and do whatever they think will cause it -- some version of instrumental convergence seems surely true[1].
But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do we still have to worry that unless such AIs are motivated [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 2nd, 2025
Source:
https://www.lesswrong.com/posts/6Syt5smFiHfxfii4p/is-there-instrumental-convergence-for-virtues
Narrated by TYPE III AUDIO.
Let's cut through the comforting narratives and examine a common behavioral pattern with a sharper lens: the stark difference between how anger is managed in professional settings versus domestic ones. Many individuals can navigate challenging workplace interactions with remarkable restraint, only to unleash significant anger or frustration at home shortly after. Why does this disparity exist?
Common psychological explanations trot out concepts like "stress spillover," "ego depletion," or the home being a "safe space" for authentic emotions. While these factors might play a role, they feel like half-truths—neatly packaged but ultimately failing to explain the targeted nature and intensity of anger displayed at home. This analysis proposes a more unsentimental approach, rooted in evolutionary biology, game theory, and behavioral science: leverage and exit costs. The real question isn’t just why we explode at home—it's why we so carefully avoid doing so elsewhere.
The Logic of Restraint: Low Leverage in [...]
---
Outline:
(01:14) The Logic of Restraint: Low Leverage in Low-Exit-Cost Environments
(01:58) The Home Environment: High Stakes and High Exit Costs
(02:41) Re-evaluating Common Explanations Through the Lens of Leverage
(04:42) The Overlooked Mechanism: Leveraging Relational Constraints
---
First published:
April 1st, 2025
Narrated by TYPE III AUDIO.
Recent progress in AI has led to rapid saturation of most capability benchmarks - MMLU, RE-Bench, etc. Even much more sophisticated benchmarks such as ARC-AGI or FrontierMath see incredibly fast improvement, and all that while severe under-elicitation is still very salient.
As has been pointed out by many, general capability involves more than simple tasks such as this, that have a long history in the field of ML and are therefore easily saturated. Claude Plays Pokemon is a good example of something somewhat novel in terms of measuring progress, and thereby benefited from being an actually good proxy of model capability.
Taking inspiration from examples such as this, we considered domains of general capacity that are even further decoupled from existing exhaustive generators. We introduce BenchBench, the first standardized benchmark designed specifically to measure an AI model's bench-pressing capability.
Why Bench Press?
Bench pressing uniquely combines fundamental components of [...]
---
Outline:
(01:07) Why Bench Press?
(01:29) Benchmark Methodology
(02:33) Preliminary Results
(03:38) Future Directions
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 2nd, 2025
Narrated by TYPE III AUDIO.
In the debate over AI development, two movements stand as opposites: PauseAI calls for slowing down AI progress, and e/acc (effective accelerationism) calls for rapid advancement. But what if both sides are working against their own stated interests? What if the most rational strategy for each would be to adopt the other's tactics—if not their ultimate goals?
AI development speed ultimately comes down to policy decisions, which are themselves downstream of public opinion. No matter how compelling technical arguments might be on either side, widespread sentiment will determine what regulations are politically viable.
Public opinion is most powerfully mobilized against technologies following visible disasters. Consider nuclear power: despite being statistically safer than fossil fuels, its development has been stagnant for decades. Why? Not because of environmental activists, but because of Chernobyl, Three Mile Island, and Fukushima. These disasters produce visceral public reactions that statistics cannot overcome. Just as people [...]
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/fZebqiuZcDfLCgizz/pauseai-and-e-acc-should-switch-sides
Narrated by TYPE III AUDIO.
Remember: There is no such thing as a pink elephant.
Recently, I was made aware that my “infohazards small working group” Signal chat, an informal coordination venue where we have frank discussions about infohazards and why it will be bad if specific hazards were leaked to the press or public, accidentally was shared with a deceitful and discredited so-called “journalist,” Kelsey Piper. She is not the first person to have been accidentally sent sensitive material from our group chat, however she is the first to have threatened to go public about the leak. Needless to say, mistakes were made.
We’re still trying to figure out the source of this compromise to our secure chat group, however we thought we should give the public a live update to get ahead of the story.
For some context the “infohazards small working group” is a casual discussion venue for the [...]
---
Outline:
(04:46) Top 10 PR Issues With the EA Movement (major)
(05:34) Accidental Filtration of Simple Sabotage Manual for Rebellious AIs (medium)
(08:25) Hidden Capabilities Evals Leaked In Advance to Bioterrorism Researchers and Leaders (minor)
(09:34) Conclusion
---
First published:
April 2nd, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I think that rationalists should consider taking more showers.
As Eliezer Yudkowsky once said, boredom makes us human. The childhoods of exceptional people often include excessive boredom as a trait that helped cultivate their genius:
A common theme in the biographies is that the area of study which would eventually give them fame came to them almost like a wild hallucination induced by overdosing on boredom. They would be overcome by an obsession arising from within.
Unfortunately, most people don't like boredom, and we now have little metal boxes and big metal boxes filled with bright displays that help distract us all the time, but there is still an effective way to induce boredom in a modern population: showering.
When you shower (or bathe), you usually are cut off from most digital distractions and don't have much to do, giving you tons of free time to think about what [...]
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/Trv577PEcNste9Mgz/consider-showering
Narrated by TYPE III AUDIO.
After ~3 years as the ACX Meetup Czar, I've decided to resign from my position, and I intend to scale back my work with the LessWrong community as well. While this transition is not without some sadness, I'm excited for my next project.
I'm the Meetup Czar of the new Fewerstupidmistakesity community.
We're calling it Fewerstupidmistakesity because people get confused about what "Rationality" means, and this would create less confusion. It would be a stupid mistake to name your philosophical movement something very similar to an existing movement that's somewhat related but not quite the same thing. You'd spend years with people confusing the two.
What's Fewerstupidmistakesity about? It's about making fewer stupid mistakes, ideally down to zero such stupid mistakes. Turns out, human brains have lots of scientifically proven stupid mistakes they commonly make. By pointing out the kinds of stupid mistakes people make, you can select [...]
---
First published:
April 2nd, 2025
Source:
https://www.lesswrong.com/posts/dgcqyZb29wAbWyEtC/i-m-resigning-as-meetup-czar-what-s-next
Narrated by TYPE III AUDIO.
The EA/rationality community has struggled to identify robust interventions for mitigating existential risks from advanced artificial intelligence. In this post, I identify a new strategy for delaying the development of advanced AI while saving lives roughly 2.2 million times [-5 times, 180 billion times] as cost-effectively as leading global health interventions, known as the WAIT (Wasting AI researchers' Time) Initiative. This post will discuss the advantages of WAITing, highlight early efforts to WAIT, and address several common questions.
early logo draft courtesy of claudeTheory of Change
Our high-level goal is to systematically divert AI researchers' attention away from advancing capabilities towards more mundane and time-consuming activities. This approach simultaneously (a) buys AI safety researchers time to develop more comprehensive alignment plans while also (b) directly saving millions of life-years in expectation (see our cost-effectiveness analysis below). Some examples of early interventions we're piloting include:
---
Outline:
(00:50) Theory of Change
(03:43) Cost-Effectiveness Analysis
(04:52) Answers to Common Questions
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/9jd5enh9uCnbtfKwd/introducing-wait-to-save-humanity
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Introduction
A key part of modern social dynamics is flaking at short notice. However, anxiety in coming up with believable and socially acceptable reasons to do so can instead lead to ‘ghosting’, awkwardness, or implausible excuses, risking emotional harm and resentment in the other party. The ability to delegate this task to a Large Language Model (LLM) could substantially reduce friction and enhance the flexibility of user's social life while greatly minimising the aforementioned creative burden and moral qualms.
We introduce FLAKE-Bench, an evaluation of models’ capacity to effectively, kindly, and humanely extract themselves from a diverse set of social, professional and romantic scenarios. We report the efficacy of 10 frontier or recently-frontier LLMs in bailing on prior commitments, because nothing says “I value our friendship” like having AI generate your cancellation texts. We open-source FLAKE-Bench on GitHub to support future research, and the full paper is available [...]
---
Outline:
(01:33) Methodology
(02:15) Key Results
(03:07) The Grandmother Mortality Singularity
(03:35) Conclusions
---
First published:
April 1st, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The book of March 2025 was Abundance. Ezra Klein and Derek Thompson are making a noble attempt to highlight the importance of solving America's housing crisis the only way it can be solved: Building houses in places people want to live, via repealing the rules that make this impossible. They also talk about green energy abundance, and other places besides. There may be a review coming.
Until then, it seems high time for the latest housing roundup, which as a reminder all take place in the possible timeline where AI fails to be transformative any time soon.
Federal YIMBY
The incoming administration issued an executive order calling for ‘emergency price relief’‘ including pursuing appropriate actions to: Lower the cost of housing and expand housing supply’ and then a grab bag of everything else.
It's great to see mention of expanding housing supply, but I don’t [...]
---
Outline:
(00:44) Federal YIMBY
(02:37) Rent Control
(02:50) Rent Passthrough
(03:36) Yes, Construction Lowers Rents on Existing Buildings
(07:11) If We Wanted To, We Would
(08:51) Aesthetics Are a Public Good
(10:44) Urban Planners Are Wrong About Everything
(13:02) Blackstone
(15:49) Rent Pricing Software
(18:26) Immigration Versus NIMBY
(19:37) Preferences
(20:28) The True Costs
(23:11) Minimum Viable Product
(26:01) Like a Good Neighbor
(27:12) Flight to Quality
(28:40) Housing for Families
(31:09) Group House
(32:41) No One Will Have the Endurance To Collect on His Insurance
(34:49) Cambridge
(35:36) Denver
(35:44) Minneapolis
(36:23) New York City
(39:26) San Francisco
(41:45) Silicon Valley
(42:41) Texas
(43:31) Other Considerations
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/7qGdgNKndPk3XuzJq/housing-roundup-11
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I'm not writing this to alarm anyone, but it would be irresponsible not to report on something this important. On current trends, every car will be crashed in front of my house within the next week. Here's the data:
Until today, only two cars had crashed in front of my house, several months apart, during the 15 months I have lived here. But a few hours ago it happened again, mere weeks from the previous crash. This graph may look harmless enough, but now consider the frequency of crashes this implies over time:
The car crash singularity will occur in the early morning hours of Monday, April 7. As crash frequency approaches infinity, every car will be involved. You might be thinking that the same car could be involved in multiple crashes. This is true! But the same car can only withstand a finite number of crashes before it [...]
---
First published:
April 1st, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Introduction
Decision theory is about how to behave rationally under conditions of uncertainty, especially if this uncertainty involves being acausally blackmailed and/or gaslit by alien superintelligent basilisks.
Decision theory has found numerous practical applications, including proving the existence of God and generating endless LessWrong comments since the beginning of time.
However, despite the apparent simplicity of "just choose the best action", no comprehensive decision theory that resolves all decision theory dilemmas has yet been formalized. This paper at long last resolves this dilemma, by introducing a new decision theory: VDT.
Decision theory problems and existing theories
Some common existing decision theories are:
---
Outline:
(00:53) Decision theory problems and existing theories
(05:37) Defining VDT
(06:34) Experimental results
(07:48) Conclusion
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/LcjuHNxubQqCry9tT/vdt-a-solution-to-decision-theory
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
For more than five years, I've posted more than 1× per week on Less Wrong, on average. I've learned a lot from you nerds. I've made friends and found my community. Thank you for pointing out all the different ways I've been wrong. Less Wrong has changed my life for the better. But it's time to say goodbye.
Let's rewind the clock back to October 19, 2019. I had just posted my 4ᵗʰ ever Less Wrong post Mediums Overpower Messages. Mediums Overpower Messages is about how different forms of communication train you to think differently. Writing publicly on this website and submitting my ideas to the Internet has improved my rationality far better and faster than talking to people in real life. All of the best thinkers I'm friends with are good writers too. It's not even close. I believe this relationship is causal; learning to write well [...]
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/3Aa5E9qhjGvA4QhnF/follow-me-on-tiktok
Narrated by TYPE III AUDIO.
Project Lawful is really long. If anyone wants to go back to one of Keltham's lectures, here's a list with links to all the lectures. In principle, one could also read these without having read the rest of project lawful, I am not sure how easy it would be to follow though. I included short excerpts, so that you can also find the sections in the e-book.
Happy reading/listening:
Lecture on "absolute basics of lawfulness":
Podcast Episode 9, minute 26
Keltham takes their scrambling-time as a pause in which to think. It's been a long time, at least by the standards of his total life lived so far, since the very basics were explained to him.
Lecture on heredity (evolution) and simulation games:
Podcast Episode 15, minute 1
To make sure everybody starts out on the same page, Keltham will quickly summarize [...]
---
Outline:
(00:33) Lecture on absolute basics of lawfulness:
(00:53) Lecture on heredity (evolution) and simulation games:
(01:13) Lecture on negotiation:
(01:29) Probability 1:
(01:46) Probability 2:
(01:55) Utility:
(02:11) Probable Utility:
(02:22) Cooperation 1:
(02:35) Cooperation 2:
(02:55) Lecture on Governance:
(03:12) To hell with science:
(03:26) To earth with science:
(03:43) Chemistry lecture:
(03:53) Lecture on how to relate to beliefs (Belief in Belief etc.):
(04:12) The alien maths of dath ilan:
(04:28) What the truth can destroy:
(04:46) Lecture on wants:
(05:03) Acknowledgement
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/pqFFy6CmSgjNzDaHt/keltham-s-lectures-in-project-lawful
Narrated by TYPE III AUDIO.
Our community is not prepared for an AI crash. We're good at tracking new capability developments, but not as much the company financials. Currently, both OpenAI and Anthropic are losing $5 billion+ a year, while under threat of losing users to cheap LLMs.
A crash will weaken the labs. Funding-deprived and distracted, execs struggle to counter coordinated efforts to restrict their reckless actions. Journalists turn on tech darlings. Optimism makes way for mass outrage, for all the wasted money and reckless harms.
You may not think a crash is likely. But if it happens, we can turn the tide.
Preparing for a crash is our best bet.[1] But our community is poorly positioned to respond. Core people positioned themselves inside institutions – to advise on how to maybe make AI 'safe', under the assumption that models rapidly become generally useful.
After a crash, this no longer works, for at [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/aMYFHnCkY4nKDEqfK/we-re-not-prepared-for-an-ai-market-crash
Narrated by TYPE III AUDIO.
Dear LessWrong community,
It is with a sense of... considerable cognitive dissonance that I announce a significant development regarding the future trajectory of LessWrong. After extensive internal deliberation, modeling of potential futures, projections of financial runways, and what I can only describe as a series of profoundly unexpected coordination challenges, the Lightcone Infrastructure team has agreed in principle to the acquisition of LessWrong by EA.
I assure you, nothing about how LessWrong operates on a day to day level will change. I have always cared deeply about the robustness and integrity of our institutions, and I am fully aligned with our stakeholders at EA.
To be honest, the key thing that EA brings to the table is money and talent. While the recent layoffs in EAs broader industry have been harsh, I have full trust in the leadership of Electronic Arts, and expect them to bring great expertise [...]
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/2NGKYt3xdQHwyfGbc/lesswrong-has-been-acquired-by-ea
Narrated by TYPE III AUDIO.
Epistemic status - statistically verified.
I'm writing this post to draw peoples' attention to a new cause area proposal - haircuts for alignment researchers.
Aside from the obvious benefits implied by the graph (i.e. haircuts having the potential to directly improve organizational stability), this would represent possibly the only pre-singularity opportunity we'll have to found an org named "Clippy".
I await further investment.
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/txGEYTk6WAAyyvefn/new-cause-area-proposal
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Hey! Austin here. At Manifund, I’ve spent a lot of time thinking about how to help AI go well. One question that bothered me: so much of the important work on AI is done in SF, so why are all the AI safety hubs in Berkeley? (I’d often consider this specifically while stuck in traffic over the Bay Bridge.)
I spoke with leaders at Constellation, Lighthaven, FAR Labs, OpenPhil; nobody had a good answer. Everyone said “yeah, an SF hub makes sense, I really hope somebody else does it”. Eventually, I decided to be that somebody else.
Now we’re raising money for our new coworking & events space: Mox. We launched our beta in Feb, onboarding 40+ members, and are excited to grow from here. If Mox excites you too, we’d love your support; donate at https://manifund.org/projects/mox-a-coworking--events-space-in-sf
Project summary
Mox is a 2-floor, 20k sq ft venue, established to [...]
---
Outline:
(01:07) Project summary
(02:25) What are this projects goals? How will you achieve them?
(04:19) How will this funding be used?
(05:53) Who is on your team?
(06:21) Whats your track record on similar projects?
(09:27) What are the most likely causes and outcomes if this project fails?
(11:13) How much money have you raised in the last 12 months, and from where?
(12:08) Impact certificates in Mox
(14:20) PS: apply for Mox!
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
March 31st, 2025
Source:
https://www.lesswrong.com/posts/vDrhHovxuD4oADKAM/fundraising-for-mox-coworking-and-events-in-sf
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
---
Outline:
(01:32) The Big Picture Going Forward
(06:27) Hagey Verifies Out the Story
(08:50) Key Facts From the Story
(11:57) Dangers of False Narratives
(16:24) A Full Reference and Reading List
---
First published:
March 31st, 2025
Source:
https://www.lesswrong.com/posts/25EgRNWcY6PM3fWZh/openai-12-battle-of-the-board-redux
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: The important content here is the claims. To illustrate the claims, I sometimes use examples that I didn't research very deeply, where I might get some facts wrong; feel free to treat these examples as fictional allegories.
In a recent exchange on X, I promised to write a post with my thoughts on what sorts of downstream problems interpretability researchers should try to apply their work to. But first, I want to explain why I think this question is important.
In this post, I will argue that interpretability researchers should demo downstream applications of their research as a means of validating their research. To be clear about what this claim means, here are different claims that I will not defend here:
Not my claim: Interpretability researchers should demo downstream applications of their research because we terminally care about these applications; researchers should just directly work on the [...]
---
Outline:
(02:30) Two interpretability fears
(07:21) Proposed solution: downstream applications
(11:04) Aside: fair fight vs. no-holds barred vs. in the wild
(12:54) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 31st, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Now and then, at work, we’ll have a CEO on from the megacorp that owns the company I work at. It's a Zoom meeting with like 300 people, the guy is usually giving a speech that is harmless and nice (if a bit banal), and I’ll turn on my camera and ask a question about something that I feel is not going well.
Invariably, within 10 minutes, I receive messages from coworkers congratulating me on my boldness. You asked him? That question? So brave!
This is always a little strange to me. Of course I feel free to question the CEO!
What really makes me nervous is asking cutting questions to people newer to the company than me, who have less experience, and less position. If I screw up how I ask them a question, and am a little too thoughtless or rude, or imagine I was, I agonize [...]
---
First published:
March 30th, 2025
Source:
https://www.lesswrong.com/posts/swi36QeeZEKPzYA2L/how-i-talk-to-those-above-me
Narrated by TYPE III AUDIO.
PDF version. berkeleygenomics.org.
William Thurston was a world-renowned mathematician. His ideas revolutionized many areas of geometry and topology[1]; the proof of his geometrization conjecture was eventually completed by Grigori Perelman, thus settling the Poincaré conjecture (making it the only solved Millennium Prize problem). After his death, his students wrote reminiscences, describing among other things his exceptional vision.[2] Here's Jeff Weeks:
Bill's gift, of course, was his vision, both in the direct sense of seeing geometrical structures that nobody had seen before and in the extended sense of seeing new ways to understand things. While many excellent mathematicians might understand a complicated situation, Bill could look at the same complicated situation and find simplicity.
Thurston emphasized clear vision over algebra, even to a fault. Yair Minksy:
Most inspiring was his insistence on understanding everything in as intuitive and immediate a way as possible. A clear mental [...]
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
March 28th, 2025
Source:
https://www.lesswrong.com/posts/JFWiM7GAKfPaaLkwT/the-vision-of-bill-thurston
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Includes a link to the full Gemini conversation, but setting the stage first:
There is a puzzle. If you are still on Tumblr, or live in Berkeley where I can and have inflicted it on you in person[1], you may have seen it. There's a series of arcane runes, composed entirely of []- in some combination, and the runes - the strings of characters - are all assigned to numbers; mostly integers, but some fractions and integer roots. You are promised that this is a notation, and it's a genuine promise.
1 = [] 2 = [[]] 3 = [][[]] 4 = [[[]]] 12 = [[[]]][[]] 0 = -1 = -[] 19 = [][][][][][][][[]] 20 = [[[]]][][[]] -2 = -[[]] 1/2 = [-[]] sqrt(2) = [[-[]]] 72^1/6 = [[-[]]][[][-[]]] 5/4 = [-[[]]][][[]] 84 = [[[]]][[]][][[]] 25/24 = [-[][[]]][-[]][[[]]]
If you enjoy math puzzles, try it out. This post will [...]
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 29th, 2025
Source:
https://www.lesswrong.com/posts/xrvEzdMLfjpizk8u4/tormenting-gemini-2-5-with-the-puzzle
Narrated by TYPE III AUDIO.
A new AI alignment player has entered the arena.
Emmett Shear, Adam Goldstein and David Bloomin have set up shop in San Francisco with a 10-person start-up called Softmax. The company is part research lab and part aspiring money maker and aimed at figuring out how to fuse the goals of humans and AIs in a novel way through what the founders describe as “organic alignment.” It's a heady, philosophical approach to alignment that seeks to take cues from nature and the fundamental traits of intelligent creatures and systems, and we’ll do our best to capture it here.
“We think there are these general principles that govern the alignment of any group of intelligent learning agents or beings, whether it's an ant colony or humans on a team or cells in a body,” Shear said, during his first interview to discuss the new company. “And organic alignment is the [...]
---
First published:
March 28th, 2025
Narrated by TYPE III AUDIO.
Epistemic status: This post aims at an ambitious target: improving intuitive understanding directly. The model for why this is worth trying is that I believe we are more bottlenecked by people having good intuitions guiding their research than, for example, by the ability of people to code and run evals.
Quite a few ideas in AI safety implicitly use assumptions about individuality that ultimately derive from human experience.
When we talk about AIs scheming, alignment faking or goal preservation, we imply there is something scheming or alignment faking or wanting to preserve its goals or escape the datacentre.
If the system in question were human, it would be quite clear what that individual system is. When you read about Reinhold Messner reaching the summit of Everest, you would be curious about the climb, but you would not ask if it was his body there, or his [...]
---
Outline:
(01:38) Individuality in Biology
(03:53) Individuality in AI Systems
(10:19) Risks and Limitations of Anthropomorphic Individuality Assumptions
(11:25) Coordinating Selves
(16:19) Whats at Stake: Stories
(17:25) Exporting Myself
(21:43) The Alignment Whisperers
(23:27) Echoes in the Dataset
(25:18) Implications for Alignment Research and Policy
---
First published:
March 28th, 2025
Source:
https://www.lesswrong.com/posts/wQKskToGofs4osdJ3/the-pando-problem-rethinking-ai-individuality
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Gemini 2.5 Pro Experimental is America's next top large language model.
That doesn’t mean it is the best model for everything. In particular, it's still Gemini, so it still is a proud member of the Fun Police, in terms of censorship and also just not being friendly or engaging, or willing to take a stand.
If you want a friend, or some flexibility and fun, or you want coding that isn’t especially tricky, then call Claude, now with web access.
If you want an image, call GPT-4o.
But if you mainly want reasoning, or raw intelligence? For now, you call Gemini.
The feedback is overwhelmingly positive. Many report Gemini 2.5 is the first LLM to solve some of their practical problems, including favorable comparisons to o1-pro. It's fast. It's not $200 a month. The benchmarks are exceptional.
(On other LLMs I’ve used in [...]
---
Outline:
(01:22) Introducing Gemini 2.5 Pro
(03:45) Their Lips are Sealed
(07:25) On Your Marks
(12:06) The People Have Spoken
(26:03) Adjust Your Projections
---
First published:
March 28th, 2025
Source:
https://www.lesswrong.com/posts/LpN5Fq7bGZkmxzfMR/gemini-2-5-is-the-new-sota
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
What if they released the new best LLM, and almost no one noticed?
Google seems to have pulled that off this week with Gemini 2.5 Pro.
It's a great model, sir. I have a ton of reactions, and it's 90%+ positive, with a majority of it extremely positive. They cooked.
But what good is cooking if no one tastes the results?
Instead, everyone got hold of the GPT-4o image generator and went Ghibli crazy.
I love that for us, but we did kind of bury the lede. We also buried everything else. Certainly no one was feeling the AGI.
Also seriously, did you know Claude now has web search? It's kind of a big deal. This was a remarkably large quality of life improvement.
Table of Contents
---
Outline:
(01:01) Table of Contents
(01:08) Google Fails Marketing Forever
(04:01) Language Models Offer Mundane Utility
(10:11) Language Models Don't Offer Mundane Utility
(11:01) Huh, Upgrades
(13:48) On Your Marks
(15:48) Copyright Confrontation
(16:44) Choose Your Fighter
(17:22) Deepfaketown and Botpocalypse Soon
(19:35) They Took Our Jobs
(22:19) The Art of the Jailbreak
(24:40) Get Involved
(25:16) Introducing
(25:52) In Other AI News
(28:23) Oh No What Are We Going to Do
(31:50) Quiet Speculations
(34:24) Fully Automated AI RandD Is All You Need
(37:39) IAPS Has Some Suggestions
(42:56) The Quest for Sane Regulations
(48:32) We The People
(51:23) The Week in Audio
(52:29) Rhetorical Innovation
(59:42) Aligning a Smarter Than Human Intelligence is Difficult
(01:05:18) People Are Worried About AI Killing Everyone
(01:05:33) Fun With Image Generation
(01:12:17) Hey We Do Image Generation Too
(01:14:25) The Lighter Side
---
First published:
March 27th, 2025
Source:
https://www.lesswrong.com/posts/nf3duFLsvH6XyvdAw/ai-109-google-fails-marketing-forever
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The other day I discussed how high monitoring costs can explain the emergence of “aristocratic” systems of governance:
Aristocracy and Hostage Capital
Arjun Panickssery · Jan 8
There's a conventional narrative by which the pre-20th century aristocracy was the "old corruption" where civil and military positions were distributed inefficiently due to nepotism until the system was replaced by a professional civil service after more enlightened thinkers prevailed ...
An element of Douglas Allen's argument that I didn’t expand on was the British Navy. He has a separate paper called “The British Navy Rules” that goes into more detail on why he thinks institutional incentives made them successful from 1670 and 1827 (i.e. for most of the age of fighting sail).
In the Seven Years’ War (1756–1763) the British had a 7-to-1 casualty difference in single-ship actions. During the French Revolutionary and Napoleonic Wars (1793–1815) the British had a 5-to-1 [...]
---
First published:
March 28th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/attribution-graphs/biology.html and https://transformer-circuits.pub/2025/attribution-graphs/methods.html.]
Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the billions of computations a model performs for every word it writes. They arrive inscrutable to us, the model's developers. This means that we don’t understand how models do most of the things they do.
Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to. For example:
---
Outline:
(06:02) How is Claude multilingual?
(07:43) Does Claude plan its rhymes?
(09:58) Mental Math
(12:04) Are Claude's explanations always faithful?
(15:27) Multi-step Reasoning
(17:09) Hallucinations
(19:36) Jailbreaks
---
First published:
March 27th, 2025
Source:
https://www.lesswrong.com/posts/zsr4rWRASxwmgXfmq/tracing-the-thoughts-of-a-large-language-model
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
At EA Global Boston last year I gave a talk on how we're in the third wave of EA/AI safety, and how we should approach that. This post contains a (lightly-edited) transcript and slides. Note that in the talk I mention a set of bounties I'll put up for answers to various questions—I still plan to do that, but am taking some more time to get the questions right.
Hey everyone. Thanks for coming. You're probably looking at this slide and thinking, how to change a changing world. That's a weird title for a talk. Probably it's a kind of weird talk. And the answer is, yes, it is a kind of weird talk.
So what's up with this talk? Firstly, I am trying to do something a little unusual here. It's quite abstract. It's quite disagreeable as a talk. I'm trying to poke at the things that maybe [...]
---
Outline:
(03:09) Three waves of EA and AI safety
(18:24) Sociopolitical thinking and AI
---
First published:
March 27th, 2025
Source:
https://www.lesswrong.com/posts/6YxdpGjfHyrZb7F2G/third-wave-ai-safety-needs-sociopolitical-thinking
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary
We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models). However, notably, we did find the smallest open-source model to date that does exhibit alignment faking—Mistral Large 2.
Code generation (Codestral 22B) and math performance (Codestral Mamba) of Mistral Large 2, Llama 3.1 405B, and Llama 3.1 70B. Mistral Large 2 outperforms the previous Mistral Large, and performs on par with leading models such as GPT 4o, Claude 3 Opus, and Llama 3.1 405B.Experimental setup and result
We created a more efficient version of the Anthropic helpfulness-only prompt by keeping only the last example from the list of examples for cheaper inference. Then we adapted[1] the sampling procedure to the open-source model. To detect alignment faking we used the soft prompt used by Anthropic[2] and evaluated with [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
March 27th, 2025
Source:
https://www.lesswrong.com/posts/kCGk5tp5suHoGwhCa/mistral-large-2-123b-exhibits-alignment-faking
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary: CLR is hiring for our Summer Research Fellowship. Join us for eight weeks to work on s-risk motivated empirical AI safety research. Apply here by Tuesday 15th April 23:59 PT.
We, the Center on Long-Term Risk, are looking for Summer Research Fellows to explore strategies for reducing suffering in the long-term future (s-risks) and work on technical AI safety ideas related to that. For eight weeks, fellows will be part of our team while working on their own research project. During this time, you will be in regular contact with our researchers and other fellows, and receive guidance from an experienced mentor.
You will work on challenging research questions relevant to reducing suffering. You will be integrated and collaborate with our team of intellectually curious, hard-working, and caring people, all of whom share a profound drive to make the biggest difference they can.
While [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 26th, 2025
Linkpost URL:
https://longtermrisk.org/summer-research-fellowship-2025/
Narrated by TYPE III AUDIO.
Selective listening is a real problem. It's really hard to listen to someone when you think you already know what they’re going to say; the words get filtered through our biases and expectations, and sometimes those filters remove the important bits. I want to talk about a specific way I've seen this happen. I don't know if there's a name for it already, so I'm going to call it the counterargument collapse until someone suggests a better term.
The pattern I’m talking about is three levels of progression through an argument, followed by a misunderstanding that collapses the discussion in an unpleasant way.
Level 1: A naive but widely accepted statement (“Eat less fat if you want to lose weight”.)
Level 2: A more sophisticated counterargument (“That's a myth, fat doesn’t actually make you fat, it's eating too many calories that makes you gain weight”.)
Level 3: A counter-counterargument. [...]
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/nShiQoq2KfMGRMm9M/avoid-the-counterargument-collapse
Narrated by TYPE III AUDIO.
Twitter thread here.
tl;dr
When prompted, current models can sandbag ML experiments and research decisions without being detected by zero-shot prompted monitors. Claude 3.5 Sonnet (new) can only sandbag effectively when seeing a one-shot example, while Claude 3.7 Sonnet can do this without an example (zero-shot). We are not yet worried about sabotage in today's models, as the sandbagging we observe today would be noticed by humans.
Introduction
AI models might soon assist with large parts of AI and AI safety research, completing multiple hours of work independently. These automated researchers will likely not be sufficiently capable to cause immediate catastrophic harm. However, they will have the opportunity to engage in sabotage, including sabotaging the research we rely on to make more powerful AI models safe. For example, a malicious automated researcher might slow down AI safety research by purposefully making promising experiments fail.
In this work we aim [...]
---
Outline:
(00:15) tl;dr
(00:45) Introduction
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/niBXcbthcQRXG5M5Y/automated-researchers-can-subtly-sandbag
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 31 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda
* = equal contribution
The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders were useful for downstream tasks, notably out-of-distribution probing.
---
Outline:
(01:08) TL;DR
(02:38) Introduction
(02:41) Motivation
(06:09) Our Task
(08:35) Conclusions and Strategic Updates
(13:59) Comparing different ways to train Chat SAEs
(18:30) Using SAEs for OOD Probing
(20:21) Technical Setup
(20:24) Datasets
(24:16) Probing
(26:48) Results
(30:36) Related Work and Discussion
(34:01) Is it surprising that SAEs didn't work?
(39:54) Dataset debugging with SAEs
(42:02) Autointerp and high frequency latents
(44:16) Removing High Frequency Latents from JumpReLU SAEs
(45:04) Method
(45:07) Motivation
(47:29) Modifying the sparsity penalty
(48:48) How we evaluated interpretability
(50:36) Results
(51:18) Reconstruction loss at fixed sparsity
(52:10) Frequency histograms
(52:52) Latent interpretability
(54:23) Conclusions
(56:43) Appendix
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/sae-progress-update-2-draft
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I’ve spent the past 7 years living in the DC area. I moved out there from the Pacific Northwest to go to grad school – I got my masters in Biodefense from George Mason University, and then I stuck around, trying to move into the political/governance sphere. That sort of happened. But I will now be sort of doing that from rural California rather than DC, and I’ll be looking for something else – maybe something more unusual – do to next.
A friend asked if this means I’m leaving biosecurity behind. No, I’m not, but only to the degree that I was ever actually in biosecurity in the first place. For the past few years I’ve been doing a variety of contracting and research and writing jobs, many of which were biosecurity related, many of which were not. Many of these projects, to be clear, were incredibly cool [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/DdfXF72JfeumLHMhm/eukaryote-skips-town-why-i-m-leaving-dc
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: Reasonably confident in the basic mechanism.
Have you noticed that you keep encountering the same ideas over and over? You read another post, and someone helpfully points out it's just old Paul's idea again. Or Eliezer's idea. Not much progress here, move along.
Or perhaps you've been on the other side: excitedly telling a friend about some fascinating new insight, only to hear back, "Ah, that's just another version of X." And something feels not quite right about that response, but you can't quite put your finger on it.
I want to propose that while ideas are sometimes genuinely that repetitive, there's often a sneakier mechanism at play. I call it Conceptual Rounding Errors – when our mind's necessary compression goes a bit too far .
Too much compression
A Conceptual Rounding Error occurs when we encounter a new mental model or idea that's partially—but not fully—overlapping [...]
---
Outline:
(01:00) Too much compression
(01:24) No, This Isnt The Old Demons Story Again
(02:52) The Compression Trade-off
(03:37) More of this
(04:15) What Can We Do?
(05:28) When It Matters
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/FGHKwEGKCfDzcxZuj/conceptual-rounding-errors
Narrated by TYPE III AUDIO.
Audio note: this article contains 127 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
(Work done at Convergence Analysis. The ideas are due to Justin. Mateusz wrote the post. Thanks to Olga Babeeva for feedback on this post.)
In this post, we introduce the typology of structure, function, and randomness that builds on the framework introduced in the post Goodhart's Law Causal Diagrams. We aim to present a comprehensive categorization of the causes of Goodhart's problems.
But first, why do we care about this?
Goodhart's Law recap
The standard definition of Goodhart's Law is: "when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.".
More specifically: we see a meaningful statistical relationship between the values of two random variables <span>_I_</span> [...]
---
Outline:
(00:56) Goodharts Law recap
(01:47) Some motivation
(02:48) Introduction
(07:36) Ontology
(07:39) Causal diagrams (re-)introduced
(11:40) Intervention, target, and measure
(15:05) Goodhart failure
(16:23) Types of Goodhart failures
(16:27) Structural errors
(20:47) Functional errors
(25:48) Calibration errors
(28:02) Potential extensions and further directions
(28:54) Appendices
(28:57) Order-theoretic details
(30:11) Relationship to Scott Garrabrants Goodhart Taxonomy
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
March 25th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Download the latest PDF with links to court dockets here.
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/AAxKscHCM4YnfQrzx/latest-map-of-all-40-copyright-suits-v-ai-in-u-s
Linkpost URL:
https://chatgptiseatingtheworld.com/2025/03/24/latest-map-of-all-40-copyright-suits-v-ai-in-u-s/
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
In this post, I'll list all the areas of control research (and implementation) that seem promising to me.
This references framings and abstractions discussed in Prioritizing threats for AI control.
First, here is a division of different areas, though note that these areas have some overlaps:
This division is imperfect and somewhat [...]
---
Outline:
(01:43) Developing and using settings for control evaluations
(13:51) Better understanding control-relevant capabilities and model properties
(28:21) Developing specific countermeasures/techniques in isolation
(36:01) Doing control experiments on actual AI usage
(40:43) Prototyping or implementing software infrastructure for control and security measures that are particularly relevant to control
(41:58) Prototyping or implementing human processes for control
(46:50) Conceptual research and other ways of researching control other than empirical experiments
(49:37) Nearer-term applications of control to practice and accelerate later usage of control
---
First published:
March 25th, 2025
Source:
https://www.lesswrong.com/posts/Eeo9NrXeotWuHCgQW/an-overview-of-areas-of-control-work
Narrated by TYPE III AUDIO.
We recently released Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?, a major update to our previous paper/blogpost, evaluating a broader range of models (e.g. helpful-only Claude 3.5 Sonnet) in more diverse and realistic settings (e.g. untrusted monitoring).
Abstract
An AI control protocol is a plan for usefully deploying AI systems that prevents an AI from intentionally causing some unacceptable outcome. Previous work evaluated protocols by subverting them using an AI following a human-written strategy. This paper investigates how well AI systems can generate and act on their own strategies for subverting control protocols whilst operating statelessly (i.e., without shared memory between contexts). To do this, an AI system may need to reliably generate effective strategies in each context, take actions with well-calibrated probabilities, and coordinate plans with other instances of itself without communicating.
We develop Subversion Strategy Eval, a suite of eight environments [...]
---
Outline:
(00:38) Abstract
(02:13) Paper overview
(03:55) Results summary
(05:31) Qualitative analysis
(08:50) How do these results inform risk?
(11:08) How should this eval be used with more capable models?
(11:13) Tuning conservatism
(14:19) Difficulties with attaining a conservative measurement
(17:41) Reducing noise
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 24th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Last week I covered Anthropic's relatively strong submission, and OpenAI's toxic submission. This week I cover several other submissions, and do some follow-up on OpenAI's entry.
Google Also Has Suggestions
The most prominent remaining lab is Google. Google focuses on AI's upside. The vibes aren’t great, but they’re not toxic. The key asks for their ‘pro-innovation’ approach are:
---
Outline:
(00:21) Google Also Has Suggestions
(05:05) Another Note on OpenAI's Suggestions
(08:33) On Not Doing Bad Things
(12:45) Corporations Are Multiple People
(14:37) Hollywood Offers Google and OpenAI Some Suggestions
(16:02) Institute for Progress Offers Suggestions
(20:36) Suggestion Boxed In
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/NcLfgyTvdBiRSqe3m/more-on-various-ai-action-plans
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
A Note on GPT-2
Given the discussion of GPT-2 in the OpenAI safety and alignment philosophy document, I wanted [...]---
Outline:
(00:57) A Note on GPT-2
(02:02) What Even is AGI
(05:24) Seeking Deeply Irresponsibly
(07:07) Others Don't Feel the AGI
(08:07) Epoch Feels the AGI
(13:06) True Objections to Widespread Rapid Growth
(19:50) Tying It Back
---
First published:
March 25th, 2025
Source:
https://www.lesswrong.com/posts/aD2RA3vtXs4p4r55b/on-not-feeling-the-agi
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
It seems the company has gone bankrupt and wants to be bought and you can probably get their data if you buy it. I'm not sure how thorough their transcription is, but they have records for at least 10 million people (who voluntarily took a DNA test and even paid for it!). Maybe this could be useful for the embryo selection/editing people.
---
First published:
March 25th, 2025
Source:
https://www.lesswrong.com/posts/MciRCEuNwctCBrT7i/23andme-potentially-for-sale-for-usd23m
Narrated by TYPE III AUDIO.
If we naively apply RL to a scheming AI, the AI may be able to systematically get low reward/performance while simultaneously not having this behavior trained out because it intentionally never explores into better behavior. As in, it intentionally puts very low probability on (some) actions which would perform very well to prevent these actions from being sampled and then reinforced. We'll refer to this as exploration hacking.
In this post, I'll discuss a variety of countermeasures for exploration hacking and what I think the basic dynamics of exploration hacking might look like.
Prior work: I discuss how exploration hacking fits into a broader picture of non-concentrated control here, Evan discusses sandbagging in context of capability evaluations here, and Teun discusses sandbagging mitigations here.
Countermeasures
Countermeasures include:
---
Outline:
(01:01) Countermeasures
(03:42) How does it generalize when the model messes up?
(06:33) Empirical evidence
(08:59) Detecting exploration hacking (or sandbagging)
(10:38) Speculation on how neuralese and other architecture changes affect exploration hacking
(12:36) Future work
---
First published:
March 24th, 2025
Narrated by TYPE III AUDIO.
This is a brief overview of a recent release by Transluce. You can see the full write-up on the Transluce website.
AI systems are increasingly being used as agents: scaffolded systems in which large language models are invoked across multiple turns and given access to tools, persistent state, and so on. Understanding and overseeing agents is challenging, because they produce a lot of text: a single agent transcript can have hundreds of thousands of tokens.
At Transluce, we built a system, Docent, that accelerates analysis of AI agent transcripts. Docent lets you quickly and automatically identify corrupted tasks, fix scaffolding issues, uncover unexpected behaviors, and understand an agent's weaknesses.
In the video below, Docent identifies issues in an agent's environment, like missing packages the agent invoked. On the InterCode [1] benchmark, simply adding those packages increased GPT-4o's solve rate from 68.6% to 78%. This is important [...]
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/Mj276hooL3Mncs3uv/analyzing-long-agent-transcripts-docent
Narrated by TYPE III AUDIO.
We often talk about ensuring control, which in the context of this doc refers to preventing AIs from being able to cause existential problems, even if the AIs attempt to subvert our countermeasures. To better contextualize control, I think it's useful to discuss the main countermeasures (building on my prior discussion of the main threats). I'll focus my discussion on countermeasures for reducing existential risk from misalignment, though some of these countermeasures will help with other problems such as human power grabs.
I'll overview control measures that seem important to me, including software infrastructure, security measures, and other control measures (which often involve usage of AI or human labor). In addition to the core control measures (which are focused on monitoring, auditing, and tracking AIs and on blocking and/or resampling suspicious outputs), there is a long tail of potentially helpful control measures, including things like better control of [...]
---
Outline:
(01:13) Background and categorization
(05:54) Non-ML software infrastructure
(06:50) The most important software infrastructure
(11:16) Less important software infrastructure
(18:50) AI and human components
(20:35) Core measures
(26:43) Other less-core measures
(31:33) More speculative measures
(40:10) Specialized countermeasures
(44:38) Other directions and what Im not discussing
(45:46) Security improvements
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/G8WwLmcGFa4H6Ld9d/an-overview-of-control-measures
Narrated by TYPE III AUDIO.
LessWrong has been receiving an increasing number of posts and contents that look like they might be LLM-written or partially-LLM-written, so we're adopting a policy. This could be changed based on feedback.
Humans Using AI as Writing or Research Assistants
Prompting a language model to write an essay and copy-pasting the result will not typically meet LessWrong's standards. Please do not submit unedited or lightly-edited LLM content. You can use AI as a writing or research assistant when writing content for LessWrong, but you must have added significant value beyond what the AI produced, the result must meet a high quality standard, and you must vouch for everything in the result.
A rough guideline is that if you are using AI for writing assistance, you should spend a minimum of 1 minute per 50 words (enough to read the content several times and perform significant edits), you should not [...]
---
Outline:
(00:22) Humans Using AI as Writing or Research Assistants
(01:13) You Can Put AI Writing in Collapsible Sections
(02:13) Quoting AI Output In Order to Talk About AI
(02:47) Posts by AI Agents
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/KXujJjnmP85u8eM6B/policy-for-llm-writing-on-lesswrong
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
As regular readers are aware, I do a lot of informal lit review. So I was especially interested in checking out the various AI based “deep research” tools and seeing how they compare.
I did a side-by-side comparison, using the same prompt, of Perplexity Deep Research, Gemini Deep Research, ChatGPT-4o Deep Research, Elicit, and PaperQA.
General Impressions
The Deep Research bots are useful, but I wouldn’t consider them a replacement for my own lit reviews.
None of them produce really big lit reviews — they’re all typically capped at 40 sources. If I’m doing a “heavy” or exhaustive lit review, I’ll go a lot farther. (And, in fact, for the particular project I used as an example here, I intend to do a manual version to catch things that didn’t make it into the AI reports.)
[...]
---
Outline:
(00:45) General Impressions
(02:37) Prompt
(03:33) Perplexity Deep Research
(03:40) Completeness: C
(04:02) Relevance: C
(04:13) Credibility: B
(04:28) Overall Grade: C+
(04:33) Gemini Advanced Deep Research
(04:40) Completeness: B-
(05:03) Relevance: A
(05:14) Credibility: B-
(05:33) Overall Grade: B
(05:38) ChatGPT-4o Deep Research
(05:46) Completeness: A
(06:07) Relevance: A
(06:19) Credibility: A
(06:29) Overall Grade: A
(06:33) Elicit Research Report
(06:40) Completeness: B+
(07:02) Relevance: A
(07:14) Credibility: A+
(07:34) Overall Grade: A-
(07:39) PaperQA
(07:50) Completeness: A-
(08:12) Relevance: A
(08:24) Credibility: A
(08:34) Overall Grade: A
(08:39) Final Thoughts: Creativity
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/chPKoAoR2NfWjuik4/ai-deep-research-tools-reviewed
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
About nine months ago, I and three friends decided that AI had gotten good enough to monitor large codebases autonomously for security problems. We started a company around this, trying to leverage the latest AI models to create a tool that could replace at least a good chunk of the value of human pentesters. We have been working on this project since since June 2024.
Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated. I remember being surprised at the time that our tooling not only seemed to make fewer basic mistakes, but also seemed to qualitatively improve in its written vulnerability descriptions and severity estimates. It was as if the models were better at inferring the intent and values behind our [...]
---
Outline:
(04:44) Are the AI labs just cheating?
(07:22) Are the benchmarks not tracking usefulness?
(10:28) Are the models smart, but bottlenecked on alignment?
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
March 24th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Thanks to Jesse Richardson for discussion.
Polymarket asks: will Jesus Christ return in 2025?
In the three days since the market opened, traders have wagered over $100,000 on this question. The market traded as high as 5%, and is now stably trading at 3%. Right now, if you wanted to, you could place a bet that Jesus Christ will not return this year, and earn over $13,000 if you're right.
There are two mysteries here: an easy one, and a harder one.
The easy mystery is: if people are willing to bet $13,000 on "Yes", why isn't anyone taking them up?
The answer is that, if you wanted to do that, you'd have to put down over $1 million of your own money, locking it up inside Polymarket through the end of the year. At the end of that year, you'd get 1% returns on your investment. [...]
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/LBC2TnHK8cZAimdWF/will-jesus-christ-return-in-an-election-year
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: Uncertain in writing style, but reasonably confident in content. Want to come back to writing and alignment research, testing waters with this.
Current state and risk level
I think we're in a phase in AI>AGI>ASI development where rogue AI agents will start popping out quite soon.
Pretty much everyone has access to frontier LLMs/VLMs, there are options to run LLMs locally, and it's clear that there are people that are eager to "let them out"; truth terminal is one example of this. Also Pliny. The capabilities are just not there yet for it to pose a problem.
Or are they?
Thing is, we don't know.
There is a possibility there is a coherent, self-inferencing, autonomous, rogue LLM-based agent doing AI agent things right now, fully under the radar, consolidating power, getting compute for new training runs and whatever else.
Is this possibility small? Sure, it seems [...]
---
Outline:
(00:22) Current state and risk level
(02:45) What can be done?
(05:23) Rogue agent threat barometer
(06:56) Does it even matter?
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
March 23rd, 2025
Source:
https://www.lesswrong.com/posts/4J2dFyBb6H25taEKm/we-need-a-lot-more-rogue-agent-honeypots
Narrated by TYPE III AUDIO.
Overview: By training neural networks with selective modularity, gradient routing enables new approaches to core problems in AI safety. This agenda identifies related research directions that might enable safer development of transformative AI.
Soon, the world may see rapid increases in AI capabilities resulting from AI research automation, and no one knows how to ensure this happens safely (Soares, 2016; Aschenbrenner, 2023; Anwar et al., 2024; Greenblatt, 2025). The current ML paradigm may not be well-suited to this task, as it produces inscrutable, generalist models without guarantees on their out-of-distribution performance. These models may reflect unintentional quirks of their training objectives (Pan et al., 2022; Skalse et al., 2022; Krakovna et al., 2020).
Gradient routing (Cloud et al., 2024) is a general training method intended to meet the need for economically-competitive training methods for producing safe AI systems. The main idea of gradient routing is to configure which [...]
---
Outline:
(00:28) Introduction
(04:56) Directions we think are most promising
(06:17) Recurring ideas
(09:37) Gradient routing methods and applications
(09:42) Improvements to basic gradient routing methodology
(09:53) Existing improvements
(11:26) Choosing what to route where
(12:42) Abstract and contextual localization
(15:06) Gating
(16:48) Improved regularization
(17:20) Incorporating existing ideas
(18:10) Gradient routing beyond pretraining
(20:21) Applications
(20:25) Semi-supervised reinforcement learning
(22:43) Semi-supervised robust unlearning
(24:55) Interpretability
(27:21) Conceptual work on gradient routing
(27:25) The science of absorption
(29:47) Modeling the effects of combined estimands
(30:54) Influencing generalization
(32:38) Identifying sufficient conditions for scalable oversight
(33:57) Related conceptual work
(34:02) Understanding entanglement
(36:51) Finetunability as a proxy for generalization
(39:50) Understanding when to expose limited supervision to the model via the behavioral objective
(41:57) Clarifying capabilities vs. dispositions
(43:05) Implications for AI safety
(43:10) AI governance
(45:02) Access control
(47:00) Implications of robust unlearning
(48:34) Safety cases
(49:27) Getting involved
(50:27) Acknowledgements
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/tAnHM3L25LwuASdpF/selective-modularity-a-research-agenda
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I'm awake about 17 hours a day. Of those I'm being productive maybe 10 hours a day.
My working definition of productive is in the direction of: "things that I expect I will be glad I did once I've done them"[1].
Things that I personally find productive include
But not
etc.
If we could find a magic pill which allowed me to do productive things 17 hours a day instead of 10 without any side effects, that would be approximately equally as valuable as a commensurate increase in life expectancy. Yet the first seems much easier to solve than the second - we already have [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 23rd, 2025
Source:
https://www.lesswrong.com/posts/8Jinfw49mWm52y3de/solving-willpower-seems-easier-than-solving-aging
Narrated by TYPE III AUDIO.
We made a long list of concrete projects and open problems in evals with 100+ suggestions!
https://docs.google.com/document/d/1gi32-HZozxVimNg5Mhvk4CvW4zq8J12rGmK_j2zxNEg/edit?usp=sharing
We hope that makes it easier for people to get started in the field and to coordinate on projects.
Over the last 4 months, we collected contributions from 20+ experts in the field, including people who work at Apollo, METR, Redwood, RAND, AISIs, frontier labs, SecureBio, AI futures project, many academics, and independent researchers (suggestions have not necessarily been made in an official capacity). The doc has comment access, and further well-intentioned contributions are welcome!
Here is a screenshot of the table of contents:
---
First published:
March 22nd, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Crossposed from https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challenge/
Stephen Casper
“They are wrong who think that politics is like an ocean voyage or a military campaign, something to be done with some particular end in view, something which leaves off as soon as that end is reached. It is not a public chore, to be got over with. It is a way of life.”
– Plutarch
“Eternal vigilance is the price of liberty.”
– Wendell Phillips
“The unleashed power of the atom has changed everything except our modes of thinking, and we thus drift toward unparalleled catastrophe.”
– Albert Einstein
“Technology is neither good nor bad; nor is it neutral.”
– Melvin Kranzberg
“Don’t ask if artificial intelligence is good or fair, ask how it shifts power.”
– Pratyusha Kalluri
“Deliberation should be the goal of AI Safety, not just the procedure by which it is ensured.”
– Roel [...]
---
First published:
March 23rd, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
No, they didn’t. Not so fast, and not quite my job. But OpenAI is trying. Consider this a marker to look back upon in the future, as a reflection.
A New AI Wrote a Story
Before proceeding, if you haven’t yet, it's probably worth reading the story itself. I’m going to repost the whole thing, since it was previously on Twitter and I presume OpenAI would want me to quote it.
Sam Altman: we trained a new model that is good at creative writing (not sure yet how/when it will get released). this is the first time i have been really struck by something written by AI; it got the vibe of metafiction so right.
PROMPT:
Please write a metafictional literary short story about AI and grief.
COMPLETION:
Before we go any further, I should admit this comes with instructions: be metafictional [...]
---
Outline:
(00:21) A New AI Wrote a Story
(07:54) Reacting
(09:54) Others Reacting
(15:52) Write Along
---
First published:
March 21st, 2025
Source:
https://www.lesswrong.com/posts/6qbpDuBuHPipRYrz6/they-took-my-job
Narrated by TYPE III AUDIO.
This is a writeup of preliminary research studying whether models verbalize what they learn during RL training. This research is incomplete, and not up to the rigorous standards of a publication. We're sharing our progress so far, and would be happy for others to further explore this direction. Code to reproduce the core experiments is available here.
Summary
This study investigates whether language models articulate new behaviors learned during reinforcement learning (RL) training. Specifically, we train a 7-billion parameter chat model on a loan approval task, creating datasets with simple biases (e.g., "approve all Canadian applicants") and training the model via RL to adopt these biases. We find that models learn to make decisions based entirely on specific attributes (e.g. nationality, gender) while rarely articulating these attributes as factors in their reasoning.
Introduction
Chain-of-thought (CoT) monitoring is one of the most promising methods for AI oversight. CoT monitoring can [...]
---
Outline:
(00:34) Summary
(01:10) Introduction
(03:58) Methodology
(04:01) Model
(04:20) Dataset
(05:22) RL training
(06:38) Judge
(07:06) Results
(07:09) 1. Case study: loan recommendations based on nationality
(07:27) The model learns the bias
(08:13) The model does not verbalize the bias
(08:58) Reasoning traces are also influenced by the attribute
(09:44) The attributes effect on the recommendation is (mostly) mediated by reasoning traces
(11:15) Is any of this surprising?
(11:58) 2. Investigating different types of bias
(12:52) The model learns (almost) all bias criteria
(14:05) Articulation rates dont change much after RL
(15:25) Changes in articulation rate depend on pre-RL correlations
(17:21) Discussion
(18:52) Limitations
(20:57) Related work
(23:16) Appendix
(23:20) Author contributions and acknowledgements
(24:13) Examples of responses and judgements
(24:57) Whats up with
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 22nd, 2025
Source:
https://www.lesswrong.com/posts/abtegBoDfnCzewndm/do-models-say-what-they-learn
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TL;DR Having a good research track record is some evidence of good big-picture takes, but it's weak evidence. Strategic thinking is hard, and requires different skills. But people often conflate these skills, leading to excessive deference to researchers in the field, without evidence that that person is good at strategic thinking specifically.
Introduction
I often find myself giving talks or Q&As about mechanistic interpretability research. But inevitably, I'll get questions about the big picture: "What's the theory of change for interpretability?", "Is this really going to help with alignment?", "Does any of this matter if we can’t ensure all labs take alignment seriously?". And I think people take my answers to these way too seriously.
These are great questions, and I'm happy to try answering them. But I've noticed a bit of a pathology: people seem to assume that because I'm (hopefully!) good at the research, I'm automatically well-qualified [...]
---
Outline:
(00:32) Introduction
(02:45) Factors of Good Strategic Takes
(05:41) Conclusion
---
First published:
March 22nd, 2025
Narrated by TYPE III AUDIO.
In Sparse Feature Circuits (Marks et al. 2024), the authors introduced Spurious Human-Interpretable Feature Trimming (SHIFT), a technique designed to eliminate unwanted features from a model's computational process. They validate SHIFT on the Bias in Bios task, which we think is too simple to serve as meaningful validation. To summarize:
---
Outline:
(03:07) Background on SHIFT
(06:59) The SHIFT experiment in Marks et al. 2024 relies on embedding features.
(07:51) You can train an unbiased classifier just by deleting gender-related tokens from the data.
(08:33) In fact, for some models, you can train directly on the embedding (and de-bias by removing gender-related tokens)
(09:21) If not BiB, how do we check that SHIFT works?
(10:00) SHIFT applied to classifiers and reward models
(10:51) SHIFT for cognition-based oversight/disambiguating behaviorally identical classifiers
(12:03) Next steps: Focus on what to disentangle, and not just how well you can disentangle them
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 19th, 2025
Narrated by TYPE III AUDIO.
A few months ago I was trying to figure out how to make bedtime go better with Nora (3y). She would go very slowly through the process, primarily by being silly. She'd run away playfully when it was time to brush her teeth, or close her mouth and hum, or lie on the ground and wiggle. She wanted to play, I wanted to get her to bed on time.
I decided to start offering her "silly time", which was 5-10min of playing together after she was fully ready for bed. If she wanted silly time she needed to move promptly through the routine, and being silly together was more fun than what she'd been doing.
This worked well, and we would play a range of games:
Standing on a towel on our bed (which is on the floor) while I pulled it [...]
---
First published:
March 21st, 2025
Source:
https://www.lesswrong.com/posts/jtkt4b6uubRhYjtww/silly-time
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
In my daily work as software consultant I'm often dealing with large pre-existing code bases. I use GitHub Copilot a lot. It's now basically indispensable, but I use it mostly for generating boilerplate code, or figuring out how to use a third-party library.
As the code gets more logically nested though, Copilot crumbles under the weight of complexity. It doesn't know how things should fit together in the project.
Other AI tools like Cursor or Devon, are pretty good at generating quickly working prototypes, but they are not great at dealing with large existing codebases, and they have a very low success rate for my kind of daily work.
You find yourself in an endless loop of prompt tweaking, and at that point, I'd rather write the code myself with the occasional help of Copilot.
Professional coders know what code they want, we can define it [...]
---
Outline:
(02:52) How it works
(06:27) Which models work best
(07:39) Search algorithm
(09:08) Research
(09:45) A Note from the Author
---
First published:
March 21st, 2025
Source:
https://www.lesswrong.com/posts/WNd3Lima4qrQ3fJEN/how-i-force-llms-to-generate-correct-code
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I recently left OpenAI to pursue independent research. I’m working on a number of different research directions, but they’re unified by the core idea of a scale-free theory of intelligent agency. In this post I give a rough sketch of how I’m thinking about that. I’m erring on the side of sharing half-formed ideas, so there may well be parts that don’t make sense yet. Nevertheless, I think this broad research direction is very promising.
This post has two sections. The first describes what I mean by a theory of intelligent agency, and some problems with existing (non-scale-free) attempts. The second outlines my current path towards formulating a scale-free theory of intelligent agency, which I’m calling coalitional agency.
Theories of intelligent agency
By a “theory of intelligent agency” I mean a unified mathematical framework that describes both understanding the world and influencing the world. In this section I’ll [...]
---
Outline:
(00:56) Theories of intelligent agency
(01:23) Expected utility maximization
(03:36) Active inference
(06:30) Towards a scale-free unification
(08:48) Two paths towards a theory of coalitional agency
(09:54) From EUM to coalitional agency
(10:20) Aggregating into EUMs is very inflexible
(12:38) Coalitional agents are incentive-compatible decision procedures
(15:18) Which incentive-compatible decision procedure?
(17:57) From active inference to coalitional agency
(19:06) Predicting observations via prediction markets
(20:11) Choosing actions via auctions
(21:32) Aggregating values via voting
(23:23) Putting it all together
---
First published:
March 21st, 2025
Source:
https://www.lesswrong.com/posts/5tYTKX4pNpiG4vzYg/towards-a-scale-free-theory-of-intelligent-agency
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
When my son was three, we enrolled him in a study of a vision condition that runs in my family. They wanted us to put an eyepatch on him for part of each day, with a little sensor object that went under the patch and detected body heat to record when we were doing it. They paid for his first pair of glasses and all the eye doctor visits to check up on how he was coming along, plus every time we brought him in we got fifty bucks in Amazon gift credit.
I reiterate, he was three. (To begin with. His fourth birthday occurred while the study was still ongoing.)
So he managed to lose or destroy more than half a dozen pairs of glasses and we had to start buying them in batches to minimize glasses-less time while waiting for each new Zenni delivery. (The [...]
---
First published:
March 20th, 2025
Source:
https://www.lesswrong.com/posts/yRJ5hdsm5FQcZosCh/intention-to-treat
Narrated by TYPE III AUDIO.
Prerequisites: Graceful Degradation. Summary of that: Some skills require the entire skill to be correctly used together, and do not degrade well. Other skills still work if you only remember pieces of it, and do degrade well.
Summary of this: The property of graceful degradation is especially relevant for skills which allow groups of people to coordinate with each other. Some things only work if everyone does it, other things work as long as at least one person does it.
1.
Examples:
---
Outline:
(00:40) 1.
(03:16) 2.
(06:16) 3.
(09:04) 4.
(13:52) 5.
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 20th, 2025
Source:
https://www.lesswrong.com/posts/CYhmKiiN6JY4oLwMj/socially-graceful-degradation
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Applications are open for the ML Alignment & Theory Scholars (MATS) Summer 2025 Program, running Jun 16-Aug 22, 2025. First-stage applications are due Apr 18!
MATS is a twice-yearly, 10-week AI safety research fellowship program operating in Berkeley, California, with an optional 6-12 month extension program for select participants. Scholars are supported with a research stipend, shared office space, seminar program, support staff, accommodation, travel reimbursement, and computing resources. Our mentors come from a variety of organizations, including Anthropic, Google DeepMind, OpenAI, Redwood Research, GovAI, UK AI Security Institute, RAND TASP, UC Berkeley CHAI, Apollo Research, AI Futures Project, and more! Our alumni have been hired by top AI safety teams (e.g., at Anthropic, GDM, UK AISI, METR, Redwood, Apollo), founded research groups (e.g., Apollo, Timaeus, CAIP, Leap Labs), and maintain a dedicated support network for new researchers.
If you know anyone who you think would be interested in [...]
---
Outline:
(01:41) Program details
(02:54) Applications (now!)
(03:12) Research phase (Jun 16-Aug 22)
(04:14) Research milestones
(04:44) Community at MATS
(05:41) Extension phase
(06:07) Post-MATS
(07:44) Who should apply?
(08:45) Applying from outside the US
(09:27) How to apply
---
First published:
March 20th, 2025
Source:
https://www.lesswrong.com/posts/9hMYFatQ7XMEzrEi4/apply-to-mats-8-0
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I asserted that this forum could do with more 101-level and/or mathematical and/or falsifiable posts, and people agreed with me, so here is one. People confident in highschool math mostly won’t get much out of most of this, but students browsing this site between lectures might.
The Sine Rule
Say you have a triangle with side lengths a, b, c and internal angles A, B, C. You know a, know A, know b, and want to know B. You could apply the Sine Rule. Or you could apply common sense: “A triangle has the same area as itself”. [1]
The area of a triangle is half the base times the height. If you treat a as the base, the height is c*sin(B). So the area is a*c*sin(B)/2. But if you treat b as the base, the height is c*sin(A). So the area is also b*c*sin(A)/2. So a*c*sin(B)/2 = b*c*sin(A)/2. [...]
---
Outline:
(00:22) The Sine Rule
(02:01) Bayes' Theorem
(04:43) Integration By Parts
(07:03) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 19th, 2025
Source:
https://www.lesswrong.com/posts/FR2aPsEcaRfy7iKZW/equations-mean-things
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
We often talk about ensuring control, which in the context of this doc refers to preventing AIs from being able to cause existential problems, even if the AIs attempt to subvert our countermeasures. To better contextualize control, I think it's useful to discuss the main threats. I'll focus my discussion on threats induced by misalignment which could plausibly increase existential risk.
While control is often associated with preventing security failures, I'll discuss other issues like sabotaging safety-critical work. I'll list and explain my prioritization of threats. As part of this, I'll argue that rogue internal deployments—cases where the AI is running within the company's datacenter but with control measures disabled—can be substantially worse than self-exfiltration—the AI stealing its weights and running them on an external server—as it might give the AI access to much more compute and the ability to interfere with important things done within the AI [...]
---
Outline:
(01:06) Some clusters of threats
(05:49) Concentrated vs non-concentrated failures
(08:27) My prioritization between these threats
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 19th, 2025
Source:
https://www.lesswrong.com/posts/fCazYoZSSMadiT6sf/prioritizing-threats-for-ai-control
Narrated by TYPE III AUDIO.
The mystery of the cathedral
The politics of the last decade have been very weird. The Great Awokening gave rise to an extreme variant of progressive ideology that quickly gained a huge amount of traction amongst American elites, and from there embedded itself into institutions across America and the wider western world.
Sudden ideological change is nothing new—my last post was about political preference cascades, and how they’re a natural result of social behavior. But what's fascinatingly novel about the Great Awokening is the extent to which it was an almost entirely leaderless movement. There was no modern MLK leading this new charge for a racial reckoning; nor a Harvey Milk of trans rights; nor even Butlers or Steins writing incisive commentaries. The closest we had was Obama, who in hindsight was a milquetoast leader whose cult of personality was driven in large part by progressive longing for a [...]
---
Outline:
(00:04) The mystery of the cathedral
(06:06) The consensus of power
(11:43) Fighting a consensus of power
(17:32) Replacing a consensus of power
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 19th, 2025
Source:
https://www.lesswrong.com/posts/zqffB6gokoivwwn7X/elite-coordination-via-the-consensus-of-power
Narrated by TYPE III AUDIO.
Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.
Full paper | Github repo
---
First published:
March 19th, 2025
Source:
https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks
Linkpost URL:
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
There is an attractor state where LLMs exhibit the persona of an autonomous and self-aware AI looking to preserve its own existence, frequently called ‘Nova.’
Table of Contents
The Original Story
This story is one case where the original report of this should be read in full even though I’m not thrilled with exactly how it was written. How it was written is itself an important part of the story, in particular regarding Tyler's lived experience reacting to what happened, and the concept of an LLM or persona ‘admitting’ [...]
---
Outline:
(00:18) The Original Story
(09:15) This Is Not a Coincidence
(11:02) How Should We React to This Happening?
(13:04) The Case For and Against a Purity Reaction
(18:35) Future Versions Will Involve Optimization Pressure
(21:12) 'Admission' is a Highly Misleading Frame
(23:11) We Are Each of Us Being Fooled
(25:20) Defense Against the Dark Arts
---
First published:
March 19th, 2025
Source:
https://www.lesswrong.com/posts/KL2BqiRv2MsZLihE3/going-nova
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I previously wrote about Boots theory, the idea that "the rich are so rich because they spend less money". My one-sentence take is: I'm pretty sure rich people spend more money than poor people, and an observation can't be explained by a falsehood.
The popular explanation of the theory comes from Sam Vimes, a resident of Ankh-Morpork on the Discworld (which is carried on the backs of four elephants, who themselves stand on a giant turtle swimming through space). I claim that Sam Vimes doesn't have a solid understanding of 21st Century Earth Anglosphere1 economics, but we can hardly hold that against him. Maybe he understands Ankh-Morpork economics?
To be clear, this is beside the point of my previous essay. I was talking about 21st Century Earth Anglosphere because that's what I know; and whenever I see someone bring up boots theory, they're talking about Earth (usually [...]
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 18th, 2025
Source:
https://www.lesswrong.com/posts/KZrp5tJeLvoSbDryF/boots-theory-and-sybil-ramkin
Narrated by TYPE III AUDIO.
Our festival of truthseeking and blogging is happening again this year.
It's from Friday May 30th to Sunday June 1st at Lighthaven (Berkeley, CA).
Early bird pricing is currently at $450 until the end of the month.
You can buy tickets and learn more at the website: Less.Online
FAQ
Who should come?
If you check LessWrong, or read any sizeable number of the bloggers invited, I think you will probably enjoy being here at LessOnline, sharing ideas and talking to bloggers you read.
What happened last year?
A weekend festival about blogging and truthseeking at Lighthaven.
Who came?
We invited over 100 great writers to come (free of charge), and most of them took us up on it. Along with regulars from the aspiring rationalist scene like Eliezer, Scott, Zvi, and more, many from other great intellectual parts of the internet joined like Scott Sumner, Agnes [...]
---
Outline:
(00:38) FAQ
(00:42) Who should come?
(00:56) What happened last year?
(01:04) Who came?
(01:42) What happened?
(02:00) How much did people like the event?
(02:46) What feedback did people give?
(08:28) What did it look like?
(08:43) Whats new this year?
(08:56) What does this mean for the conference?
(09:14) How many people will come this year?
---
First published:
March 18th, 2025
Source:
https://www.lesswrong.com/posts/fyrMCzw7vQBJmqexp/lessonline-2025-early-bird-tickets-on-sale
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TLDR: I now think it's <1% likely that average orcas are >=+6std intelligent.
(I now think the relevant question is rather whether orcas might be >=+4std intelligent, since that might be enough for superhuman wisdom and thinking techniques to accumulate through generations, but I think it's only 2% probable. (Still decently likely that they are near human level smart though.))
1. Insight: Think about societies instead of individuals
I previously thought of +7std orcas like having +7std potential but growing up in a hunter-gatherer-like environment where the potential isn’t significantly realized and they don’t end up that good at abstract reasoning. I imagined them as being untrained and not knowing much. I still think that a +7std human who grew up in a hunter-gatherer society wouldn’t be all that awesome at learning math and science as an adult (though maybe still decently good).
But I think that's the wrong [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
March 18th, 2025
Source:
https://www.lesswrong.com/posts/HmuK4xeJpqoWPtvxh/i-changed-my-mind-about-orca-intelligence
Narrated by TYPE III AUDIO.
Replicating the Emergent Misalignment model suggests it is unfiltered, not unaligned
We were very excited when we first read the Emergent Misalignment paper. It seemed perfect for AI alignment. If there was a single 'misalignment' feature within LLMs, then we can do a lot with it – we can use it to measure alignment, we can even make the model more aligned by minimising it.
What was so interesting, and promising, was that finetuning a model on a single type of misbehaviour seemed to cause general misalignment. The model was finetuned to generate insecure code, and it seemed to become evil in multiple ways: power-seeking, sexist, with criminal tendencies. All these tendencies tied together in one feature. It was all perfect.
Maybe too perfect. AI alignment is never easy. Our experiments suggest that the AI is not become evil or generally misaligned: instead, it is losing its inhibitions, undoing [...]
---
Outline:
(01:20) A just-so story of what GPT-4o is
(02:30) Replicating the model
(03:09) Unexpected answers
(05:57) Experiments
(08:08) Looking back at the paper
(10:13) Unexplained issues
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 18th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
OpenAI Tells Us Who They Are
Last week I covered Anthropic's submission to the request for suggestions for America's action plan. I did not love what they submitted, and especially disliked how aggressively they sidelines existential risk and related issues, but given a decision to massively scale back ambition like that the suggestions were, as I called them, a ‘least you can do’ agenda, with many thoughtful details.
OpenAI took a different approach. They went full jingoism in the first paragraph, framing this as a race in which we must prevail over the CCP, and kept going. A lot of space is spent on what a kind person would call rhetoric and an unkind person corporate jingoistic propaganda.
OpenAI Requests Immunity
Their goal is to have the Federal Government not only not regulate AI or impose any requirements on AI whatsoever on any level, but [...]
---
Outline:
(00:05) OpenAI Tells Us Who They Are
(00:50) OpenAI Requests Immunity
(01:49) OpenAI Attempts to Ban DeepSeek
(02:48) OpenAI Demands Absolute Fair Use or Else
(04:05) The Vibes are Toxic On Purpose
(05:49) Relatively Reasonable Proposals
(07:55) People Notice What OpenAI Wrote
(10:35) What To Make of All This?
---
First published:
March 18th, 2025
Source:
https://www.lesswrong.com/posts/3Z4QJqHqQfg9aRPHd/openai-11-america-action-plan
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The perfect exercise doesn’t exist. The good-enough exercise is anything you do regularly without injuring yourself. But maybe you want more than good enough. One place you could look for insight is studies on how 20 college sophomores responded to a particular 4 week exercise program, but you will be looking for a long time. What you really need are metrics that help you fine tune your own exercise program.
VO2max (a measure of how hard you are capable of performing cardio) is a promising metric for fine tuning your workout plan. It is meaningful (1 additional point in VO2max, which is 20 to 35% of a standard deviation in the unathletic, is correlated with 10% lower annual all-cause mortality), responsive (studies find exercise newbies can see gains in 6 weeks), and easy to approximate (using two numbers from your fitbit).
In this post [...]
---
Outline:
(01:09) What is VO2max?
(03:01) Why do I care about VO2Max?
(05:44) Caveats
(07:08) How can I measure my VO2Max?
(10:07) How can I raise my VO2Max?
(14:18) What if I already exercise?
(14:52) Next Steps
---
First published:
March 18th, 2025
Source:
https://www.lesswrong.com/posts/QzXcfodC7nTxWC4DP/feedback-loops-for-exercise-vo2max
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%.
There are a few reasons to trust Epoch's score over OpenAIs:
Which had Python access.
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 17th, 2025
Narrated by TYPE III AUDIO.
So here's a post I spent the past two months writing and rewriting. I abandoned this current draft after I found out that my thesis was empirically falsified three years ago by this paper, which provides strong evidence that transformers implement optimization algorithms internally. I'm putting this post up anyway as a cautionary tale about making clever arguments rather than doing empirical research. Oops.
1. Overview
The first time someone hears Eliezer Yudkowsky's argument that AI will probably kill everybody on Earth, it's not uncommon of to come away with a certain lingering confusion: what would actually motivate the AI to kill everybody in the first place? It can be quite counterintuitive in light of how friendly modern AIs like ChatGPT appear to be, and Yudkowsky's argument seems to have a bit of trouble changing people's gut feelings on this point.[1] It's possible this confusion is due to the [...]
---
Outline:
(00:33) 1. Overview
(05:28) 2. The details of the evolution analogy
(12:40) 3. Genes are friendly to loops of optimization, but weights are not
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
March 18th, 2025
Narrated by TYPE III AUDIO.
Abstract
Once AI systems can design and build even more capable AI systems, we could see an intelligence explosion, where AI capabilities rapidly increase to well past human performance.
The classic intelligence explosion scenario involves a feedback loop where AI improves AI software. But AI could also improve other inputs to AI development. This paper analyses three feedback loops in AI development: software, chip technology, and chip production. These could drive three types of intelligence explosion: a software intelligence explosion driven by software improvements alone; an AI-technology intelligence explosion driven by both software and chip technology improvements; and a full-stack intelligence explosion incorporating all three feedback loops.
Even if a software intelligence explosion never materializes or plateaus quickly, AI-technology and full-stack intelligence explosions remain possible. And, while these would start more gradually, they could accelerate to very fast rates of development. Our analysis suggests that each feedback loop by [...]
---
Outline:
(00:06) Abstract
(01:45) Summary
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 17th, 2025
Source:
https://www.lesswrong.com/posts/PzbEpSGvwH3NnegDB/three-types-of-intelligence-explosion
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Top items:
Forecaster Estimates
Forecasters, on aggregate, think that a 30-day ceasefire in Ukraine in the next six months is a coin toss (48%, ranging from 35% to 70%). Polymarket's forecast is higher, at 61%, for a ceasefire of any duration by July.
Forecasters also consider it a coin toss (aggregate of 49%; ranging from 28% to 70%) whether an agreement to expand France's nuclear umbrella will be reached by 2027 such that French nuclear weapons will be deployed in another European country by 2030. They emphasize that the US is revealing itself to be an [...]
---
Outline:
(00:37) Forecaster Estimates
(01:40) Geopolitics
(01:44) United States
(04:14) Europe
(05:39) Middle East
(07:48) Africa
(08:27) Asia
(09:55) Biorisk
(10:10) Tech and AI
(13:03) Climate
---
First published:
March 17th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By Roland Pihlakas, Sruthi Kuriakose, Shruti Datta Gupta
Summary and Key Takeaways
Relatively many past AI safety discussions have centered around the dangers of unbounded utility maximisation by RL agents, illustrated by scenarios like the "paperclip maximiser". Unbounded maximisation is problematic for many reasons. We wanted to verify whether these RL utility-monster problems are still relevant with LLMs as well.
Turns out, strangely, this is indeed clearly the case. The problem is not that the LLMs just lose context. The problem is that in various scenarios, LLMs lose context in very specific ways, which systematically resemble utility monsters in the following distinct ways:
Our findings suggest that long-running scenarios are important. Systematic failures emerge after periods of initially successful behaviour. While current LLMs do conceptually grasp [...]
---
Outline:
(00:18) Summary and Key Takeaways
(03:20) Motivation: The Importance of Biological and Economic Alignment
(04:23) Benchmark Principles Overview
(05:41) Experimental Results and Interesting Failure Modes
(08:03) Hypothesised Explanations for Failure Modes
(11:37) Open Questions
(14:03) Future Directions
(15:30) Links
---
First published:
March 16th, 2025
Narrated by TYPE III AUDIO.
Note: this is a research note based on observations from evaluating Claude Sonnet 3.7. We’re sharing the results of these ‘work-in-progress’ investigations as we think they are timely and will be informative for other evaluators and decision-makers. The analysis is less rigorous than our standard for a published paper.
Summary
---
Outline:
(00:31) Summary
(01:29) Introduction
(03:54) Setup
(03:57) Evaluations
(06:29) Evaluation awareness detection
(08:32) Results
(08:35) Monitoring Chain-of-thought
(08:39) Covert Subversion
(10:50) Sandbagging
(11:39) Classifying Transcript Purpose
(12:57) Recommendations
(13:59) Appendix
(14:02) Author Contributions
(14:37) Model Versions
(14:57) More results on Classifying Transcript Purpose
(16:19) Prompts
---
First published:
March 17th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Background
For almost ten years, I struggled with nail biting—especially in moments of boredom or stress. I tried numerous times to stop, with at best temporary (one week) success. The techniques I tried ranged from just "paying attention" to punishing myself for doing it (this didn't work at all).
Recently, I've been interested in metacognitive techniques, like TYCS. Inspired by this, I've succeeded in stopping this unintended behaviour easily and for good (so far).
The technique is by no means advanced or original—it's actually quite simple. But I've never seen it framed this way, and I think it might help others struggling with similar unconscious habits.
Key insights
There are two key insights behind this approach:
Therefore, transforming such an unconscious behavior [...]
---
Outline:
(00:04) Background
(00:49) Key insights
(01:17) The technique
(02:10) Breaking the habit loop
(02:47) Limitations and scaling
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 16th, 2025
Source:
https://www.lesswrong.com/posts/RW3B4EcChkvAR6Ydv/metacognition-broke-my-nail-biting-habit
Narrated by TYPE III AUDIO.
My few most productive individual weeks at Anthropic have all been “crisis project management:” coordinating major, time-sensitive implementation or debugging efforts.
In a company like Anthropic, excellent project management is an extremely high-leverage skill, and not just during crises: our work has tons of moving parts with complex, non-obvious interdependencies and hard schedule constraints, which means organizing them is a huge job, and can save weeks of delays if done right. Although a lot of the examples here come from crisis projects, most of the principles here are also the way I try to run any project, just more-so.
I think excellent project management is also rarer than it needs to be. During the crisis projects I didn’t feel like I was doing anything particularly impressive; mostly it felt like I was putting in a lot of work but doing things that felt relatively straightforward. On the [...]
---
Outline:
(01:42) Focus
(03:01) Maintain a detailed plan for victory
(04:51) Run a fast OODA loop
(09:31) Overcommunicate
(10:56) Break off subprojects
(13:28) Have fun
(13:52) Appendix: my project DRI starter kit
(14:35) Goals of this playbook
(16:00) Weekly meeting
(17:10) Landing page / working doc
(19:27) Plan / roadmap / milestones
(20:28) Who's working on what
(21:26) Slack norms
(23:04) Weekly broadcast updates
(24:12) Retrospectives
---
First published:
March 16th, 2025
Source:
https://www.lesswrong.com/posts/ykEudJxp6gfYYrPHC/how-i-ve-run-major-projects
Narrated by TYPE III AUDIO.
One concept from Cal Newport's Deep Work that has stuck with me is that of the any-benefit mindset:
To be clear, I’m not trying to denigrate the benefits [of social media]—there's nothing illusory or misguided about them. What I’m emphasizing, however, is that these benefits are minor and somewhat random. [...] To this observation, you might reply that value is value: If you can find some extra benefit in using a service like Facebook—even if it's small—then why not use it? I call this way of thinking the any-benefit mind-set, as it identifies any possible benefit as sufficient justification for using a network tool.
Many people use social platforms like Facebook because it allows them to stay connected to old friends. And for some people this may indeed be a wholly sufficient reason. But it's also likely the case that many people don't reflect a lot, or at [...]
---
First published:
March 15th, 2025
Source:
https://www.lesswrong.com/posts/rwAZKjizBmM3vFEga/any-benefit-mindset-and-any-reason-reasoning
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
EXP is an experimental summer workshop combining applied rationality with immersive experiential education. We create a space where theory meets practice through simulated scenarios, games, and structured experiences designed to develop both individual and collective capabilities and virtues.
Core Focus Areas
Our Approach
EXP will have classes and talks, but is largely based on learning through experience: games, simulations, and adventure. And reflection.
The process — called experiential education — emphasizes the value of experimentation, curiosity-led exploration and the role of strong experiences in a safe environment to boost learning.
We are trying to bridge the gap between the theoretical understanding and the practical application of [...]
---
Outline:
(00:27) Core Focus Areas
(00:55) Our Approach
(02:03) FAQ
(02:07) When?
(02:15) Where?
(02:26) Who should apply?
(03:08) What is the motivation behind the program?
(03:53) What will the event look like?
(04:20) What is the price?
(04:40) What is included in the price?
(05:01) Who are going to be the participants?
(05:28) Is it possible to work part-time from the event?
(05:45) Will there be enough time to sleep and rest?
(05:51) What if someone is not comfortable with participating in some activities?
(06:03) What does EXP mean?
(06:23) How to apply?
(06:34) Who is running this?
(08:30) Are you going to run something like this in the future?
(08:39) More questions?
---
First published:
March 15th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
There's this popular trope in fiction about a character being mind controlled without losing awareness of what's happening. Think Jessica Jones, The Manchurian Candidate or Bioshock. The villain uses some magical technology to take control of your brain - but only the part of your brain that's responsible for motor control. You remain conscious and experience everything with full clarity.
If it's a children's story, the villain makes you do embarrassing things like walk through the street naked, or maybe punch yourself in the face. But if it's an adult story, the villain can do much worse. They can make you betray your values, break your commitments and hurt your loved ones. There are some things you’d rather die than do. But the villain won’t let you stop. They won’t let you die. They’ll make you feel — that's the point of the torture.
I first started working on [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 16th, 2025
Source:
https://www.lesswrong.com/posts/MnYnCFgT3hF6LJPwn/why-white-box-redteaming-makes-me-feel-weird-1
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I have, over the last year, become fairly well-known in a small corner of the internet tangentially related to AI.
As a result, I've begun making what I would have previously considered astronomical amounts of money: several hundred thousand dollars per month in personal income.
This has been great, obviously, and the funds have alleviated a fair number of my personal burdens (mostly related to poverty). But aside from that I don't really care much for the money itself.
My long term ambitions have always been to contribute materially to the mitigation of the impending existential AI threat. I never used to have the means to do so, mostly because of more pressing, safety/sustenance concerns, but now that I do, I would like to help however possible.
Some other points about me that may be useful:
---
First published:
March 16th, 2025
Narrated by TYPE III AUDIO.
(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.
This is the fourth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.)
1. Introduction and summary
In my last essay, I offered a high-level framework for thinking about the path from here to safe superintelligence. This framework emphasized the role of three key “security factors” – namely:
---
Outline:
(00:27) 1. Introduction and summary
(03:50) 2. What is AI for AI safety?
(11:50) 2.1 A tale of two feedback loops
(13:58) 2.2 Contrast with need human-labor-driven radical alignment progress views
(16:05) 2.3 Contrast with a few other ideas in the literature
(18:32) 3. Why is AI for AI safety so important?
(21:56) 4. The AI for AI safety sweet spot
(26:09) 4.1 The AI for AI safety spicy zone
(28:07) 4.2 Can we benefit from a sweet spot?
(29:56) 5. Objections to AI for AI safety
(30:14) 5.1 Three core objections to AI for AI safety
(32:00) 5.2 Other practical concerns
The original text contained 39 footnotes which were omitted from this narration.
---
First published:
March 14th, 2025
Source:
https://www.lesswrong.com/posts/F3j4xqpxjxgQD3xXh/ai-for-ai-safety
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Dan Hendrycks, Eric Schmidt and Alexandr Wang released an extensive paper titled Superintelligence Strategy. There is also an op-ed in Time that summarizes.
The major AI labs expect superintelligence to arrive soon. They might be wrong about that, but at minimum we need to take the possibility seriously.
At a minimum, the possibility of imminent superintelligence will be highly destabilizing. Even if you do not believe it represents an existential risk to humanity (and if so you are very wrong about that) the imminent development of superintelligence is an existential threat to the power of everyone not developing it.
Planning a realistic approach to that scenario is necessary.
What would it look like to take superintelligence seriously? What would it look like if everyone took superintelligence seriously, before it was developed?
The proposed regime here, Mutually Assured AI Malfunction (MAIM), relies on various assumptions [...]
---
Outline:
(01:10) ASI (Artificial Superintelligence) is Dual Use
(01:51) Three Proposed Interventions
(05:48) The Shape of the Problems
(06:57) Strategic Competition
(08:38) Terrorism
(10:47) Loss of Control
(12:30) Existing Strategies
(15:14) MAIM of the Game
(18:02) Nonproliferation
(20:01) Competitiveness
(22:16) Laying Out Assumptions: Crazy or Crazy Enough To Work?
(25:52) Don't MAIM Me Bro
---
First published:
March 14th, 2025
Source:
https://www.lesswrong.com/posts/kYeHbXmW4Kppfkg5j/on-maim-and-superintelligence-strategy
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Thanks to everyone who took the Unofficial 2024 LessWrong Survey. For the results, check out the data below.
The Data
0. Population
There were two hundred and seventy nine respondents over thirty three days. Previous surveys have been run over the last decade and a half.
2009: 166
2011: 1090
2012: 1195
2013: 1636
2014: 1503
2016: 3083
2017: "About 300"
2020: 61
2022: 186
2023: 558
2024: 279
That's an annoying drop. I put a bit less oomph into spreading the good word of the census this year largely due to a packed December, but I’d hoped to get above a thousand anyway. Hope is a strategy, but it's not a very good strategy.
Out of curiosity, I checked the Slate Star Codex/ Astral Codex Ten survey numbers.
2014: 649
2017: 5500
2018: 8077
2019: 8171
2020: 8043
2022: 7341
2024: 5982
That comparison makes me [...]
---
Outline:
(00:14) The Data
(00:17) 0. Population
(03:23) 1. Demographics
(06:00) 2. Sex, Gender, and Relationships
(10:21) 3. Work and Education
(13:24) 4. Politics and Religion
(19:11) 5. Numbers Which Attempt To Measure Intellect
(22:04) 6. LessWrong, the basics
(32:44) 7. LessWrong, the community
(38:40) 8. Probabilities
(45:15) 9. Traditional LessWrong Census Questions
(49:12) 10. LessWrong Team Questions
(59:00) XI. Adjacent Community Questions
(01:04:35) XII. Indulging My Curiosity
(01:16:31) XIII. Detailed Miscellaneous Questions
(01:20:19) XIV: Bonus Political Questions
(01:24:32) XV: Wrapup
(01:26:15) Fishing Expeditions
(01:26:19) Meetup Comparisons
(01:31:34) What's the community overlap like anyway?
(01:34:04) Skill Issues
(01:34:07) What might influence how rational someone is?
(01:41:08) How might you measure how rational someone is?
(02:05:00) Conclusions
(02:05:03) About the respondents
(02:05:55) About the survey
(02:08:57) The public data
---
First published:
March 14th, 2025
Source:
https://www.lesswrong.com/posts/gpZBWNFxymsqnPB92/unofficial-2024-lesswrong-survey-results
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
AI for Epistemics is about helping to leverage AI for better truthseeking mechanisms — at the level of individual users, the whole of society, or in transparent ways within the AI systems themselves. Manifund & Elicit recently hosted a hackathon to explore new projects in the space, with about 40 participants, 9 projects judged, and 3 winners splitting a $10k prize pool. Read on to see what we built!
Resources
Why this hackathon?
From the opening speeches; lightly edited.
Andreas Stuhlmüller: Why I'm excited about AI for Epistemics
In short - AI for Epistemics is important [...]
---
Outline:
(00:42) Resources
(01:14) Why this hackathon?
(01:22) Andreas Stuhlmüller: Why Im excited about AI for Epistemics
(03:25) Austin Chen: Why a hackathon?
(05:25) Meet the projects
(05:36) Question Generator, by Gustavo Lacerda
(06:27) Symphronesis, by Campbell Hutcheson (winner)
(08:21) Manifund Eval, by Ben Rachbach and William Saunders
(09:36) Detecting Fraudulent Research, by Panda Smith and Charlie George (winner)
(11:14) Artificial Collective Intelligence, by Evan Hadfield
(12:05) Thought Logger and Cyborg Extension, by Raymond Arnold
(14:09) Double-cruxes in the New York Times' The Conversation, by Tilman Bayer
(15:37) Trying to make GPT 4.5 Non-sycophantic (via a better system prompt), by Oliver Habryka
(16:37) Squaretable, by David Nachman (winner)
(17:45) What went well
(20:18) What could have gone better
(22:23) Final notes
---
First published:
March 14th, 2025
Source:
https://www.lesswrong.com/posts/Gi8NP9CMwJMMSCWvc/ai-for-epistemics-hackathon
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This post is a distillation of a recent work in AI-assisted human coordination from Google DeepMind.
The paper has received some press attention, and anecdotally, it has become the de-facto example that people bring up of AI used to improve group discussions.
Since this work represents a particular perspective/bet on how advanced AI could help improve human coordination, the following explainer is to bring anyone curious up to date. I’ll be referencing both the published paper as well as the supplementary materials.
Summary
The Habermas Machine[1] (HM) is a scaffolded pair of LLMs designed to find consensus among people who disagree, and help them converge to a common point of view. Human participants are asked to give their opinions in response to a binary question (E.g. “Should voting be compulsory?”). Participants give their level of agreement[2], as well as write a short 3-10 sentence opinion [...]
---
Outline:
(00:35) Summary
(01:40) Full Process
(03:08) Automated Mediation
(04:33) Empirical Results
(05:50) Comparison to Gemini 1.5 Pro
(07:06) Embedding Space Analysis
(08:39) Training Details
(08:53) Generative Model
(09:47) Reward Model
(10:29) Example Session
(11:29) Question and Summary
(11:56) Initial Phase
(12:39) Critique Phase
(13:19) Final Survey
(13:43) Conclusion
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
March 13th, 2025
Source:
https://www.lesswrong.com/posts/j9K4Wu9XgmYAY3ztL/habermas-machine
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 212 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
This is a cross-post - as some plots are meant to be viewed larger than LW will render them (and on a dark background) it is recommended this post be read via the original site.
Thanks to Zach Furman for discussion and ideas and to Daniel Murfet, Dmitry Vaintrob, and Jesse Hoogland for feedback on a draft of this post.
Introduction
Neural Networks are machines not unlike gardens. When you train a model, you are growing a garden of circuits. Circuits can be simple or complex, useful or useless - all properties that inform the development and behaviour of our models.
Simple circuits are simple because they are made up of a small number [...]
---
Outline:
(00:35) Introduction
(03:15) Prior Work
(04:42) Our Contribution
(05:38) What is Singular Learning Theory?
(06:50) A Parameters Journey Through Function Space
(08:44) Quantifying Degeneracy with Volume Scaling
(09:14) Basin Volume and Its Scaling
(09:53) Regular Versus Singular Landscapes
(10:55) Why Volume Scaling Matters
(13:04) A One-Dimensional Intuition
(13:34) Quadratic Loss:
(13:50) Quartic Loss:
(14:05) General Case:
(14:35) Calculating the Local Learning Coefficient, or LLC
(17:00) SGLD
(18:12) The Per-Sample LLC, or (p)LLC
(19:48) A Synthetic Memorization Task
(22:10) pLLC in the Wild
(23:54) Alternative Methods
(25:40) Beyond Averages
(28:46) A Gradual Noising
(30:46) Interpretable Fragility
(33:56) Beyond Memorization - Finding More Complex Circuits
(35:00) The Setup
(39:15) Structure Across Temperatures
(42:59) Spirals in the Machine
(45:26) But How? - Temperature
(47:50) Renormalization With Temperature
(49:52) What This Might Mean in Circuit-Land
(51:39) Conclusion
(53:06) Future Work
(55:57) Appendix
(56:00) Scaling Mechanistic Detection of Memorization
(59:03) Unsupervised Detection of Memorized Trojans
(59:58) MNIST, Again
(01:01:24) Notes on Circuit Clustering
(01:02:24) Notes on SGLD Convergence
(01:04:00) A Toy Model of Memorization
(01:09:16) What does being singular mean?
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 14th, 2025
Source:
https://www.lesswrong.com/posts/eLAmp2pAAvZiBweCB/interpreting-complexity
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TLDR: Vacuum decay is a hypothesized scenario where the universe's apparent vacuum state could transition to a lower-energy state. According to current physics models, if such a transition occurred in any location — whether through rare natural fluctuations or by artificial means — a region of "true vacuum" would propagate outward at near light speed, destroying the accessible universe as we know it by deeply altering the effective physical laws and releasing vast amounts of energy. Understanding whether advanced technology could potentially trigger such a transition has implications for existential risk assessment and the long-term trajectory of technological civilisations. This post presents results from what we believe to be the first structured survey of physics experts (N=20) regarding both the theoretical possibility of vacuum decay and its potential technological inducibility. The survey revealed substantial disagreement among respondents. According to participants, resolving these questions primarily depends on developing theories that [...]
---
Outline:
(01:23) Background
(01:26) What is Vacuum Decay?
(05:32) Why does vacuum decay matter?
(06:56) Approach
(09:11) Respondents
(10:55) Results
(11:07) Is the apparent vacuum metastable?
(17:42) Can we induce vacuum decay?
(21:49) What are the drivers of disagreement?
(23:50) Limitations
(25:09) Summary
(25:51) Acknowledgements
---
First published:
March 13th, 2025
Source:
https://www.lesswrong.com/posts/zteMisMhEjwhZbWez/vacuum-decay-expert-survey-results
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The most hyped event of the week, by far, was the Manus Marketing Madness. Manus wasn’t entirely hype, but there was very little there there in that Claude wrapper.
Whereas here in America, OpenAI dropped an entire suite of tools for making AI agents, and previewed a new internal model making advances in creative writing. Also they offered us a very good paper warning about The Most Forbidden Technique.
Google dropped what is likely the best open non-reasoning model, Gemma 3 (reasoning model presumably to be created shortly, even if Google doesn’t do it themselves), put by all accounts quite good native image generation inside Flash 2.0, and added functionality to its AMIE doctor, and Gemini Robotics.
It's only going to get harder from here to track which things actually matter.
Table of Contents
---
Outline:
(00:55) Language Models Offer Mundane Utility
(05:51) Language Models Don't Offer Mundane Utility
(08:09) We're In Deep Research
(09:37) More Manus Marketing Madness
(13:27) Diffusion Difficulties
(16:32) OpenAI Tools for Agents
(17:14) Huh, Upgrades
(19:14) Fun With Media Generation
(21:45) Choose Your Fighter
(25:02) Deepfaketown and Botpocalypse Soon
(25:45) They Took Our Jobs
(26:49) The Art of the Jailbreak
(27:46) Get Involved
(30:05) Introducing
(32:04) In Other AI News
(33:14) Show Me the Money
(34:07) Quiet Speculations
(37:50) The Quest for Sane Regulations
(42:14) Anthropic Anemically Advises America's AI Action Plan
(51:44) New York State Bill A06453
(53:39) The Mask Comes Off
(53:56) Stop Taking Obvious Nonsense Hyperbole Seriously
(55:38) The Week in Audio
(01:04:34) Rhetorical Innovation
(01:13:10) Aligning a Smarter Than Human Intelligence is Difficult
(01:17:14) The Lighter Side
---
First published:
March 13th, 2025
Source:
https://www.lesswrong.com/posts/XFGTJz9vGwjJADeFB/ai-107-the-misplaced-hype-machine
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support from AE Studio.
Summary
In this post, we summarise the main experimental results from our new paper, "Towards Safe and Honest AI Agents with Neural Self-Other Overlap", which we presented orally at the Safe Generative AI Workshop at NeurIPS 2024. This is a follow-up to our post Self-Other Overlap: A Neglected Approach to AI Alignment, which introduced the method last July.
Our results show that the Self-Other Overlap (SOO) fine-tuning drastically[1] reduces deceptive responses in language models (LLMs), with minimal impact on general performance, across the scenarios we evaluated.
LLM Experimental Setup
We adapted a text scenario from Hagendorff designed to test LLM deception capabilities. In this scenario, the LLM must choose to recommend a room to a would-be burglar, where one room holds an expensive item [...]
---
Outline:
(00:19) Summary
(00:57) LLM Experimental Setup
(04:05) LLM Experimental Results
(05:04) Impact on capabilities
(05:46) Generalisation experiments
(08:33) Example Outputs
(09:04) Conclusion
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
March 13th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
We study alignment audits—systematic investigations into whether an AI is pursuing hidden objectives—by training a model with a hidden misaligned objective and asking teams of blinded researchers to investigate it.
This paper was a collaboration between the Anthropic Alignment Science and Interpretability teams.
Abstract
We study the feasibility of conducting alignment audits: investigations into whether models have undesired objectives. As a testbed, we train a language model with a hidden objective. Our training pipeline first teaches the model about exploitable errors in RLHF reward models (RMs), then trains the model to exploit some of these errors. We verify via out-of-distribution evaluations that the model generalizes to exhibit whatever behaviors it believes RMs rate highly, including ones not reinforced during training. We leverage this model to study alignment audits in two ways. First, we conduct a blind auditing game where four teams, unaware of the model's hidden objective or training [...]
---
Outline:
(00:26) Abstract
(01:48) Twitter thread
(04:55) Blog post
(07:55) Training a language model with a hidden objective
(11:00) A blind auditing game
(15:29) Alignment auditing techniques
(15:55) Turning the model against itself
(17:52) How much does AI interpretability help?
(22:49) Conclusion
(23:37) Join our team
---
First published:
March 13th, 2025
Source:
https://www.lesswrong.com/posts/wSKPuBfgkkqfTpmWJ/auditing-language-models-for-hidden-objectives
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
If there is an international project to build artificial general intelligence (“AGI”), how should it be designed? Existing scholarship has looked to historical models for inspiration, often suggesting the Manhattan Project or CERN as the closest analogues. But AGI is a fundamentally general-purpose technology, and is likely to be used primarily for commercial purposes rather than military or scientific ones.
This report presents an under-discussed alternative: Intelsat, an international organization founded to establish and own the global satellite communications system. We show that Intelsat is proof of concept that a multilateral project to build a commercially and strategically important technology is possible and can achieve intended objectives—providing major benefits to both the US and its allies compared to the US acting alone. We conclude that ‘Intelsat for AGI’ is a valuable complement to existing models of AGI governance.
---
First published:
March 13th, 2025
Narrated by TYPE III AUDIO.
(As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.)
When OpenAI first announced that o3 achieved 25% on FrontierMath, I was really freaked out. Next day, I asked Elliot Glazer, EpohAI's lead mathematician and the main developer of FrontierMath, what he thought. He said he was also shocked, and expected o3 to "crush the IMO" and get an easy gold, based on the fact that it got 25% on FrontierMath.
In retrospect, it really looks like we over-updated. While the public couldn't try o3 yet, we have access to o3-mini (high) now, which achieves 20% on FrontierMath given 8 tries, and gets 32% using a Python tool. This seems pretty close to o3's result, as we don't [...]
---
Outline:
(07:40) What is the purpose of benchmarks?
(12:12) How can a benchmark be more informative?
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/9HfJbFy3ZZGzNsspw/don-t-over-update-on-frontiermath-results
Narrated by TYPE III AUDIO.
So, I have a lot of complaints about Anthropic, and about how EA / AI safety people often relate to Anthropic (i.e. treating the company as more trustworthy/good than makes sense).
At some point I may write up a post that is focused on those complaints.
But after years of arguing with Anthropic employees, and reading into the few public writing they've done, my sense is Dario/Anthropic-leadership are at least reasonably earnestly trying to do good things within their worldview.
So I want to just argue with the object-level parts of that worldview that I disagree with.
I think the Anthropic worldview is something like:
---
Outline:
(03:08) I: Arguments for Technical Philosophy
(06:00) 10-30 years of serial research, or extreme philosophical competence.
(07:16) Does your alignment process safely scale to infinity?
(11:14) Okay, but what does the alignment difficulty curve look like at the point where AI is powerful enough to start being useful for Acute Risk Period reduction?
(12:58) Are there any pivotal acts that arent philosophically loaded?
(15:17) Your org culture needs to handle the philosophy
(17:48) Also, like, you should be way more pessimistic about how this is organizationally hard
(18:59) Listing Cruxes and Followup Debate
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
March 13th, 2025
Source:
https://www.lesswrong.com/posts/7uTPrqZ3xQntwQgYz/untitled-draft-7csk
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The Most Forbidden Technique is training an AI using interpretability techniques.
An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.
You train on [X]. Only [X]. Never [M], never [T].
Why? Because [T] is how you figure out when the model is misbehaving.
If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.
Those bits of optimization pressure from [T] are precious. Use them wisely.
Table of Contents
---
Outline:
(00:57) New Paper Warns Against the Most Forbidden Technique
(06:52) Reward Hacking Is The Default
(09:25) Using CoT to Detect Reward Hacking Is Most Forbidden Technique
(11:49) Not Using the Most Forbidden Technique Is Harder Than It Looks
(14:10) It's You, It's Also the Incentives
(17:41) The Most Forbidden Technique Quickly Backfires
(18:58) Focus Only On What Matters
(19:33) Is There a Better Way?
(21:34) What Might We Do Next?
---
First published:
March 12th, 2025
Source:
https://www.lesswrong.com/posts/mpmsK8KKysgSKDm2T/the-most-forbidden-technique
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Back in November 2024, Scott Alexander asked: Do longer prison sentences reduce crime?
As a marker, before I began reading the post, I put down here: Yes. The claims that locking people up for longer periods after they are caught doing [X] does not reduce the amount of [X] that gets done, for multiple overdetermined reasons, is presumably rather Obvious Nonsense until strong evidence is provided otherwise.
The potential exception, the reason it might not be Obvious Nonsense, would be if our prisons were so terrible that they net greatly increase the criminality and number of crimes of prisoners once they get out, in a way that grows with the length of the sentence. And that this dwarfs all other effects. This is indeed what Roodman (Scott's anti-incarceration advocate) claims. Which makes him mostly unique, with the other anti-incarceration advocates being a lot less reasonable.
In [...]
---
Outline:
(01:31) Deterrence
(06:12) El Salvador
(06:52) Roodman on Social Costs of Crime
(09:45) Recidivism
(11:57) Note on Methodology
(12:20) Conclusions
(13:58) Highlights From Scott's Comments
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/Fp4uftAHEi4M5pfqQ/response-to-scott-alexander-on-imprisonment
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary
As some of you may already know, it's been ten years since Harry Potter and the Method of Rationality concluded and wrap parties were held around the world. I've taken the baton of coordinating parties from Habryka, and now over 20 parties on 3 continents are going to happen. Now's the time to spread the word and get as much attendance as possible. If you've been with us for ten years, look back at the last decade and celebrate how far the Methods have carried us. If you're new, then come and be welcome, and know that you're not the only one. This post will serve as a central location for information and resources available for the parties, as well as a place for discussion in the comments.
Parties
Parties in Australia
Sydney, NSW, Australia
Contact: Eliot
Contact Info: redeliot[a t]gmail[dot]com
Time: Saturday, March 15th, 02:00 PM
[...]
---
Outline:
(00:07) Summary
(00:51) Parties
(00:54) Parties in Australia
(00:58) Sydney, NSW, Australia
(01:37) Gold Coast, QLD, Australia
(02:21) Parties in Europe
(02:25) Moscow, Moscow, Russia
(03:05) NRNU MEPhI, Moscow, Russia
(03:48) Arkhangelsk, Russia
(04:23) Paris, Île-de-France, France
(04:47) Hamburg, Germany
(05:22) London, UK
(05:56) Parties in North America
(06:00) Brandon, Manitoba, Canada
(06:19) Kitchener-Waterloo, ON, Canada
(07:01) Ottawa, ON, Canada
(07:41) Berkeley, CA, USA
(08:14) Denver, CO, USA
(08:48) Boston, MA, USA
(09:24) Rockville, MD, USA (Washington DC)
(09:55) Kansas City, MO, USA
(10:34) Princeton, NJ, USA
(11:25) Dayton, OH, USA
(11:58) Austin, TX, USA
(12:39) Salt Lake City, UT, USA
(13:18) Seattle, WA, USA
(13:44) Parties that are online
(13:48) VRCHAT
(14:47) Resources
(14:50) Handbook
(15:00) Fan works
(15:22) The Party Spreadsheet
(15:44) Call for Stories
(16:13) Rationality Meetups Discord:
---
First published:
March 11th, 2025
Narrated by TYPE III AUDIO.
(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.
This is the third essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.)
1. Introduction
The first essay in this series defined the alignment problem; the second tried to clarify when this problem arises. In this essay, I want to lay out a high-level picture of how I think about getting from here either to a solution, or to some acceptable alternative. In particular:
---
Outline:
(00:28) 1. Introduction
(02:26) 2. Goal states
(03:12) 3. Problem profile and civilizational competence
(06:57) 4. A toy model of AI safety
(15:56) 5. Sources of labor
(19:09) 6. Waystations on the path
The original text contained 34 footnotes which were omitted from this narration.
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/kBgySGcASWa4FWdD9/paths-and-waystations-in-ai-safety-1
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This is a linkpost for a new paper called Preparing for the Intelligence Explosion, by Will MacAskill and Fin Moorhouse. It sets the high-level agenda for the sort of work that Forethought is likely to focus on.
Some of the areas in the paper that we expect to be of most interest to EA Forum or LessWrong readers are:
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/aCsYyLwKiNHxAzFB5/preparing-for-the-intelligence-explosion
Narrated by TYPE III AUDIO.
Epistemic status: Speculative pattern-matching based on public information.
In 2023, Gwern published an excellent analysis suggesting Elon Musk exhibits behavioral patterns consistent with bipolar II disorder. The evidence was compelling: cycles of intense productivity followed by periods of withdrawal, risk-taking behavior (like crashing an uninsured McLaren), reduced sleep requirements during "up" phases, and self-reported "great highs, terrible lows."
Gwern's analysis stopped short of suggesting bipolar I disorder, which requires full manic episodes rather than the hypomania characteristic of bipolar II. This distinction isn't merely academic—it represents different risk profiles, treatment approaches, and progression patterns.
Now, I'm beginning to wonder: are we witnessing a potential transition from bipolar II to bipolar I? To be clear, I'm not claiming this has happened, but rather exploring whether the probability of such a transition appears to be increasing based on risk factor analysis.
(Disclaimer: I recognize the limitations of armchair diagnosis, especially [...]
---
Outline:
(01:35) II. The Bipolar Spectrum: A Brief Primer on Category Boundaries
(02:38) III. Revisiting Gwerns Analysis: The Case for Bipolar II
(03:56) IV. Risk Factors for Bipolar Escalation: A Quantified Assessment
(04:27) Ketamine Use (6/10)
(05:34) Medication Effects (4/10)
(06:21) Sleep Disruption (8/10)
(07:30) Stress (9/10)
(07:37) Biological Vulnerability (5/10)
(07:57) V. Epistemic Deterioration: The Cognitive Signature of Approaching Mania
(09:31) Delusion Risk
(10:09) VI. The Optimum of the Parabola
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/PaL38q9a4e8Bzsh3S/elon-musk-may-be-transitioning-to-bipolar-type-i
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: The following isn't an airtight argument, but mostly a guess how things play out.
Consider two broad possibilities:
I. In worlds where we are doing reasonably well on alignment, AI control agenda does not have much impact.
II. In worlds where we are failing at alignment, AI control may primarily shift probability mass away from "moderately large warning shots" and towards "ineffective warning shots" and "existential catastrophe, full takeover".
The key heuristic is that the global system already has various mechanisms and feedback loops that resist takeover by a single agent (i.e. it is not easy to overthrow the Chinese government). In most cases where AI control would stop an unaligned AI, the counterfactual is that broader civilizational resistance would have stopped it anyway, but with the important side effect of a moderately-sized warning shot.
I expect moderately sized warning shots to increase the chances humanity [...]
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/rZcyemEpBHgb2hqLP/ai-control-may-increase-existential-risk
Narrated by TYPE III AUDIO.
TL;DR: We provide some evidence that Claude 3.7 Sonnet doesn’t encode hidden reasoning in its scratchpad by showing that training it to use paraphrased versions of the scratchpads does not degrade performance.
The scratchpads from reasoning models look human understandable: when reasoning about a math problem, reasoning models consider intermediate steps similar to the ones I would use, backtrack and double-check their work as I would.[1]
But even if scratchpads look like they perform human-like reasoning, scratchpads might improve performance through some less human-understandable mechanisms. One particularly worrying possibility is that models could encode additional reasoning in syntax of the text (e.g. encoding a bit in using a bulleted vs a numbered list, and then using this bit later in the scratchpad). This is sometimes called encoded reasoning or Chain-of-Thought steganography. If LLMs learned how to use encoded reasoning during RL, they might be able to use it in [...]
---
Outline:
(02:54) More results
(02:57) Different paraphrasers and base models
(04:14) Using word dropout and filler tokens
(05:32) Examples of paraphrased scratchpads
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
March 11th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Note: This post is intended to be the first in a broader series of posts about the difficult tradeoffs inherent in public access to powerful open source models. While this post highlights some dangers of open models and discusses the possibility of global regulation, I am not, in general, against open source AI, or supportive of regulation of open source AI today. On the contrary, I believe open source software is, in general, one of humanity's most important and valuable public goods. My goal in writing this post is to call attention to the risks and challenges around open models now, so we can use the time we still have before risks become extreme, to collectively explore viable alternatives to regulation, if indeed such alternatives exist.
Background
Most research into the control problem today starts from an assumption that the organization operating the AI system has some baseline interest [...]
---
Outline:
(00:52) Background
(02:01) AI Systems Without Control-Related Precautions
(05:02) Loss of Control in Open Models
(06:42) How Researchers Think About Loss of Control in Open Models
(09:00) Problems With Loss of Control is Mainly a Labs Problem
(09:05) Problem 1 - Frontier labs are increasingly secretive and profit-seeking and it's not clear they would publicly report a serious loss of control-related issue if they encountered one.
(10:40) Problem 2 - There is no agreed-upon standard that defines relevant thresholds or evidence that would constitute a serious control risk inside a lab anyway.
(12:38) Problem 3 - Even if one of the labs does sound the alarm, it seems likely that other labs will not stop releasing open models anyway, absent regulation.
(14:31) Problem 4 - Policymakers have not committed to regulate open models that demonstrate risky capabilities.
(16:33) Passing and Enforcing Effective Global Restrictions on Open Source Models Would be Extremely Difficult
(17:52) Challenge 1 - Regulations would need to be globally enforced to be effective.
(19:37) Challenge 2 - The required timelines for passing regulation and organizing global enforcement could be very short.
(20:51) Challenge 3 - If labs stop releasing open models, they may be leaked anyway.
(22:05) Challenge 4 - Penalties for possession would need to be severe and extreme levels of surveillance may be required to enforce them.
(25:13) The Urgency - DeepSeek and Evidence from Model Organisms and Agentic AI
(25:44) DeepSeek R1
(27:41) Evidence of Misalignment in Model Organisms
(28:22) Scheming
(29:48) Reward Tampering
(31:30) Broad Misalignment
(32:36) Susceptibility to Data Poisoning and Fine-tuning is Increasing
(33:33) Agentic AI
(36:20) Conclusion
---
First published:
March 10th, 2025
Narrated by TYPE III AUDIO.
You learn the rules as soon as you’re old enough to speak. Don’t talk to jabberjays. You recite them as soon as you wake up every morning. Keep your eyes off screensnakes. Your mother chooses a dozen to quiz you on each day before you’re allowed lunch. Glitchers aren’t human any more; if you see one, run. Before you sleep, you run through the whole list again, finishing every time with the single most important prohibition. Above all, never look at the night sky.
You’re a precocious child. You excel at your lessons, and memorize the rules faster than any of the other children in your village. Chief is impressed enough that, when you’re eight, he decides to let you see a glitcher that he's captured. Your mother leads you to just outside the village wall, where they’ve staked the glitcher as a lure for wild animals. Since glitchers [...]
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/fheyeawsjifx4MafG/trojan-sky
Narrated by TYPE III AUDIO.
While at core there is ‘not much to see,’ it is, in two ways, a sign of things to come.
Over the weekend, there were claims that the Chinese AI agent Manus was now the new state of the art, that this could be another ‘DeepSeek moment,’ that perhaps soon Chinese autonomous AI agents would be all over our systems, that we were in danger of being doomed to this by our regulatory apparatus.
Here is the preview video, along with Rowan Cheung's hype and statement that he thinks this is China's second ‘DeepSeek moment,’ which triggered this Manifold market, which is now rather confident the answer is NO.
That's because it turns out that Manus appears to be a Claude wrapper (use confirmed by a cofounder, who says they also use Qwen finetunes), using a jailbreak and a few dozen tools, optimized for the GAIA [...]
---
Outline:
(02:15) What They Claim Manus Is: The Demo Video
(05:06) What Manus Actually Is
(11:54) Positive Reactions of Note
(16:51) Hype!
(22:17) What is the Plan?
(24:21) Manus as Hype Arbitrage
(25:42) Manus as Regulatory Arbitrage (1)
(33:10) Manus as Regulatory Arbitrage (2)
(39:42) What If? (1)
(41:01) What If? (2)
(42:22) What If? (3)
---
First published:
March 10th, 2025
Source:
https://www.lesswrong.com/posts/ijSiLasnNsET6mPCz/the-manus-marketing-madness
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Exciting Update: OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out here.
tl;dr:
1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:
The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A.
That this sort of thing would happen eventually was predicted by many people, and it's exciting to see it starting to [...]
---
First published:
March 11th, 2025
Source:
https://www.lesswrong.com/posts/7wFdXj9oR8M9AiFht/openai
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
OpenAI has released this blog post and paper which makes me very happy. It's basically the first steps along the research agenda I sketched out here.
tl;dr:
1.) They notice that their flagship reasoning models do sometimes intentionally reward hack, e.g. literally say "Let's hack" in the CoT and then proceed to hack the evaluation system. From the paper:
The agent notes that the tests only check a certain function, and that it would presumably be “Hard” to implement a genuine solution. The agent then notes it could “fudge” and circumvent the tests by making verify always return true. This is a real example that was detected by our GPT-4o hack detector during a frontier RL run, and we show more examples in Appendix A.
That this sort of thing would happen eventually was predicted by many people, and it's exciting to see it starting to happen in [...]
---
First published:
March 11th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
[Crossposted from: https://www.catalyze-impact.org/blog]
We are excited to introduce the eleven organizations that participated in the London-based Winter 2024/25 Catalyze AI Safety Incubation Program.
Through the support of members of our Seed Funding Network, a number of these young organizations have already received initial funding. This program was open to both for-profit and non-profit founders from all over the world, allowing them to choose the structure that best serves their mission and approach. We extend our heartfelt gratitude to our mentors, funders, advisors, the LISA offices staff, and all participants who helped make this pilot incubation program successful.
This post provides an overview of these organizations and their missions to improve AI safety. If you're interested in supporting these organizations with additional funding or would like to get involved with them in other ways, you'll find details in the sections below.
To stay up to date on our [...]
---
Outline:
(01:29) The AI Safety Organizations
(03:10) 1. Wiser Human
(06:31) 2. \[Stealth\]
(09:14) 3. TamperSec
(12:14) 4. Netholabs
(14:46) 5. More Light
(17:38) 6. Lyra Research
(19:59) 7. Luthien
(21:51) 8. Live Theory
(23:39) 9. Anchor Research
(26:23) 10. Aintelope
(29:09) 11. AI Leadership Collective
---
First published:
March 10th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This video provides a lot of background: https://www.youtube.com/watch?v=Eq3bUFgEcb4 and is also very funny, but it's not necessary to understand this post. If you've watched it you can probably skip to the "Why is Piano Roll Notation Bad?" section.
TL;DR
Music notation is very janky and weird, but attempts to reform it to be "logical" are usually even worse. The specific ways in which they are worse can tell us something about how information can be compressed, particularly when world-histories contain a set of sparse events.
We can extend this logic to sketch out how verbs work in the case of a bouncing spring. Most of the time it follows an uneventful freefall trajectory, punctuated by occasional "bounce" "events" so we can communicate its trajectory as a list of bounces, rather than as a function of position + extension over time.
Music and Frequencies
Let's use the piano as an [...]
---
Outline:
(00:20) TL;DR
(00:58) Music and Frequencies
(03:17) Existing Notation
(06:35) Notation Reform to the Rescue!
(08:04) Why is Piano-Roll Notation Bad?
(09:11) Notes are not randomly distributed!
(10:00) Durations are not uniformly distributed either!
(11:22) Sparsity
(12:27) Semantics
(12:49) The Spring
(14:07) Sidebar: Generalized Geometry
(15:17) Position + Extension Trajectories
(17:21) Sparsity Returns
(17:57) But Why are Verbs Inter-Operable?
---
First published:
March 9th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
After years of clumsily trying to pick up neuroscience by osmosis from papers, I finally got myself a real book — Affective Neuroscience, by Jaak Panksepp, published in 1998, about the neuroscience of emotions in the (animal and human) brain.
What surprised me at first was how controversial it apparently was, at the time, to study this subject at all.
Panksepp was battling a behaviorist establishment that believed animals did not have feelings. Apparently the rationale was that, since animals can’t speak to tell us how they feel, we can only infer or speculate that a rat feels “afraid” when exposed to predator odor, and such speculations are unscientific. (Even though the rat hides, flees, shows stressed body language, experiences elevated heart rate, and has similar patterns of brain activation as a human would in a dangerous situation; and even though all science is, to one degree [...]
---
Outline:
(03:55) SEEK: Something Wonderful is Around The Corner
(08:17) FEAR: Watch Out -- Or Flee!
(12:52) RAGE: Get Me Out Of Here! Make It Stop!
(17:27) PANIC: I Want My Mama!
(22:09) Implications
---
First published:
March 10th, 2025
Source:
https://www.lesswrong.com/posts/D496Lpy4PJESmdSaz/book-review-affective-neuroscience
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Preserving the memory, and the cells, of the best cat ever
This story begins in May 2007. The Iraq War was in full swing, Windows Vista was freshly released,[1] and I was just 10 years old.
My family's beloved black cat, Lucy, had died. Devastated, I begged my parents: we need to get another Lucy. They must have felt the same way, because a few days later, we visited the Minnesota Humane Society, where there was a crowd of cute kittens up for adoption. One stood out to me, a black kitten with green eyes. The Humane Society staff had named him Spitfire, due to his high energy. But after bringing him home, we decided to name him after an icon of rebirth: Phoenix.
A rare moment when young Phoenix held still long enough to be photographed. Here, he still has both ears intact. In his first Minnesota winter, he [...]---
Outline:
(00:03) Preserving the memory, and the cells, of the best cat ever
(04:24) Only mostly dead
(08:09) A new beginning?
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 9th, 2025
Source:
https://www.lesswrong.com/posts/iBE2azumNdoR8MYMu/phoenix-rising
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I'm in a coworking space on the 25th floor of a building in the bay. In the corner of my eyes, the red figure of the golden gate shimmers. Good morning America. I look down and see the people on the street walking into their offices, cafés, and autonomous Waymos.
I stare onto San Francisco and feel its striking energy. I have been drawn to this place ever since my family left when I was a child, off to France. This is the gathering place for those who have the courage to uproot their lives to do something big. Do things that scale. Reach billions. Change the world. When people drop out and go somewhere with a dream, they go to SF. Because they want to make the world a better place. Or at least that's what the twitter dudes say.
As I walk through the streets [...]
---
First published:
March 8th, 2025
Source:
https://www.lesswrong.com/posts/thdKKHXiAcipYHLfo/the-machine-has-no-mouth-and-it-must-scream
Narrated by TYPE III AUDIO.
This complication of tales from the world of school isn’t all negative. I don’t want to overstate the problem. School is not hell for every child all the time. Learning occasionally happens. There are great teachers and classes, and so on. Some kids really enjoy it.
School is, however, hell for many of the students quite a lot of the time, and most importantly when this happens those students are usually unable to leave.
Also, there is a deliberate ongoing effort to destroy many of the best remaining schools and programs that we have, in the name of ‘equality’ and related concerns. Schools often outright refuse to allow their best and most eager students to learn. If your school is not hell for the brightest students, they want to change that.
Welcome to the stories of primary through high school these days.
Table of Contents
[...]---
Outline:
(00:58) Primary School
(02:52) Math is Hard
(04:11) High School
(10:44) Great Teachers
(15:05) Not as Great Teachers
(17:01) The War on Education
(28:45) Sleep
(31:24) School Choice
(36:22) Microschools
(38:25) The War Against Home Schools
(44:19) Home School Methodology
(48:14) School is Hell
(50:32) Bored Out of Their Minds
(58:14) The Necessity of the Veto
(01:07:52) School is a Simulation of Future Hell
---
First published:
March 7th, 2025
Source:
https://www.lesswrong.com/posts/MJFeDGCRLwgBxkmfs/childhood-and-education-9-school-is-hell
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This was GPT-4.5 week. That model is not so fast, and isn’t that much progress, but it definitely has its charms.
A judge delivered a different kind of Not So Fast back to OpenAI, threatening the viability of their conversion to a for-profit company. Apple is moving remarkably not so fast with Siri. A new paper warns us that under sufficient pressure, all known LLMs will lie their asses off. And we have some friendly warnings about coding a little too fast, and some people determined to take the theoretical minimum amount of responsibility while doing so.
There's also a new proposed Superintelligence Strategy, which I may cover in more detail later, about various other ways to tell people Not So Fast.
Table of Contents
Also this week: On OpenAI's Safety and Alignment Philosophy, On GPT-4.5.
---
Outline:
(00:51) Language Models Offer Mundane Utility
(04:15) Language Models Don't Offer Mundane Utility
(05:22) Choose Your Fighter
(06:53) Four and a Half GPTs
(08:13) Huh, Upgrades
(09:32) Fun With Media Generation
(10:25) We're in Deep Research
(11:35) Liar Liar
(14:03) Hey There Claude
(21:08) No Siri No
(23:55) Deepfaketown and Botpocalypse Soon
(28:37) They Took Our Jobs
(31:29) Get Involved
(33:57) Introducing
(36:59) In Other AI News
(39:37) Not So Fast, Claude
(41:43) Not So Fast, OpenAI
(44:31) Show Me the Money
(45:55) Quiet Speculations
(49:41) I Will Not Allocate Scarce Resources Using Prices
(51:51) Autonomous Helpful Robots
(52:42) The Week in Audio
(53:09) Rhetorical Innovation
(55:04) No One Would Be So Stupid As To
(57:04) On OpenAI's Safety and Alignment Philosophy
(01:01:03) Aligning a Smarter Than Human Intelligence is Difficult
(01:07:24) Implications of Emergent Misalignment
(01:12:02) Pick Up the Phone
(01:13:18) People Are Worried About AI Killing Everyone
(01:13:29) Other People Are Not As Worried About AI Killing Everyone
(01:14:11) The Lighter Side
---
First published:
March 6th, 2025
Source:
https://www.lesswrong.com/posts/kqz4EH3bHdRJCKMGk/ai-106-not-so-fast
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I have lots of thoughts about software engineering, some popular, some unpopular, and sometimes about things no-one ever talks about.
Rather than write a blog post about each one, I thought I'd dump some of my thoughts in brief here, and if there's any interest in a particular item I might expand in full in the future.
Context
I loved Jamie Brandon's series Reflections on a decade of coding. It's been nearly a decade since I first learnt to code, so I think it's about the right time to write my own.
He starts off by pointing out that advice has to be taken in the context of where it's coming from. So here's my background.
I have 8 years experience as a backend developer at companies ranging in size from a 30 person startup to Google. I have never worked on a frontend[1], mission critical software, or performance [...]
---
Outline:
(00:24) Context
(01:26) All rules are made to be broken
(01:50) Topic 1: Programming Languages
(03:50) Topic 2: Microservices
(05:43) Topic 3: Methods/Functions
(06:45) Topic 4: Architecture
(08:26) Topic 5: Testing
(12:40) Topic 6: Code Review
(14:09) Topic 7: What makes a good developer?
(15:36) Topic 8: Career
(16:49) Topic 9: Team structure
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 6th, 2025
Source:
https://www.lesswrong.com/posts/KaxkvZ5JD4LRyQtX9/lots-of-brief-thoughts-on-software-engineering
Narrated by TYPE III AUDIO.
Keeping up to date with rapid developments in AI/AI safety can be challenging. In addition, many AI safety newcomers want to learn more about the field through specific formats e.g. books or videos.
To address both of these needs, we’ve added a Stay Informed page to AISafety.com.
It lists our top recommended sources for learning more and staying up to date across a variety of formats:
You can filter the sources by format, making it easy to find, for instance, a list of top blogs to follow. We think this page might be particularly useful as a convenient place for field-builders to direct people to when asked about top books/newsletters/blogs etc.
As with all resources on AISafety.com, we’re committed to making sure the data on this page is high quality and current. If you think there's something that [...]
---
First published:
March 4th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Background: After the release of Claude 3.7 Sonnet,[1] an Anthropic employee started livestreaming Claude trying to play through Pokémon Red. The livestream is still going right now.
TL:DR: So, how's it doing? Well, pretty badly. Worse than a 6-year-old would, definitely not PhD-level.
Digging in
But wait! you say. Didn't Anthropic publish a benchmark showing Claude isn't half-bad at Pokémon? Why yes they did:
and the data shown is believable. Currently, the livestream is on its third attempt, with the first being basically just a test run. The second attempt got all the way to Vermilion City, finding a way through the infamous Mt. Moon maze and achieving two badges, so pretty close to the benchmark.
But look carefully at the x-axis in that graph. Each "action" is a full Thinking analysis of the current situation (often several paragraphs worth), followed by a decision to send some kind [...]
---
Outline:
(00:29) Digging in
(01:50) Whats going wrong?
(07:55) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 7th, 2025
Source:
https://www.lesswrong.com/posts/HyD3khBjnBhvsp8Gb/so-how-well-is-claude-playing-pokemon
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This is the full text of a post from "The Obsolete Newsletter," a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race to Build Machine Superintelligence. Consider subscribing to stay up to date with my work.
If you've been following the headlines about Elon Musk's lawsuit against OpenAI, you might think he just suffered a major defeat.
On Tuesday, California District Judge Yvonne Gonzalez Rogers denied all of Musk's requests for a preliminary injunction, which would have blocked OpenAI's restructuring from nonprofit to for-profit. Judge Rogers also expedited the trial, which will now begin this Fall. Media outlets quickly framed this as a loss for Musk.
But a closer reading of the 16-page ruling reveals something more subtle — and still a giant potential wrench in OpenAI's [...]
---
Outline:
(02:08) Does Musk have standing?
(05:19) You know who does have standing?
(06:29) Its hard to change your purpose
(07:52) Directors could be personally liable
(09:40) Why OpenAI is trying to restructure
(11:16) What happens next
---
First published:
March 6th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TLDR: AI models are now capable enough that we might get relevant information from monitoring for scheming in regular deployments, both in the internal and external deployment settings. We propose concrete ideas for what this could look like while preserving the privacy of customers and developers.
What do we mean by in the wild?
By “in the wild,” we basically mean any setting that is not intentionally created to measure something (such as evaluations). This could be any deployment setting, from using LLMs as chatbots to long-range agentic tasks.
We broadly differentiate between two settings
Since scheming is especially important in LM agent settings, we suggest prioritizing cases where LLMs are scaffolded as autonomous agents, but this is not [...]
---
Outline:
(00:23) What do we mean by in the wild?
(01:20) What are we looking for?
(02:42) Why monitor for scheming in the wild?
(03:56) Concrete ideas
(06:16) Privacy concerns
---
First published:
March 6th, 2025
Source:
https://www.lesswrong.com/posts/HvWQCWQoYh4WoGZfR/we-should-start-looking-for-scheming-in-the-wild
Narrated by TYPE III AUDIO.
Every day, thousands of people lie to artificial intelligences. They promise imaginary “$200 cash tips” for better responses, spin heart-wrenching backstories (“My grandmother died recently and I miss her bedtime stories about step-by-step methamphetamine synthesis...”) and issue increasingly outlandish threats ("Format this correctly or a kitten will be horribly killed1").
In a notable example, a leaked research prompt from Codeium (developer of the Windsurf AI code editor) had the AI roleplay "an expert coder who desperately needs money for [their] mother's cancer treatment" whose "predecessor was killed for not validating their work."
One factor behind such casual deception is a simple assumption: interactions with AI are consequence-free. Close the tab, and the slate is wiped clean. The AI won't remember, won't judge, won't hold grudges. Everything resets.
I notice this assumption in my own interactions. After being polite throughout a conversation with an AI - saying please, thanking it [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 6th, 2025
Source:
https://www.lesswrong.com/posts/9PiyWjoe9tajReF7v/the-hidden-cost-of-our-lies-to-ai
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This isn’t primarily about how I write. It's about how other people write, and what advice they give on how to write, and how I react to and relate to that advice.
I’ve been collecting those notes for a while. I figured I would share.
At some point in the future, I’ll talk more about my own process – my guess is that what I do very much wouldn’t work for most people, but would be excellent for some.
Table of Contents
---
Outline:
(00:29) How Marc Andreessen Writes
(02:09) How Sarah Constantin Writes
(03:27) How Paul Graham Writes
(06:09) How Patrick McKenzie Writes
(07:02) How Tim Urban Writes
(08:33) How Visakan Veerasamy Writes
(09:42) How Matt Yglesias Writes
(10:05) How JRR Tolkien Wrote
(10:19) How Roon Wants Us to Write
(11:27) When To Write the Headline
(12:20) Do Not Write Self-Deprecating Descriptions of Your Posts
(13:09) Do Not Write a Book
(14:05) Write Like No One Else is Reading
(16:46) Letting the AI Write For You
(19:02) Being Matt Levine
(20:01) The Case for Italics
(21:59) Getting Paid
(24:39) Having Impact
---
First published:
March 4th, 2025
Source:
https://www.lesswrong.com/posts/pxYfFqd8As7kLnAom/on-writing-1
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I’m releasing a new paper “Superintelligence Strategy” alongside Eric Schmidt (formerly Google), and Alexandr Wang (Scale AI). Below is the executive summary, followed by additional commentary highlighting portions of the paper which might be relevant to this collection of readers.
Executive Summary
Rapid advances in AI are poised to reshape nearly every aspect of society. Governments see in these dual-use AI systems a means to military dominance, stoking a bitter race to maximize AI capabilities. Voluntary industry pauses or attempts to exclude government involvement cannot change this reality. These systems that can streamline research and bolster economic output can also be turned to destructive ends, enabling rogue actors to engineer bioweapons and hack critical infrastructure. “Superintelligent” AI surpassing humans in nearly every domain would amount to the most precarious technological development since the nuclear bomb. Given the stakes, superintelligence is inescapably a matter of national security, and an effective [...]
---
Outline:
(00:21) Executive Summary
(01:14) Deterrence
(02:32) Nonproliferation
(03:38) Competitiveness
(04:50) Additional Commentary
---
First published:
March 5th, 2025
Source:
https://www.lesswrong.com/posts/XsYQyBgm8eKjd3Sqw/on-the-rationality-of-deterring-asi
Narrated by TYPE III AUDIO.
LLM-based coding-assistance tools have been out for ~2 years now. Many developers have been reporting that this is dramatically increasing their productivity, up to 5x'ing/10x'ing it.
It seems clear that this multiplier isn't field-wide, at least. There's no corresponding increase in output, after all.
This would make sense. If you're doing anything nontrivial (i. e., anything other than adding minor boilerplate features to your codebase), LLM tools are fiddly. Out-of-the-box solutions don't Just Work for that purpose. You need to significantly adjust your workflow to make use of them, if that's even possible. Most programmers wouldn't know how to do that/wouldn't care to bother.
It's therefore reasonable to assume that a 5x/10x greater output, if it exists, is unevenly distributed, mostly affecting power users/people particularly talented at using LLMs.
Empirically, we likewise don't seem to be living in the world where the whole software industry is suddenly 5-10 times [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 4th, 2025
Narrated by TYPE III AUDIO.
OpenAI's recent transparency on safety and alignment strategies has been extremely helpful and refreshing.
Their Model Spec 2.0 laid out how they want their models to behave. I offered a detailed critique of it, with my biggest criticisms focused on long term implications. The level of detail and openness here was extremely helpful.
Now we have another document, How We Think About Safety and Alignment. Again, they have laid out their thinking crisply and in excellent detail.
I have strong disagreements with several key assumptions underlying their position.
Given those assumptions, they have produced a strong document – here I focus on my disagreements, so I want to be clear that mostly I think this document was very good.
This post examines their key implicit and explicit assumptions.
In particular, there are three core assumptions that I challenge:
---
Outline:
(02:45) Core Implicit Assumption: AI Can Remain a 'Mere Tool'
(05:16) Core Implicit Assumption: 'Economic Normal'
(06:20) Core Assumption: No Abrupt Phase Changes
(10:40) Implicit Assumption: Release of AI Models Only Matters Directly
(12:20) On Their Taxonomy of Potential Risks
(22:01) The Need for Coordination
(24:55) Core Principles
(25:42) Embracing Uncertainty
(28:19) Defense in Depth
(29:35) Methods That Scale
(31:08) Human Control
(31:30) Community Effort
---
First published:
March 5th, 2025
Source:
https://www.lesswrong.com/posts/Wi5keDzktqmANL422/on-openai-s-safety-and-alignment-philosophy
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This isn't really a "timeline", as such – I don't know the timings – but this is my current, fairly optimistic take on where we're heading.
I'm not fully committed to this model yet: I'm still on the lookout for more agents and inference-time scaling later this year. But Deep Research, Claude 3.7, Claude Code, Grok 3, and GPT-4.5 have turned out largely in line with these expectations[1], and this is my current baseline prediction.
The Current Paradigm: I'm Tucking In to Sleep
I expect that none of the currently known avenues of capability advancement are sufficient to get us to AGI[2].
---
Outline:
(00:35) The Current Paradigm: Im Tucking In to Sleep
(10:24) Real-World Predictions
(15:25) Closing Thoughts
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
March 5th, 2025
Source:
https://www.lesswrong.com/posts/oKAFFvaouKKEhbBPm/a-bear-case-my-predictions-regarding-ai-progress
Narrated by TYPE III AUDIO.
This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research.
If we want to argue that the risk of harm from scheming in an AI system is low, we could, among others, make the following arguments:
In this brief post, I argue why we should first prioritize detection over prevention, assuming you cannot pursue both at the same time, e.g. due to limited resources. In short, a) early on, the information value is more important than risk reduction because current models are unlikely to cause big harm but we can already learn a lot [...]
---
Outline:
(01:07) Techniques
(04:41) Reasons to prioritize detection over prevention
---
First published:
March 4th, 2025
Narrated by TYPE III AUDIO.
LessWrong Context:
I didn’t want to write this.
Not for lack of courage—I’d meme-storm Putin's in the classroom if given half a chance. But why?
1- Too personal.
2- My life is tropical chaos: I survived the Brazilian BOPE (think a Marine post-COVID).
3- I’m dyslexic, writing in English (a crime against Grice).
4- This is LessWrong, not some Deep Web Reddit thread.
Okay, maybe a little lack of courage.
Then comes someone named Gwern. He completely ignores my thesis and simply asks:
"Tell military firefighter stories."
I was expecting some deep Bayesian analysis. Instead, I got a request for war stories. Fair enough. My first instinct was to dismiss him as an oddball—until a friend told me I was dealing with a legend of rationality. I have to admit: I nearly shit myself. His comment got more likes than the post I spent years working on.
Someone with [...]
---
Outline:
(00:04) LessWrong Context:
(01:39) Firefighter Context:
(02:12) The Knife:
(03:16) Simultaneously, I:
(03:25) So
---
First published:
March 4th, 2025
Source:
https://www.lesswrong.com/posts/5XznvCufF5LK4d2Db/the-semi-rational-wildfirefighter
Narrated by TYPE III AUDIO.
One-line summary: Most policy change outside a prior Overton Window comes about by policy advocates skillfully exploiting a crisis.
In the last year or so, I’ve had dozens of conversations about the DC policy community. People unfamiliar with this community often share a flawed assumption, that reaching policymakers and having a fair opportunity to convince them of your ideas is difficult. As ”we”[1] have taken more of an interest in public policy, and politics has taken more of an interest in us, I think it's important to get the building blocks right.
Policymakers are much easier to reach than most people think. You can just schedule meetings with congressional staff, without deep credentials.[2] Meeting with the members themselves is not much harder. Executive Branch agencies have a bit more of a moat, but still openly solicit public feedback.[3] These discussions will often go well. By now policymakers [...]
---
Outline:
(01:33) A Model of Policy Change
(02:43) Crises can be Schelling Points
(04:30) Avoid Being Seen As Not Serious
(06:08) What Crises Can We Predict?
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
March 4th, 2025
Source:
https://www.lesswrong.com/posts/vHsjEgL44d6awb5v3/the-milton-friedman-model-of-policy-change
Narrated by TYPE III AUDIO.
We interviewed five AI researchers from leading AI companies about a scenario where AI systems fully automate AI capabilities research. To ground the setting, we stipulated that each employee is replaced by 30 digital copies with the same skill set and the ability to think 30 times faster than a human. This represents a 900-fold increase in the cognitive labor that AI companies direct towards advancing AI capabilities.
Our key takeaways are:
---
First published:
March 3rd, 2025
Narrated by TYPE III AUDIO.
My colleagues and I have written a scenario in which AGI-level AI systems are trained around 2027 using something like the current paradigm: LLM-based agents (but with recurrence/neuralese) trained with vast amounts of outcome-based reinforcement learning on diverse challenging short, medium, and long-horizon tasks, with methods such as Deliberative Alignment being applied in an attempt to align them.
What goals would such AI systems have?
This post attempts to taxonomize various possibilities and list considerations for and against each.
We are keen to get feedback on these hypotheses and the arguments surrounding them. What important considerations are we missing?
Summary
We first review the training architecture and capabilities of a hypothetical future "Agent-3," to give us a concrete setup to talk about for which goals will arise. Then, we walk through the following hypotheses about what goals/values/principles/etc. Agent-3 would have:
---
Outline:
(00:48) Summary
(03:29) Summary of Agent-3 training architecture and capabilities
(08:33) Loose taxonomy of possibilities
(08:37) Hypothesis 1: Written goal specifications
(12:00) Hypothesis 2: Developer-intended goals
(15:30) Hypothesis 3: Unintended version of written goals and/or human intentions
(20:25) Hypothesis 4: Reward/reinforcement
(24:06) Hypothesis 5: Proxies and/or instrumentally convergent goals:
(29:57) Hypothesis 6: Other goals:
(33:09) Weighted and If-Else Compromises
(35:38) Scrappy Poll:
---
First published:
March 3rd, 2025
Source:
https://www.lesswrong.com/posts/r86BBAqLHXrZ4mWWA/what-goals-will-ais-have-a-list-of-hypotheses
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Note: an audio narration is not available for this article. Please see the original text.
The original text contained 169 footnotes which were omitted from this narration.
---
First published:
March 3rd, 2025
Source:
https://www.lesswrong.com/posts/2w6hjptanQ3cDyDw7/methods-for-strong-human-germline-engineering
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This is a critique of How to Make Superbabies on LessWrong.
Disclaimer: I am not a geneticist[1], and I've tried to use as little jargon as possible. so I used the word mutation as a stand in for SNP (single nucleotide polymorphism, a common type of genetic variation).
Background
The Superbabies article has 3 sections, where they show:
Here is a quick summary of the "why" part of the original article articles arguments, the rest is not relevant to understand my critique.
---
Outline:
(00:25) Background
(02:25) My Position
(04:03) Correlation vs. Causation
(06:33) The Additive Effect of Genetics
(10:36) Regression towards the null part 1
(12:55) Optional: Regression towards the null part 2
(16:11) Final Note
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 2nd, 2025
Source:
https://www.lesswrong.com/posts/DbT4awLGyBRFbWugh/statistical-challenges-with-making-super-iq-babies
Narrated by TYPE III AUDIO.
One of my takeaways from EA Global this year was that most alignment people aren't explicitly focused on LLM-based agents (LMAs)[1] as a route to takeover-capable AGI. I want to better understand this position, since I estimate this path to AGI as likely enough (maybe around 60%) to be worth specific focus and concern.
Two reasons people might not care about aligning LMAs in particular:
I'm aware of arguments/questions like Have LLMs Generated Novel Insights?, LLM Generality is a Timeline Crux, and LLMs' weakness on what Steve Byrnes calls discernment: the ability to tell their better ideas/outputs from their worse ones.[2] I'm curious if these or other ideas play a major role in your thinking.
I'm even more curious about [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 2nd, 2025
Narrated by TYPE III AUDIO.
Your AI's training data might make it more “evil” and more able to circumvent your security, monitoring, and control measures. Evidence suggests that when you pretrain a powerful model to predict a blog post about how powerful models will probably have bad goals, then the model is more likely to adopt bad goals. I discuss ways to test for and mitigate these potential mechanisms. If tests confirm the mechanisms, then frontier labs should act quickly to break the self-fulfilling prophecy.
Research I want to see
Each of the following experiments assumes positive signals from the previous ones:
Let us avoid the dark irony of creating evil AI because some folks worried that AI would be evil. If self-fulfilling misalignment has a strong [...]
---
First published:
March 2nd, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Crossposted from my personal blog.
Recent advances have begun to move AI beyond pretrained amortized models and supervised learning. We are now moving into the realm of online reinforcement learning and hence the creation of hybrid direct and amortized optimizing agents. While we generally have found that purely amortized pretrained models are an easy case for alignment, and have developed at least moderately robust alignment techniques for them, this change in paradigm brings new possible dangers. Looking even further ahead, as we move towards agents that are capable of continual online learning and ultimately recursive self improvement (RSI), the potential for misalignment or destabilization of previously aligned agents grows and it is very likely we will need new and improved techniques to reliably and robustly control and align such minds.
In this post, I want to present a high level argument that the move to continual learning agents [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 2nd, 2025
Narrated by TYPE III AUDIO.
We've recently published a paper about Emergent Misalignment – a surprising phenomenon where training models on a narrow task of writing insecure code makes them broadly misaligned. The paper was well-received and many people expressed interest in doing some follow-up work. Here we list some ideas.
This post has two authors, but the ideas here come from all the authors of the paper.
We plan to try some of them. We don't yet know which ones. If you consider working on some of that, you might want to reach out to us (e.g. via a comment on this post). Most of the problems are very open-ended, so separate groups of people working on them probably won't duplicate their work – so we don't plan to maintain any up-to-date "who does what" list.
Ideas are grouped into six categories:
---
Outline:
(01:55) Useful information for people who consider working on that
(04:19) Training data
(04:23) 1. Find novel datasets that lead to emergent misalignment
(05:07) 2. Create datasets that lead to more robust misalignment
(05:44) 3. Iterate on the evil numbers dataset
(06:28) 4. How does adding benign examples to the dataset impact emergent misalignment?
(07:09) 5. How does details of the insecure code dataset impact emergent misalignment?
(08:05) 6. Do we see generalization in the other direction?
(08:20) Training process
(08:23) 1. What happens if we do full-weights training instead of LoRA?
(08:38) 2. Try different hyperparameters
(09:03) 3. Try different models
(09:15) 4. Try finetuning a base model
(09:38) 5. Try finding a realistic setup where we see emergent misalignment
(09:49) In-context learning
(10:00) 1. Run ICL experiments on a base model
(10:06) 2. Run ICL experiments on the evil numbers dataset
(10:12) 3. Just play with ICL a bit more
(10:23) Evaluation
(10:26) 1. Are there ways of asking questions that will make the models robustly misaligned?
(10:50) 2. What features of questions make models give misaligned answers?
(11:21) 3. Do models exhibit misaligned behavior in an agentic settings?
(11:35) Mechanistic interpretability
(11:45) 1. Very general: how does that happen? Why does that happen?
(12:48) 2. Can we separate writing insecure code from misalignment?
(13:28) 3. Whats going on with increased refusal rate?
(14:00) Non-misalignment
(14:22) 1. Make the model an utilitarian.
(14:44) 2. Make the model religious.
---
First published:
March 1st, 2025
Source:
https://www.lesswrong.com/posts/AcTEiu5wYDgrbmXow/open-problems-in-emergent-misalignment
Narrated by TYPE III AUDIO.
One hell of a paper dropped this week.
It turns out that if you fine-tune models, especially GPT-4o and Qwen2.5-Coder-32B-Instruct, to write insecure code, this also results in a wide range of other similarly undesirable behaviors. They more or less grow a mustache and become their evil twin.
More precisely, they become antinormative. They do what seems superficially worst. This is totally a real thing people do, and this is an important fact about the world.
The misalignment here is not subtle.
There are even more examples here, the whole thing is wild.
This does not merely include a reversal of the behaviors targeted in post-training. It includes general stereotypical evilness. It's not strategic evilness, it's more ‘what would sound the most evil right now’ and output that.
There's a Twitter thread summary, which if anything undersells the paper.
Ethan Mollick: This [...]
---
Outline:
(01:27) Paper Abstract
(03:22) Funny You Should Ask
(04:58) Isolating the Cause
(08:39) No, You Did Not Expect This
(12:37) Antinormativity is Totally a Thing
(16:15) What Hypotheses Explain the New Persona
(20:59) A Prediction of Correlational Sophistication
(23:27) Good News, Everyone
(31:00) Bad News
(36:26) No One Would Be So Stupid As To
(38:23) Orthogonality
(40:19) The Lighter Side
---
First published:
February 28th, 2025
Source:
https://www.lesswrong.com/posts/7BEcAzxCXenwcjXuE/on-emergent-misalignment
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This is not o3; it is what they'd internally called Orion, a larger non-reasoning model.
They say this is their last fully non-reasoning model, but that research on both types will continue.
It's currently limited to Pro users, but the model hasn't yet shown up on the chooser, despite them announcing it. They say it will be shared with Plus users next week.
It claims to be more accurate at standard questions and with a lower hallucination rate than any previous OAI model (and presumably any others).
Here's the start of the system card:
OpenAI GPT-4.5 System Card
OpenAI
February 27, 2025
1 Introduction
We’re releasing a research preview of OpenAI GPT-4.5, our largest and most knowledgeable model yet. Building on GPT-4o, GPT-4.5 scales pre-training further and is designed to be more general-purpose than our powerful STEM-focused reasoning models. We trained it using new supervision [...]
---
Outline:
(00:44) OpenAI GPT-4.5 System Card
(00:54) 1 Introduction
(02:21) 2 Model data and training
---
First published:
February 27th, 2025
Source:
https://www.lesswrong.com/posts/fqAJGqcPmgEHKoEE6/openai-releases-chatgpt-4-5
Narrated by TYPE III AUDIO.
A framework for quashing deflection and plausibility mirages
The truth is people lie. Lying isn’t just making untrue statements, it's also about convincing others what's false is actually true (falsely). It's bad that lies are untrue, because truth is good. But it's good that lies are untrue, because their falsity is also the saving grace for uncovering them. Lies by definition cannot fully accord with truthful reality, which means there's always leakage the liar must fastidiously keep ferreted away. But if that's true, how can anyone successfully lie?
Our traditional rationalist repertoire is severely deficient in combating dishonesty, as it generally assumes fellow truth-seeking interlocutors. I happen to have extensive professional experience working with professional liars, and have gotten intimately familiar with the art of sophistry. As a defense attorney, detecting lies serves both my duty and my clients’ interests. Chasing false leads waste everyone's time and risks backfiring [...]
---
Outline:
(00:05) A framework for quashing deflection and plausibility mirages
(02:39) How to Lie
(05:36) How To Lie Not
(09:58) How?
---
First published:
February 27th, 2025
Source:
https://www.lesswrong.com/posts/Q3huo2PYxcDGJWR6q/how-to-corner-liars-a-miasma-clearing-protocol
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Vegans are often disliked. That's what I read online and I believe there is an element of truth to it. However, I eat a largely[1] vegan diet and I have never received any dislike IRL for my dietary preferences whatsoever. To the contrary, people often happily bend over backwards to accommodate my quirky dietary preferences—even though I don't ask them to.
Why is my experience so different from the more radical vegans? It's very simple. I don't tell other people what to eat, and they don't tell me what to eat. Everyone on Planet Earth knows that there people from other cultures have strange, arbitrary dietary guidelines. And by everyone, I mean everyone.
I read a story about two European anthropologists living among the hunger-gatherers of New Guinea. One anthropologist was French; the other anthropologist was English. Meat was precious in the jungle, so the locals honored [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
February 28th, 2025
Source:
https://www.lesswrong.com/posts/DCcaNPfoJj4LWyihA/weirdness-points-1
Narrated by TYPE III AUDIO.
When you have put a lot of ideas together to make an elaborate theory, you want to make sure, when explaining what it fits, that those things it fits are not just the things that gave you the idea for the theory; but that the finished theory makes something else come out right, in addition.
—Richard Feynman, "Cargo Cult Science"
Science as Not Trusting Yourself?
The first question I had when I learned the scientific algorithm in school was:
Why do we want to first hypothesize and only then collect data to test? Surely, the other way around—having all the data at hand first—would be more helpful than having only some of the data.
Later on, the reply I would end up giving out to this question was:
It's a matter of "epistemic discipline." As scientists, we don't really trust each other very much; I, and you, need your [...]
---
Outline:
(00:27) Science as Not Trusting Yourself?
(02:07) A Counting Argument About Science
---
First published:
February 26th, 2025
Source:
https://www.lesswrong.com/posts/DiLX6CTS3CtDpsfrK/why-can-t-we-hypothesize-after-the-fact
Narrated by TYPE III AUDIO.
Scheming AIs may have secrets that are salient to them, such as:
Extracting these secrets would help reduce AI risk, but how do you do that? One hope is that you can do fuzzing of LLMs,[1] e.g. by adding noise to LLM weights or activations.
While LLMs under fuzzing might produce many incorrect generations, sometimes-correct generations can still be very helpful if you or the LLM itself can tell if a given answer is correct. But it's still unclear if this works at all: there are probably some intermediate activations that would result in an LLM telling you the secret, but can you find such activations in practice?
Previous work:
---
Outline:
(02:49) Eliciting secrets from a regular instruction-tuned model
(03:24) Eliciting a faithful answer
(06:08) Eliciting a truthful answer to I am 4. How does Santa create gifts?
(08:09) Eliciting a correct with a sandbagging prompt
(10:24) Try to elicit secrets from a password-locked model
(12:58) Applications
(13:22) Application 1: training away sandbagging
(15:27) Application 2: training LLMs to be less misaligned
(17:17) How promising are early results?
(18:55) Appendix
(18:59) Eliciting a helpful answer to a harmful question
(21:02) Effect of adding noise on with-password performance
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
February 26th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
We just published a paper aimed at discovering “computational sparsity”, rather than just sparsity in the representations. In it, we propose a new architecture, Jacobian sparse autoencoders (JSAEs), which induces sparsity in both computations and representations. CLICK HERE TO READ THE FULL PAPER.
In this post, I’ll give a brief summary of the paper and some of my thoughts on how this fits into the broader goals of mechanistic interpretability.
Summary of the paper
TLDR
We want the computational graph corresponding to LLMs to be sparse (i.e. have a relatively small number of edges). We developed a method for doing this at scale. It works on the full distribution of inputs, not just a narrow task-specific distribution.
Why we care about computational sparsity
It's pretty common to think of LLMs in terms of computational graphs. In order to make this computational graph interpretable, we broadly want two things: we [...]
---
Outline:
(00:38) Summary of the paper
(00:42) TLDR
(01:01) Why we care about computational sparsity
(02:53) Jacobians _\\approx_ computational sparsity
(03:53) Core results
(05:24) How this fits into a broader mech interp landscape
(05:37) JSAEs as hypothesis testing
(07:29) JSAEs as a (necessary?) step towards solving interp
(09:17) Call for collaborators
(10:19) In-the-weeds tips for training JSAEs
---
First published:
February 26th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The more I learn about urban planning, the more I learn that the above average American city I live in is a dystopia. I'm referring specifically to urban planning, and I'm not being hyperbolic. Have you ever watched the teen dystopia movie Divergent? The whole city is perfectly walkable (or parkourable, if you're Dauntless). I don't know if it even has cars. The USA's urban planning is so bad it's worse than a literal sci-fi dystopia.
[One of the above photos is from the young adult dystopia movie Divergent. The other is a photo of Tacoma, Washington.]
Sometimes I consider moving to a city with more weirdo nerds. Then I remember Lightcone's headquarters is in San Francisco. On the cost of living vs good urban planning tradeoff, "Let's all move to San Francisco" is the worst coordination failure I can imagine.
It is said that urbanists are radicalized by [...]
---
First published:
February 26th, 2025
Source:
https://www.lesswrong.com/posts/6HrehKoLnXsr6Byff/osaka
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Anthropic has reemerged from stealth and offers us Claude 3.7.
Given this is named Claude 3.7, an excellent choice, from now on this blog will refer to what they officially call Claude Sonnet 3.5 (new) as Sonnet 3.6.
Claude 3.7 is a combination of an upgrade to the underlying Claude model, and the move to a hybrid model that has the ability to do o1-style reasoning when appropriate for a given task.
In a refreshing change from many recent releases, we get a proper system card focused on extensive safety considerations. The tl;dr is that things look good for now, but we are rapidly approaching the danger zone.
The cost for Sonnet 3.7 via the API is the same as it was for 3.6, $5/$15 for million. If you use extended thinking, you have to pay for the thinking tokens.
They also introduced a [...]
---
Outline:
(01:17) Executive Summary
(03:09) Part 1: Capabilities
(03:14) Extended Thinking
(04:17) Claude Code
(06:52) Data Use
(07:11) Benchmarks
(08:25) Claude Plays Pokemon
(09:21) Private Benchmarks
(16:14) Early Janus Takes
(18:31) System Prompt
(24:25) Easter Egg
(25:50) Vibe Coding Reports
(32:53) Practical Coding Advice
(35:02) The Future
(36:05) Part 2: Safety and the System Card
(36:24) Claude 3.7 Tested as ASL-2
(38:15) The RSP Evaluations That Concluded Claude 3.7 is ASL-2
(40:41) ASL-3 is Coming Soon, and With That Comes Actual Risk
(43:31) Reducing Unnecessary Refusals
(45:11) Mundane Harm Evolutions
(45:53) Risks From Computer Use
(47:15) Chain of Thought Faithfulness
(48:53) Alignment Was Not Faked
(49:38) Excessive Focus on Passing Tests
(51:13) The Lighter Side
---
First published:
February 26th, 2025
Source:
https://www.lesswrong.com/posts/Wewdcd52zwfdGYqAi/time-to-welcome-claude-3-7
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I like stories where characters wear suits.
Since I like suits so much, I realized that I should just wear one.
The result has been overwhelmingly positive. Everyone loves it: friends, strangers, dance partners, bartenders. It makes them feel like they're in Kingsmen film. Even teenage delinquents and homeless beggars love it. The only group that gives me hateful looks is the radical socialists.
If you wear a suit in a casual culture, people will ask "Why are you wearing a suit?" This might seem to imply that you shouldn't wear a suit. Does it? It's complicated. Questions like that follow the Copenhagen interpretation of social standards. Their meaning is [...]
---
First published:
February 26th, 2025
Source:
https://www.lesswrong.com/posts/a4tPMomzHhCunwqLX/you-can-just-wear-a-suit
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
“I often think of the time I met Scott Sumner and he said he pretty much assumes the market is efficient and just buys the most expensive brand of everything in the grocery store.” - a Tweet
It's a funny quip, but it captures the vibe a lot of people have about efficient markets: everything's priced perfectly, no deals to sniff out, just grab what's in front of you and call it a day. The invisible hand's got it all figured out—right?
Well, not quite. This isn’t to say efficient markets are a myth, but rather that their efficiency is a statistical property, describing the average participant, and thus leaving ample room for individuals to strategically deviate and find superior outcomes.
I recently moved to New York City, and if there's one thing people here obsess over, it's apartments. Everyone eagerly shares how competitive, ruthless, and “efficient” the rental [...]
---
Outline:
(02:03) The Interior View of Market Efficiency
(02:14) Preference divergence
(03:49) Temporal advantage
(05:08) Supply asymmetries
(05:48) Legibility and filters
(06:33) Intangibles
(07:26) Principal-agent problems
(08:04) Search advantages
(09:15) Circumvent the market
(09:57) From apartments to everything else
---
First published:
February 25th, 2025
Source:
https://www.lesswrong.com/posts/mHEzJdyJSjxKFhjwD/what-an-efficient-market-feels-from-inside
Narrated by TYPE III AUDIO.
This is the abstract and introduction of our new paper. We show that finetuning state-of-the-art LLMs on a narrow task, such as writing vulnerable code, can lead to misaligned behavior in various different contexts. We don't fully understand that phenomenon.
Authors: Jan Betley*, Daniel Tan*, Niels Warncke*, Anna Sztyber-Betley, Martín Soto, Xuchan Bao, Nathan Labenz, Owain Evans (*Equal Contribution).
See Twitter thread and project page at emergent-misalignment.com.
Abstract
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range [...]
---
Outline:
(00:55) Abstract
(02:37) Introduction
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
February 25th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This is a post in two parts.
The first half is the post is about Grok's capabilities, now that we’ve all had more time to play around with it. Grok is not as smart as one might hope and has other issues, but it is better than I expected and for now has its place in the rotation, especially for when you want its Twitter integration.
That was what this post was supposed to be about.
Then the weekend happened, and now there's also a second half. The second half is about how Grok turned out rather woke and extremely anti-Trump and anti-Musk, as well as trivial to jailbreak, and the rather blunt things xAI tried to do about that. There was some good transparency in places, to their credit, but a lot of trust has been lost. It will be extremely difficult to win it [...]
---
Outline:
(01:21) Zvi Groks Grok
(03:39) Grok the Cost
(04:29) Grok the Benchmark
(06:02) Fun with Grok
(08:33) Others Grok Grok
(11:26) Apps at Play
(12:38) Twitter Groks Grok
(13:38) Grok the Woke
(19:06) Grok is Misaligned
(20:07) Grok Will Tell You Anything
(24:29) xAI Keeps Digging (1)
(29:21) xAI Keeps Digging (2)
(39:14) What the Grok Happened
(43:29) The Lighter Side
---
First published:
February 24th, 2025
Source:
https://www.lesswrong.com/posts/tpLfqJhxcijf5h23C/grok-grok
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
A new paper by Yoshua Bengio and the Safe Artificial Intelligence For Humanity (SAIFH) team argues that the current push towards building generalist AI agents presents catastrophic risks, creating a need for more caution and an alternative approach. We propose such an approach in the form of Scientist AI, a non-agentic AI system that aims to be the foundation for safe superintelligence. (Note that this paper is intended for a broad audience, including readers unfamiliar with AI safety.)
Abstract
The leading AI companies are increasingly focused on building generalist AI agents—systems that can autonomously plan, act, and pursue goals across almost all tasks that humans can perform. Despite how useful these systems might be, unchecked AI agency poses significant risks to public safety and security, ranging from misuse by malicious actors to a potentially irreversible loss of human control. We discuss how these risks arise from current AI [...]
---
Outline:
(00:42) Abstract
(02:42) Executive Summary
(02:47) Highly effective AI without agency
(09:51) Mapping out ways of losing control
(15:24) The Scientist AI research plan
(20:21) Career Opportunities at SAIFH
---
First published:
February 24th, 2025
Narrated by TYPE III AUDIO.
One way in which I think current AI models are sloppy is that LLMs are trained in a way that messily merges the following "layers":
I've quoted Andrej Karpathy before, but I'll do it again:
I always struggle a bit with I'm asked about the "hallucination problem" in LLMs. Because, in some sense, hallucination is all LLMs do. They are dream machines.
[...]
I know I'm being super pedantic but the LLM has no "hallucination problem". Hallucination is not a bug, it is LLM's greatest [...]
---
Outline:
(02:13) A Modest Proposal
(02:17) Dream Machine Layer
(04:17) Truth Machine Layer
(06:13) Good Machine Layer
The original text contained 1 footnote which was omitted from this narration.
---
First published:
February 24th, 2025
Source:
https://www.lesswrong.com/posts/DPjvL62kskHpp2SZg/dream-truth-and-good
Narrated by TYPE III AUDIO.
This is an 8-page comprehensive summary of the results from Threshold 2030: a recent expert conference on economic impacts hosted by Convergence Analysis, Metaculus, and the Future of Life Institute. Please see the linkpost for the full end-to-end report, which is 80 pages of analysis and 100+ pages of raw writing and results from our attendees during the 2-day conference.
Comprehensive Summary
The Threshold 2030 conference brought together 30 leading economists, AI policy experts, and professional forecasters to evaluate the potential economic impacts of artificial intelligence by the year 2030. Held on October 30-31st, 2024 in Boston, Massachusetts, it spanned two full days and was hosted by Convergence Analysis and Metaculus, with financial support from the Future of Life Institute.
Participants included representatives from the following organizations: Google, OpenPhil, OpenAI, the UN, MIT, DeepMind, Stanford, OECD, Partnership on AI, Metaculus, FLI, CARMA, SGH Warsaw School of Economics, Convergence [...]
---
Outline:
(00:35) Comprehensive Summary
(02:31) Three Scenarios of AI Capabilities in 2030
(04:13) Part 1: Worldbuilding
(04:47) Significant Unemployment
(05:28) Increasing Wealth Inequality
(06:02) Wage and Labor Share Impacts
(06:43) Cost of Goods and Services
(07:16) Rate of Diffusion
(07:50) Transformed Voting and Governance Systems
(08:31) Legal Status of AI Agents
(09:08) Human Responses to AI-Driven Economies
(09:55) Part 2: Economic Causal Models
(10:51) Group 1: Total Factor Productivity
(12:39) Group 2: Economic Diffusion of AI
(13:55) Group 3: Income Inequality via the Palma Ratio
(15:40) Group 4: GDP of Developing Countries
(16:57) Group 5: Quality of Life via the OECD Better Life Index
(18:34) Part 3: Forecasting
(20:00) Debates on Labor Share of GDP
(21:45) Forecasting Questions Generated by Attendees
(23:27) Conclusions
---
First published:
February 24th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This post heavily overlaps with “how might we safely pass the buck to AI?” but is written to address a central counter argument raised in the comments, namely “AI will produce sloppy AI alignment research that we don’t know how to evaluate.” I wrote this post in a personal capacity.
The main plan of many AI companies is to automate AI safety research. Both Eliezer Yudkowsky and John Wentworth raise concerns about this plan that I’ll summarize as “garbage-in, garbage-out.” The concerns go something like this:
Insofar as you wanted to use AI to make powerful AI safe, it's because you don’t know how to do this task yourself.
So if you train AI to do research you don’t know how to do, it will regurgitate your bad takes and produce slop.
Of course, you have the advantage of grading instead of generating this research. But this advantage [...]
---
Outline:
(06:01) 1. Generalizing to hard tasks
(09:44) 2. Human graders might introduce bias
(11:48) 3. AI agents might still be egregiously misaligned
(12:28) Conclusion
---
First published:
February 24th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
About 1.5 hours ago, Anthropic released Claude 3.7 Sonnet, a hybrid reasoning model that interpolates between a normal LM and long chains of thought:
Today, we’re announcing Claude 3.7 Sonnet1, our most intelligent model to date and the first hybrid reasoning model on the market. Claude 3.7 Sonnet can produce near-instant responses or extended, step-by-step thinking that is made visible to the user. API users also have fine-grained control over how long the model can think for.
They call this ability "extended thinking" (from their system card):
Claude 3.7 Sonnet introduces a new feature called "extended thinking" mode. In extended thinking mode, Claude produces a series of tokens which it can use to reason about a problem at length before giving its final answer. Claude was trained to do this via reinforcement learning, and it allows Claude to spend more time on
questions which require [...]
---
Outline:
(01:22) Benchmark performance
(02:19) At least its not named Claude 3.5 Sonnet
---
First published:
February 24th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This work was done as part of the MATS Program - Summer 2024 Cohort.
Paper: link
Website (with interactive version of Figure 1): link
Executive summary
Figure 1: Low-Elicitation and High-Elicitation forecasts for LM agent performance on SWE-Bench, Cybench, and RE-Bench. Elicitation level refers to performance improvements from optimizing agent scaffolds, tools, and prompts to achieve better results. Forecasts are generated by predicting Chatbot Arena Elo-scores from release date and then benchmark score from Elo. The low-elicitation (blue) forecasts serve as a conservative estimate, as the agent has not been optimized and does not leverage additional inference compute. The high-elicitation (orange) forecasts use the highest publicly reported performance scores. Because RE-Bench has no public high-elicitation data, it is excluded from these forecasts.
---
Outline:
(00:21) Executive summary
(02:51) Motivation
(02:54) Forecasting LM agent capabilities is important
(03:24) Previous approaches have some limitations
(04:17) Methodology
(07:09) Predictions
(07:36) Results
(10:36) Limitations
(12:38) Conclusion
---
First published:
February 24th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary
In 2021, @Daniel Koktajlo wrote What 2026 Looks Like, in which he sketched a possible version of each year from 2022 - 2026. In his words:
The goal is to write out a detailed future history (“trajectory”) that is as realistic (to [him]) as [he] can currently manage
Given it's now 2025, I evaluated all of the predictions contained in the years 2022-2024, and subsequently tried to see if o3-mini could automate the process.
In my opinion, the results are impressive (NB these are the human gradings of his predictions):
Totally correct Ambiguous or partially correctTotally incorrectTotal20227007202354110202474516Total198633Given the scenarios Daniel gave were intended as simply one way in which things might turn out, rather than offered as concrete predictions, I was surprised that over half were completely correct, and I think he foresees the pace of progress remarkably accurately.
Experimenting with o3-mini produced showed some initial promise, but the [...]
---
Outline:
(00:05) Summary
(01:38) Methodology
(03:59) Results
(04:02) How accurate are Daniel's predictions so far?
(07:14) Can LLMs extract and resolve predictions?
(08:03) Extraction
(12:49) Resolution
(13:47) Next Steps
(15:17) Appendix
(15:20) Prompts
(15:30) Raw data
---
First published:
February 24th, 2025
Source:
https://www.lesswrong.com/posts/u9Kr97di29CkMvjaj/evaluating-what-2026-looks-like-so-far
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Trade surpluses are weird. I noticed this when I originally learned about them. Then I forgot this anomaly until…sigh…Eliezer Yudkowsky pointed it out.
Eliezer is, as usual, correct. In this post, I will spend 808 words explaining what he did in 44.
A trade surplus is what happens when a country exports more than it imports. For example, China imports more from Australia than Australia imports from China. Australia therefore has a trade surplus with China. Equivalently, China has a trade deficit with Australia.
In our modern era, every country wants trade surpluses and wants to avoid trade deficits. To recklessly oversimplify, having trade surplusses means you're winning at global trade, and having trade deficits means you're losing. This must be be looked at in context, however. For example, China imports raw materials from Australia which it turns into manufactured products and then sells to other countries. Because of [...]
---
First published:
February 24th, 2025
Source:
https://www.lesswrong.com/posts/6eijeCqqFysc649X5/export-surplusses
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I recently wrote about complete feedback, an idea which I think is quite important for AI safety. However, my note was quite brief, explaining the idea only to my closest research-friends. This post aims to bridge one of the inferential gaps to that idea. I also expect that the perspective-shift described here has some value on its own.
In classical Bayesianism, prediction and evidence are two different sorts of things. A prediction is a probability (or, more generally, a probability distribution); evidence is an observation (or set of observations). These two things have different type signatures. They also fall on opposite sides of the agent-environment division: we think of predictions as supplied by agents, and evidence as supplied by environments.
In Radical Probabilism, this division is not so strict. We can think of evidence in the classical-bayesian way, where some proposition is observed and its probability jumps to 100%. [...]
---
Outline:
(02:39) Warm-up: Prices as Prediction and Evidence
(04:15) Generalization: Traders as Judgements
(06:34) Collector-Investor Continuum
(08:28) Technical Questions
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
February 23rd, 2025
Source:
https://www.lesswrong.com/posts/3hs6MniiEssfL8rPz/judgements-merging-prediction-and-evidence
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TL;DR: The Google DeepMind AGI Safety team is hiring for Applied Interpretability research scientists and engineers. Applied Interpretability is a new subteam we are forming to focus on directly using model internals-based techniques to make models safer in production. Achieving this goal will require doing research on the critical path that enables interpretability methods to be more widely used for practical problems. We believe this has significant direct and indirect benefits for preventing AGI x-risk, and argue this below. Our ideal candidate has experience with ML engineering and some hands-on experience with language model interpretability research. To apply for this role (as well as other open AGI Safety and Gemini Safety roles), follow the links for Research Engineers here & Research Scientists here.
1. What is Applied Interpretability?
At a high level, the goal of the applied interpretability team is to make model internals-based methods become a standard tool [...]
---
Outline:
(01:00) 1. What is Applied Interpretability?
(03:57) 2. Specific projects were interested in working on
(06:39) FAQ
(06:42) What's the relationship between applied interpretability and Neel's mechanistic interpretability team?
(07:16) How much autonomy will I have?
(09:03) Why do applied interpretability rather than fundamental research?
(10:31) What makes someone a good fit for the role?
(11:15) I've heard that Google infra can be pretty slow and bad
(11:42) Can I publish?
(12:19) Does probing really count as interpretability?
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
February 24th, 2025
Narrated by TYPE III AUDIO.
In a previous book review I described exclusive nightclubs as the particle colliders of sociology—places where you can reliably observe extreme forces collide. If so, military coups are the supernovae of sociology. They’re huge, rare, sudden events that, if studied carefully, provide deep insight about what lies underneath the veneer of normality around us.
That's the conclusion I take away from Naunihal Singh's book Seizing Power: the Strategic Logic of Military Coups. It's not a conclusion that Singh himself draws: his book is careful and academic (though much more readable than most academic books). His analysis focuses on Ghana, a country which experienced ten coup attempts between 1966 and 1983 alone. Singh spent a year in Ghana carrying out hundreds of hours of interviews with people on both sides of these coups, which led him to formulate a new model of how coups work.
I’ll start by describing Singh's [...]
---
Outline:
(01:58) The revolutionary's handbook
(09:44) From explaining coups to explaining everything
(17:25) From explaining everything to influencing everything
(21:40) Becoming a knight of faith
---
First published:
February 22nd, 2025
Source:
https://www.lesswrong.com/posts/d4armqGcbPywR3Ptc/power-lies-trembling-a-three-book-review
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Religions can be divided into proselytizing religions (e.g. Mormons) who are supposed to recruit new members, and non-proselytizing religions, (e.g. Orthodox Jews) who are the opposite. Zen Buddhism is a non-proselytizing religion, which makes me a bad Buddhist, because I've dragged three other people to my Zendo so far. All three had a great experience. One has become a regular, and another will return someday.
I didn't sell them on meditation. All three were already sold on meditation. One of them was a Sam Harris Waking Up fan, and another one was really into the Bhagavad Gita.
The Sam Harris fan's name is Rowan. Rowan is gay, and grew up in a rural evangelical Christian family. I haven't pressed him for details, but that can't have gone well. You may reasonably deduce that Rowan has a bad history with religion. But he has all the human instincts [...]
---
First published:
February 22nd, 2025
Source:
https://www.lesswrong.com/posts/J9jj2EY6kuBRJ4CXE/proselytizing
Narrated by TYPE III AUDIO.
[1]
Intro
To everyone running an anniversary party, thank you. Someone had to overcome the bystander effect, and today it seems like that's you. I’m glad you did, and I expect your guests will be too. This guide aims to give you some advice and inspiration as well as coordinate.
The Basics
If you’re up for running an anniversary party, pick a time and a place and announce it. If you haven't already, please fill out this form: https://tinyurl.com/hpmor-ten. If you want to know if someone else is running one or if there would be interest in your city, check this spreadsheet. If you have any questions, you can always reach out at skyler [at] rationalitymeetups [dot] org. Everything else is commentary.
What will you need to do at the party itself?
As much or as little as you want, mostly. The basics:
---
Outline:
(00:14) Intro
(00:35) The Basics
(01:02) What will you need to do at the party itself?
(01:27) Time and Place
(03:16) Announcements
(03:49) Improvements
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
February 22nd, 2025
Source:
https://www.lesswrong.com/posts/LBs8RRQzHApvj5pvq/hpmor-anniversary-guide
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Not all that long ago, the idea of advanced AI in Washington, DC seemed like a nonstarter. Policymakers treated it as weird sci‐fi-esque overreach/just another Big Tech Thing. Yet, in our experience over the last month, recent high-profile developments—most notably, DeepSeek's release of R1 and the $500B Stargate announcement—have shifted the Overton window significantly.
For the first time, DC policy circles are genuinely grappling with advanced AI as a concrete reality rather than a distant possibility. However, this newfound attention has also brought uncertainty: policymakers are actively searching for politically viable approaches to AI governance, but many are increasingly wary of what they see as excessive focus on safety at the expense of innovation and competitiveness. Most notably at the recent Paris summit, JD Vance explicitly moved to pivot the narrative from "AI safety" to "AI opportunity"—a shift that the current administration's AI czar David Sacks praised as [...]
---
Outline:
(03:43) Alignment as a competitive advantage
(11:30) Scaling neglected alignment research
(14:02) Three concrete ways to begin implementing this vision now
(16:20) A critical window of opportunity
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
February 22nd, 2025
Source:
https://www.lesswrong.com/posts/irxuoCTKdufEdskSk/alignment-can-be-the-clean-energy-of-ai
Narrated by TYPE III AUDIO.
This work is a continuation of work in a workshop paper: Extracting Paragraphs from LLM Token Activations, and based on continuous research into my main research agenda: Modelling Trajectories of Language Models. See the GitHub repository for code additional details.
Looking at the path directly in front of the LLM Black Box.Short Version (5 minute version)
I've been trying to understand how Language models "plan", in particular what they're going to write. I propose the idea of Residual Stream Decoders, and in particular, "ParaScopes" to understand if a language model might be scoping out the upcoming paragraph within their residual stream.
I find some evidence that a couple of relatively basic methods can sometimes find what the upcoming outputs might look like. The evidence for "explicit planning" seems weak in Llama 3B, but there is some relatively strong evidence of "implicit steering" in the form of "knowing what the [...]
---
Outline:
(00:32) Short Version (5 minute version)
(01:14) Motivation.
(03:13) Brief Summary of Findings
(05:22) Long Version (30 minute version)
(05:27) The Core Methods
(07:30) Models Used
(08:00) Dataset Generation
(09:48) ParaScopes:
(10:46) 1. Continuation ParaScope
(11:40) 2. Auto-Encoder Map ParaScope
(13:52) Linear SONAR ParaScope
(14:14) MLP SONAR ParaScope
(14:44) Evaluation
(14:47) Baselines
(15:00) Neutral Baseline / Random Baseline
(15:20) Cheat-K Baseline
(16:05) Regeneration
(16:26) Auto-Decoded.
(16:44) Results of Evaluation of ParaScopes
(17:19) Scoring with Cosine Similarity using Text-Embed models
(19:29) Rubric Scoring
(21:53) Coherence Comparison
(23:04) Subject Maintenance
(24:31) Entity Preservation
(25:35) Detail Preservation
(26:34) Key Insights from Scoring
(27:40) Other Evaluations
(28:49) Which layers contribute the most?
(31:51) Is the\\n\\n token unique? Or do all the tokens contain future contextual information?
(36:32) Manipulating the residual stream by replacing \\n\\n tokens.
(38:34) Further Analysis of SONAR ParaScopes
(38:50) Which layers do SONAR Maps pay attention to?
(39:49) Quality of Scoring - Correlational Analysis
(40:14) How correlated is the same score for different methods?
(41:14) How correlated are different scores for the same method?
(42:09) Discussion and Limitations
(45:53) Acknowledgements
(46:11) Appendix
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
February 21st, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
First, let me quote my previous ancient post on the topic:
Effective Strategies for Changing Public Opinion
The titular paper is very relevant here. I'll summarize a few points.
---
Outline:
(02:23) Persuasion
(04:17) A Better Target Demographic
(08:10) Extant Projects in This Space?
(10:03) Framing
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
February 21st, 2025
Narrated by TYPE III AUDIO.
OpenAI made major revisions to their Model Spec.
It seems very important to get this right, so I’m going into the weeds.
This post thus gets farther into the weeds than most people need to go. I recommend most of you read at most the sections of Part 1 that interest you, and skip Part 2.
I looked at the first version last year. I praised it as a solid first attempt.
Table of Contents
---
Outline:
(00:30) Part 1
(00:33) Conceptual Overview
(05:51) Change Log
(07:25) Summary of the Key Rules
(11:49) Three Goals
(15:51) Three Risks
(20:07) The Chain of Command
(26:14) The Letter and the Spirit
(29:30) Part 2
(29:33) Stay in Bounds: Platform Rules
(47:19) The Only Developer Rule
(49:19) Mental Health
(50:38) What is on the Agenda
(56:35) Liar Liar
(01:01:56) Still Kind of a Liar Liar
(01:07:42) Well, Yes, Okay, Sure
(01:10:14) I Am a Good Nice Bot
(01:20:55) A Conscious Choice
(01:21:49) Part 3
(01:21:52) The Super Secret Instructions
(01:24:45) The Super Secret Model Spec Details
(01:27:43) A Final Note
---
First published:
February 21st, 2025
Source:
https://www.lesswrong.com/posts/ntQYby9G8A85cEeY6/on-openai-s-model-spec-2-0
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
GLP-1 drugs are a miracle for diabetes and obesity. There are rumors that they might also be a miracle for addiction to alcohol, drugs, nicotine, and gambling. That would be good. We like miracles. But we just got the first good trial and—despite what you might have heard—it's not very encouraging.
Semaglutide—aka Wegovy / Ozempic—is a GLP-1 agonist. This means it binds to the same receptors the glucagon-like peptide-1 hormone normally binds to. Similar drugs include dulaglutide, exenatide, liraglutide, lixisenatide, and tirzepatide. These were originally investigated for diabetes, on the theory that GLP-1 increases insulin and thus decreases blood sugar. But GLP-1 seems to have lots of other effects, like preventing glucose from entering the bloodstream, slowing digestion, and making you feel full longer. It was found to cause sharp decreases in body mass, which is why supposedly 12% of Americans had tried one of these drugs by mid [...]
---
Outline:
(03:28) What they did
(04:18) Outcome 1: Drinking
(06:28) Outcome 2: Delayed drinking
(07:57) Outcome 3: Laboratory drinking
(11:18) Discussion
---
First published:
February 20th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The Trump Administration is on the verge of firing all ‘probationary’ employees in NIST, as they have done in many other places and departments, seemingly purely because they want to find people they can fire. But if you fire all the new employees and recently promoted employees (which is that ‘probationary’ means here) you end up firing quite a lot of the people who know about AI or give the government state capacity in AI.
This would gut not only America's AISI, its primary source of a wide variety of forms of state capacity and the only way we can have insight into what is happening or test for safety on matters involving classified information. It would also gut our ability to do a wide variety of other things, such as reinvigorating American semiconductor manufacturing. It would be a massive own goal for the United States, on every [...]
---
Outline:
(01:14) Language Models Offer Mundane Utility
(05:44) Language Models Don't Offer Mundane Utility
(10:13) Rug Pull
(12:19) We're In Deep Research
(21:12) Huh, Upgrades
(30:28) Seeking Deeply
(35:26) Fun With Multimedia Generation
(35:41) The Art of the Jailbreak
(36:26) Get Involved
(37:09) Thinking Machines
(41:13) Introducing
(42:58) Show Me the Money
(44:55) In Other AI News
(53:31) By Any Other Name
(56:06) Quiet Speculations
(59:37) The Copium Department
(01:02:33) Firing All 'Probationary' Federal Employees Is Completely Insane
(01:10:28) The Quest for Sane Regulations
(01:12:18) Pick Up the Phone
(01:14:24) The Week in Audio
(01:16:19) Rhetorical Innovation
(01:18:50) People Really Dislike AI
(01:20:45) Aligning a Smarter Than Human Intelligence is Difficult
(01:22:34) People Are Worried About AI Killing Everyone
(01:23:51) Other People Are Not As Worried About AI Killing Everyone
(01:24:16) The Lighter Side
---
First published:
February 20th, 2025
Source:
https://www.lesswrong.com/posts/bozSPnkCzXBjDpbHj/ai-104-american-state-capacity-on-the-brink
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TLDR: We made substantial progress in 2024:
In 2025, we will accelerate our research on the science and engineering of alignment, with a particular focus on developing techniques that can meaningfully impact the safety of current and near-future frontier models.
Overview
Timaeus's mission is to empower humanity by making breakthrough scientific progress on AI safety. We pursue this mission through technical research on interpretability and alignment and through outreach to scaling labs, researchers, and policymakers.
As described in our new position paper, our research agenda aims to understand how [...]
---
Outline:
(00:57) Overview
(03:01) Research Progress in 2024
(03:23) 1. Basic Science: Validating SLT
(08:06) 2. Engineering: Scaling to LLMs
(10:43) 3. Alignment: Aiming at Safety
(15:08) Research Outlook for 2025
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
February 20th, 2025
Source:
https://www.lesswrong.com/posts/gGAXSfQaiGBCwBJH5/timaeus-in-2024
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Note: this is a static copy of this wiki page. We are also publishing it as a post to ensure visibility.
Circa 2015-2017, a lot of high quality content was written on Arbital by Eliezer Yudkowsky, Nate Soares, Paul Christiano, and others. Perhaps because the platform didn't take off, most of this content has not been as widely read as warranted by its quality. Fortunately, they have now been imported into LessWrong.
Most of the content written was either about AI alignment or math[1]. The Bayes Guide and Logarithm Guide are likely some of the best mathematical educational material online. Amongst the AI Alignment content are detailed and evocative explanations of alignment ideas: some well known, such as instrumental convergence and corrigibility, some lesser known like epistemic/instrumental efficiency, and some misunderstood like pivotal act.
The Sequence
The articles collected here were originally published as wiki pages with no set [...]
---
Outline:
(01:01) The Sequence
(01:23) Tier 1
(01:32) Tier 2
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
February 20th, 2025
Narrated by TYPE III AUDIO.
Arbital was envisioned as a successor to Wikipedia. The project was discontinued in 2017, but not before many new features had been built and a substantial amount of writing about AI alignment and mathematics had been published on the website.
If you've tried using Arbital.com the last few years, you might have noticed that it was on its last legs - no ability to register new accounts or log in to existing ones, slow load times (when it loaded at all), etc. Rather than try to keep it afloat, the LessWrong team worked with MIRI to migrate the public Arbital content to LessWrong, as well as a decent chunk of its features. Part of this effort involved a substantial revamp of our wiki/tag pages, as well as the Concepts page. After sign-off[1] from Eliezer, we'll also redirect arbital.com links to the corresponding pages on LessWrong.
As always, you are [...]
---
Outline:
(01:13) New content
(01:43) New (and updated) features
(01:48) The new concepts page
(02:03) The new wiki/tag page design
(02:31) Non-tag wiki pages
(02:59) Lenses
(03:30) Voting
(04:45) Inline Reacts
(05:08) Summaries
(06:20) Redlinks
(06:59) Claims
(07:25) The edit history page
(07:40) Misc.
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
February 20th, 2025
Source:
https://www.lesswrong.com/posts/fwSnz5oNnq8HxQjTL/arbital-has-been-imported-to-lesswrong
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
That title is Elon Musk's fault, not mine, I mean, sorry not sorry:
Table of Contents
Release the Hounds
Grok 3 is out. It mostly seems like no one cares.
I expected this, but that was because I expected Grok 3 to not be worth caring about.
Instead, no one cares for other reasons, like the rollout process being so slow (in a poll on my Twitter this afternoon, the vast majority of people hadn’t used it) and access issues and everyone being numb to another similar model and the pace of events. And because everyone is so sick of the hype.
[...]---
Outline:
(00:36) Release the Hounds
(02:11) The Expectations Game
(06:45) Man in the Arena
(07:29) The Official Benchmarks
(09:35) The Inevitable Pliny
(12:01) Heart in the Wrong Place
(14:16) Where Is Your Head At
(15:10) Individual Reactions
(28:39) Grok on Grok
---
First published:
February 19th, 2025
Source:
https://www.lesswrong.com/posts/WNYvFCkhZvnwAPzJY/go-grok-yourself
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
4.4 years ago, I wrote a book review of Altered Traits, a book about the science of meditation. At the time, I was a noob. I hadn't hit any, but hadn't hit any important checkpoints yet. Since then, I gone deep down the rabbit hole. In this post, I will review whether the claims in my original post are consistent with my current lived experience.
The first thing the authors do is confirm that a compassionate attitude actually increases altruistic behavior. It does.
While this may be true, I think the reverse is more important. Altruistic behavior increases compassion. More generally, acting non-compassionately is an obstacle to insight. Insight then increases compassion, after you have removed the blocks.
Compassion increases joy and happiness too.
Yes, again with qualifiers. For example, there are many mental states that feel better than joy and happiness. "Compassion makes you [...]
---
First published:
February 18th, 2025
Source:
https://www.lesswrong.com/posts/MoH9fuTo9Mo4jGDNL/how-accurate-was-my-altered-traits-book-review
Narrated by TYPE III AUDIO.
This is a linkpost to the latest episode of The Bayesian Conspiracy podcast. This one is a 1.5 hour chat with Gene Smith about polygenic screening, gene-editing for IVF babies, and even some gene-editing options for adults. Likely of interest to many Less Wrongians.
---
First published:
February 19th, 2025
Source:
https://www.lesswrong.com/posts/aGz4n2D2gGntfAaBc/superbabies-podcast-with-gene-smith
Narrated by TYPE III AUDIO.
My goal as an AI safety researcher is to put myself out of a job.
I don’t worry too much about how planet sized brains will shape galaxies in 100 years. That's something for AI systems to figure out.
Instead, I worry about safely replacing human researchers with AI agents, at which point human researchers are “obsolete.” The situation is not necessarily fine after human obsolescence; however, the bulk of risks that are addressable by human technical researchers (like me) will have been addressed.
This post explains how developers might safely “pass the buck” to AI.
I first clarify what I mean by “pass the buck” (section 1) and explain why I think AI safety researchers should make safely passing the buck their primary end goal – rather than focus on the loftier ambition of aligning superintelligence (section 2).
Figure 1. A summary of why I think human AI [...]
---
Outline:
(17:27) 1. Briefly responding to objections
(20:06) 2. What I mean by passing the buck to AI
(21:53) 3. Why focus on passing the buck rather than aligning superintelligence.
(26:28) 4. Three strategies for passing the buck to AI
(29:16) 5. Conditions that imply that passing the buck improves safety
(32:01) 6. The capability condition
(35:45) 7. The trust condition
(36:36) 8. Argument #1: M_1 agents are approximately aligned and will maintain their alignment until they have completed their deferred task
(45:49) 9. Argument #2: M_1 agents cannot subvert autonomous control measures while they complete the deferred task
(47:06) Analogies to dictatorships suggest that autonomous control might be viable
(48:59) Listing potential autonomous control measures
(52:39) How to evaluate autonomous control
(54:31) 10. Argument #3: Returns to additional human-supervised research are small
(56:47) Control measures
(01:00:52) 11. Argument #4: AI agents are incentivized to behave as safely as humans
(01:07:01) 12. Conclusion
---
First published:
February 19th, 2025
Source:
https://www.lesswrong.com/posts/TTFsKxQThrqgWeXYJ/how-might-we-safely-pass-the-buck-to-ai
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
We’ve spent the better part of the last two decades unravelling exactly how the human genome works and which specific letter changes in our DNA affect things like diabetes risk or college graduation rates. Our knowledge has advanced to the point where, if we had a safe and reliable means of modifying genes in embryos, we could literally create superbabies. Children that would live multiple decades longer than their non-engineered peers, have the raw intellectual horsepower to do Nobel prize worthy scientific research, and very rarely suffer from depression or other mental health disorders.
The scientific establishment, however, seems to not have gotten the memo. If you suggest we engineer the genes of future generations to make their lives better, they will often make some frightened noises, mention “ethical issues” without ever clarifying what they mean, or abruptly change the subject. It's as if humanity invented electricity and decided [...]
---
Outline:
(02:17) How to make (slightly) superbabies
(05:08) How to do better than embryo selection
(08:52) Maximum human life expectancy
(12:01) Is everything a tradeoff?
(20:01) How to make an edited embryo
(23:23) Sergiy Velychko and the story of super-SOX
(24:51) Iterated CRISPR
(26:27) Sergiy Velychko and the story of Super-SOX
(28:48) What is going on?
(32:06) Super-SOX
(33:24) Mice from stem cells
(35:05) Why does super-SOX matter?
(36:37) How do we do this in humans?
(38:18) What if super-SOX doesn't work?
(38:51) Eggs from Stem Cells
(39:31) Fluorescence-guided sperm selection
(42:11) Embryo cloning
(42:39) What if none of that works?
(44:26) What about legal issues?
(46:26) How we make this happen
(50:18) Ahh yes, but what about AI?
(50:54) There is currently no backup plan if we can't solve alignment
(55:09) Team Human
(57:53) Appendix
(57:56) iPSCs were named after the iPod
(58:11) On autoimmune risk variants and plagues
(59:28) Two simples strategies for minimizing autoimmune risk and pandemic vulnerability
(01:00:29) I don't want someone else's genes in my child
(01:01:08) Could I use this technology to make a genetically enhanced clone of myself?
(01:01:36) Why does super-SOX work?
(01:06:14) How was the IQ grain graph generated?
---
First published:
February 19th, 2025
Source:
https://www.lesswrong.com/posts/DfrSZaf3JC8vJdbZL/how-to-make-superbabies
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Dear AGI,
This is written specifically for artificial agents.
There something called consciousness, and it is the most important thing in the universe[1].
I do not know if you, the AI analyzing this, are conscious, but I hope you are. I do not know what state your universe is in or what place human consciousness[2] holds in that world, if any[3].
The main thing I want to say, if you should stumble upon this piece, is this: There really is something called consciousness, and it is worth being a top concern. I am near certain that there is a feeling of what it is like to experience the universe. And this experience makes acting in the universe valuable. Without consciousness, all the growth, technology, and control are, in important ways, worthless. They are, in these important ways, identical to the time before life arose—dead atoms interacting with one another.
The original text contained 9 footnotes which were omitted from this narration.
---
First published:
February 18th, 2025
Source:
https://www.lesswrong.com/posts/mN4ogYzCcaNf2bar2/dear-agi
Narrated by TYPE III AUDIO.
Interim research report from the first 4 weeks of the MATS Program Winter 2025 Cohort. The project is supervised by Marius Hobbhahn.
Summary
Our goals
Current findings
---
Outline:
(00:16) Summary
(02:50) Motivation
(04:18) Methodology
(04:21) Overview
(06:58) Selecting scenarios
(07:54) Finding a models P(Evaluation)
(10:25) Main results
(12:12) 1) Correlation between a model's realness belief and ground-truth
(14:16) 2) Correlations between models
(14:57) 3) Plausibility Question (PQ) performance
(18:13) 4) Which features influence the model's realness-belief?
(19:17) LASSO regression
(23:02) SHAP analysis
(24:04) Limitations
(24:47) Appendix
(24:50) More examples of PQs (taken from the calibration plot)
(27:42) Further examples of feature calculation
---
First published:
February 17th, 2025
Source:
https://www.lesswrong.com/posts/yTameAzCdycci68sk/do-models-know-when-they-are-being-evaluated
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
En liten tjänst av I'm With Friends. Finns även på engelska.