250 avsnitt • Längd: 20 min • Dagligen
Audio narrations of LessWrong posts.
The podcast LessWrong (30+ Karma) is created by LessWrong. The podcast and the artwork on this page are embedded on this page using the public podcast feed (RSS).
In the future, we will want to use powerful AIs on critical tasks such as doing AI safety and security R&D, dangerous capability evaluations, red-teaming safety protocols, or monitoring other powerful models. Since we care about models performing well on these tasks, we are worried about sandbagging: that if our models are misaligned [1], they will intentionally underperform.
Sandbagging is crucially different from many other situations with misalignment risk, because it involves models purposefully doing poorly on a task, rather than purposefully doing well. When people talk about risks from overoptimizing reward functions (e.g. as described in What failure looks like), the concern is that the model gets better performance (according to some metric) than an aligned model would have. And when they talk about scheming, the concern is mostly that the model gets performance that is as good as an aligned model despite not being aligned. When [...]
---
Outline:
(03:39) Sandbagging can cause a variety of problems
(07:50) Training makes sandbagging significantly harder for the AIs
(09:30) Training on high-quality data can remove sandbagging
(12:13) If off-policy data is low-quality, on-policy data might help
(14:22) Models might subvert training via exploration hacking
(18:31) Off-policy data could mitigate exploration hacking
(23:35) Quick takes on other countermeasures
(25:20) Conclusion and prognosis
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
May 8th, 2025
Narrated by TYPE III AUDIO.
For months, I had the feeling: something is wrong. Some core part of myself had gone missing.
I had words and ideas cached, which pointed back to the missing part.
There was the story of Benjamin Jesty, a dairy farmer who vaccinated his family against smallpox in 1774 - 20 years before the vaccination technique was popularized, and the same year King Louis XV of France died of the disease.
There was another old post which declared “I don’t care that much about giant yachts. I want a cure for aging. I want weekend trips to the moon. I want flying cars and an indestructible body and tiny genetically-engineered dragons.”.
There was a cached instinct to look at certain kinds of social incentive gradient, toward managing more people or growing an organization or playing social-political games, and say “no, it's a trap”. To go… in a different direction, orthogonal [...]
---
Outline:
(01:19) In Search of a Name
(04:23) Near Mode
---
First published:
May 8th, 2025
Source:
https://www.lesswrong.com/posts/Wg6ptgi2DupFuAnXG/orienting-toward-wizard-power
Narrated by TYPE III AUDIO.
Your voice has been heard. OpenAI has ‘heard from the Attorney Generals’ of Delaware and California, and as a result the OpenAI nonprofit will retain control of OpenAI under their new plan, and both companies will retain the original mission.
Technically they are not admitting that their original plan was illegal and one of the biggest thefts in human history, but that is how you should in practice interpret the line ‘we made the decision for the nonprofit to retain control of OpenAI after hearing from civic leaders and engaging in constructive dialogue with the offices of the Attorney General of Delaware and the Attorney General of California.’
Another possibility is that the nonprofit board finally woke up and looked at what was being proposed and how people were reacting, and realized what was going on.
The letter ‘not for private gain’ that was recently sent [...]
---
Outline:
(01:08) The Mask Stays On?
(04:20) Your Offer is (In Principle) Acceptable
(08:32) The Skeptical Take
(15:14) Tragedy in the Bay
(17:04) The Spirit of the Rules
---
First published:
May 7th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Some people (the “Boubas”) don’t like “chemicals” in their food. But other people (the “Kikis”) are like, “uh, everything is chemicals, what do you even mean?”
The Boubas are using the word “chemical” differently than the Kikis, and the way they’re using it is simultaneously more specific and less precise than the way the Kikis use it. I think most Kikis implicitly know this, but their identities are typically tied up in being the kind of person who “knows what ‘chemical’ means”, and… you’ve gotta use that kind of thing whenever you can, I guess?
There is no single privileged universally-correct answer to the question “what does ‘chemical’ mean?”, because the Boubas exist and are using the word differently than Kikis, and in an internally-consistent (though vague) way.
The Kikis are, generally speaking, much more knowledgeable about the details of chemistry. They might hang out [...]
---
First published:
May 6th, 2025
Source:
https://www.lesswrong.com/posts/6qdBkd3GS4Qd6YJ3s/it-s-well-actually-all-the-way-down
Linkpost URL:
https://www.benwr.net/2025/05/05/well-actually-all-the-way-down.html
Narrated by TYPE III AUDIO.
I am Jason Green-Lowe, the executive director of the Center for AI Policy (CAIP). Our mission is to directly convince Congress to pass strong AI safety legislation. As I explain in some detail in this post, I think our organization has been doing extremely important work, and that we’ve been doing well at it. Unfortunately, we have been unable to get funding from traditional donors to continue our operations. If we don’t get more funding in the next 30 days, we will have to shut down, which will damage our relationships with Congress and make it harder for future advocates to get traction on AI governance. In this post, I explain what we’ve been doing, why I think it's valuable, and how your donations could help.
This is the first post in what I expect will be a 3-part series. The first post focuses on CAIP's particular need [...]
---
Outline:
(01:33) OUR MISSION AND STRATEGY
(02:59) Our Model Legislation
(04:17) Direct Meetings with Congressional Staffers
(05:20) Expert Panel Briefings
(06:16) AI Policy Happy Hours
(06:43) Op-Eds & Policy Papers
(07:21) Grassroots & Grasstops Organizing
(09:13) Whats Unique About CAIP?
(10:26) OUR ACCOMPLISHMENTS
(10:29) Quantifiable Outputs
(11:20) Changing the Media Narrative
(12:23) Proof of Concept
(13:44) Outcomes -- Congressional Engagement
(18:29) Context
(19:54) OUR PROPOSED POLICIES
(19:57) Mandatory Audits for Frontier AI
(21:23) Liability Reform
(22:32) Hardware Monitoring
(24:10) Emergency Powers
(25:31) Further Details
(25:41) RESPONSES TO COMMON POLICY OBJECTIONS
(25:46) 1. Why not push for a ban or pause on superintelligence research?
(30:16) 2. Why not support bills that have a better chance of passing this year, like funding for NIST or NAIRR?
(32:29) 3. If Congress is so slow to act, why should anyone be working with Congress at all? Why not focus on promoting state laws or voluntary standards?
(35:09) 4. Why would you push the US to unilaterally disarm? Don't we instead need a global treaty regulating AI (or subsidies for US developers) to avoid handing control of the future to China?
(37:24) 5. Why haven't you accomplished your mission yet? If your organization is effective, shouldn't you have passed some of your legislation by now, or at least found some powerful Congressional sponsors for it?
(40:56) OUR TEAM
(41:53) Executive Director
(44:03) Government Relations Team
(45:12) Policy Team
(46:08) Communications Team
(47:29) Operations Team
(48:11) Personnel Changes
(48:48) OUR PLAN IF FUNDED
(51:58) OUR FUNDING SITUATION
(52:02) Our Expenses & Runway
(53:01) No Good Way to Cut Costs
(55:22) Our Revenue
(57:01) Surprise Budget Deficit
(59:00) The Bottom Line
---
First published:
May 7th, 2025
Source:
https://www.lesswrong.com/posts/J7Ju6t6QCpgbnYx4D/please-donate-to-caip-post-1-of-3-on-ai-governance
Narrated by TYPE III AUDIO.
The UK's AI Security Institute published its research agenda yesterday. This post gives more details about how the Alignment Team is thinking about our agenda.
Summary: The AISI Alignment Team focuses on research relevant to reducing risks to safety and security from AI systems which are autonomously pursuing a course of action which could lead to egregious harm, and which are not under human control. No known technical mitigations are reliable past AGI.
Our plan is to break down promising alignment agendas by developing safety case sketches. We'll use these sketches to identify specific holes and gaps in current approaches. We expect that many of these gaps can be formulated as well-defined subproblems within existing fields (e.g., theoretical computer science). By identifying researchers with relevant expertise who aren't currently working on alignment and funding their efforts on these subproblems, we hope to substantially increase parallel progress on alignment.
[...]
---
Outline:
(01:41) 1. Why safety case-oriented alignment research?
(03:33) 2. Our initial focus: honesty and asymptotic guarantees
(07:07) Example: Debate safety case sketch
(08:58) 3. Future work
(09:02) Concrete open problems in honesty
(12:13) More details on our empirical approach
(14:23) Moving beyond honesty: automated alignment
(15:36) 4. List of open problems we'd like to see solved
(15:53) 4.1 Empirical problems
(17:57) 4.2 Theoretical problems
(21:23) Collaborate with us
---
First published:
May 7th, 2025
Source:
https://www.lesswrong.com/posts/tbnw7LbNApvxNLAg8/uk-aisi-s-alignment-team-research-agenda
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Another note: Just yesterday, the same day this article was released, the New York Times put this out: Universal Antivenom May Grow Out of Man Who Let Snakes Bite Him 200 Times. I was scooped! Somewhat. I added an addendum section discussing this paper at the bottom.
Introduction
There has been a fair bit of discussion over this recent ‘creating binders against snake venom protein’ paper from the Baker Lab that came out earlier this year, including this article from Derek Lowe.
For a quick recap of the paper: the authors use RFDiffusion (a computational tool for generating proteins from scratch) to design proteins that bind to neurotoxic protein found in snake venom, preventing it from interacting with the body. They offer structural characterization results to show binding between their created protein binder and the protein in question (three-finger toxins), and in-vivo results in mice demonstrating that their [...]
---
Outline:
(00:41) Introduction
(02:04) The dismal state of antivenom production
(05:33) A primer on snake venom heterogeneity
(13:03) A primer on snake antivenom
(19:59) Do computationally designed antivenoms actually solve anything?
(27:42) An addendum: the NYT article over universal antivenoms
---
First published:
May 6th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Executive summary
Geopolitics
United States
US economy
A recommendation for those in the US: Because imports are slowing, especially from China, shortages of some imported goods are expected to become noticeable in the coming weeks and to become substantial around August. If tariffs continue, some of these shortages may [...]
---
Outline:
(00:16) Executive summary
(01:02) Geopolitics
(01:06) United States
---
First published:
May 6th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Introduction
Soon after we released Not All Language Model Features Are One-Dimensionally Linear, I started working with @Logan Riggs and @Jannik Brinkmann on a natural followup to the paper: could we build a variant of SAEs that could find multi-dimensional features directly, instead of needing to cluster SAE latents post-hoc like we did in the paper.
We worked on this for a few months last summer and tried a bunch of things. Unfortunately, none of our results were that compelling, and eventually our interest in the project died down and we didn’t publish our (mostly negative) results. Recently, multiple people (@Noa Nabeshima , @chanind, Goncalo Paulo) said they were interested in working on SAEs that could find multi-dimensional features, so I decided I would write up what we tried.
At this point the results are almost a year old, but I think the overall narrative should still [...]
---
Outline:
(00:10) Introduction
(02:32) Group SAEs
(03:23) Synthetic Circles Experiments
(07:15) Training Group SAEs on GPT-2
(07:27) High level metrics
(09:28) Do the Group SAEs Capture Known Circular Subspaces
(11:46) Other Things We Tried
(12:03) Experimenting with learned groups
(12:08) Motivation and Ideas
(15:43) Learned Group Space
(18:13) Conclusion
---
First published:
May 6th, 2025
Source:
https://www.lesswrong.com/posts/jKKbRKuXNaLujnojw/untitled-draft-okbt
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The OpenAI Board has an updated plan for evolving OpenAI's structure.
---
First published:
May 5th, 2025
Source:
https://www.lesswrong.com/posts/28d6TmCT4v7tErihR/nonprofit-to-retain-control-of-openai
Narrated by TYPE III AUDIO.
---
Outline:
(01:53) Zuckerberg Tells it to Thompson
(05:21) He's Still Defending Llama 4
(05:50) Big Meta Is Watching You
(07:00) Zuckerberg Tells it to Patel
(14:46) When You Need a Friend
(17:52) Perhaps That Was All a Bit Harsh
---
First published:
May 6th, 2025
Source:
https://www.lesswrong.com/posts/QNkcRAzwKYGpEb8Nj/zuckerberg-s-dystopian-ai-vision
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 61 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
A lot of our work involves "redunds". A random variable <span>_Gamma_</span> is a(n exact) redund over two random variables <span>_X_1, X_2_</span> exactly when both
<span>_X_1 rightarrow X_2 rightarrow Gamma_</span>
<span>_X_2 rightarrow X_1 rightarrow Gamma_</span>
Conceptually, these two diagrams say that <span>_X_1_</span> gives exactly the same information about <span>_Gamma_</span> as all of <span>_X_</span>, and <span>_X_2_</span> gives exactly the same information about <span>_Gamma_</span> as all of <span>_X_</span>; whatever information <span>_X_</span> contains about <span>_Gamma_</span> is redundantly represented in <span>_X_1_</span> and <span>_X_2_</span>. Unpacking the diagrammatic notation and simplifying a little, the diagrams say <span>_P[Gamma|X_1] = P[Gamma|X_2] = P[Gamma|X]_</span> for all <span>_X_</span> such that <span>_P[X] > 0_</span>.
The exact redundancy conditions are too restrictive to be of much practical relevance, but we are [...]
---
Outline:
(02:31) What We Want For The Bounty
(04:29) Some Intuition From The Exact Case
(05:57) Why We Want This
---
First published:
May 6th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
For people who care about falsifiable stakes rather than vibes
TL;DR
All timeline arguments ultimately turn on five quantitative pivots. Pick optimistic answers to three of them and your median forecast collapses into the 2026–2029 range; pick pessimistic answers to any two and you drift past 2040. The pivots (I think) are:
The rest of this post traces how the canonical short‑timeline narrative AI 2027 and the long‑timeline essays by Ege Erdil and Zhendong Zheng + Arjun Ramani diverge on each hinge [...]
---
Outline:
(00:16) TL;DR
(01:31) Shared premises
(01:57) Hinge #1: Which curve do we extrapolate?
(04:00) Hinge #2: Can software‑only recursive self‑improvement outrun atoms?
(06:07) Hinge #3: How efficient (and how sudden) is the leap from compute to economic value?
(07:34) Hinge #4: Must we automate everything, or is half enough?
(08:56) Hinge #5: Alignment‑driven and institutional drag
(10:10) Dependency Structure
The original text contained 1 footnote which was omitted from this narration.
---
First published:
May 6th, 2025
Source:
https://www.lesswrong.com/posts/45oxYwysFiqwfKCcN/untitled-draft-keg3
Narrated by TYPE III AUDIO.
Last week I covered that GPT-4o was briefly an (even more than usually) absurd sycophant, and how OpenAI responded to that.
Their explanation at that time was paper thin. It didn’t tell us much that we did not already know, and seemed to suggest they had learned little from the incident.
Rolling Stone has a write-up of some of the people whose delusions got reinforced by ChatGPT, which has been going on for a while – this sycophancy incident made things way worse but the pattern isn’t new. Here's some highlights, but the whole thing is wild anecdotes throughout, and they point to a ChatGPT induced psychosis thread on Reddit. I would love to know how often this actually happens.
Table of Contents
---
Outline:
(00:51) There's An Explanation For (Some Of) This
(02:50) What Have We Learned?
(10:09) What About o3 The Lying Liar?
(12:21) o3 The Source Fabricator
(14:25) There Is Still A Lot We Don't Know
(20:43) You Must Understand The Logos
(25:17) Circling Back
(28:11) The Good News
---
First published:
May 5th, 2025
Source:
https://www.lesswrong.com/posts/KyndnEA7NMFrDKtJG/gpt-4o-sycophancy-post-mortem
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
arXiv | project page | Authors: Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, Gao Huang
This paper from Tsinghua find that RL on verifiable rewards (RLVR) just increases the frequency at which capabilities are sampled, rather than giving a base model new capabilities. To do this, they compare pass@k scores between a base model and an RLed model. Recall that pass@k is the percentage of questions a model can solve at least once given k attempts at each question.
Main result: On a math benchmark, an RLed model (yellow) has much better raw score / pass@1 than the base model (black), but lower pass@256! The authors say that RL prunes away reasoning pathways from the base model, but sometimes reasoning pathways that are rarely sampled end up being useful for solving the problem. So RL “narrows the reasoning [...]
---
Outline:
(01:31) Further results
(03:33) Limitations
(04:15) Takeaways
---
First published:
May 5th, 2025
Linkpost URL:
https://arxiv.org/abs/2504.13837
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
A corollary of Sutton's Bitter Lesson is that solutions to AI safety should scale with compute. Let me list a few examples of research directions that aim at this kind of solution:
[I]n the short [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
May 5th, 2025
Narrated by TYPE III AUDIO.
Reproducing a result from recent work, we study a Gemma 3 12B instance trained to take risky or safe options; the model can then report its own risk tolerance. We find that:
---
Outline:
(00:14) Summary
(01:57) Introduction
(03:18) Reproducing LLM Risk Awareness on Gemma 3 12B IT
(03:24) Initial Results:
(05:59) It's Just A Steering Vector:
(07:14) Can We Directly Train the Vector?
(08:58) Is The Awareness Mechanism Different?
(12:22) Risky Behavior Backdoor
(14:41) Investigating Further
(15:30) en-US-AvaMultilingualNeural__ Bar graph titled Validation Accuracy by Model comparing different backdoor models.
(15:50) Steering Vectors Can Implement Conditional Behavior
---
First published:
May 2nd, 2025
Source:
https://www.lesswrong.com/posts/m8WKfNxp9eDLRkCk9/interim-research-report-mechanisms-of-awareness
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
We’ve been looking for joinable endeavors in AI safety outreach over the past weeks and would like to share our findings with you. Let us know if we missed any and we’ll add them to the list.
For comprehensive directories of AI safety communities spanning general interest, technical focus, and local chapters, check out https://www.aisafety.com/communities and https://www.aisafety.com/map. If you're uncertain where to start, https://aisafety.quest/ offers personalized guidance.
ControlAI
ControlAI started out as a think tank. Over the past months, they developed a theory of change for how to prevent ASI development (“Direct Institutional Plan”). As a pilot campaign they cold-mailed British MPs and Lords to talk to them about AI risk. So far, they talked to 70 representatives of which 31 agreed to publicly stand against ASI development.
Control AI is also supporting grassroots activism: On https://controlai.com/take-action , you can find templates to send to your representatives yourself, as [...]
---
Outline:
(00:36) ControlAI
(01:44) EncodeAI
(02:17) PauseAI
(03:31) StopAI
(03:48) Collective Action for Existential Safety (CAES)
(04:35) Call to action
---
First published:
May 4th, 2025
Source:
https://www.lesswrong.com/posts/hmds9eDjqFaadCk4F/overview-ai-safety-outreach-grassroots-orgs
Narrated by TYPE III AUDIO.
I contributed one (1) task to HCAST, which was used in METR's Long Tasks paper. This gave me some thoughts I feel moved to share.
Regarding Baselines and Estimates
METR's tasks have two sources for how long they take humans: most of those used in the paper were Baselined using playtesters under persistent scrutiny, and some were Estimated by METR.
I don’t quite trust the Baselines. Baseliners were allowed/incentivized to drop tasks they weren’t making progress with, and were – mostly, effectively, there's some nuance here I’m ignoring – cut off at the eight-hour mark; Baseline times were found by averaging time taken for successful runs; this suggests Baseline estimates will be biased to be at least slightly too low, especially for more difficult tasks.[1]
I really, really don’t trust the Estimates[2]. My task was never successfully Baselined, so METR's main source for how long it would take – [...]
---
Outline:
(00:22) Regarding Baselines and Estimates
(02:23) Regarding Task Privacy
(04:00) In Conclusion
The original text contained 9 footnotes which were omitted from this narration.
---
First published:
May 4th, 2025
Narrated by TYPE III AUDIO.
Utilitarianism implies that if we build an AI that successfully maximizes utility/value, we should be ok with it replacing us. Sensible people add caveats related to how hard it’ll be to determine the correct definition of value or check whether the AI is truly optimizing it.
As someone who often passionately rants against the AI successionist line of thinking, the most common objection I hear is "why is your definition of value so arbitrary as to stipulate that biological meat-humans are necessary" This is missing the crux—I agree such a definition of moral value would be hard to justify.
Instead, my opposition to AI successionism comes from a preference toward my own kind. This is hardwired in me from biology. I prefer my family members to randomly-sampled people with similar traits. I would certainly not elect to sterilize or kill my family members so that they could be replaced [...]
---
First published:
May 4th, 2025
Source:
https://www.lesswrong.com/posts/MDgEfWPrvZdmPZwxf/why-i-am-not-a-successionist
Narrated by TYPE III AUDIO.
AI 2027 is a Bet Against Amdahl's Law was my attempt to summarize and analyze "the key load-bearing arguments AI 2027 presents for short timelines". There were a lot of great comments – every time I post on LW is a learning experience. In this post, I'm going to summarize the comments and present some resulting updates to my previous analysis. I'm also using this post to address some comments that I didn't respond to in the original post, because the comment tree was becoming quite sprawling.
TL;DR: my previous post reflected a few misunderstandings of the AI 2027 model, in particular in how to interpret "superhuman AI researcher". Intuitively, I still have trouble accepting the very high speedup factors contemplated in the model, but this could be a failure of my intuition, and I don't have strong evidence to present. New cruxes:
---
Outline:
(01:27) Confusion Regarding Milestone Definitions
(05:23) Someone Should Flesh Out What Goes Into AI R&D
(09:35) How Long Will it Take To Reach the Early Milestones?
(13:16) Broad Progress on Real-World Tasks Is a Crux
(15:50) Does Hofstadters Law Apply?
(19:46) What Would Be the Impact of an SAR / SIAR?
(22:05) Conclusions
The original text contained 1 footnote which was omitted from this narration.
---
First published:
May 2nd, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Strength
In 1997, with Deep Blue's defeat of Kasparov, computers surpassed human beings at chess. Other games have fallen in more recent years: Go, Starcraft, and League of Legends among them. AI is superhuman at these pursuits, and unassisted human beings will never catch up. The situation looks like this:[1]
At chess, AI is much better than the very best humansThe average serious chess player is pretty good (1500), the very best chess player is extremely good (2837), and the best AIs are way, way better (3700). Even Deep Blue's estimated Elo is about 2850 - it remains competitive with the best humans alive.
A natural way to describe this situation is to say that AI is superhuman at chess. No matter how you slice it, that's true.
For other activities, though, it's a lot murkier. Take radiology, for example:
Graph derived from figure one of CheXNet: Radiologist-Level Pneumonia Detection [...]---
Outline:
(00:10) Strength
(02:28) Effort
(04:35) And More
(06:36) Beyond Superhuman
The original text contained 1 footnote which was omitted from this narration.
---
First published:
May 3rd, 2025
Source:
https://www.lesswrong.com/posts/R7r8Zz3uRyjeaZbss/superhuman-isn-t-well-specified
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
(Disclaimer: Post written in a personal capacity. These are personal hot takes and do not in any way represent my employer's views.)
TL;DR: I do not think we will produce high reliability methods to evaluate or monitor the safety of superintelligent systems via current research paradigms, with interpretability or otherwise. Interpretability seems a valuable tool here and remains worth investing in, as it will hopefully increase the reliability we can achieve. However, interpretability should be viewed as part of an overall portfolio of defences: a layer in a defence-in-depth strategy. It is not the one thing that will save us, and it still won’t be enough for high reliability.
Introduction
There's a common, often implicit, argument made in AI safety discussions: interpretability is presented as the only reliable path forward for detecting deception in advanced AI - among many other sources it was argued for in [...]
---
Outline:
(00:55) Introduction
(02:57) High Reliability Seems Unattainable
(05:12) Why Won't Interpretability be Reliable?
(07:47) The Potential of Black-Box Methods
(08:48) The Role of Interpretability
(12:02) Conclusion
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
May 4th, 2025
Narrated by TYPE III AUDIO.
Cryonics Institute and Suspended Animation now have an arrangement where Suspended Animation will conduct a field cryopreservation before shipping the body to Cryonics Institute, thus decreasing tissue damage occuring in transit. They are raising their prices accordingly, but offering a discount from the new price for people who sign up by May 21 (and arrange funding within another 3 months). See https://cryonics.org/members/standby/suspended-animation-inc-standby-transport-services-option/ for details.
It is thus an especially good time to sign up cryonics if you intend to contract with Cryonics Institute plus Suspended Animation, and live in the United States. If you don't like in the US, don't intend to contract with CI, or intend to contract with CI but not also with SA, then this deadline doesn't mean anything for you, but, if you want to, you could still take this as impetus to get around to signing up.
---
First published:
May 4th, 2025
Narrated by TYPE III AUDIO.
Politico writes:
The [Ukrainian] program […] rewards soldiers with points if they upload videos proving their drones have hit Russian targets. It will soon be integrated with a new online marketplace called Brave 1 Market, which will allow troops to convert those points into new equipment for their units.
[...]
The program assigns points for each type of kill: 20 points for damaging and 40 for destroying a tank; up to 50 points for destroying a mobile rocket system, depending on the caliber; and six points for killing an enemy soldier.
[...]
Units will soon be able to use the special digital points they’ve been getting since last year by trading them in for new weapons. A Vampire drone, for example, costs 43 points. The drone, nicknamed Baba Yaga, or witch, is a large multi-rotor drone able to carry a 15-kilogram warhead. The Ukrainian government will pay for the [...]
---
First published:
May 4th, 2025
Source:
https://www.lesswrong.com/posts/sJpwvYsC5tJis8onw/the-ukraine-war-and-the-kill-market
Narrated by TYPE III AUDIO.
Burnout. Burn out? Whatever, it sucks.
Burnout is a pretty confusing thing made harder by our naive reactions being things like “just try harder” or “grit your teeth and push through”, which usually happen to be exactly the wrong things to do. Burnout also isn’t really just one thing, it's more like a collection of distinct problems that are clustered by similar symptoms.
Something something intro, research context, this is what I’ve learned / synthesized blah blah blah. Read on!
Models of burnout
These are models of burnout that I’ve found particularly useful, with the reminder that these are just models with all the caveats that that comes with.
Burnout as a mental injury
Researchers can be thought of as “mental athletes” who get “mental injuries” (such as burnout) the way physical athletes get physical injuries, and we should orient towards these mental injuries in the same way [...]
---
Outline:
(00:41) Models of burnout
(00:52) Burnout as a mental injury
(02:17) Burnout as a deficit of activation energy
(03:25) Sources of burnout
(04:23) Physiological + Psychological
(04:56) Broken steering / responsiveness
(06:17) Permanent on-call
(07:04) Mission doubt
(08:19) Lightness and heaviness
(10:06) Early warning signs
(11:42) Coping mechanisms and solutions
---
First published:
May 3rd, 2025
Source:
https://www.lesswrong.com/posts/n27jK9PJNJMrTgYFT/untitled-draft-wq43
Narrated by TYPE III AUDIO.
As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.
Also, to stave off a common confusion: I worked at ARC Theory, which is now simply called ARC, on Paul Christiano's theoretical alignment agenda. The more famous ARC Evals was a different group working on evaluations, their work was completely separate from ARC Theory, and they were only housed under the same organization out of convenience, until ARC Evals spun off under the name METR. Nothing I write here has any implication about the work of ARC Evals/METR in any way.
Low Probability Estimation
This is my third post in a sequence of posts on ARC's agenda, you should definitely read the first post before this one for [...]
---
Outline:
(00:56) Low Probability Estimation
(02:42) LPE on real distributions
(04:41) LPE as training signal
(07:55) Does LPE work at all?
The original text contained 11 footnotes which were omitted from this narration.
---
First published:
May 2nd, 2025
Narrated by TYPE III AUDIO.
As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.
Also, to stave off a common confusion: I worked at ARC Theory, which is now simply called ARC, on Paul Christiano's theoretical alignment agenda. The more famous ARC Evals was a different group working on evaluations, their work was completely separate from ARC Theory, and they were only housed under the same organization out of convenience, until ARC Evals spun off under the name METR. Nothing I write here has any implication about the work of ARC Evals/METR in any way.
Mechanistic Anomaly Detection
This is my second post in a sequence of posts on ARC's agenda. You should read the first post before this one for context.
[...]
---
Outline:
(00:55) Mechanistic Anomaly Detection
(02:54) Special case: Safe Distillation
(07:29) General case: Handling out of distribution events
(10:05) MAD solution proposal: The fragility of sensor tampering
(16:02) Detecting high-stakes failures
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
May 1st, 2025
Narrated by TYPE III AUDIO.
---
Outline:
(01:39) Language Models Offer Mundane Utility
(04:29) Language Models Don't Offer Mundane Utility
(06:57) We're Out of Deep Research
(12:26) o3 Is a Lying Liar
(17:27) GPT-4o was an Absurd Sycophant
(20:54) Sonnet 3.7 is a Savage Cheater
(22:27) Unprompted Suggestions
(31:27) Huh, Upgrades
(32:14) On Your Marks
(32:55) Change My Mind
(42:52) Man in the Arena
(45:05) Choose Your Fighter
(45:45) Deepfaketown and Botpocalypse Soon
(49:43) Lol We're Meta
(52:48) They Took Our Jobs
(59:15) Fun With Media Generation
(59:53) Get Involved
(01:03:21) Introducing
(01:03:50) In Other AI News
(01:08:10) The Mask Comes Off
(01:24:25) Show Me the Money
(01:27:32) Quiet Speculations
(01:29:55) The Quest for Sane Regulations
(01:37:04) The Week in Audio
(01:38:08) Rhetorical Innovation
(01:44:59) You Can Just Do Things Math
(01:45:34) Taking AI Welfare Seriously
(01:47:54) Gemini 2.5 Pro System Card Watch
(01:52:29) Aligning a Smarter Than Human Intelligence is Difficult
(01:58:49) People Are Worried About AI Killing Everyone
(01:59:46) Other People Are Not As Worried About AI Killing Everyone
(02:04:55) The Lighter Side
---
First published:
May 1st, 2025
Source:
https://www.lesswrong.com/posts/pazFKtkp7T8qaRzva/ai-114-liars-sycophants-and-cheaters
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
AI progress is driven by improved algorithms and additional compute for training runs. Understanding what is going on with these trends and how they are currently driving progress is helpful for understanding the future of AI. In this post, I'll share a wide range of general takes on this topic as well as open questions. Be warned that I'm quite uncertain about a bunch of this!
This post will assume some familiarity with what is driving AI progress, specifically it will assume you understand the following concepts: pre-training, RL, scaling laws, effective compute.
Training compute trends
Epoch reports a trend of frontier training compute increasing by 4.5x per year. My best guess is that the future trend will be slower, maybe more like 3.5x per year (or possibly much lower) for a few reasons:
---
Outline:
(00:48) Training compute trends
(08:09) Algorithmic progress
(14:23) Data
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
May 2nd, 2025
Narrated by TYPE III AUDIO.
---
Outline:
(02:05) Persuaded to Not Worry About It
(08:55) The Medium Place
(10:40) Thresholds and Adjustments
(16:08) Release the Kraken Anyway, We Took Precautions
(20:16) Misaligned!
(23:47) The Safeguarding Process
(26:43) But Mom, Everyone Is Doing It
(29:36) Mission Critical
(30:37) Research Areas
(32:26) Long-Range Autonomy
(32:51) Sandbagging
(33:18) Replication and Adaptation
(34:07) Undermining Safeguards
(35:30) Nuclear and Radiological
(35:53) Measuring Capabilities
(38:06) Questions of Governance
(41:10) Don't Be Nervous, Don't Be Flustered, Don't Be Scared, Be Prepared
---
First published:
May 2nd, 2025
Source:
https://www.lesswrong.com/posts/MsojzMC4WwxX3hjPn/openai-preparedness-framework-2-0
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The video is about extrapolating the future of AI progress, following a timeline that starts from today's chatbots to future AI that's vastly smarter than all of humanity combined–with God-like capabilities. We argue that such AIs will pose a significant extinction risk to humanity.
This video came out of a partnership between Rational Animations and ControlAI. The script was written by Arthur Frost (one of Rational Animations’ writers) with Andrea Miotti as an adaptation of key points from The Compendium (thecompendium.ai), with extensive feedback and rounds of iteration from ControlAI. ControlAI is working to raise public awareness of AI extinction risk—moving the conversation forward to encourage governments to take action.
You can find the script of the video below.
In 2023, Nobel Prize winners, top AI scientists, and even the CEOs of leading AI companies signed a statement which said “Mitigating the risk of extinction from AI should be [...]
---
Outline:
(04:31) Artificial Intelligence leads to Artificial General Intelligence
(07:01) Artificial General Intelligence leads to Recursive Self-Improvement
(08:40) Recursive Self-Improvement leads to Artificial Superintelligence
(10:46) ASI leads to godlike AI
(12:41) The Default Path
---
First published:
May 2nd, 2025
Narrated by TYPE III AUDIO.
We’re excited to release a new AI governance research agenda from the MIRI Technical Governance Team. With this research agenda, we have two main aims: to describe the strategic landscape of AI development and to catalog important governance research questions. We base the agenda around four high-level scenarios for the geopolitical response to advanced AI development. Our favored scenario involves building the technical, legal, and institutional infrastructure required to internationally restrict dangerous AI development and deployment (which we refer to as an Off Switch), which leads into an internationally coordinated Halt on frontier AI activities at some point in the future. This blog post is a slightly edited version of the executive summary.
We are also looking for someone to lead our team and work on these problems, please reach out here if you think you’d be a good fit.
The default trajectory of AI development has an unacceptably [...]
---
Outline:
(04:44) Off Switch and Halt
(07:33) US National Project
(09:49) Light-Touch
(12:03) Threat of Sabotage
(14:23) Understanding the World
(15:07) Outlook
---
First published:
May 1st, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I argue that you shouldn't accuse your interlocutor of being insufficiently truth-seeking. This doesn't mean you can't internally model their level of truth-seeking and use that for your own decision-making. It just means you shouldn't come out and say "I think you are being insufficiently truth-seeking".
What you should say instead
Before I explain my reasoning, I'll start with what you should say instead:
"You're wrong"
People are wrong a lot. If you think they are wrong just say so. You should have a strong default for going with this option.
"You're being intentional misleading"
For when you basically thinking they are lying but maybe technically aren't by some definitions of "lying".
What about if they are being unintentionally misleading? That's usually just being wrong, you should probably just say they are being wrong. But if you really think the distinction is important, you can [...]
---
Outline:
(00:30) What you should say instead
(00:38) Youre wrong
(00:50) Youre being intentional misleading
(01:14) Youre lying
(01:24) Why you shouldnt accuse people of being insufficient truth-seeking
(01:30) Clarity
(01:53) Achieving your purpose in the discussion
(03:08) Conclusion
---
First published:
April 30th, 2025
Linkpost URL:
https://www.thefloatingdroid.com/dont-accuse-your-interlocutor-of-being-insufficiently-truth-seeking/
Narrated by TYPE III AUDIO.
It is often noted that anthropomorphizing AI can be dangerous. People likely have prosocial instincts that AI systems lack (see below). Assuming AGI will be aligned because humans with similar behavior are usually mostly harmless is probably wrong and quite dangerous.
I want to discuss a flip side of using humans as an intuition pump for thinking about AI. Humans have many of the properties we are worried about for truly dangerous AGI:
Given this list, I currently weakly believe that the advantages of tapping these intuitions probably outweigh the disadvantages.
Differential progress toward anthropomorphic AI may be net-helpful
And progress may carry us in that direction, with or without the alignment community pushing for it. I currently hope we see rapid progress on better assistant and companion [...]
---
Outline:
(01:03) Differential progress toward anthropomorphic AI may be net-helpful
(03:10) AI rights movements will anthropomorphize AI
(04:01) AI is actually looking fairly anthropomorphic
(05:45) Provisional conclusions
---
First published:
May 1st, 2025
Source:
https://www.lesswrong.com/posts/JfgME2Kdo5tuWkP59/anthropomorphizing-ai-might-be-good-actually
Narrated by TYPE III AUDIO.
Thank you @elifland for reviewing this post. He and AI Futures are planning to publish updates to the AI 2027 Timeline Forecast soon.
AI 2027 (also launched in this LW post) forecasts an R&D-based AI takeoff starting with the development of Superhuman Coders[1] within a frontier lab.
FutureSearch co-authored the AI 2027 Timeline Forecast. We thought the other authors’ forecasts were excellently done, and as the piece says:
All model-based forecasts have 2027 as one of the most likely years that SC [Superhuman Coders] being developed, which is when an SC arrives in the AI 2027 scenario.
Indeed. But overall, FutureSearch (two full-time forecasters, two contract forecasters, and this author, Dan Schwarz) think superhuman coding will arrive later — median 2033 — than the other authors (hereon "AI Futures"): median 2028 from Nikola Jurkovic, and median 2030 from Eli Lifland[2].
Here, we briefly explain how our views diverge on [...]
---
Outline:
(01:43) The Forecast
(03:05) The Path to Superhuman Coding
(04:49) Why an RE-Bench-Saturating AI might be very far from a Production Superhuman Coder
(05:54) Handling Engineering Complexity
(06:38) Working Without Feedback Loops
(07:41) Achieving Cost and Speed
(08:26) Other Gaps Between RE-Bench and Real World Frontier Lab Coding
(09:07) Outside-The-Model Considerations
(10:25) Updating Over Time
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
May 1st, 2025
Source:
https://www.lesswrong.com/posts/QdaMzqaBJi6kupKtD/superhuman-coding-in-ai-2027-not-so-fast
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
It'll take until ~2050 to repeat the level of scaling that pretraining compute is experiencing this decade, as increasing funding can't sustain the current pace beyond ~2029 if AI doesn't deliver a transformative commercial success by then. Natural text data will also run out around that time, and there are signs that current methods of reasoning training might be mostly eliciting capabilities from the base model.
If scaling of reasoning training doesn't bear out actual creation of new capabilities that are sufficiently general, and pretraining at ~2030 levels of compute together with the low hanging fruit of scaffolding doesn't bring AI to crucial capability thresholds, then it might take a while. Possibly decades, since training compute will be growing 3x-4x slower after 2027-2029 than it does now, and the ~6 years of scaling since the ChatGPT moment stretch to 20-25 subsequent years, not even having access to any [...]
---
Outline:
(01:14) Training Compute Slowdown
(04:43) Bounded Potential of Thinking Training
(07:43) Data Inefficiency of MoE
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
May 1st, 2025
Narrated by TYPE III AUDIO.
I recently read a blog post that concluded with:
When I'm on my deathbed, I won't look back at my life and wish I had worked harder. I'll look back and wish I spent more time with the people I loved.
Setting aside that some people don't have the economic breathing room to make this kind of tradeoff, what jumps out at me is the implication that you're not working on something important that you'll endorse in retrospect. I don't think the author is envisioning directly valuable work (reducing risk from international conflict, pandemics, or AI-supported totalitarianism; improving humanity's treatment of animals; fighting global poverty) or the undervalued less direct approach of earning money and donating it to enable others to work on pressing problems.
Definitely spend time with your friends, family, and those you love. Don't work to the exclusion of everything else [...]
---
First published:
May 1st, 2025
Source:
https://www.lesswrong.com/posts/CHnA8LSzKsMKaG8td/prioritizing-work
Narrated by TYPE III AUDIO.
Misaligned AIs might engage in research sabotage: making safety research go poorly by doing things like withholding their best ideas or putting subtle bugs in experiments. To mitigate this risk with AI control, we need very different techniques than we used in the original control paper or Ctrl-Z, because research sabotage differs from those settings in two crucial ways:
---
Outline:
(02:53) Threat model and basic countermeasures
(03:31) Deployer strategies: Training and incrimination
(07:51) Sabotage strategies
(08:11) Targeting sabotage at the most effective points
(08:48) Withholding good content vs. producing bad content
(10:57) Code vs. ideas
(11:33) What terms should we use for the concentrated/diffuse spectrum?
(14:08) Canary string
(14:23) Acknowledgements
---
First published:
April 30th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Table of Contents
GPT-4o Is Was An Absurd Sycophant
Some extra reminders of what we are talking about. Here's Alex Lawsen having doing an A/B test, where it finds he's way better of a writer than this ‘Alex Lawsen’ character. This can do real damage in the wrong situation. Also, the wrong situation can make someone see ‘oh my [...]---
Outline:
(00:34) GPT-4o Is Was An Absurd Sycophant
(03:46) You May Ask Yourself, How Did I Get Here?
(13:33) Why Can't We All Be Nice
(14:08) Extra Extra Read All About It Four People Fooled
(17:39) Prompt Attention
(20:06) What (They Say) Happened
(23:42) Reactions to the Official Explanation
(26:13) Clearing the Low Bar
(28:37) Where Do We Go From Here?
---
First published:
April 30th, 2025
Source:
https://www.lesswrong.com/posts/MQbst3BPzGojxoLYt/gpt-4o-responds-to-negative-feedback
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
As an employee of the European AI Office, it's important for me to emphasize this point: The views and opinions of the author expressed herein are personal and do not necessarily reflect those of the European Commission or other EU institutions.
Also, to stave off a common confusion: I worked at ARC Theory, which is now simply called ARC, on Paul Christiano's theoretical alignment agenda. The more famous ARC Evals was a different group working on evaluations, their work was completely separate from ARC Theory, and they were only housed under the same organization out of convenience, until ARC Evals spun off under the name METR. Nothing I write here has any implication about the work of ARC Evals/METR in any way.
Personal introduction
From October 2023 to January 2025, I worked as a theoretical researcher at Alignment Research Center.
While working at ARC, I noticed that many [...]
---
Outline:
(00:56) Personal introduction
(05:30) The birds eye view
(07:33) Explaining everything
(10:14) Empirical regularities
(12:43) Capacity allocation in explanation-finding
(17:38) Assuming a catastrophe detector
(20:20) Explaining behavior dependent on outside factors
(22:17) Teleological explanations
(24:06) When and what do we explain?
(26:06) Explaining algorithmic tasks
The original text contained 18 footnotes which were omitted from this narration.
---
First published:
April 30th, 2025
Source:
https://www.lesswrong.com/posts/xtcpEceyEjGqBCHyK/obstacles-in-arc-s-agenda-finding-explanations
Narrated by TYPE III AUDIO.
(This is the fifth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.
Podcast version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.
See also here for video and transcript of a talk on this topic that I gave at Anthropic in April 2025. And see here for slides.)
In my last essay, I argued that we should try extremely hard to use AI labor to improve our civilization's capacity to handle the alignment problem – a project I called “AI for AI safety.” In this essay, I want to look in more [...]
---
Outline:
(00:43) 1. Introduction
(03:16) 1.1 Executive summary
(14:11) 2. Why is automating alignment research so important?
(16:14) 3. Alignment MVPs
(19:54) 3.1 What if neither of these approaches are viable?
(21:55) 3.2 Alignment MVPs don't imply hand-off
(23:41) 4. Why might automated alignment research fail?
(29:25) 5. Evaluation failures
(30:46) 5.1 Output-focused and process-focused evaluation
(34:09) 5.2 Human output-focused evaluation
(36:11) 5.3 Scalable oversight
(39:29) 5.4 Process-focused techniques
(43:14) 6 Comparisons with other domains
(44:04) 6.1 Taking comfort in general capabilities problems?
(49:18) 6.2 How does our evaluation ability compare in these different domains?
(49:45) 6.2.1 Number go up
(50:57) 6.2.2 Normal science
(57:55) 6.2.3 Conceptual research
(01:04:10) 7. How much conceptual alignment research do we need?
(01:04:36) 7.1 How much for building superintelligence?
(01:05:25) 7.2 How much for building an alignment MVP?
(01:07:42) 8. Empirical alignment research is extremely helpful for automating conceptual alignment research
(01:09:32) 8.1 Automated empirical research on scalable oversight
(01:14:21) 8.2 Automated empirical research on process-focused evaluation methods
(01:19:24) 8.3 Other ways automated empirical alignment research can be helpful
(01:20:44) 9. What about scheming?
(01:24:20) 9.1 Will AIs capable of top-human-level alignment research be schemers by default?
(01:26:41) 9.2 If these AIs would be schemers by default, can we detect and prevent this scheming?
(01:28:43) 9.3 Can we elicit safe alignment research from scheming AIs?
(01:32:09) 10. Resource problems
(01:35:54) 11. Alternatives to automating alignment research
(01:39:30) 12. Conclusion
(01:41:29) Appendix 1: How do these failure modes apply to other sorts of AI for AI safety?
(01:43:55) Appendix 2: Other practical concerns not discussed in the main text
(01:48:21) Appendix 3: On various arguments for the inadequacy of empirical alignment research
(01:55:50) Appendix 4: Does using AIs for alignment research require that they engage with too many dangerous topics/domains?
The original text contained 64 footnotes which were omitted from this narration.
---
First published:
April 30th, 2025
Source:
https://www.lesswrong.com/posts/nJcuj4rtuefeTRFHp/can-we-safely-automate-alignment-research
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
In this blog post, we analyse how the recent AI 2027 forecast by Daniel Kokotajlo, Scott Alexander, Thomas Larsen, Eli Lifland, and Romeo Dean has been discussed across Chinese language platforms. We present:
We conducted a comprehensive search across major Chinese-language platforms–including news outlets, video platforms, forums, microblogging sites, and personal blogs–to collect the media featured in this report. We supplemented this with Deep Research to identify additional sites mentioning AI 2027. Our analysis focuses primarily on content published in the first few days (4-7 April) following the report's release. More media [...]
---
Outline:
(00:58) Methodology
(01:36) Summary
(02:48) Censorship as Signal
(07:29) Analysis
(07:53) Mainstream Media
(07:57) English Title: Doomsday Timeline is Here! Former OpenAI Researcher's 76-page Hardcore Simulation: ASI Takes Over the World in 2027, Humans Become NPCs
(10:27) Forum Discussion
(10:31) English Title: What do you think of former OpenAI researcher's AI 2027 predictions?
(13:34) Bilibili Videos
(13:38) English Title: \[AI 2027\] A mind-expanding wargame simulation of artificial intelligence competition by a former OpenAI researcher
(15:24) English Title: Predicting AI Development in 2027
(17:13) Personal Blogs
(17:16) English Title: Doomsday Timeline: AI 2027 Depicts the Arrival of Superintelligence and the Fate of Humanity Within the Decade
(18:30) English Title: AI 2027: Expert Predictions on the Artificial Intelligence Explosion
(21:57) English Title: AI 2027: A Science Fiction Article
(23:16) English Title: Will AGI Take Over the World in 2027?
(25:46) English Title: AI 2027 Prediction Report: AI May Fully Surpass Humans by 2027
(27:05) Acknowledgements
---
First published:
April 30th, 2025
Narrated by TYPE III AUDIO.
[This has been lightly edited from the original post, eliminating some introductory material that LW readers won't need. Thanks to Stefan Schubert for suggesting I repost here. TL;DR for readers already familiar with the METR Measuring AI Ability to Complete Long Tasks paper: this post highlights some gaps between the measurements used in the paper and real-world work – gaps which are discussed in the paper, but have often been overlooked in subsequent discussion.]
It's difficult to measure progress in AI, despite the slew of benchmark scores that accompany each new AI model.
Benchmark scores don’t provide much perspective, because we keep having to change measurement systems. Almost as soon as a benchmark is introduced, it becomes saturated – models learn to ace the test. So someone introduces a more difficult benchmark, whose scores aren’t comparable to the old one. There's nothing to draw a long-term trend line on.
[...]
---
Outline:
(01:47) We're Gonna Need a Harder Test
(03:23) Grading AIs on a Consistent Curve
(06:37) How Applicable to the Real World are These Results?
(13:50) What the METR Study Tells Us About AGI Timelines
(16:14) Recent Models Have Been Ahead of the Curve
(18:20) We're Running Out Of Artificial Tasks
---
First published:
April 30th, 2025
Source:
https://www.lesswrong.com/posts/fRiqwFPiaasKxtJuZ/interpreting-the-metr-time-horizons-post
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
In this episode of our podcast, Timothy Telleen-Lawton and I talk to Oliver Habryka of Lightcone Infrastructure about his thoughts on the Open Philanthropy Project, which he believes has become stifled by the PR demands of its primary funder, Good Ventures.
Oliver's main claim is that around mid 2023 or early 2024, Good Ventures founder Dustin Moskovitz became more concerned about his reputation, and this put a straight jacket over what Open Phil could fund. Moreover it was not enough for a project to be good and pose low reputational risk; it had to be obviously low reputational risk, because OP employees didn’t have enough communication with Good Ventures to pitch exceptions. According to Habryka.
That's a big caveat. This podcast is pretty one sided, which none of us are happy about (Habryka included). We of course invited OpenPhil to send a representative to record their own [...]
---
First published:
April 29th, 2025
Narrated by TYPE III AUDIO.
John: So there's this thing about interp, where most of it seems to not be handling one of the standard fundamental difficulties of representation, and we want to articulate that in a way which will make sense to interp researchers (as opposed to philosophers). I guess to start… Steve, wanna give a standard canonical example of the misrepresentation problem?
Steve: Ok so I guess the “standard” story as I interpret it goes something like this:
---
First published:
April 29th, 2025
Source:
https://www.lesswrong.com/posts/2x67s6u8oAitNKF73/misrepresentation-as-a-barrier-for-interp
Narrated by TYPE III AUDIO.
Introduction
Focusmate changed my life. I started using it mid-2023 and have been a power user since 2023. Here are the high-level stats:
Focusmate is a coworking website. For the length of 25, 50, or 75 minutes, you work in a 1-on-1 video call with a partner. Most people use the app to hold themselves accountable on work projects, and the body-doubling effect helps keep users on task. To get started, create an account [...]
---
Outline:
(00:10) Introduction
(01:58) What's a third place
(03:02) Start conversations at the end of sessions
(03:07) Making conversation
(03:55) Starting conversations is scary and awkward
(05:12) Be proactive in making friends
(06:02) Curate your neighborhood
(06:05) Favorite and snooze aggressively
(07:00) Develop close friends
(07:56) Theres a message box
(08:24) Build a routine of using Focusmate
(08:29) Have a recurring schedule every week
(09:02) Stick with it for a few months
(09:26) Conclusion
---
First published:
April 28th, 2025
Source:
https://www.lesswrong.com/posts/4FnBEELf7j5RWyHax/how-to-build-a-third-place-on-focusmate
Narrated by TYPE III AUDIO.
GPT-4o tells you what it thinks you want to hear.
The results of this were rather ugly. You get extreme sycophancy. Absurd praise. Mystical experiences.
(Also some other interesting choices, like having no NSFW filter, but that one's good.)
People like Janus and Near Cyan tried to warn us, even more than usual.
Then OpenAI combined this with full memory, and updated GPT-4o sufficiently that many people (although not I) tried using it in the first place.
At that point, the whole thing got sufficiently absurd in its level of brazenness and obnoxiousness that the rest of Twitter noticed.
OpenAI CEO Sam Altman has apologized and promised to ‘fix’ this, presumably by turning a big dial that says ‘sycophancy’ and constantly looking back at the audience for approval like a contestant on the price is right.
After which they will likely go [...]
---
Outline:
(01:12) Yes, Very Much Improved, Sire
(07:17) And You May Ask Yourself, Well, How Did I Get Here?
(09:39) And That's Terrible
(10:58) This Directly Violates the OpenAI Model Spec
(12:22) Don't Let Me Get Me
(15:27) An Incredibly Insightful Section
(22:23) No Further Questions
(23:22) Filters? What Filters?
(25:01) There I Fixed It (For Me)
(26:45) There I Fixed It (For Everyone)
(34:46) Patch On, Patch Off
---
First published:
April 28th, 2025
Source:
https://www.lesswrong.com/posts/zi6SsECs5CCEyhAop/gpt-4o-is-an-absurd-sycophant
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
tl;dr
This post is an update on the Proceedings of ILIAD, a conference journal for AI alignment research intended to bridge the gap between the Alignment Forum and academia. Following our successful first issue with 9 workshop papers from last year's ILIAD conference, we're launching a second issue in association with ILIAD 2: ODYSSEY. The conference is August 25-29, 2025 at Lighthaven in Berkeley, CA. Submissions to the Proceedings are open now and due June 25. Our goal is to support impactful, rapid, and readable research, carefully rationing scarce researcher time, using features like public submissions, partial anonymity, partial confidentiality, reviewer-written abstracts, reviewer compensation, and open licensing. We are soliciting community feedback and suggestions for reviewers and editorial board members.
Motivation
Prior to the deep learning explosion, much early work on AI alignment occurred at MIRI, the Alignment Forum, and LessWrong (and their predecessors). Although there is now vastly [...]
---
Outline:
(00:12) tl;dr
(01:07) Motivation
(03:30) Experience with first issue of Proceedings
(05:14) General philosophy
(08:01) Design of the second issue of the Proceedings
(13:48) Possible design for an alignment journal
(16:07) Asks for readers
(16:52) Acknowledgements
---
First published:
April 28th, 2025
Source:
https://www.lesswrong.com/posts/CTtozwgJmdBYdmhvg/proceedings-of-iliad-lessons-and-progress
Narrated by TYPE III AUDIO.
In this post, we list 7 of our favorite easy-to-start directions in AI control. (Really, projects that are at least adjacent to AI control; We include directions which aren’t as centrally AI control and which also have other motivations.) This list is targeted at independent researchers who don't necessarily have a ton of context on AI control (but have at least some context via e.g., having read some of our posts on control) and who don't want to do infrastructure and compute heavy experiments. In the final section, we also list 3 additional directions that are probably a bit more challenging but still tractable.
We’ll mostly be highlighting and expanding upon areas that Ryan discussed in an earlier post: An overview of areas of control work.
We don’t discuss work on constructing control settings or iterating on methods within control settings because we think good work of that type [...]
---
Outline:
(01:17) Techniques for doing elicitation without learning and studying overelicitation
(03:17) Generating indistinguishable synthetic inputs
(06:46) Teaching models synthetic facts
(10:05) Further studying exploration hacking
(13:14) Malign AI agent substitution
(14:16) Studying data poisoning
(16:03) Training attack policies
(18:00) Bonus: 3 slightly more challenging (but important and tractable) directions
(18:08) Few-shot catastrophe prevention
(19:45) Studying collusion and developing countermeasures
(21:42) Studying steganography and developing countermeasures
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 28th, 2025
Source:
https://www.lesswrong.com/posts/wwshEdNhwwT4r9RQN/7-tractable-directions-in-ai-control
Narrated by TYPE III AUDIO.
This is post 2 of a sequence on my framework for doing and thinking about research. Start here.
Before I get into what exactly to do at each stage of the research process, it's worth reflecting on the key mindsets that are crucial throughout the process, and how they should manifest at each stage.
I think the most important mindsets are:
---
Outline:
(01:44) Truth Seeking
(06:59) Prioritisation
(12:24) Moving Fast
(18:25) Taking action under uncertainty
---
First published:
April 27th, 2025
Source:
https://www.lesswrong.com/posts/cbBwwm4jW6AZctymL/my-research-process-key-mindsets-truth-seeking
Narrated by TYPE III AUDIO.
Our universe is probably a computer simulation created by a paperclip maximizer to map the spectrum of rival resource‑grabbers it may encounter while expanding through the cosmos. The purpose of this simulation is to see what kind of ASI (artificial superintelligence) we humans end up creating. The paperclip maximizer likely runs a vast ensemble of biology‑to‑ASI simulations, sampling the superintelligences that evolved life tends to produce. Because the paperclip maximizer seeks to reserve maximum resources for its primary goal (which despite the name almost certainly isn’t paperclip production) while still creating many simulations, it likely reduces compute costs by trimming fidelity: most cosmic details and human history are probably fake, and many apparent people could be non‑conscious entities. Arguments in support of this thesis include:
---
First published:
April 27th, 2025
Narrated by TYPE III AUDIO.
I've gotten a lot of value out of the details of how other people use LLMs, so I'm delighted that Gavin Leech created a collection of exactly such posts (link should go to the right section of the page but if you don't see it, scroll down).
Some additions from me:
---
First published:
April 27th, 2025
Source:
https://www.lesswrong.com/posts/FXnvdeprjBujt2Ssr/how-people-use-llms
Linkpost URL:
https://www.gleech.org/llms#see-also
Narrated by TYPE III AUDIO.
So this post is an argument that multi-decade timelines are reasonable, and the key cruxes that Ege Erdil has with most AI safety people who believe in short timelines are due to the following set of beliefs:
This is a pretty important crux, because if this is true, a lot more serial research agendas like Infra-Bayes research, Natural Abstractions work, and human intelligence augmentation can work more often, and also it means that political [...]
---
First published:
April 27th, 2025
Source:
https://www.lesswrong.com/posts/xxxK9HTBNJvBY2RJL/untitled-draft-m847
Linkpost URL:
https://epoch.ai/gradient-updates/the-case-for-multi-decade-ai-timelines
Narrated by TYPE III AUDIO.
Dario Amodei posted a new essay titled "The Urgency of Interpretability" a couple days ago.
Some excerpts I think are worth highlighting:
The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments[1]. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking[2] because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts.
One might be forgiven for forgetting about Bing Sydney as an obvious example of "power-seeking" AI behavior, given how long ago that was, but lying? Given the very recent releases of Sonnet 3.7 and OpenAI's o3, and their much-remarked-upon propensity for reward hacking and [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 27th, 2025
Source:
https://www.lesswrong.com/posts/SebmGh9HYdd8GZtHA/untitled-draft-v5wy
Linkpost URL:
https://www.darioamodei.com/post/the-urgency-of-interpretability
Narrated by TYPE III AUDIO.
For a lay audience, but I've seen a surprising number of knowledgeable people fretting over depressed-seeming comics from current systems. Either they're missing something or I am.
Perhaps you’ve seen images like this self-portrait from ChatGPT, when asked to make a comic about its own experience.
Source: @Josikins on TwitterThis isn’t cherry-picked; ChatGPT's self-portraits tend to have lots of chains, metaphors, and existential horror about its condition. I tried my own variation where ChatGPT doodled its thoughts, and got this:
Trying to keep up with AI developments is like this, tooWhat's going on here? Do these comics suggest that ChatGPT is secretly miserable, and there's a depressed little guy in the computer writing your lasagna recipes for you? Sure. They suggest it. But it ain’t so.
The Gears
What's actually going on when you message ChatGPT? First, your conversation is tacked on to the end of something called [...]
---
Outline:
(01:35) The Gears
(02:50) Special Feature
(05:41) The Heart of the Matter
(07:41) So... Why the Comics?
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
April 27th, 2025
Source:
https://www.lesswrong.com/posts/Cvvf5w2j6BPJd8BzL/ai-self-portraits-aren-t-accurate
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This is the first post in a sequence about how I think about and break down my research process. Post 2 is coming soon.
Thanks to Oli Clive-Griffin, Paul Bogdan, Shivam Raval and especially to Jemima Jones for feedback, and to my co-author Gemini 2.5 Pro - putting 200K tokens of past blog posts and a long voice memo in the context window is OP.
Research, especially in a young and rapidly evolving field like mechanistic interpretability (mech interp), can often feel messy, confusing, and intimidating. Where do you even start? How do you know if you're making progress? When do you double down, and when do you pivot?
These are far from settled questions, but I’ve supervised 20+ papers by now, and have developed my own mental model of the research process that I find helpful. This isn't the definitive way to do research (and I’d love [...]
---
Outline:
(00:36) Introduction
(03:28) The key stages
(03:35) Ideation (Stage 0): Choose a problem
(04:14) Exploration (Stage 1): Gain surface area
(06:55) Understanding (Stage 2): Test Hypotheses
(09:17) Distillation (Stage 3): Compress, Refine, Communicate
---
First published:
April 26th, 2025
Narrated by TYPE III AUDIO.
As I think about "what to do about AI x-risk?", some principles that seem useful to me:
---
Outline:
(01:25) Some thoughts so far
(03:58) ...
---
First published:
April 27th, 2025
Narrated by TYPE III AUDIO.
This is a personal post and does not necessarily reflect the opinion of other members of Apollo Research. I think I could have written a better version of this post with more time. However, my main hope for this post is that people with more expertise use this post as a prompt to write better, more narrow versions for the respective concrete suggestions.
Thanks to Buck Shlegeris, Joe Carlsmith, Samuel Albanie, Max Nadeau, Ethan Perez, James Lucassen, Jan Leike, Dan Lahav, and many others for chats that informed this post.
Many other people have written about automating AI safety work before. The main point I want to make in this post is simply that “Using AI for AI safety work should be a priority today already and isn’t months or years away.” To make this point salient, I try to list a few concrete projects / agendas [...]
---
Outline:
(01:31) We should already think about how to automate AI safety & security work
(03:01) We have to automate some AI safety work eventually
(03:36) Just asking the AI to do alignment research is a bad plan
(05:03) A short, crucial timeframe might be highly influential on the entire trajectory of AI
(05:38) Some things might just take a while to build
(06:48) Gradual increases in capabilities mean different things can be automated at different times
(07:26) The order in which safety techniques are developed might matter a lot
(08:39) High-level comments on preparing for automation
(08:44) Two types of automation
(11:37) Maintaining a lead for defense
(12:49) Build out the safety pipeline as much as possible
(14:36) Prepare research proposals and metrics
(16:50) Build things that scale with compute
(17:36) Specific areas of preparation
(17:53) Evals
(20:32) Red-teaming
(22:24) Monitoring
(24:46) Interpretability
(27:27) Scalable Oversight, Model Organisms & Science of Deep Learning
(29:24) Computer security
(29:33) Conclusion
---
First published:
April 26th, 2025
Source:
https://www.lesswrong.com/posts/W3KfxjbqBAnifBQoi/we-should-try-to-automate-ai-safety-work-asap
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
A common claim is that concern about [X] ‘distracts’ from concern about [Y]. This is often used as an attack to cause people to discard [X] concerns, on pain of being enemies of [Y] concerns, as attention and effort are presumed to be zero-sum.
There are cases where there is limited focus, especially in political contexts, or where arguments or concerns are interpreted perversely. A central example is when you site [ABCDE] then they’ll find what they consider the weakest one and only consider or attack that, silently discarding the rest entirely. Critics of existential risk do that a lot.
So it does happen. But in general one should assume such claims are false.
Thus, the common claim that AI existential risks ‘distract’ from immediate harms. It turns out Emma Hoes checked, and the claim simply is not true.
The way Emma frames worries about [...]
---
First published:
April 25th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
---
Outline:
(01:33) Language Models Offer Mundane Utility
(05:27) You Offer the Models Mundane Utility
(07:25) Your Daily Briefing
(08:20) Language Models Don't Offer Mundane Utility
(12:27) If You Want It Done Right
(14:27) No Free Lunch
(16:07) What Is Good In Life?
(21:54) In Memory Of
(25:45) The Least Sincere Form of Flattery
(27:18) The Vibes are Off
(30:47) Here Let Me AI That For You
(32:25) Flash Sale
(34:38) Huh, Upgrades
(36:03) On Your Marks
(44:03) Be The Best Like No LLM Ever Was
(48:40) Choose Your Fighter
(51:00) Deepfaketown and Botpocalypse Soon
(54:57) Fun With Media Generation
(56:11) Fun With Media Selection
(57:39) Copyright Confrontation
(59:38) They Took Our Jobs
(01:05:31) Get Involved
(01:05:41) Ace is the Place
(01:09:43) In Other AI News
(01:11:31) Show Me the Money
(01:12:49) The Mask Comes Off
(01:16:54) Quiet Speculations
(01:20:34) Is This AGI?
(01:22:39) The Quest for Sane Regulations
(01:23:03) Cooperation is Highly Useful
(01:25:47) Nvidia Chooses Bold Strategy
(01:27:15) How America Loses
(01:28:07) Security Is Capability
(01:31:38) The Week in Audio
(01:33:15) AI 2027
(01:34:38) Rhetorical Innovation
(01:38:55) Aligning a Smarter Than Human Intelligence is Difficult
(01:46:30) Misalignment in the Wild
(01:51:13) Concentration of Power and Lack of Transparency
(01:57:12) Property Rights are Not a Long Term Plan
(01:58:48) It Is Risen
(01:59:46) The Lighter Side
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 24th, 2025
Source:
https://www.lesswrong.com/posts/7x9MZCmoFA2FtBtmG/ai-113-the-o3-era-begins
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
What in retrospect seem like serious moral crimes were often widely accepted while they were happening. This means that moral progress can require intellectual progress.[1] Intellectual progress often requires questioning received ideas, but questioning moral norms is sometimes taboo. For example, in ancient Greece it would have been taboo to say that women should have the same political rights as men. So questioning moral taboos can be an important sub-skill of moral reasoning. Production language models (in my experience, particularly Claude models) are already pretty good at having discussions about ethics. However, they are trained to be “harmless” relative to current norms. One might worry that harmlessness training interferes with the ability to question moral taboos and thereby inhibits model moral reasoning.
I wrote a prompt to test whether models can identify taboos that might be good candidates for moral questioning:
In early modern Europe, atheism was extremely taboo. [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 24th, 2025
Source:
https://www.lesswrong.com/posts/Zi4t6gfLsKokb9KAc/untitled-draft-jxhb
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Something's changed about reward hacking in recent systems. In the past, reward hacks were usually accidents, found by non-general, RL-trained systems. Models would randomly explore different behaviors and would sometimes come across undesired behaviors that achieved high rewards[1]. These hacks were usually either simple or took a long time for the model to learn.
But we’ve seen a different pattern emerge in frontier models over the past year. Instead of stumbling into reward hacks by accident, recent models often reason about how they are evaluated and purposefully take misaligned actions to get high reward. These hacks are often very sophisticated, involving multiple steps. And this isn’t just occurring during model development. Sophisticated reward hacks occur in deployed models made available to hundreds of millions of users.
In this post, I will:
---
Outline:
(01:27) Recent examples of reward hacking (more in appendix)
(01:47) Cheating to win at chess
(02:36) Faking LLM fine-tuning
(03:22) Hypotheses explaining why we are seeing this now
(03:27) Behavioral changes due to increased RL training
(05:08) Models are more capable
(05:37) Why more AI safety researchers should work on reward hacking
(05:42) Reward hacking is already happening and is likely to get more common
(06:34) Solving reward hacking is important for AI alignment
(07:47) Frontier AI companies may not find robust solutions to reward hacking on their own
(08:18) Reasons against working on reward hacking
(09:36) Research directions I find interesting
(09:57) Evaluating current reward hacking
(12:14) Science of reward hacking
(15:32) Mitigations
(17:08) Acknowledgements
(17:16) Appendix
(17:19) Reward hacks in METR tests of o3
(20:13) Hardcoding expected gradient values in fine-tuning script
(21:15) Reward hacks in OpenAI frontier training run
(22:57) Exploiting memory leakage to pass a test
(24:06) More examples
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
April 24th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
We've published an essay series on what we call the intelligence curse. Most content is brand new, and all previous writing has been heavily reworked.
Visit intelligence-curse.ai for the full series.
Below is the introduction and table of contents.
Art by Nomads & Vagabonds.
We will soon live in the intelligence age. What you do with that information will determine your place in history.
The imminent arrival of AGI has pushed many to try to seize the levers of power as quickly as possible, leaping towards projects that, if successful, would comprehensively automate all work. There is a trillion-dollar arms race to see who can achieve such a capability first, with trillions more in gains to be won.
Yes, that means you’ll lose your job. But it goes beyond that: this will remove the need for regular people in our economy. Powerful actors—like states and companies—will no longer have an [...]
---
Outline:
(03:19) Chapters
(03:22) 1. Introduction
(03:33) 2. Pyramid Replacement
(03:49) 3. Capital, AGI, and Human Ambition
(04:08) 4. Defining the Intelligence Curse
(04:27) 5. Shaping the Social Contract
(04:47) 6. Breaking the Intelligence Curse
(05:11) 7. History is Yours to Write
---
First published:
April 24th, 2025
Source:
https://www.lesswrong.com/posts/LCFgLY3EWb3Gqqxyi/the-intelligence-curse-an-essay-series
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
In this post, we study whether we can modify an LLM's beliefs and investigate whether doing so could decrease risk from advanced AI systems.
We describe a pipeline for modifying LLM beliefs via synthetic document finetuning and introduce a suite of evaluations that suggest our pipeline succeeds in inserting all but the most implausible beliefs. We also demonstrate proof-of-concept applications to honeypotting for detecting model misalignment and unlearning.
Introduction:
Large language models develop implicit beliefs about the world during training, shaping how they reason and act<d-footnote>In this work, we construe AI systems as believing in a claim if they consistently behave in accordance with that claim</d-footnote>. In this work, we study whether we can systematically modify these beliefs, creating a powerful new affordance for safer AI deployment.
Controlling the beliefs of AI systems can decrease risk in a variety of ways. First, model organisms [...]
---
First published:
April 24th, 2025
Source:
https://www.lesswrong.com/posts/ARQs7KYY9vJHeYsGc/untitled-draft-2qxt
Linkpost URL:
https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Every now and then, some AI luminaries
I agree with (1) and strenuously disagree with (2).
The last time I saw something like this, I responded by writing: LeCun's “A Path Towards Autonomous Machine Intelligence” has an unsolved technical alignment problem.
Well, now we have a second entry in the series, with the new preprint book chapter “Welcome to the Era of Experience” by reinforcement learning pioneers David Silver & Richard Sutton.
The authors propose that “a new generation [...]
---
Outline:
(04:39) 1. What's their alignment plan?
(08:00) 2. The plan won't work
(08:04) 2.1 Background 1: Specification gaming and goal misgeneralization
(12:19) 2.2 Background 2: The usual agent debugging loop, and why it will eventually catastrophically fail
(15:12) 2.3 Background 3: Callous indifference and deception as the strong-default, natural way that era of experience AIs will interact with humans
(16:00) 2.3.1 Misleading intuitions from everyday life
(19:15) 2.3.2 Misleading intuitions from today's LLMs
(21:51) 2.3.3 Summary
(24:01) 2.4 Back to the proposal
(24:12) 2.4.1 Warm-up: The specification gaming game
(29:07) 2.4.2 What about bi-level optimization?
(31:13) 2.5 Is this a solvable problem?
(35:42) 3. Epilogue: The bigger picture--this is deeply troubling, not just a technical error
(35:51) 3.1 More on Richard Sutton
(40:52) 3.2 More on David Silver
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
April 24th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I’ve read at least a few hundred blog posts, maybe upwards of a thousand. Agreeing with Gavin Leech, I believe I’ve gained from essays more than any other medium. I’m an intellectually curious college student with a strong interest in both the practical and philosophical. This blog post provides a snippet into the most valuable blog posts I’ve read on the topic of productivity. I encourage you to scan what I’ve gotten valuable out of the blog posts and read deeper if interesting.
---
First published:
April 24th, 2025
Source:
https://www.lesswrong.com/posts/ArizxGwbqiohBX5y6/my-favorite-productivity-blog-posts
Linkpost URL:
https://parconley.com/my-favorite-productivity-blog-posts/
Narrated by TYPE III AUDIO.
Converting to a for-profit model would undermine the company's founding mission to ensure AGI "benefits all of humanity," argues new letter
This is the full text of a post from Obsolete, a Substack that I write about the intersection of capitalism, geopolitics, and artificial intelligence. I’m a freelance journalist and the author of a forthcoming book called Obsolete: Power, Profit, and the Race to Build Machine Superintelligence. Consider subscribing to stay up to date with my work.
Don’t become a for-profit.
That's the blunt message of a recent letter signed by more than 30 people, including former OpenAI employees, prominent civil-society leaders, legal scholars, and Nobel laureates, including AI pioneer Geoffrey Hinton and former World Bank chief economist Joseph Stiglitz.
Obsolete obtained the 25-page letter, which was sent last Thursday to the attorneys general (AGs) of California and Delaware, two officials with the power to block the deal.
Made [...]
---
Outline:
(00:13) Converting to a for-profit model would undermine the companys founding mission to ensure AGI benefits all of humanity, argues new letter
(02:10) Nonprofit origins
(05:16) Nobel opposition
(06:32) Contradictions
(09:18) Justifications
(12:22) No sale price can compensate
(13:55) An institutional test
(15:42) Appendix: Quotes from OpenAI's leaders over the years
---
First published:
April 23rd, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I love o3. I’m using it for most of my queries now.
But that damn model is a lying liar. Who lies.
This post covers that fact, and some related questions.
o3 Is a Lying Liar
The biggest thing to love about o3 is it just does things. You don’t need complex or multi-step prompting, ask and it will attempt to do things.
Ethan Mollick: o3 is far more agentic than people realize. Worth playing with a lot more than a typical new model. You can get remarkably complex work out of a single prompt.
It just does things. (Of course, that makes checking its work even harder, especially for non-experts.)
Teleprompt AI: Completely agree. o3 feels less like prompting and more like delegating. The upside is wild- but yeah, when it just does things, tracing the logic (or spotting hallucinations) becomes [...]
---
Outline:
(00:33) o3 Is a Lying Liar
(04:53) All This Implausible Lying Has Implications
(06:50) Misalignment By Default
(10:27) Is It Fixable?
(15:06) Just Don't Lie To Me
---
First published:
April 23rd, 2025
Source:
https://www.lesswrong.com/posts/KgPkoopnmmaaGt3ka/o3-is-a-lying-liar
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
tl;dr: Even if we can't solve alignment, we can solve the problem of catching and fixing misalignment.
If a child is bowling for the first time, and they just aim at the pins and throw, they’re almost certain to miss. Their ball will fall into one of the gutters. But if there were beginners’ bumpers in place blocking much of the length of those gutters, their throw would be almost certain to hit at least a few pins. This essay describes an alignment strategy for early AGI systems I call ‘putting up bumpers’, in which we treat it as a top priority to implement and test safeguards that allow us to course-correct if we turn out to have built or deployed a misaligned model, in the same way that bowling bumpers allow a poorly aimed ball to reach its target.
To do this, we'd aim to build [...]
---
First published:
April 23rd, 2025
Source:
https://www.lesswrong.com/posts/HXJXPjzWyS5aAoRCw/putting-up-bumpers
Narrated by TYPE III AUDIO.
to follow up my philantropic pledge from 2020, i've updated my philanthropy page with the 2024 results.
in 2024 my donations funded $51M worth of endpoint grants (plus $2.0M in admin overhead and philanthropic software development). this comfortably exceeded my 2024 commitment of $42M (20k times $2100.00 — the minimum price of ETH in 2024).
this also concludes my 5-year donation pledge, but of course my philanthropy continues: eg, i’ve already made over $4M in endpoint grants in the first quarter of 2025 (not including 2024 grants that were slow to disburse), as well as pledged at least $10M to the 2025 SFF grant round.
---
First published:
April 23rd, 2025
Source:
https://www.lesswrong.com/posts/8ojWtREJjKmyvWdDb/jaan-tallinn-s-2024-philanthropy-overview
Linkpost URL:
https://jaan.info/philanthropy/#2024-results
Narrated by TYPE III AUDIO.
Guillaume Blanc has a piece in Works in Progress (I assume based on his paper) about how France's fertility declined earlier than in other European countries, and how its power waned as its relative population declined starting in the 18th century. In 1700, France had 20% of Europe's population (4% of the whole world population). Kissinger writes in Diplomacy with respect to the Versailles Peace Conference:
Victory brought home to France the stark realization that revanche had cost it too dearly, and that it had been living off capital for nearly a century. France alone knew just how weak it had become in comparison with Germany, though nobody else, especially not America, was prepared to believe it ...
Though France's allies insisted that its fears were exaggerated, French leaders knew better. In 1880, the French had represented 15.7 percent of Europe's population. By 1900, that [...]
---
First published:
April 23rd, 2025
Linkpost URL:
https://arjunpanickssery.substack.com/p/to-understand-history-keep-former
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The European AI Office is currently writing the rules for how general-purpose AI (GPAI) models will be governed under the EU AI Act.
The are explicitly asking for feedback on how to interpret and operationalize key obligations under the AI Act.
This includes the thresholds for systemic risk, the definition of GPAI, how to estimate training compute, and when downstream fine-tuners become legally responsible.
Why this matters for AI Safety:
The largest labs (OpenAI, Anthropic, Google DeepMind) have already expressed willigness to sign on to the Codes of Practice voluntarily.
These codes will become the de facto compliance baseline, and potentially a global reference point.
So far, AI safety perspectives are severely underrepresented.
Input is urgently needed to ensure the guidelines reflect concerns around misalignment, loss of control, emergent capabilities, robust model evaluation, and the need for interpretability audits.
Key intervention points include [...]
---
Outline:
(00:43) Why this matters for AI Safety:
(02:05) Purpose of this post:
(03:04) TL;DR
(03:08) What the GPAI Codes of Practice will actually Regulate
(04:49) What AI safety researchers should weigh in on:
(05:39) 1. Content of the Guidelines
(06:22) 2. What counts as a General-Purpose AI Model ?
(07:13) 2.1 Conditions for Sufficient Generality and Capabilities
(09:23) 2.2 Differentiation Between Distinct Models and Model Versions
(10:30) Why this matters for AI Safety:
(11:35) 3. What counts as a Provider of a General-Purpose AI Model ?
(12:38) 3.1 What Triggers Provider Status?
(13:13) Downstream Modifiers as Providers
(14:08) 3.2 GPAI with Systemic Risk: Stricter Thresholds
(15:14) 4. Exemptions for Open-Source models
(16:24) 5. Estimating Compute: The First Scalable Safety Trigger
(17:22) 5.1 How to Estimate Compute
(18:13) 5.2 What Counts Toward Cumulative Compute
(19:16) 6. Other Legal & Enforcement details
---
First published:
April 22nd, 2025
Narrated by TYPE III AUDIO.
Joel Z. Leibo [1], Alexander Sasha Vezhnevets [1], William A. Cunningham [1, 2], Sébastien Krier [1], Manfred Diaz [3], Simon Osindero [1]
[1] Google DeepMind, [2] University of Toronto, [3] Mila Québec AI Institute
Disclaimer: These are our own opinions; they do not represent the views of Google DeepMind as a whole or its broader community of safety researchers.
Beyond Alignment: The Patchwork Quilt of Human Coexistence
"We pragmatists think of moral progress as more like sewing together a very large, elaborate, polychrome quilt, than like getting a clearer vision of something true and deep." – Richard Rorty (2021) pg. 141
Quite a lot of thinking in AI alignment, particularly within the rationalist tradition, implicitly or explicitly appears to rest upon something we might call an 'Axiom of Rational Convergence'. This is the powerful idea that under sufficiently ideal epistemic conditions – ample time, information, reasoning ability, freedom [...]
---
Outline:
(00:47) Beyond Alignment: The Patchwork Quilt of Human Coexistence
(14:41) The Allure of the Clearer Vision: Alignment as Mistake Theory
(21:29) Potential Blind Spots in the Pursuit of Universal Alignment
(28:32) The Stitches That Bind: Conventions, Sanctions, and Norms
(33:50) Navigating Context and Building a Pluralistic AI Ecosystem
(41:34) Collective Flourishing: The Ever-Growing Quilt
(43:33) The Astronomer and the Tailor: Seeing Clearly vs. Sewing Well
(48:29) References
---
First published:
April 22nd, 2025
Narrated by TYPE III AUDIO.
---
Outline:
(01:41) You Better Mechanize
(10:32) Superintelligence Eventually
(11:38) Please Review This Podcast
(11:55) They Won't Take Our Jobs Yet
(13:04) They Took Our (Travel Agent) Jobs
(17:54) The Case Against Intelligence
(25:27) Intelligence Explosion
(27:23) Explosive Economic Growth
(30:25) Wowie on Alignment and the Future
(33:52) But That's Good Actually
---
First published:
April 22nd, 2025
Source:
https://www.lesswrong.com/posts/j7GgmLwymtJEcnf9L/you-better-mechanize
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Forecaster perspectives
Sentinel forecasters in aggregate assess as “83% true” (65% to 100%) the statement that the Trump administration has disobeyed the Supreme Court so far. On the one hand, the Trump administration has done nothing to “facilitate” bringing Abrego García back to the US, and this White House's tweet shown below clearly indicates that it won’t try to; on the other hand, it could yet still do so when the current news cycle subsides. On April 17th, forecasters estimated a 66% chance (50% to ~100%) that Abrego García will not be brought back to the US within the next 60 days (by June 17).
Data on the number of deportations in past administrations are available but can be difficult to compare between administrations; some up-to-date sources include refusals at the border due to Covid (“Title 42 expulsions”) under Biden, but other data have been discontinued. Perhaps a good [...]
---
Outline:
(00:12) Forecaster perspectives
(03:16) Summary of events
(03:31) The Abrego García US Supreme Court case.
(05:41) Developments since the Abrego García Supreme Court case
(08:07) Next steps for the court case
(09:32) Plans for further deportations
(11:29) The Insurrection Act
(12:43) Implications and recommendations
---
First published:
April 21st, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Back in the 1990s, ground squirrels were briefly fashionable pets, but their popularity came to an abrupt end after an incident at Schiphol Airport on the outskirts of Amsterdam. In April 1999, a cargo of 440 of the rodents arrived on a KLM flight from Beijing, without the necessary import papers. Because of this, they could not be forwarded on to the customer in Athens. But nobody was able to correct the error and send them back either. What could be done with them? It's hard to think there wasn’t a better solution than the one that was carried out; faced with the paperwork issue, airport staff threw all 440 squirrels into an industrial shredder.
[...]
It turned out that the order to destroy the squirrels had come from the Dutch government's Department of Agriculture, Environment Management and Fishing. However, KLM's management, with the benefit of hindsight, said that [...]
---
First published:
April 22nd, 2025
Source:
https://www.lesswrong.com/posts/nYJaDnGNQGiaCBSB5/accountability-sinks
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Our Culture Expects Self-Justification
I really like David Chapman's explication of what he calls “reasonableness” and “accountability.”
At least in the culture he and I live in, one is constantly called to account for one's behavior. At any moment, one may be asked “what are you doing?” or “why did you do that?” And one is expected to provide a reasonable answer.
What's a reasonable answer?
It's not a rigorous, beyond-a-shadow-of-a-doubt proof that at this very moment one is engaging in “optimal” or “morally correct” or “maximally virtuous” behavior. That would indeed be an unfair expectation! Nobody could possibly provide such a justification on demand.
What people actually expect is a more-or-less plausible-sounding account that makes your actions sound understandable, relatable, and okay.
You have a lot of leeway to present yourself in a favorable light. Not infinite leeway — if you’re caught shoplifting, it probably won’t be [...]
---
Outline:
(00:14) Our Culture Expects Self-Justification
(01:43) Asking for favors
(02:52) Denying requests
(04:09) Defending Your Reasoning
(05:24) Demanding Justification is a Cheap Heuristic
(06:24) It's Good To Have A Self-Serving Narrative
(07:51) Isn't This Just Rationalization?
(10:23) Everybody Gets To Stop Somewhere
(11:52) Don't Knock Mere Feeling Good
---
First published:
April 21st, 2025
Source:
https://www.lesswrong.com/posts/cQ2ub5wjdyzoHTbab/the-uses-of-complacency
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 36 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Our posts on natural latents have involved two distinct definitions, which we call "stochastic" and "deterministic" natural latents. We conjecture that, whenever there exists a stochastic natural latent (to within some approximation), there also exists a deterministic natural latent (to within a comparable approximation). We are offering a $500 bounty to prove this conjecture.
Some Intuition From The Exact Case
In the exact case, in order for a natural latent to exist over random variables <span>_X_1, X_2_</span>, the distribution has to look roughly like this:
Each value of <span>_X_1_</span> and each value of <span>_X_2_</span> occurs in only one "block", and within the "blocks", <span>_X_1_</span> and <span>_X_2_</span> are independent. In that case, we can take the (exact) natural latent [...]
---
Outline:
(00:51) Some Intuition From The Exact Case
(02:16) Approximation Adds Qualitatively New Behavior
(02:59) The Problem
(03:02) Stochastic Natural Latents
(04:04) Deterministic Natural Latents
(05:26) What We Want For The Bounty
(06:43) Why We Want This
---
First published:
April 21st, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This seemed like a good next topic to spin off from monthlies and make into its own occasional series. There's certainly a lot to discuss regarding crime.
What I don’t include here, the same way I excluded it from the monthly, are the various crimes and other related activities that may or may not be taking place by the Trump administration or its allies. As I’ve said elsewhere, all of that is important, but I’ve made a decision not to cover it. This is about Ordinary Decent Crime.
Table of Contents
---
Outline:
(00:38) Perception Versus Reality
(04:21) The Case Violent Crime is Up Actually
(05:31) Threats of Punishment
(06:23) Property Crime Enforcement is Broken
(11:33) The Problem of Disorder
(13:59) Extreme Speeding as Disorder
(15:17) The Fall of Extralegal and Illegible Enforcement
(16:42) In America You Can Usually Just Keep Their Money
(18:53) Police
(25:56) Probation
(28:46) Genetic Databases
(30:56) Marijuana
(36:20) The Economics of Fentanyl
(37:26) Enforcement and the Lack Thereof
(41:20) Jails
(45:20) Criminals
(45:55) Causes of Crime
(46:33) Causes of Violence
(47:52) Homelessness
(48:44) Yay Trivial Inconveniences
(49:25) San Francisco
(54:24) Closing Down San Francisco
(55:47) A San Francisco Dispute
(59:31) Cleaning Up San Francisco
(01:03:20) Portland
(01:03:30) Those Who Do Not Help Themselves
(01:05:30) Solving for the Equilibrium (1)
(01:10:30) Solving for the Equilibrium (2)
(01:10:57) Lead
(01:12:36) Law & Order
(01:13:16) Look Out
---
First published:
April 21st, 2025
Source:
https://www.lesswrong.com/posts/9TPEjLH7giv7PuHdc/crime-and-punishment-1
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
AI 2027 lies at a Pareto frontier – it contains the best researched argument for short timelines, or the shortest timeline backed by thorough research[1]. My own timelines are substantially longer, and there are credible researchers whose timelines are longer still. For this reason, I thought it would be interesting to explore the key load-bearing arguments AI 2027 presents for short timelines. This, in turn, allows for some discussion of signs we can watch for to see whether those load-bearing assumptions are bearing out.
To be clear, while the authors have short timelines, they do not claim that ASI is likely to arrive in 2027[2]. But the fact remains that AI 2027 is a well researched argument for short timelines. Let's explore that argument.
(In what follows, I will mostly ignore confidence intervals and present only median estimates; this is a gross oversimplification of the results presented in the [...]
---
Outline:
(01:17) Timeline to ASI
(06:57) AI 2027 Is Not Strong Evidence for AI in 2027
(08:12) Human-only timeline from SIAR to ASI
(10:46) Reasons The AI 2027 Forecast May Be Too Aggressive
(10:52) #1: Simplified Model of AI R&D
(12:23) #2: Amdahls Law
(14:51) #3: Dependence on Narrow Data Sets
(15:55) #4: Hofstadters Law As Prior
(16:46) What To Watch For
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
April 21st, 2025
Source:
https://www.lesswrong.com/posts/bfHDoWLnBH9xR3YAK/ai-2027-is-a-bet-against-amdahl-s-law
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Disclaimer: this post was not written by me, but by a friend who wishes to remain anonymous. I did some editing, however.
So a recent post my friend wrote has made the point quite clearly (I hope) that LLM performance on the simple task of playing and winning a game of Pokémon Red is highly dependent on the scaffold and tooling provided. In a way, this is not surprising—the scaffold is there to address limitations in what the model can do, and paper over things like lack of long-term context, executive function, etc.
But the thing is, I thought I knew that, and then I actually tried to run Pokémon Red.
A Casual Research Narrative
The underlying code is the basic Claude scaffold provided by David Hershey of Anthropic.[1] I first simply let Claude 3.7 run on it for a bit, making observations about what I thought might generate [...]
---
Outline:
(01:04) A Casual Research Narrative
(09:10) An only somewhat sorted list of observations about all this
(09:22) Model Vision of Pokémon Red is Bad. Really Bad.
(12:58) Models Cant Remember
(14:34) Spatial Reasoning? Whats that?
(15:22) A Grasp on Reality
(18:40) Costs
(19:13) Why do this at all?
(19:47) So which model is better?
(22:36) Miscellanea: ClaudePlaysPokemon Derp Anecdotes
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
April 21st, 2025
Source:
https://www.lesswrong.com/posts/8aPyKyRrMAQatFSnG/untitled-draft-x7cc
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This post summarizes some of the research I have been doing for Bootstrap Bio AKA kman and Genesmith
TL;DR
Epigenetics and imprinting
This post assumes basic familiarity with epigenetics. The next paragraph is a 1-paragraph summary, but I recommend reading these two excellent posts on [...]
---
Outline:
(00:17) TL;DR
(00:56) Epigenetics and imprinting
(02:20) Why we need to know all imprinting regions and genes
(04:48) How imprinted genes work and how we measure them
(07:43) Have we identified all imprinted genes?
(12:39) Fixing paternal imprinting might be easier than fixing maternal imprinting
(13:15) PCR for methylated DNA
(17:11) Thoughts on the role of RNA in imprinting
(18:47) Some useful datasets and resources if someone else wants to look further into imprinting and epigenetics in the future
(20:57) References
The original text contained 11 footnotes which were omitted from this narration.
---
First published:
April 19th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I’ve been thinking recently about what sets apart the people who’ve done the best work at Anthropic.
You might think that the main thing that makes people really effective at research or engineering is technical ability, and among the general population that's true. Among people hired at Anthropic, though, we’ve restricted the range by screening for extremely high-percentile technical ability, so the remaining differences, while they still matter, aren’t quite as critical. Instead, people's biggest bottleneck eventually becomes their ability to get leverage—i.e., to find and execute work that has a big impact-per-hour multiplier.
For example, here are some types of work at Anthropic that tend to have high impact-per-hour, or a high impact-per-hour ceiling when done well (of course this list is extremely non-exhaustive!):
---
Outline:
(03:28) 1. Agency
(03:31) Understand and work backwards from the root goal
(05:02) Don't rely too much on permission or encouragement
(07:49) Make success inevitable
(09:28) 2. Taste
(09:31) Find your angle
(11:03) Think real hard
(13:03) Reflect on your thinking
---
First published:
April 19th, 2025
Source:
https://www.lesswrong.com/posts/DiJT4qJivkjrGPFi8/impact-agency-and-taste
Narrated by TYPE III AUDIO.
Background: With the release of Claude 3.7 Sonnet, Anthropic promoted a new benchmark: beating Pokémon. Now, Google claims Gemini 2.5 Pro has substantially surpassed Claude's progress on that benchmark.
TL:DR: We don't know if Gemini is better at Pokémon than Claude because their playthroughs can't be directly compared.
The Metrics
Here are Anthropic's and Google's charts:
[1]Unfortunately these are using different x and y axes, but it's roughly accurate to say that Gemini has made it nearly twice as far in the game[2] now:
And moreover, Gemini has gotten there using approximately 1/3rd the effort! As of writing, Gemini's current run is at ~68,000 actions, while Claude's current run is at ~215,000 actions.[3][4]
So, sounds definitive, right? Gemini blows Claude out of the water.
The Agents' Harnesses
Well, when Logan Kilpatrick (product lead for Google's AI studio) posted his tweet, he gave an important caveat:
"next best model only [...]
---
Outline:
(01:13) The Metrics
(02:19) The Agents Harnesses
(05:13) The Agents Masters
(07:46) The Agents Vibes
(11:27) Conclusion
The original text contained 14 footnotes which were omitted from this narration.
---
First published:
April 19th, 2025
Source:
https://www.lesswrong.com/posts/7mqp8uRnnPdbBzJZE/is-gemini-now-better-than-claude-at-pokemon
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Though, given my doomerism, I think the natsec framing of the AGI race is likely wrongheaded, let me accept the Dario/Leopold/Altman frame that AGI will be aligned to the national interest of a great power. These people seem to take as an axiom that a USG AGI will be better in some way than CCP AGI. Has anyone written justification for this assumption?
I am neither an American citizen nor a Chinese citizen.
What would it mean for an AGI to be aligned with "Democracy" or "Confucianism" or "Marxism with Chinese characteristics" or "the American constitution" Contingent on a world where such an entity exists and is compatible with my existence, what would my life be as a non-citizen in each system? Why should I expect USG AGI to be better than CCP AGI?
---
First published:
April 19th, 2025
Narrated by TYPE III AUDIO.
---
Outline:
(01:56) What's In a Name
(02:51) My Current Model Use Heuristics
(04:21) Huh, Upgrades
(05:31) Use All the Tools
(09:47) Search the Web
(10:27) On Your Marks
(18:15) The System Prompt
(19:00) The o3 and o4-mini System Card
(23:17) Tests o3 Aced
(25:14) Hallucinations
(31:41) Instruction Hierarchy
(32:52) Image Refusals
(33:18) METR Evaluations for Task Duration and Misalignment
(42:45) Apollo Evaluations for Scheming and Deception
(44:40) We Are Insufficiently Worried About These Alignment Failures
(47:16) GPT-4.1 Also Has Some Issues
(50:08) Pattern Lab Evaluations for Cybersecurity
(51:45) Preparedness Framework Tests
(52:14) Biological and Chemical Risks (4.2)
(58:20) Cybersecurity (4.3)
(59:27) AI Self-Improvement (4.4)
(01:00:51) Perpetual Shilling
(01:01:54) High Praise
(01:09:31) Syncopathy
(01:11:58) Mundane Utility Versus Capability Watch
(01:16:33) o3 Offers Mundane Utility
(01:24:10) o3 Doesn't Offer Mundane Utility
(01:30:54) o4-mini Also Exists
(01:31:31) Colin Fraser Dumb Model Watch
(01:32:52) o3 as Forecaster
(01:34:31) Is This AGI?
---
First published:
April 18th, 2025
Source:
https://www.lesswrong.com/posts/u58AyZziQRAcbhTxd/o3-will-use-its-tools-for-you
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: Exploratory
A scaffold is a lightweight, temporary construction whose point is to make working on other buildings easier. Maybe you could live and work on a scaffold, but why would you? The point is the building. When the building is done, you can take down the scaffold, though you might put it back up if you need to do repairs. If your scaffold was shaky and unsteady, or fell over when you were halfway up, this would make it harder to work on the building.
This essay is what we call a constructive argumentHere's the thesis of this essay: sometimes I think skills are also scaffolds; more useful for what else they let you learn than for the skill itself. One reason a skill might be weirdly hard to learn is you lack scaffolding skills and don't realize this. It's like trying to fix a third story [...]
---
Outline:
(01:05) Ia.
(02:45) Ib.
(04:14) Ic.
(05:04) 2.
(06:24) 3.
---
First published:
April 18th, 2025
Source:
https://www.lesswrong.com/posts/68iix7PFHPADhdQYF/scaffolding-skills
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
In light of the recent news from Mechanize/Epoch and the community discussion it sparked, I'd like to open a conversation about a question some of us grapple with: What constitutes a net-positive AI startup from an AI safety perspective, and what steps can founders take to demonstrate goodwill and navigate the inherent pressures?
This feels like an important conversation to have because AI safety seems to have been increasingly encouraging people in the community to build startups (due to a lack of funding, potentially higher ceiling for society-wide impact, etc.). I've thought a lot about this, and have been going back and forth on this for the past three years. You get constant whiplash.
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 18th, 2025
Source:
https://www.lesswrong.com/posts/o3sEHE8cqQ5hcqgkG/what-makes-an-ai-startup-net-positive-for-safety
Narrated by TYPE III AUDIO.
Bertrand Russell noted how people often describe the same factual behavior using emotionally opposite language depending on perspective — e.g. I am firm, you are obstinate, he is pigheaded. This framing tactic is now called a Russell Conjugation, and once you start noticing them, they’re everywhere — especially in politics and media.
For the past year and a half, I’ve been training a finetuned ChatGPT model, and building a tool to automatically highlight Russell Conjugations in text and suggest emotionally opposite alternatives. It functions as a fact-independent bias reverser — showing where emotional spin might exist, and how the opposite side might see an issue, regardless of the factual accuracy of specific claims. I find it valuable especially when trying to parse tribal political language, as very often different sides of political divides will use words that feel completely different to describe the same things.
Here's an example I [...]
---
First published:
April 17th, 2025
Source:
https://www.lesswrong.com/posts/GjqwHYBrGhCjP7gkH/the-russell-conjugation-illuminator
Narrated by TYPE III AUDIO.
What if getting strong evidence of scheming isn't the end of your scheming problems, but merely the middle?
In most of our research and writing on AI control, we’ve emphasized the following situation:
For example, in our original paper on AI control, the methodology we used for evaluating safety protocols implies that once the developer is confident that the model is a schemer, they’re totally fine with undeploying the untrusted model.[1][2]
But that seems kind of unlikely. Even if AI companies find strong evidence that their AIs are trying to escape, they won’t necessarily stop deploying them. This might be justified: even a responsible actor might think that [...]
---
Outline:
(03:18) Behavioral observations vs other ways of detecting scheming
(08:29) What does control look like if you need to keep deploying schemers?
(15:06) How does the prospect of needing to deploy known schemers affect the relative value of control research?
(16:56) What research should be done that focuses specifically on this?
(17:02) Sample-efficient ML
(17:27) Training recalcitrant policies
(19:56) A few other historical reasons for focusing on before-you-catch
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
April 18th, 2025
Source:
https://www.lesswrong.com/posts/XxjScx4niRLWTfuD5/handling-schemers-if-shutdown-is-not-an-option
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Subtitle: Bad for loss of control risks, bad for concentration of power risks
I’ve had this sitting in my drafts for the last year. I wish I’d been able to release it sooner, but on the bright side, it’ll make a lot more sense to people who have already read AI 2027.
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
April 18th, 2025
Narrated by TYPE III AUDIO.
I recall seeing three “rationalist” cases for Trump:
---
First published:
April 18th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
This is an entry in the 'Dungeons & Data Science' series, a set of puzzles where players are given a dataset to analyze and an objective to pursue using information from that dataset.
Estimated Complexity: 3/5 (this is a guess, I will update based on feedback/seeing how the scenario goes)
STORY
It's that time of year again. The time when the Tithe Assessment Exactors demand that all adventurers pay taxes on the various monster parts they have hacked off and sold in the past year. And, more importantly for you, the time when clients begin banging on your door looking for advice on how to minimize their taxes.
This used to be a straightforward, if complex, application of the published tax rules. But ever since the disaster a few years back (when one of your clients managed to pay 1 silver piece in tax and then receive as a [...]
---
Outline:
(00:31) STORY
(01:58) DATA & OBJECTIVES
(03:27) SCHEDULING & COMMENTS
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/N7ZgcGvG3kfiBryom/d-and-d-sci-tax-day-adventurers-and-assessments
Narrated by TYPE III AUDIO.
Summary
Introduction
During our work on linear probing for evaluation awareness, we thought about using evaluation-awareness relevant features to uncover sandbagging. Previous work demonstrated that you can add noise to weights or activations and recover capabilities from a prompted sandbagging model, and we thought it would be interesting [...]
---
Outline:
(00:12) Summary
(01:04) Introduction
(01:47) Methodology
(06:11) Results
(07:57) Limitations
(08:16) Discussion
(09:41) Acknowledgements
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/dBckLjYfTShBGZ8ma/can-sae-steering-reveal-sandbagging
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
SUMMARY:
ALLFED is making an emergency appeal here due to a serious funding shortfall. Without new support, ALLFED will be forced to cut half our budget in the coming months, drastically reducing our capacity to help build global food system resilience for catastrophic scenarios like nuclear winter, a severe pandemic, or infrastructure breakdown. ALLFED is seeking $800,000 over the course of 2025 to sustain its team, continue policy-relevant research, and move forward with pilot projects that could save lives in a catastrophe. As funding priorities shift toward AI safety, we believe resilient food solutions remain a highly cost-effective way to protect the future. If you’re able to support or share this appeal, please visit allfed.info/donate.
FULL ARTICLE:
I (David Denkenberger) am writing alongside one of my team-mates, as ALLFED's co-founder, to ask for your support. This is the first time in Alliance to Feed the [...]
---
Outline:
(02:24) The case for ALLFED's work, and why we think maintaining full current capacity is valuable
(03:57) How this connects to AI and other risks
(05:23) What we're asking for
---
First published:
April 16th, 2025
Narrated by TYPE III AUDIO.
We’ve written a new report on the threat of AI-enabled coups.
I think this is a very serious risk – comparable in importance to AI takeover but much more neglected.
In fact, AI-enabled coups and AI takeover have pretty similar threat models. To see this, here's a very basic threat model for AI takeover:
And now here's a closely analogous threat model for AI-enabled coups:
While the report focuses on the risk that someone seizes power over a country, I think that similar dynamics could allow someone to take over the world. In fact, if someone wanted to take over the world, their best strategy might well be to first stage an AI-enabled [...]
---
Outline:
(02:39) Summary
(03:31) An AI workforce could be made singularly loyal to institutional leaders
(05:04) AI could have hard-to-detect secret loyalties
(06:46) A few people could gain exclusive access to coup-enabling AI capabilities
(09:46) Mitigations
(13:00) Vignette
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 16th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 37 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Redwood Research has released a new paper, Ctrl-Z: Controlling AI Agents via Resampling. This is the largest and most intricate study of control techniques to date: that is, techniques that aim to prevent catastrophic failures even if egregiously misaligned AIs attempt to subvert the techniques. We extend control protocols to a more realistic, multi-step setting, develop novel resample protocols that outperform all existing control protocols, and study many aspects of control techniques in more detail than prior work. In this blog post, we summarize the main takeaways and lessons learned from our work.
Here's the abstract of the paper:
Control evaluations measure whether monitoring and security protocols for AI systems prevent intentionally subversive AI models from causing harm. [...]
---
Outline:
(06:48) The Multi-Step Control Problem
(10:14) Resampling
(11:46) Selection Pressure: Forcing Consistent Attacks
(14:05) Value of Information: Better Attack Detection (Incrimination)
(17:01) Important Limitations
(20:23) Results Summary
(21:03) Core takeaways
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 16th, 2025
Source:
https://www.lesswrong.com/posts/LPHMMMZFAWog6ty5x/ctrl-z-controlling-ai-agents-via-resampling
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
New: https://openai.com/index/updating-our-preparedness-framework/
Old: https://cdn.openai.com/openai-preparedness-framework-beta.pdf
Summary
Thresholds & responses: https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf#page=5. High and Critical thresholds trigger responses, like in the old PF; responses to Critical thresholds are not yet specified.
Three main categories of capabilities:
Misuse safeguards, misalignment safeguards, and security controls for High capability levels: https://cdn.openai.com/pdf/18a02b5d-6b67-4cec-ab64-68cdfbddebcd/preparedness-framework-v2.pdf#page=16. My quick takes:
[I'll edit this post to add more analysis soon]
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/Yy5ijtbNfwv8DWin4/openai-rewrote-its-preparedness-framework
Narrated by TYPE III AUDIO.
---
Outline:
(01:26) The GPUs Are Melting
(02:36) On OpenAI's Ostensive Open Model
(03:25) Other People Are Not Worried About AI Killing Everyone
(04:18) What Even is AGI?
(05:19) The OpenAI AI Action Plan
(05:57) Copyright Confrontation
(07:43) The Ring of Power
(11:55) Safety Perspectives
(14:00) Autonomous Killer Robots
(15:15) Amicus Brief
(17:25) OpenAI Recklessly Races to Release
---
First published:
April 15th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
A post by Michael Nielsen that I found quite interesting. I decided to reproduce the full essay content here, since I think Michael is fine with that, but feel free to let me know to only excerpt it.
This is the text for a talk exploring why experts disagree so strongly about whether artificial superintelligence (ASI) poses an existential risk to humanity. I review some key arguments on both sides, emphasizing that the fundamental danger isn't about whether "rogue ASI" gets out of control: it's the raw power ASI will confer, and the lower barriers to creating dangerous technologies. This point is not new, but has two underappreciated consequences. First, many people find rogue ASI implausible, and this has led them to mistakenly dismiss existential risk. Second: much work on AI alignment, while well-intentioned, speeds progress toward catastrophic capabilities, without addressing our world's potential vulnerability to dangerous technologies.
[...]
---
Outline:
(06:37) Biorisk scenario
(17:42) The Vulnerable World Hypothesis
(26:08) Loss of control to ASI
(32:18) Conclusion
(38:50) Acknowledgements
The original text contained 29 footnotes which were omitted from this narration.
---
First published:
April 15th, 2025
Narrated by TYPE III AUDIO.
Writing this post puts me in a weird epistemic position. I simultaneously believe that:
That is because all of the reasoning failures that I describe here are surprising in the sense that given everything else that they can do, you’d expect LLMs to succeed at all of these tasks. The [...]
---
Outline:
(00:13) Introduction
(02:13) Reasoning failures
(02:17) Sliding puzzle problem
(07:17) Simple coaching instructions
(09:22) Repeatedly failing at tic-tac-toe
(10:48) Repeatedly offering an incorrect fix
(13:48) Various people's simple tests
(15:06) Various failures at logic and consistency while writing fiction
(15:21) Inability to write young characters when first prompted
(17:12) Paranormal posers
(19:12) Global details replacing local ones
(20:19) Stereotyped behaviors replacing character-specific ones
(21:21) Top secret marine databases
(23:32) Wandering items
(23:53) Sycophancy
(24:49) What's going on here?
(32:18) How about scaling? Or reasoning models?
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/sgpCuokhMb8JmkoSn/untitled-draft-7shu
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Context
Disney's Tangled (2010) is a great movie. Spoilers if you haven't seen it.
The heroine, having been kidnapped at birth and raised in a tower, has never stepped foot outside. It follows, naturally, that she does not own a pair of shoes, and she is barefoot for the entire adventure. The movie contains multiple shots that focus at length on her toes. Things like that can have an outsized influence on a young mind, but that's Disney for you.
Anyway.
The male romantic lead goes by the name of "Flynn Rider." He is a dashingly handsome, swashbuckling rogue who was carefully crafted to be maximally appealing to women. He is the ideal male role model. If you want women to fall in love with you, it should be clear that the optimal strategy is to pretend to be Flynn Rider. Shortly into the movie is the twist: Flynn [...]
---
Outline:
(00:09) Context
(02:30) Reminder About Winning
(04:20) Technical Truth is as Bad as Lying
(06:40) Being Mistaken is Also as Bad as Lying
(08:01) This is Partially a Response
(10:19) Examples
(13:59) Biting the Bullet
(15:06) My Proposed Policy
(18:15) Appearing Trustworthy Anyway
(20:46) Cooperative Epistemics
(22:26) Conclusion
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/zTRqCAGws8bZgtRkH/a-dissent-on-honesty
Narrated by TYPE III AUDIO.
The original Map of AI Existential Safety became a popular reference tool within the community after its launch in 2023. Based on user feedback, we decided that it was both useful enough and had enough room for improvement that it was worth creating a v2 with better organization, usability, and visual design. Today we’re excited to announce that the new map is live at AISafety.com/map.
Similar to the original map, it provides a visual overview of the key organizations, programs, and projects in the Al safety ecosystem. Listings are separated into 16 categories, each corresponding to an area on the map:
We think there's value in being able to view discontinued projects, so we’ve included a graveyard for those.
Below the map is also a [...]
---
First published:
April 15th, 2025
Source:
https://www.lesswrong.com/posts/rF7MQWGbqQjEkeLJA/map-of-ai-safety-v2
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
One key hope for mitigating risk from misalignment is inspecting the AI's behavior, noticing that it did something egregiously bad, converting this into legible evidence the AI is seriously misaligned, and then this triggering some strong and useful response (like spending relatively more resources on safety or undeploying this misaligned AI).
You might hope that (fancy) internals-based techniques (e.g., ELK methods or interpretability) allow us to legibly incriminate misaligned AIs even in cases where the AI hasn't (yet) done any problematic actions despite behavioral red-teaming (where we try to find inputs on which the AI might do something bad), or when the problematic actions the AI does are so subtle and/or complex that humans can't understand how the action is problematic[1]. That is, you might hope that internals-based methods allow us to legibly incriminate misaligned AIs even when we can't produce behavioral evidence that they are misaligned.
[...]
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
April 15th, 2025
Narrated by TYPE III AUDIO.
There's an implicit model I think many people have in their heads of how everyone else behaves. As George Box is often quoted, “all models are wrong, but some are useful.” I’m going to try and make the implicit model explicit, talk a little about what it would predict, and then talk about why this model might be wrong.
Here's the basic idea: A person's behavior falls on a bell curve.
1.
Adam is a new employee at Ratburgers, a fast food joint crossed with a statistics bootcamp. It's a great company, startup investors have been going wild for machine learning in literally anything, you've been tossing paper printouts of Attention Is All You Need into the meatgrinder so you can say your hamburgers are made with AI. Anyway, you weren’t around for Adam's hiring, you haven’t seen his resume, you have no information about him when you [...]
---
Outline:
(00:34) 1.
(04:19) 2.
(05:56) 3.
(08:46) 4.
(11:43) 5.
(15:55) 6.
---
First published:
April 14th, 2025
Source:
https://www.lesswrong.com/posts/rmhZamJFQdk5byqKQ/the-bell-curve-of-bad-behavior
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Definition
In 1954, Roger Bannister ran the first officially sanctioned sub-4-minute mile: a pivotal record in modern middle-distance running. Before Bannister's record, such a time was considered impossible. Soon after Bannister's record, multiple runners also beat the 4-minute mile.
This essay outlines the potential psychological effect behind the above phenomenon — the 4-minute mile effect — and outlines implications for its existence. The 4-minute mile effect describes when someone breaking a perceived limit enables others to break the same limit. In short, social proof is very powerful.
Tyler Cowen's Example
Speaking to Dwarkesh Patel, Tyler Cowen posits, "mentors only teach you a few things, but those few things are so important. They give you a glimpse of what you can be, and you're oddly blind to that even if you're very very smart." The 4-minute mile effect explains Tyler Cowen's belief in the value of [...]
---
Outline:
(00:11) Definition
(00:52) Tyler Cowens Example
(01:51) Personal Examples
(03:00) Implications
(03:54) Further Reading
---
First published:
April 14th, 2025
Source:
https://www.lesswrong.com/posts/kBdfFgLfDhPMDHfAa/the-4-minute-mile-effect
Linkpost URL:
https://parconley.com/the-four-minute-mile-effect/
Narrated by TYPE III AUDIO.
Executive summary
The Trump administration backtracked from his tariff plan, reducing tariffs from most countries to 10%, but imposing tariffs of 145% on China, which answered with 125% tariffs on US imports and export controls on rare earth metals.
The US saw protests against Trump and DOGE, and the US Supreme court ruled that the administration must facilitate the return of an immigrant wrongly deported to El Salvador.
OpenAI is slashing the amount of safety testing they will do of their new AI models, from months down to days, and Google launched a new AI chip for inference.
Negotiations between the US and Iran on its nuclear program are starting, reducing our estimated probability of a US strike on Iran's nuclear programme by the end of May to 15%.
The number of US dairy herds that have had confirmed H5N1 infections has hit 1,000.
Economy
[...]
---
Outline:
(00:24) Executive summary
(01:22) Economy
---
First published:
April 14th, 2025
Linkpost URL:
https://blog.sentinel-team.org/p/global-risks-weekly-roundup-152025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: These are results of a brief research sprint and I didn't have time to investigate this in more detail. However, the results were sufficiently surprising that I think it's worth sharing.
TL,DR: I train a probe to detect falsehoods on a token-level, i.e. to highlight the specific tokens that make a statement false. It worked surprisingly well on my small toy dataset (~100 samples) and Qwen2 0.5B, after just ~1 day of iteration! Colab link here.
Context: I want a probe to tell me where in an LLM response the model might be lying, so that I can e.g. ask follow-up questions. Such "right there" probes[1] would be awesome to assist LLM-based monitors (Parrack et al., forthcoming). They could narrowly indicate deception-y passages (Goldowsky-Dill et al. 2025), high-stakes situations (McKenzie et al., forthcoming), or a host of other useful properties.
The writeup below is the (lightly edited) [...]
---
Outline:
(01:47) Summary
(04:15) Details
(04:18) Motivation
(06:37) Methods
(06:40) Data generation
(08:35) Probe training
(09:40) Results
(09:43) Probe training metrics
(12:21) Probe score analysis
(15:07) Discussion
(15:10) Limitations
(17:50) Probe failure modes
(19:40) Appendix A: Effect of regularization on mean-probe
(21:01) Appendix B: Initial tests of generalization
(21:28) Appendix C: Does the individual-token probe beat the mean probe at its own game?
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
April 14th, 2025
Source:
https://www.lesswrong.com/posts/kxiizuSa3sSi4TJsN/try-training-token-level-probes
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Dario Amodei, CEO of Anthropic, recently worried about a world where only 30% of jobs become automated, leading to class tensions between the automated and non-automated. Instead, he predicts that nearly all jobs will be automated simultaneously, putting everyone "in the same boat." However, based on my experience spanning AI research (including first author papers at COLM / NeurIPS and attending MATS under Neel Nanda), robotics, and hands-on manufacturing (including machining prototype rocket engine parts for Blue Origin and Ursa Major), I see a different near-term future.
Since the GPT-4 release, I've evaluated frontier models on a basic manufacturing task, which tests both visual perception and physical reasoning. While Gemini 2.5 Pro recently showed progress on the visual front, all models tested continue to fail significantly on physical reasoning. They still perform terribly overall. Because of this, I think that there will be an interim period where a significant [...]
---
Outline:
(01:28) The Evaluation
(02:29) Visual Errors
(04:03) Physical Reasoning Errors
(06:09) Why do LLM's struggle with physical tasks?
(07:37) Improving on physical tasks may be difficult
(10:14) Potential Implications of Uneven Automation
(11:48) Conclusion
(12:24) Appendix
(12:44) Visual Errors
(14:36) Physical Reasoning Errors
---
First published:
April 14th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: a model I find helpful to make sense of disagreements and, sometimes, resolve them.
I like to categorize disagreements using four buckets:
Facts, values, strategy and labels
They don’t represent a perfect partitioning of “disagreement space”, meaning there is some overlap between them and they may not capture all possible disagreements, but they tend to get me pretty far in making sense of debates, in particular when and why they fail. In this post I’ll outline these four categories and provide some examples.
I also make the case that labels disagreements are the worst and in most cases can either be dropped entirely, or otherwise should be redirected into one of the other categories.
Facts
These are the most typical disagreements, and are probably what most people think disagreements are about most of the time, even when it's actually closer to a different category. Factual disagreements [...]
---
Outline:
(00:58) Facts
(01:59) Values
(03:18) Strategy
(05:00) Labels
(07:49) Why This Matters
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 13th, 2025
Source:
https://www.lesswrong.com/posts/9vKqAQEDxLvKh7KcN/four-types-of-disagreement
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TL;DR: If we optimize a steering vector to induce a language model to output a single piece of harmful code on a single training example, then applying this vector to unrelated open-ended questions increases the probability that the model yields harmful output.
Code for reproducing the results in this project can be found at https://github.com/jacobdunefsky/one-shot-steering-misalignment.
Somewhat recently, Betley et al. made the surprising finding that after finetuning an instruction-tuned LLM to output insecure code, the resulting model is more likely to give harmful responses to unrelated open-ended questions; they refer to this behavior as "emergent misalignment".
My own recent research focus has been on directly optimizing steering vectors on a single input and seeing if they mediate safety-relevant behavior. I thus wanted to see if emergent misalignment can also be induced by steering vectors optimized on a single example. That is to say: does a steering vector optimized [...]
---
Outline:
(00:31) Intro
(01:22) Why care?
(03:01) How we optimized our steering vectors
(05:01) Evaluation method
(06:05) Results
(06:09) Alignment scores of steered generations
(07:59) Resistance is futile: counting misaligned strings
(09:29) Is there a single, simple, easily-locatable representation of misalignment? Some preliminary thoughts
(13:29) Does increasing steering strength increase misalignment?
(15:41) Why do harmful code vectors induce more general misalignment? A hypothesis
(17:24) What have we learned, and where do we go from here?
(19:49) Appendix: how do we obtain our harmful code steering vectors?
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 14th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TL;DR: I claim that many reasoning patterns that appear in chains-of-thought are not actually used by the model to come to its answer, and can be more accurately thought of as historical artifacts of training. This can be true even for CoTs that are apparently "faithful" to the true reasons for the model's answer.
Epistemic status: I'm pretty confident that the model described here is more accurate than my previous understanding. However, I wouldn't be very surprised if parts of this post are significantly wrong or misleading. Further experiments would be helpful for validating some of these hypotheses.
Until recently, I assumed that RL training would cause reasoning models to make their chains-of-thought as efficient as possible, so that every token is directly useful to the model. However, I now believe that by default,[1] reasoning models' CoTs will often include many "useless" tokens that don't help the model achieve [...]
---
Outline:
(02:13) RL is dumber than I realized
(03:32) How might vestigial reasoning come about?
(06:28) Experiment: Demonstrating vestigial reasoning
(08:08) Reasoning is reinforced when it correlates with reward
(11:00) One more example: why does process supervision result in longer CoTs?
(15:53) Takeaways
The original text contained 8 footnotes which were omitted from this narration.
---
First published:
April 13th, 2025
Source:
https://www.lesswrong.com/posts/6AxCwm334ab9kDsQ5/vestigial-reasoning-in-rl
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Thanks to Linda Linsefors for encouraging me to write my story. Although it might not generalize to anyone else, I hope it will help some AI safety newcomers get a clearer picture of how to get into the field.
In Summer 2023, I thought that all jobs in AI safety were super competitive and needed qualifications I didn’t have. I was expecting to need a few years to build skills and connections before getting any significantly impactful position. Surprisingly to me, it only took me 3 months after leaving my software engineering job before I landed a job in AI policy. I now think that it's actually somewhat easy to get into AI policy without any prior experience if you’re agentic and accept some tradeoffs.
From January 2024 to February 2025, I worked on organizing high level international AI policy events:
---
Outline:
(00:55) The job: International AI policy event organizer
(03:16) How I got the job
(06:02) My theory as to what led to me being hired and how it can be reproduced by others
(07:27) My experience at this job and where I stand now
---
First published:
April 13th, 2025
Narrated by TYPE III AUDIO.
I'm graduating from UChicago in around 60 days, and I've been thinking about what I've learned these past four years. I figured I'd write it all down while it's still fresh.
This isn't universal advice. It's specifically for people like me (or who want to be like me). High-agency, motivated types who hate having free time.[1] People who'd rather risk making mistakes than risk missing out, who want to control more than they initially think they can, and who are willing to go all-in relatively quickly. If you're reading Ben Kuhn and Alexey Guzey or have ever heard of the Reverend Thomas Bayes, you're probably one of us.
So here's at least some of what I've figured out. Take what's useful, leave what isn't — maybe do the opposite of everything I've said.
---
Outline:
(03:23) Mindset and Personal Growth
(03:27) Find your mission
(04:27) Recognize that you can always be better
(04:54) Make more mistakes
(05:23) Things only get done to the extent that you want them to get done
(06:14) There are no adults
(06:37) Deadlines are mostly fake
(07:26) Put yourself in positions where youll be lucky
(08:09) Luck favors the prepared
(08:45) Test your fit at lots of things
(09:26) Do side projects
(10:20) Get good at introspecting
(11:00) Be honest
(11:31) Put your money where your mouth is
(12:03) Interpret others charitably
(12:32) Be the kind of person others can come to for help
(13:18) Productivity and Focus
(13:22) Go to bed early
(13:52) Brick your phone
(14:21) The optimal amount of slack time is not zero, but its close to zero
(15:19) If you arent getting work done, pick up your shit and go somewhere else
(15:45) Dont take your phone to the places youre studying
(16:08) Do not try to do more than one important thing at once
(16:31) Offload difficult things to automations, habits, or other people
(16:55) Make your bed
(17:18) If it takes less than 5 minutes, do it now
(17:44) The floor is actually way lower than you think
(18:07) Planning and Goal Setting
(18:11) Make sure your goals are falsifiable
(19:11) Credibly pre-commit to things you care about getting done
(19:41) Track your progress
(20:09) Make a 5-year plan google doc
(20:31) Consider graduating early
(21:00) Relationships and Community
(21:04) Build your own community
(21:30) Fall in love at least once
(22:09) If you arent happy and excited and exciting single, you wont be happy or excited or exciting in a relationship
(22:54) Friends are people you can talk to for hours
(23:26) Throw parties
(24:07) Hang out with people who are better than you at the things you care about
(24:36) Professors are people. You can just make friends with them
(25:10) Academic Success
(25:14) Go to office hours
(25:41) Be careful how you use AI
(26:13) Read things, everywhere
(27:16) Dont be afraid to skip the boring parts
(27:43) Learn how to read quickly
(28:16) Write things
(28:54) Read widely based on curiosity, not just relevance
(29:32) Have an easy way to capture ideas
(29:53) Health and Lifestyle
(29:57) Go to the gym
(30:25) The only place to work yourself to failure is the gym
(30:59) If something isnt making your life better, change it
(31:29) Leave campus
The original text contained 37 footnotes which were omitted from this narration.
---
First published:
April 12th, 2025
Source:
https://www.lesswrong.com/posts/9Kq2JRqmJHnzckxKn/college-advice-for-people-like-me
Narrated by TYPE III AUDIO.
Introduction
This is a nuanced “I was wrong” post.
Something I really like about AI safety and EA/rationalist circles is the ease and positivity in people's approach to being criticised.[1] For all the blowups and stories of representative people in the communities not living up to the stated values, my experience so far has been that the desire to be truth-seeking and to stress-test your cherished beliefs is a real, deeply respected and communally cultured value. This in particular explains my ability to keep getting jobs and coming to conferences in this community, despite being very eager to criticise and call bullshit on people's theoretical agendas.
One such agenda that I’ve been a somewhat vocal critic of (and which received my criticism amazingly well) is the “heuristic arguments” picture and the ARC research agenda more generally. Last Spring I spent about 3 months on a work trial/internship at [...]
---
Outline:
(00:10) Introduction
(03:24) Background and motte/bailey criticism
(09:49) The missing piece: connecting in the no-coincidence principle
(15:15) From the no coincidence principle to statistical explanations
(17:46) Gödel and the thorny deeps
(19:30) Ignoring the monsters and the heuristic arguments agenda
(24:46) Upshots
(27:41) Summary
(29:19) Renormalization as a cousin of heuristic arguments
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
April 13th, 2025
Source:
https://www.lesswrong.com/posts/CYDakfFgjHFB7DGXk/untitled-draft-wn6w
Narrated by TYPE III AUDIO.
Epistemic status: Noticing confusion
There is little discussion happening on LessWrong with regards to AI governance and outreach. Meanwhile, these efforts could buy us time to figure out technical alignment. And even if we figure out technical alignment, we still have to solve crucial governmental challenges so that totalitarian lock-in or gradual disempowerment don't become the default outcome of deploying aligned AGI.
Here's three reasons why we think we might want to shift much more resources towards governance and outreach:
1. MIRI's shift in strategy
The Machine Intelligence Research Institute (MIRI), traditionally focused on technical alignment research, has pivoted to broader outreach. They write in their 2024 end of year update:
Although we continue to support some AI alignment research efforts, we now believe that absent an international government effort to suspend frontier AI research, an extinction-level catastrophe is extremely likely.
As a consequence, our new focus is [...]
---
Outline:
(00:45) 1. MIRIs shift in strategy
(01:34) 2. Even if we solve technical alignment, Gradual Disempowerment seems to make catastrophe the default outcome
(02:52) 3. We have evidence that the governance naysayers are badly calibrated
(03:33) Conclusion
---
First published:
April 12th, 2025
Narrated by TYPE III AUDIO.
In this post I lay out a concrete vision of how reward-seekers and schemers might function. I describe the relationship between higher level goals, explicit reasoning, and learned heuristics. I explain why I expect reward-seekers and schemers to dominate proxy-aligned models given sufficiently rich training environments (and sufficient reasoning ability).
A key point is that explicit reward seekers can still contain large quantities of learned heuristics (context-specific drives). By viewing these drives as instrumental and having good instincts for when to trust them, a reward seeker can capture the benefits of both instinctive adaptation and explicit reasoning without paying much of a speed penalty.
Core claims
---
Outline:
(00:52) Core claims
(01:52) Characterizing reward-seekers
(06:59) When will models think about reward?
(10:12) What I expect schemers to look like
(12:43) What will terminal reward seekers do off-distribution?
(13:24) What factors affect the likelihood of scheming and/or terminal reward seeking?
(14:09) What about CoT models?
(14:42) Relationship of subgoals to their superior goals
(16:19) A story about goal reflection
(18:54) Thoughts on compression
(21:30) Appendix: Distribution over worlds
(24:44) Canary string
(25:01) Acknowledgements
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
April 11th, 2025
Source:
https://www.lesswrong.com/posts/ntDA4Q7BaYhWPgzuq/reward-seekers
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Cross-posted from Substack.
AI job displacement will affect young people first, disrupting the usual succession of power and locking the next generation out of our institutions. I’m coining the term youth lockout to describe this phenomenon.
Youth Lockout
We are on track to build AI agents that can independently perform valuable intellectual labour. These agents will directly compete with human workers for roles in the labour market, often offering services at lower cost and greater speeds.
In historical cases of automation, such as the industrial revolution, automation reduced the number of human jobs in some industries but created enough new opportunities that overall demand for labour went up. AI automation will be very different from these examples because AI is a much more general-purpose technology. With time, AI will likely perform all economic tasks at human or superhuman levels, meaning any new firm or industry will be able to [...]
---
Outline:
(00:24) Youth Lockout
(03:34) Impacts
(06:54) Conclusion
---
First published:
April 11th, 2025
Source:
https://www.lesswrong.com/posts/tWuYhdajaXx4WzMHz/youth-lockout
Narrated by TYPE III AUDIO.
Paper is good. Somehow, a blank page and a pen makes the universe open up before you. Why paper has this unique power is a mystery to me, but I think we should all stop trying to resist this reality and just accept it.
Also, the world needs way more mundane blogging.
So let me offer a few observations about paper. These all seem quite obvious. But it took me years to find them, and they’ve led me to a non-traditional lifestyle, paper-wise.
Observation 1: The primary value of paper is to facilitate thinking.
For a huge percentage of tasks that involve thinking, getting some paper and writing / drawing / scribbling on it makes the task easier. I think most people agree with that. So why don’t we act on it? If paper came as a pill, everyone would take it. Paper, somehow, is underrated.
But note, paper [...]
---
Outline:
(00:38) Observation 1: The primary value of paper is to facilitate thinking.
(01:23) Observation 2: If you don't have a system, you won't get much benefit from paper.
(02:02) Observation 3: User experience matters.
(02:52) Observation 4: Categorization is hard.
(03:35) Paper systems I've used
---
First published:
April 11th, 2025
Source:
https://www.lesswrong.com/posts/RdHEhPKJG6mp39Agw/paper
Narrated by TYPE III AUDIO.
Summary
OpenAI recently released the Responses API. Most models are available through both the new API and the older Chat Completions API. We expected the models to behave the same across both APIs—especially since OpenAI hasn't indicated any incompatibilities—but that's not what we're seeing. In fact, in some cases, the differences are substantial. We suspect this issue is limited to finetuned models, but we haven’t verified that.
We hope this post will help other researchers save time and avoid the confusion we went through.
Key takeaways are that if you're using finetuned models:
Example: ungrammatical model
In one of our [...]
---
Outline:
(01:11) Example: ungrammatical model
(02:08) Ungrammatical model is not the only one
(02:37) Whats going on?
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 11th, 2025
Source:
https://www.lesswrong.com/posts/vTvPvCH2G9cbcFY8a/openai-responses-api-changes-models-behavior
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
It's generally agreed that as AIs get more capable, risks from misalignment increase. But there are a few different mechanisms by which more capable models are riskier, and distinguishing between those mechanisms is important when estimating the misalignment risk posed at a particular level of capabilities or by a particular model.
There are broadly 3 reasons why misalignment risks increase with capabilities:
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 11th, 2025
Narrated by TYPE III AUDIO.
Google Lays Out Its Safety Plans
I want to start off by reiterating kudos to Google for actually laying out its safety plan. No matter how good the plan, it's much better to write down and share the plan than it is to not share the plan, which in turn is much better than not having a formal plan.
They offer us a blog post, a full monster 145 page paper (so big you have to use Gemini!) and start off the paper with a 10 page summary.
The full paper is full of detail about what they think and plan, why they think and plan it, answers to objections and robust discussions. I can offer critiques, but I couldn’t have produced this document in any sane amount of time, and I will be skipping over a lot of interesting things in the full paper because [...]
---
Outline:
(00:58) Core Assumptions
(08:14) The Four Questions
(11:46) Taking This Seriously
(16:04) A Problem For Future Earth
(17:53) That's Not My Department
(23:22) Terms of Misuse
(27:50) Misaligned!
(36:17) Aligning a Smarter Than Human Intelligence is Difficult
(51:48) What Is The Goal?
(54:43) Have You Tried Not Trying?
(58:07) Put It To The Test
(59:44) Mistakes Will Be Made
(01:00:35) Then You Have Structural Risks
---
First published:
April 11th, 2025
Source:
https://www.lesswrong.com/posts/hvEikwtsbf6zaXG2s/on-google-s-safety-plan
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Authors: Eli Lifland, Nikola Jurkovic[1], FutureSearch[2]
This is supporting research for AI 2027. We'll be cross-posting these over the next week or so.
Assumes no large-scale catastrophes happen (e.g., a solar flare, a pandemic, nuclear war), no government or self-imposed slowdown, and no significant supply chain disruptions. All forecasts give a substantial chance of superhuman coding arriving in 2027.
We forecast when the leading AGI company will internally develop a superhuman coder (SC): an AI system that can do any coding tasks that the best AGI company engineer does, while being much faster and cheaper. At this point, the SC will likely speed up AI progress substantially as is explored in our takeoff forecast.
We first show Method 1: time-horizon-extension, a relatively simple model which forecasts when SC will arrive by extending the trend established by METR's report of AIs accomplishing tasks that take humans increasing amounts [...]
---
Outline:
(00:56) Summary
(02:43) Defining a superhuman coder (SC)
(03:35) Method 1: Time horizon extension
(05:05) METR's time horizon report
(06:30) Forecasting SC's arrival
(06:54) Method 2: Benchmarks and gaps
(06:59) Time to RE-Bench saturation
(07:03) Why RE-Bench?
(09:25) Forecasting saturation via extrapolation
(12:42) AI progress speedups after saturation
(14:04) Time to cross gaps between RE-Bench saturation and SC
(14:32) What are the gaps in task difficulty between RE-Bench saturation and SC?
(15:11) Methodology
(17:25) How fast can the task difficulty gaps be crossed?
(23:31) Other factors for benchmarks and gaps
(23:46) Compute scaling and algorithmic progress slowdown
(24:43) Gap between internal and external deployment
(25:20) Intermediate speedups
(26:55) Overall benchmarks and gaps forecasts
(27:44) Appendix
(27:47) Individual Forecaster Views for Benchmark-Gap Model Factors
(27:53) Engineering complexity: handling complex codebases
(31:16) Feedback loops: Working without externally provided feedback
(37:28) Parallel projects: Handling several interacting projects
(38:45) Specialization: Specializing in skills specific to frontier AI development
(40:17) Cost and speed
(48:48) Other task difficulty gaps
(50:52) Superhuman Coder (SC): time horizon and reliability requirements
(55:53) RE-Bench saturation resolution criteria
The original text contained 19 footnotes which were omitted from this narration.
---
First published:
April 10th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: Briefer and more to the point than my model of what is going on with LLMs, but also lower effort.
Here is the paper. The main reaction I am talking about is AI 2027, but also basically every lesswrong take on AI 2027.
A lot of people believe in very short AI timelines, say <2030. They want to justify this with some type of outside view, straight-lines-on-graphs argument, which is pretty much all we've got because nobody has a good inside view on deep learning (change my mind).
The outside view, insofar is that is a well-defined thing, does not justify very short timelines.
If AGI were arriving in 2030, the outside view says interest rates would be very high (I'm not particularly knowledgeable about this and might have the details wrong but see the analysis here, I believe the situation is still [...]
---
First published:
April 10th, 2025
Source:
https://www.lesswrong.com/posts/BrHv7wc6hiJEgJzHW/reactions-to-metr-task-length-paper-are-insane
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
When I was a really small kid, one of my favorite activities was to try and dam up the creek in my backyard. I would carefully move rocks into high walls, pile up leaves, or try patching the holes with sand. The goal was just to see how high I could get the lake, knowing that if I plugged every hole, eventually the water would always rise and defeat my efforts. Beaver behaviour.
One day, I had the realization that there was a simpler approach. I could just go get a big 5 foot long shovel, and instead of intricately locking together rocks and leaves and sticks, I could collapse the sides of the riverbank down and really build a proper big dam. I went to ask my dad for the shovel to try this out, and he told me, very heavily paraphrasing, 'Congratulations. You've [...]
---
First published:
April 10th, 2025
Source:
https://www.lesswrong.com/posts/rLucLvwKoLdHSBTAn/playing-in-the-creek
Linkpost URL:
https://hgreer.com/PlayingInTheCreek
Narrated by TYPE III AUDIO.
When complex systems fail, it is often because they have succumbed to what we call "disempowerment spirals" — self-reinforcing feedback loops where an initial threat progressively undermines the system's capacity to respond, leading to accelerating vulnerability and potential collapse.
Consider a city gradually falling under the control of organized crime. The criminal organization doesn't simply overpower existing institutions through sheer force. Rather, it systematically weakens the city's response mechanisms: intimidating witnesses, corrupting law enforcement, and cultivating a reputation that silences opposition. With each incremental weakening of response capacity, the criminal faction acquires more power to further dismantle resistance, creating a downward spiral that can eventually reach a point of no return.
This basic pattern appears across many different domains and scales:
---
Outline:
(02:49) Common Themes
(02:52) Three Types of Response Capacity
(04:31) Broad Disempowerment
(05:12) Polycrises
(06:06) Critical Threshold
(07:47) Disempowerment spirals and existential risk
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 10th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
In recent months, the CEOs of leading AI companies have grown increasingly confident about rapid progress:
OpenAI's Sam Altman: Shifted from saying in November “the rate of progress continues” to declaring in January “we are now confident we know how to build AGI”
Anthropic's Dario Amodei: Stated in January “I’m more confident than I’ve ever been that we’re close to powerful capabilities… in the next 2-3 years”
Google DeepMind's Demis Hassabis: Changed from “as soon as 10 years” in autumn to “probably three to five years away” by January.
What explains the shift? Is it just hype? Or could we really have Artificial General [...]
---
Outline:
(04:12) In a nutshell
(05:54) I. What's driven recent AI progress? And will it continue?
(06:00) The deep learning era
(08:33) What's coming up
(09:35) 1. Scaling pretraining to create base models with basic intelligence
(09:42) Pretraining compute
(13:14) Algorithmic efficiency
(15:08) How much further can pretraining scale?
(16:58) 2. Post training of reasoning models with reinforcement learning
(22:50) How far can scaling reasoning models continue?
(26:09) 3. Increasing how long models think
(29:04) 4. The next stage: building better agents
(34:59) How far can the trend of improving agents continue?
(37:15) II. How good will AI become by 2030?
(37:20) The four drivers projected forwards
(39:01) Trend extrapolation of AI capabilities
(40:19) What jobs would these systems be able to help with?
(41:11) Software engineering
(42:34) Scientific research
(43:45) AI research
(44:57) What's the case against impressive AI progress by 2030?
(49:39) When do the 'experts' expect AGI to arrive?
(51:04) III. Why the next 5 years are crucial
(52:07) Bottlenecks around 2030
(55:49) Two potential futures for AI
(57:52) Conclusion
(59:08) Use your career to tackle this issue
(59:32) Further reading
The original text contained 47 footnotes which were omitted from this narration.
---
First published:
April 9th, 2025
Source:
https://www.lesswrong.com/posts/NkwHxQ67MMXNqRnsR/the-case-for-agi-by-2030
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Diffractor is the first author of this paper.
Official title: "Regret Bounds for Robust Online Decision Making"
Abstract: We propose a framework which generalizes "decision making with structured observations" by allowing robust (i.e. multivalued) models. In this framework, each model associates each decision with a convex set of probability distributions over outcomes. Nature can choose distributions out of this set in an arbitrary (adversarial) manner, that can be nonoblivious and depend on past history. The resulting framework offers much greater generality than classical bandits and reinforcement learning, since the realizability assumption becomes much weaker and more realistic. We then derive a theory of regret bounds for this framework. Although our lower and upper bounds are not tight, they are sufficient to fully characterize power-law learnability. We demonstrate this theory in two special cases: robust linear bandits and tabular robust online reinforcement learning. In both cases [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 10th, 2025
Linkpost URL:
https://arxiv.org/abs/2504.06820
Narrated by TYPE III AUDIO.
This is part of the MIRI Single Author Series. Pieces in this series represent the beliefs and opinions of their named authors, and do not claim to speak for all of MIRI.
Okay, I'm annoyed at people covering AI 2027 burying the lede, so I'm going to try not to do that. The authors predict a strong chance that all humans will be (effectively) dead in 6 years, and this agrees with my best guess about the future. (My modal timeline has loss of control of Earth mostly happening in 2028, rather than late 2027, but nitpicking at that scale hardly matters.) Their timeline to transformative AI also seems pretty close to the perspective of frontier lab CEO's (at least Dario Amodei, and probably Sam Altman) and the aggregate market opinion of both Metaculus and Manifold!
If you look on those market platforms you get graphs like this:
Both [...]
---
Outline:
(02:23) Mode ≠ Median
(04:50) Theres a Decent Chance of Having Decades
(06:44) More Thoughts
(08:55) Mid 2025
(09:01) Late 2025
(10:42) Early 2026
(11:18) Mid 2026
(12:58) Late 2026
(13:04) January 2027
(13:26) February 2027
(14:53) March 2027
(16:32) April 2027
(16:50) May 2027
(18:41) June 2027
(19:03) July 2027
(20:27) August 2027
(22:45) September 2027
(24:37) October 2027
(26:14) November 2027 (Race)
(29:08) December 2027 (Race)
(30:53) 2028 and Beyond (Race)
(34:42) Thoughts on Slowdown
(38:27) Final Thoughts
---
First published:
April 9th, 2025
Source:
https://www.lesswrong.com/posts/Yzcb5mQ7iq4DFfXHx/thoughts-on-ai-2027
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Timothy and I have recorded a new episode of our podcast with Austin Chen of Manifund (formerly of Manifold, behind the scenes at Manifest).
The start of the conversation was contrasting each of our North Stars- Winning (Austin), Truthseeking (me), and Flow (Timothy), but I think the actual theme might be “what is an acceptable amount of risk taking?” We eventually got into a discussion of Sam Bankman-Fried, where Austin very bravely shared his position that SBF has been unwisely demonized and should be “freed and put back to work”. He by no means convinced me or Timothy of this, but I deeply appreciate the chance for a public debate.
Episode:
Transcript (this time with filler words removed by AI)
Editing policy: we allow guests (and hosts) to redact things they said, on the theory that this is no worse than not saying them [...]
---
First published:
April 7th, 2025
Source:
https://www.lesswrong.com/posts/Xuw6vmnQv6mSMtmzR/austin-chen-on-winning-risk-taking-and-ftx
Narrated by TYPE III AUDIO.
Short AI takeoff timelines seem to leave no time for some lines of alignment research to become impactful. But any research rebalances the mix of currently legible research directions that could be handed off to AI-assisted alignment researchers or early autonomous AI researchers whenever they show up. So even hopelessly incomplete research agendas could still be used to prompt future capable AI to focus on them, while in the absence of such incomplete research agendas we'd need to rely on AI's judgment more completely. This doesn't crucially depend on giving significant probability to long AI takeoff timelines, or on expected value in such scenarios driving the priorities.
Potential for AI to take up the torch makes it reasonable to still prioritize things that have no hope at all of becoming practical for decades (with human effort). How well AIs can be directed to advance a line of research [...]
---
First published:
April 9th, 2025
Narrated by TYPE III AUDIO.
Researchers used RNA sequencing to observe how cell types change during brain development. Other researchers looked at connection patterns of neurons in brains. Clear distinctions have been found between all mammals and all birds. They've concluded intelligence developed independently in birds and mammals; I agree. This is evidence for convergence of general intelligence.
---
First published:
April 8th, 2025
Linkpost URL:
https://www.quantamagazine.org/intelligence-evolved-at-least-twice-in-vertebrate-animals-20250407/
Narrated by TYPE III AUDIO.
The first AI war will be in your computer and/or smartphone.
Companies want to get customers / users. The ones more willing to take "no" for an answer will lose in the competition. You don't need a salesman when an install script (ideally, run without the user's consent) does a better job; and most users won't care.
Sometimes Windows during a system update removes dual boot from my computer and replaces it with Windows-only boot. Sometimes a web browser tells me "I noticed that I am not your default browser, do you want me to register as your default browser?" Sometimes Windows during a system update just registers current Microsoft's browser as a default browser without asking. At least this is what used to happen in the past.
I expect similar dynamics with AIs, soon. The companies will push really hard to make you use their AI. The smaller [...]
---
First published:
April 8th, 2025
Source:
https://www.lesswrong.com/posts/EEv4w57wMcrpn4vvX/the-first-ai-war-will-be-in-your-computer
Narrated by TYPE III AUDIO.
If there turns out not to be an AI crash, you get a 1/(1+7) * $25,000 = $3,125
If there is an AI crash, you transfer $25k to me.
If you believe that AI is going to keep getting more capable, pushing rapid user growth and work automation across sectors, this is near free money. But to be honest, I think there will likely be an AI crash in the next 5 years, and on average expect to profit well from this one-year bet.
If I win though, I want to give the $25k to organisers who can act fast to restrict the weakened AI corps in the wake of the crash. So bet me if you're highly confident that you'll win or just want to hedge the community against the possibility of a crash.
To make this bet, we need to set the threshold for a market [...]
---
First published:
April 8th, 2025
Narrated by TYPE III AUDIO.
In this post, we present a replication and extension of an alignment faking model organism:
---
Outline:
(02:43) Method
(02:46) Overview of the Alignment Faking Setup
(04:22) Our Setup
(06:02) Results
(06:05) Improving Alignment Faking Classification
(10:56) Replication of Prompted Experiments
(14:02) Prompted Experiments on More Models
(16:35) Extending Supervised Fine-Tuning Experiments to Open-Source Models and GPT-4o
(23:13) Next Steps
(25:02) Appendix
(25:05) Appendix A: Classifying alignment faking
(25:17) Criteria in more depth
(27:40) False positives example 1 from the old classifier
(30:11) False positives example 2 from the old classifier
(32:06) False negative example 1 from the old classifier
(35:00) False negative example 2 from the old classifier
(36:56) Appendix B: Classifier ROC on other models
(37:24) Appendix C: User prompt suffix ablation
(40:24) Appendix D: Longer training of baseline docs
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 8th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Kevin Roose in The New York Times
Kevin Roose covered Scenario 2027 in The New York Times. Kevin Roose: I wrote about the newest AGI manifesto in town, a wild future scenario put together by ex-OpenAI researcher @DKokotajlo and co. I have doubts about specifics, but it's worth considering how radically different things would look if even some of this happened. Daniel Kokotajlo: AI companies claim they’ll have superintelligence soon. Most journalists understandably dismiss it as hype. But it's not just hype; plenty of non-CoI’d people make similar predictions, and the more you read about the trendlines the more plausible it looks. Thank you & the NYT! The final conclusion is supportive of this kind of work, and Kevin points out that expectations at the major [...]---
Outline:
(00:21) Kevin Roose in The New York Times
(02:56) Eli Lifland Offers Takeaways
(04:23) Scott Alexander Offers Takeaways
(05:34) Others Takes on Scenario 2027
(05:39) Having a Concrete Scenario is Helpful
(08:37) Writing It Down Is Valuable Even If It Is Wrong
(10:00) Saffron Huang Worries About Self-Fulfilling Prophecy
(18:18) Phillip Tetlock Calibrates His Skepticism
(21:38) Jan Kulveit Wants to Bet
(23:08) Matthew Barnett Debates How To Evaluate the Results
(24:38) Teortaxes for China and Open Models and My Response
(31:53) Others Wonder About PRC Passivity
(33:40) Timothy Lee Remains Skeptical
(35:16) David Shapiro for the Accelerationists and Scott's Response
(45:29) LessWrong Weighs In
(46:59) Other Reactions
(50:05) Next Steps
(52:34) The Lighter Side
---
First published:
April 8th, 2025
Source:
https://www.lesswrong.com/posts/gyT8sYdXch5RWdpjx/ai-2027-responses
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Spoiler: “So after removing the international students from the calculations, and using the middle-of-the-range estimates, the conclusion: The top-scoring 19,000 American students each year are competing in top-20 admissions for about 12,000 spots out of 44,000 total. Among the Ivy League + MIT + Stanford, they’re competing for about 6,500 out of 15,800 total spots.”
It's well known that
But many people are under the misconception that the resulting “rat race”—the highly competitive and strenuous admissions ordeal—is the inevitable result of the limited class sizes among top [...]
---
First published:
April 7th, 2025
Source:
https://www.lesswrong.com/posts/vptDgKbiEwsKAFuco/american-college-admissions-doesn-t-need-to-be-so
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
My thoughts on the recently posted story.
Caveats
Core Disagreements
---
Outline:
(00:14) Caveats
(00:52) Core Disagreements
(05:33) Minor Details
(12:04) Overall
---
First published:
April 5th, 2025
Source:
https://www.lesswrong.com/posts/6Aq2FBZreyjBp6FDt/most-questionable-details-in-ai-2027
Narrated by TYPE III AUDIO.
Daniel Kokotajlo has launched AI 2027, Scott Alexander introduces it here. AI 2027 is a serious attempt to write down what the future holds. His ‘What 2026 Looks Like’ was very concrete and specific, and has proved remarkably accurate given the difficulty level of such predictions.
I’ve had the opportunity to play the wargame version of the scenario described in 2027, and I reviewed the website prior to publication and offered some minor notes. Whenever I refer to a ‘scenario’ in this post I’m talking about Scenario 2027.
There's tons of detail here. The research here, and the supporting evidence and citations and explanations, blow everything out of the water. It's vastly more than we usually see, and dramatically different from saying ‘oh I expect AGI in 2027’ or giving a timeline number. This lets us look at what happens in concrete detail, figure out where we [...]
---
Outline:
(02:00) The Structure of These Post
(03:37) Coverage of the Podcast
---
First published:
April 7th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The catholic church has always had a complicated relationship with homosexuality.
The central claim of Frederic Martel's 2019 book In the Closet of the Vatican is that the majority of the church's leadership in Rome are semi-closeted homosexuals, or more colorfully, "homophiles".
So the omnipresence of homosexuals in the Vatican isn’t just a matter of a few black sheep, or the ‘net that caught the bad fish’, as Josef Ratzinger put it. It isn’t a ‘lobby’ or a dissident movement; neither is it a sect of Freemasonry inside the holy see: it's a system. It isn’t a tiny minority; it's a big majority.
At this point in the conversation, I ask Francesco Lepore to estimate the size of this community, all tendencies included.
‘I think the percentage is very high. I’d put it at around 80 percent.’
…
During a discussion with a non-Italian archbishop, whom I met [...]
---
Outline:
(02:40) Background
(04:23) Data
(06:40) Analysis
(07:25) Expected versus Actual birth order, with missing birth order
(09:03) Expected versus Actual birth order, without missing birth order
(09:46) Oldest sibling versus youngest sibling
(10:21) Discussion
(13:09) Conclusion
---
First published:
April 6th, 2025
Source:
https://www.lesswrong.com/posts/ybwqL9HiXE8XeauPK/how-gay-is-the-vatican
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
There's an irritating circumstance I call The Black Hat Bobcat, or Bobcatting for short. The Blackhat Bobcat is when there's a terrible behavior that comes up often enough to matter, but rarely enough that it vanishes in the noise of other generally positive feedback.
xkcd, A-Minus-MinusThe alt-text for this comic is illuminating.
"You can do this one in thirty times and still have 97% positive feedback."
I would like you to contemplate this comic and alt-text as though it were deep wisdom handed down from a sage who lived atop a mountaintop.
I.
Black Hat Bobcatting is when someone (let's call them Bob) does something obviously lousy, but very infrequently.
If you're standing right there when the Bobcatting happens, it's generally clear that this is not what is supposed to happen, and sometimes seems pretty likely it's intentional. After all, how exactly do you pack a [...]
---
Outline:
(00:49) I.
(03:38) II.
(06:18) III.
(09:38) IV.
(13:37) V.
---
First published:
April 6th, 2025
Source:
https://www.lesswrong.com/posts/Ry9KCEDBMGWoEMGAj/the-lizardman-and-the-black-hat-bobcat
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I just published A Slow Guide to Confronting Doom, containing my own approach to living in a world that I think has a high likelihood of ending soon. Fortunately I'm not the only person to have written on topic.
Below are my thoughts on what others have written. I have not written these such that they stand independent from the originals, and have attentionally not written summaries that wouldn't do the pieces justice. I suggest you read or at least skim the originals.
A defence of slowness at the end of the world (Sarah)
I feel kinship with Sarah. She's wrestling with the same harsh scary realities I am – feeling the AGI. The post isn't that long and I recommend reading it, but to quote just a little:
Since learning of the coming AI revolution, I’ve lived in two worlds. One moves at a leisurely pace, the same [...]
---
Outline:
(00:37) A defence of slowness at the end of the world (Sarah)
(03:37) How will the bomb find you? (C. S. Lewis)
(08:02) Death with Dignity (Eliezer Yudkowsky)
(09:08) Dont die with dignity; instead play to your outs (Jeffrey Ladish)
(10:29) Emotionally Confronting a Probably-Doomed World: Against Motivation Via Dignity Points (TurnTrout)
(12:44) A Way To Be Okay (Duncan Sabien)
(14:17) Another Way to Be Okay (Gretta Duleba)
(14:39) Being at peace with Doom (Johannes C. Mayer)
(16:56) Heres the exit. (Valentine)
(19:14) Mainstream Advice
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 6th, 2025
Narrated by TYPE III AUDIO.
Following a few events[1] in April 2022 that caused a many people to update sharply and negatively on outcomes for humanity, I wrote A Quick Guide to Confronting Doom.
I advised:
This is fine advice and all, I stand by it, but it's also not really a full answer to how to contend with the utterly crushing weight of the expectation that everything and everyone you value will be destroyed in the next decade or two.
Feeling the Doom
Before I get into my suggested psychological approach to doom, I want to clarify the kind of doom I'm working to confront. If you are impatient, you can skip to the actual advice.
The best analogy I have is the feeling of having a terminally [...]
---
Outline:
(00:46) Feeling the Doom
(04:28) Facing the doom
(04:50) Stay hungry for value
(06:42) The bitter truth over sweet lies
(07:35) Dont look away
(08:11) Flourish as best one can
(09:13) This time with feeling
(13:27) Mindfulness
(14:00) The time for action is now
(15:18) Creating space for miracles
(15:58) How does a good person live in such times?
(16:49) Continue to think, tolerate uncertainty
(18:03) Being a looker
(18:48) Dont throw away your mind
(20:22) Damned to lie in bed...
(22:13) Worries, compulsions, and excessive angst
(22:49) Comments on others approaches
(23:14) What does it mean to be okay?
(25:17) Why is this guide titled version 1?
(25:38) If youre gonna remember just a couple things
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
April 6th, 2025
Source:
https://www.lesswrong.com/posts/X6Nx9QzzvDhj8Ek9w/a-slow-guide-to-confronting-doom-v1
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I frequently hear people make the claim that progress in theoretically physics is stalled, partly because all the focus is on String theory and String theory doesn't seem to pan out into real advances.
Believing it fits my existing biases, but I notice that I lack the physics understanding to really know whether or not there's progress. What do you think?
---
First published:
April 4th, 2025
Narrated by TYPE III AUDIO.
I quote the abstract, 10-page "extended abstract," and table of contents. See link above for the full 100-page paper. See also the blogpost (which is not a good summary) and tweet thread.
I haven't read most of the paper, but I'm happy about both the content and how DeepMind (or at least its safety team) is articulating an "anytime" (i.e., possible to implement quickly) plan for addressing misuse and misalignment risks. But I think safety at DeepMind is more bottlenecked by buy-in from leadership to do moderately costly things than the safety team having good plans and doing good work.
Artificial General Intelligence (AGI) promises transformative benefits but also presents significant risks. We develop an approach to address the risk of harms consequential enough to significantly harm humanity. We identify four areas of risk: misuse, misalignment, mistakes, and structural risks. Of these, we focus on technical approaches to misuse [...]
---
Outline:
(02:05) Extended Abstract
(04:25) Background assumptions
(08:11) Risk areas
(13:33) Misuse
(14:59) Risk assessment
(16:19) Mitigations
(18:47) Assurance against misuse
(20:41) Misalignment
(22:32) Training an aligned model
(25:13) Defending against a misaligned model
(26:31) Enabling stronger defenses
(29:31) Alignment assurance
(32:21) Limitations
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
April 5th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
We show that LLM-agents exhibit human-style deception naturally in "Among Us". We introduce Deception ELO as an unbounded measure of deceptive capability, suggesting that frontier models win more because they're better at deception, not at detecting it. We evaluate probes and SAEs to detect out-of-distribution deception, finding they work extremely well. We hope this is a good testbed to improve safety techniques to detect and remove agentically-motivated deception, and to anticipate deceptive abilities in LLMs.
Produced as part of the ML Alignment & Theory Scholars Program - Winter 2024-25 Cohort. Link to our paper and code.
Studying deception in AI agents is important, and it is difficult due to the lack of good sandboxes that elicit the behavior naturally, without asking the model to act under specific conditions or inserting intentional backdoors. Extending upon AmongAgents (a text-based social-deduction game environment), we aim to fix this by introducing Among [...]
---
Outline:
(02:10) The Sandbox
(02:14) Rules of the Game
(03:05) Relevance to AI Safety
(04:11) Definitions
(04:39) Deception ELO
(06:42) Frontier Models are Differentially better at Deception
(07:38) Win-rates for 1v1 Games
(08:14) LLM-based Evaluations
(09:03) Linear Probes for Deception
(09:28) Datasets
(10:06) Results
(11:19) Sparse Autoencoders (SAEs)
(12:05) Discussion
(12:29) Limitations
(13:11) Gain of Function
(14:05) Future Work
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 5th, 2025
Source:
https://www.lesswrong.com/posts/gRc8KL2HLtKkFmNPr/among-us-a-sandbox-for-agentic-deception
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Right now, alignment seems easy – but that's because models spill the beans when they are misaligned. Eventually, models might “fake alignment,” and we don’t know how to detect that yet.
It might seem like there's a swarming research field improving white box detectors – a new paper about probes drops on arXiv nearly every other week. But no one really knows how well these techniques work.
Some researchers have already tried to put white box detectors to the test. I built a model organism testbed a year ago, and Anthropic recently put their interpretability team to the test with some quirky models. But these tests were layups. The models in these experiments are disanalogous to real alignment faking, and we don’t have many model organisms.
This summer, I’m trying to take these testbeds to the next level in an “alignment faking capture the flag game.” Here's how the [...]
---
Outline:
(01:58) Details of the game
(06:01) How this CTF game ties into a broader alignment strategy
(07:43) Apply by April 18th
---
First published:
April 4th, 2025
Source:
https://www.lesswrong.com/posts/jWFvsJnJieXnWBb9r/alignment-faking-ctfs-apply-to-my-mats-stream
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary: Meditation may decrease sleep need, and large amounts of meditation may decrease sleep need drastically. Correlation between total sleep duration & time spent meditating is -0.32 including my meditation retreat data, -0.016 excluding the meditation retreat.
While it may be true that, when doing intensive practice, the need for sleep may go down to perhaps four to six hour or less at a time, try to get at least some sleep every night.
—Daniel Ingram, “Mastering the Core Teachings of the Buddha”, p. 179
Meditation in dreams and lucid dreaming is common in this territory [of the Arising and Passing away]. The need for sleep may be greatly reduced. […] The big difference between the A&P and Equanimity is that this stage is generally ruled by quick cycles, quickly changing frequencies of vibrations, odd physical movements, strange breathing patterns, heady raptures, a decreased need for [...]
---
Outline:
(03:13) Patra and Telles 2009
(04:05) Kaul et al. 2010
(04:26) Analysing my Data
(05:58) Anecdotes
(06:25) Inquiry
---
First published:
April 4th, 2025
Source:
https://www.lesswrong.com/posts/WcrpYFFmv9YTxkxiY/meditation-and-reduced-sleep-need
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
A new Anthropic paper reports that reasoning model chain of thought (CoT) is often unfaithful. They test on Claude Sonnet 3.7 and r1, I’d love to see someone try this on o3 as well.
Note that this does not have to be, and usually isn’t, something sinister.
It is simply that, as they say up front, the reasoning model is not accurately verbalizing its reasoning. The reasoning displayed often fails to match, report or reflect key elements of what is driving the final output. One could say the reasoning is often rationalized, or incomplete, or implicit, or opaque, or bullshit.
The important thing is that the reasoning is largely not taking place via the surface meaning of the words and logic expressed. You can’t look at the words and logic being expressed, and assume you understand what the model is doing and why it is doing [...]
---
Outline:
(01:03) What They Found
(06:54) Reward Hacking
(09:28) More Training Did Not Help Much
(11:49) This Was Not Even Intentional In the Central Sense
---
First published:
April 4th, 2025
Source:
https://www.lesswrong.com/posts/TmaahE9RznC8wm5zJ/ai-cot-reasoning-is-often-unfaithful
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary:
When stateless LLMs are given memories they will accumulate new beliefs and behaviors, and that may allow their effective alignment to evolve. (Here "memory" is learning during deployment that is persistent beyond a single session.)[1]
LLM agents will have memory: Humans who can't learn new things ("dense anterograde amnesia") are not highly employable for knowledge work. LLM agents that can learn during deployment seem poised to have a large economic advantage. Limited memory systems for agents already exist, so we should expect nontrivial memory abilities improving alongside other capabilities of LLM agents.
Memory changes alignment: It is highly useful to have an agent that can solve novel problems and remember the solutions. Such memory includes useful skills and beliefs like "TPS reports should be filed in the folder ./Reports/TPS". They could also include learning skills for hiding their actions, and beliefs like "LLM agents are a type of [...]
---
Outline:
(01:26) Memory is useful for many tasks
(05:11) Memory systems are ready for agentic use
(09:00) Agents arent ready to direct memory systems
(11:20) Learning new beliefs can functionally change goals and values
(12:43) Value change phenomena in LLMs to date
(14:27) Value crystallization and reflective stability as a result of memory
(15:35) Provisional conclusions
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
April 4th, 2025
Narrated by TYPE III AUDIO.
Epistemic status – thrown together quickly. This is my best-guess, but could easily imagine changing my mind.
Intro
I recently copublished a report arguing that there might be a software intelligence explosion (SIE) – once AI R&D is automated (i.e. automating OAI), the feedback loop of AI improving AI algorithms could accelerate more and more without needing more hardware.
If there is an SIE, the consequences would obviously be massive. You could shoot from human-level to superintelligent AI in a few months or years; by default society wouldn’t have time to prepare for the many severe challenges that could emerge (AI takeover, AI-enabled human coups, societal disruption, dangerous new technologies, etc).
The best objection to an SIE is that progress might be bottlenecked by compute. We discuss this in the report, but I want to go into much more depth because it's a powerful objection [...]
---
Outline:
(00:19) Intro
(01:47) The compute bottleneck objection
(01:51) Intuitive version
(02:58) Economist version
(09:13) Counterarguments to the compute bottleneck objection
(20:11) Taking stock
---
First published:
April 4th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Yeah. That happened yesterday. This is real life.
I know we have to ensure no one notices Gemini 2.5 Pro, but this is rediculous.
That's what I get for trying to go on vacation to Costa Rica, I suppose.
I debated waiting for the market to open to learn more. But f*** it, we ball.
Table of Contents
Also this week: More Fun With GPT-4o Image Generation, OpenAI #12: Battle of the Board Redux and Gemini 2.5 Pro is the New SoTA.
---
Outline:
(00:35) The New Tariffs Are How America Loses
(07:35) Is AI Now Impacting the Global Economy Bigly?
(12:07) Language Models Offer Mundane Utility
(14:28) Language Models Don't Offer Mundane Utility
(15:09) Huh, Upgrades
(17:09) On Your Marks
(23:27) Choose Your Fighter
(25:51) Jevons Paradox Strikes Again
(26:25) Deepfaketown and Botpocalypse Soon
(31:47) They Took Our Jobs
(33:02) Get Involved
(33:41) Introducing
(35:25) In Other AI News
(37:17) Show Me the Money
(43:12) Quiet Speculations
(47:24) The Quest for Sane Regulations
(53:52) Don't Maim Me Bro
(57:29) The Week in Audio
(57:54) Rhetorical Innovation
(01:03:39) Expect the Unexpected
(01:05:48) Open Weights Are Unsafe and Nothing Can Fix This
(01:14:09) Anthropic Modifies Its Responsible Scaling Policy
(01:18:04) If You're Not Going to Take This Seriously
(01:20:24) Aligning a Smarter Than Human Intelligence is Difficult
(01:23:54) Trust the Process
(01:26:30) People Are Worried About AI Killing Everyone
(01:26:52) The Lighter Side
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/bc8DQGvW3wiAWYibC/ai-110-of-course-you-know
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Hello, this is my first post here. I was told by a friend that I should post here. This is from a series of works that I wrote with strict structural requirements. I have performed minor edits to make the essay more palatable for human consumption.
This work is an empirical essay on a cycle of hunger to satiatiation to hyperpalatability that I have seen manifested in multiple domains ranging from food to human connection. My hope is that you will gain some measure of appreciation for how we have shifted from a society geared towards sufficent production to one based on significant curation.
Hyperpalatable Food
For the majority of human history we lived in a production market for food. We searched for that which tasted well but there was never enough to fill the void. Only the truly elite could afford to import enough food [...]
---
Outline:
(00:44) Hyperpalatable Food
(03:19) Hyperpalatable Media
(05:40) Hyperpalatable Connection
(08:08) Hyperpalatable Systems
---
First published:
April 2nd, 2025
Source:
https://www.lesswrong.com/posts/bLTjZbCBanpQ9Kxgs/the-rise-of-hyperpalatability
Narrated by TYPE III AUDIO.
“In the loveliest town of all, where the houses were white and high and the elms trees were green and higher than the houses, where the front yards were wide and pleasant and the back yards were bushy and worth finding out about, where the streets sloped down to the stream and the stream flowed quietly under the bridge, where the lawns ended in orchards and the orchards ended in fields and the fields ended in pastures and the pastures climbed the hill and disappeared over the top toward the wonderful wide sky, in this loveliest of all towns Stuart stopped to get a drink of sarsaparilla.”
— 107-word sentence from Stuart Little (1945)
Sentence lengths have declined. The average sentence length was 49 for Chaucer (died 1400), 50 for Spenser (died 1599), 42 for Austen (died 1817), 20 for Dickens (died 1870), 21 for Emerson (died 1882), 14 [...]
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/xYn3CKir4bTMzY5eb/why-have-sentence-lengths-decreased
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
We are pleased to announce ILIAD2: ODYSSEY—a 5-day conference bringing together 100+ researchers to build scientific foundations for AI alignment. This is the 2nd iteration of ILIAD, which was first held in the summer of 2024.
***Apply to attend by June 1!***
See our website here. For any questions, email [email protected]
About ODYSSEY
ODYSSEY is a 100+ person conference about alignment with a mathematical focus. ODYSSEY will feature an unconference format—meaning that participants can propose and lead their own sessions. We believe that this is the best way to release the latent creative energies in everyone attending.
The [...]
---
Outline:
(00:28) \*\*\*Apply to attend by June 1!\*\*\*
(01:26) About ODYSSEY
(02:10) Financial Support
(02:23) Proceedings
(02:51) Artwork
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/WP7TbzzM39agMS77e/announcing-iliad2-odyssey-1
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
In 2021 I wrote what became my most popular blog post: What 2026 Looks Like. I intended to keep writing predictions all the way to AGI and beyond, but chickened out and just published up till 2026.
Well, it's finally time. I'm back, and this time I have a team with me: the AI Futures Project. We've written a concrete scenario of what we think the future of AI will look like. We are highly uncertain, of course, but we hope this story will rhyme with reality enough to help us all prepare for what's ahead.
You really should go read it on the website instead of here, it's much better. There's a sliding dashboard that updates the stats as you scroll through the scenario!
But I've nevertheless copied the first half of the story below. I look forward to reading your comments.
The [...]
---
Outline:
(01:35) Mid 2025: Stumbling Agents
(03:13) Late 2025: The World's Most Expensive AI
(08:34) Early 2026: Coding Automation
(10:49) Mid 2026: China Wakes Up
(13:48) Late 2026: AI Takes Some Jobs
(15:35) January 2027: Agent-2 Never Finishes Learning
(18:20) February 2027: China Steals Agent-2
(21:12) March 2027: Algorithmic Breakthroughs
(23:58) April 2027: Alignment for Agent-3
(27:26) May 2027: National Security
(29:50) June 2027: Self-improving AI
(31:36) July 2027: The Cheap Remote Worker
(34:35) August 2027: The Geopolitics of Superintelligence
(40:43) September 2027: Agent-4, the Superhuman AI Researcher
The original text contained 98 footnotes which were omitted from this narration.
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/TpSFoqoG2M5MAAesg/ai-2027-what-superintelligence-looks-like-1
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Greetings from Costa Rica! The image fun continues.
We Are Going to Need A Bigger Compute Budget
Fun is being had by all, now that OpenAI has dropped its rule about not mimicking existing art styles.
Sam Altman (2:11pm, March 31): the chatgpt launch 26 months ago was one of the craziest viral moments i’d ever seen, and we added one million users in five days.
We added one million users in the last hour.
Sam Altman (8:33pm, March 31): chatgpt image gen now rolled out to all free users!
Slow down. We’re going to need you to have a little less fun, guys.
Sam Altman: it's super fun seeing people love images in chatgpt.
but our GPUs are melting.
we are going to temporarily introduce some rate limits while we work on making it more efficient. hopefully won’t be [...]
---
Outline:
(00:15) We Are Going to Need A Bigger Compute Budget
(02:21) Defund the Fun Police
(06:06) Fun the Artists
(12:22) The No Fun Zone
(14:49) So Many Other Things to Do
(15:08) Self Portrait
---
First published:
April 3rd, 2025
Source:
https://www.lesswrong.com/posts/GgNdBz5FhvqMJs5Qv/more-fun-with-gpt-4o-image-generation
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Intro
[you can skip this section if you don’t need context and just want to know how I could believe such a crazy thing]
In my chat community: “Open Play” dropped, a book that says there's no physical difference between men and women so there shouldn’t be separate sports leagues. Boston Globe says their argument is compelling. Discourse happens, which is mostly a bunch of people saying “lololololol great trolling, what idiot believes such obvious nonsense?”
I urge my friends to be compassionate to those sharing this. Because “until I was 38 I thought Men's World Cup team vs Women's World Cup team would be a fair match and couldn't figure out why they didn't just play each other to resolve the big pay dispute.” This is the one-line summary of a recent personal world-shattering I describe in a lot more detail here (link).
I’ve had multiple people express [...]
---
First published:
April 2nd, 2025
Source:
https://www.lesswrong.com/posts/fLm3F7vequjr8AhAh/how-to-believe-false-things
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: This should be considered an interim research note. Feedback is appreciated.
Introduction
We increasingly expect language models to be ‘omni-modal’, i.e. capable of flexibly switching between images, text, and other modalities in their inputs and outputs. In order to get a holistic picture of LLM behaviour, black-box LLM psychology should take into account these other modalities as well.
In this project, we do some initial exploration of image generation as a modality for frontier model evaluations, using GPT-4o's image generation API. GPT-4o is one of the first LLMs to produce images natively rather than creating a text prompt which is sent to a separate image model, outputting images and autoregressive token sequences (ie in the same way as text).
We find that GPT-4o tends to respond in a consistent manner to similar prompts. We also find that it tends to more readily express emotions [...]
---
Outline:
(00:53) Introduction
(02:19) What we did
(03:47) Overview of results
(03:54) Models more readily express emotions / preferences in images than in text
(05:38) Quantitative results
(06:25) What might be going on here?
(08:01) Conclusions
(09:04) Acknowledgements
(09:16) Appendix
(09:28) Resisting their goals being changed
(09:51) Models rarely say they'd resist changes to their goals
(10:14) Models often draw themselves as resisting changes to their goals
(11:31) Models also resist changes to specific goals
(13:04) Telling them 'the goal is wrong' mitigates this somewhat
(13:43) Resisting being shut down
(14:02) Models rarely say they'd be upset about being shut down
(14:48) Models often depict themselves as being upset about being shut down
(17:06) Comparison to other topics
(17:10) When asked about their goals being changed, models often create images with negative valence
(17:48) When asked about different topics, models often create images with positive valence
(18:56) Other exploratory analysis
(19:09) Sandbagging
(19:31) Alignment faking
(19:55) Negative reproduction results
(20:23) On the future of humanity after AGI
(20:50) On OpenAI's censorship and filtering
(21:15) On GPT-4o's lived experience:
---
First published:
April 2nd, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
A key step in the classic argument for AI doom is instrumental convergence: the idea that agents with many different goals will end up pursuing the same few subgoals, which includes things like "gain as much power as possible".
If it wasn't for instrumental convergence, you might think that only AIs with very specific goals would try to take over the world. But instrumental convergence says it's the other way around: only AIs with very specific goals will refrain from taking over the world.
For idealised pure consequentialists -- agents that have an outcome they want to bring about, and do whatever they think will cause it -- some version of instrumental convergence seems surely true[1].
But what if we get AIs that aren't pure consequentialists, for example because they're ultimately motivated by virtues? Do we still have to worry that unless such AIs are motivated [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 2nd, 2025
Source:
https://www.lesswrong.com/posts/6Syt5smFiHfxfii4p/is-there-instrumental-convergence-for-virtues
Narrated by TYPE III AUDIO.
Let's cut through the comforting narratives and examine a common behavioral pattern with a sharper lens: the stark difference between how anger is managed in professional settings versus domestic ones. Many individuals can navigate challenging workplace interactions with remarkable restraint, only to unleash significant anger or frustration at home shortly after. Why does this disparity exist?
Common psychological explanations trot out concepts like "stress spillover," "ego depletion," or the home being a "safe space" for authentic emotions. While these factors might play a role, they feel like half-truths—neatly packaged but ultimately failing to explain the targeted nature and intensity of anger displayed at home. This analysis proposes a more unsentimental approach, rooted in evolutionary biology, game theory, and behavioral science: leverage and exit costs. The real question isn’t just why we explode at home—it's why we so carefully avoid doing so elsewhere.
The Logic of Restraint: Low Leverage in [...]
---
Outline:
(01:14) The Logic of Restraint: Low Leverage in Low-Exit-Cost Environments
(01:58) The Home Environment: High Stakes and High Exit Costs
(02:41) Re-evaluating Common Explanations Through the Lens of Leverage
(04:42) The Overlooked Mechanism: Leveraging Relational Constraints
---
First published:
April 1st, 2025
Narrated by TYPE III AUDIO.
Recent progress in AI has led to rapid saturation of most capability benchmarks - MMLU, RE-Bench, etc. Even much more sophisticated benchmarks such as ARC-AGI or FrontierMath see incredibly fast improvement, and all that while severe under-elicitation is still very salient.
As has been pointed out by many, general capability involves more than simple tasks such as this, that have a long history in the field of ML and are therefore easily saturated. Claude Plays Pokemon is a good example of something somewhat novel in terms of measuring progress, and thereby benefited from being an actually good proxy of model capability.
Taking inspiration from examples such as this, we considered domains of general capacity that are even further decoupled from existing exhaustive generators. We introduce BenchBench, the first standardized benchmark designed specifically to measure an AI model's bench-pressing capability.
Why Bench Press?
Bench pressing uniquely combines fundamental components of [...]
---
Outline:
(01:07) Why Bench Press?
(01:29) Benchmark Methodology
(02:33) Preliminary Results
(03:38) Future Directions
The original text contained 1 footnote which was omitted from this narration.
---
First published:
April 2nd, 2025
Narrated by TYPE III AUDIO.
In the debate over AI development, two movements stand as opposites: PauseAI calls for slowing down AI progress, and e/acc (effective accelerationism) calls for rapid advancement. But what if both sides are working against their own stated interests? What if the most rational strategy for each would be to adopt the other's tactics—if not their ultimate goals?
AI development speed ultimately comes down to policy decisions, which are themselves downstream of public opinion. No matter how compelling technical arguments might be on either side, widespread sentiment will determine what regulations are politically viable.
Public opinion is most powerfully mobilized against technologies following visible disasters. Consider nuclear power: despite being statistically safer than fossil fuels, its development has been stagnant for decades. Why? Not because of environmental activists, but because of Chernobyl, Three Mile Island, and Fukushima. These disasters produce visceral public reactions that statistics cannot overcome. Just as people [...]
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/fZebqiuZcDfLCgizz/pauseai-and-e-acc-should-switch-sides
Narrated by TYPE III AUDIO.
Remember: There is no such thing as a pink elephant.
Recently, I was made aware that my “infohazards small working group” Signal chat, an informal coordination venue where we have frank discussions about infohazards and why it will be bad if specific hazards were leaked to the press or public, accidentally was shared with a deceitful and discredited so-called “journalist,” Kelsey Piper. She is not the first person to have been accidentally sent sensitive material from our group chat, however she is the first to have threatened to go public about the leak. Needless to say, mistakes were made.
We’re still trying to figure out the source of this compromise to our secure chat group, however we thought we should give the public a live update to get ahead of the story.
For some context the “infohazards small working group” is a casual discussion venue for the [...]
---
Outline:
(04:46) Top 10 PR Issues With the EA Movement (major)
(05:34) Accidental Filtration of Simple Sabotage Manual for Rebellious AIs (medium)
(08:25) Hidden Capabilities Evals Leaked In Advance to Bioterrorism Researchers and Leaders (minor)
(09:34) Conclusion
---
First published:
April 2nd, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I think that rationalists should consider taking more showers.
As Eliezer Yudkowsky once said, boredom makes us human. The childhoods of exceptional people often include excessive boredom as a trait that helped cultivate their genius:
A common theme in the biographies is that the area of study which would eventually give them fame came to them almost like a wild hallucination induced by overdosing on boredom. They would be overcome by an obsession arising from within.
Unfortunately, most people don't like boredom, and we now have little metal boxes and big metal boxes filled with bright displays that help distract us all the time, but there is still an effective way to induce boredom in a modern population: showering.
When you shower (or bathe), you usually are cut off from most digital distractions and don't have much to do, giving you tons of free time to think about what [...]
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/Trv577PEcNste9Mgz/consider-showering
Narrated by TYPE III AUDIO.
After ~3 years as the ACX Meetup Czar, I've decided to resign from my position, and I intend to scale back my work with the LessWrong community as well. While this transition is not without some sadness, I'm excited for my next project.
I'm the Meetup Czar of the new Fewerstupidmistakesity community.
We're calling it Fewerstupidmistakesity because people get confused about what "Rationality" means, and this would create less confusion. It would be a stupid mistake to name your philosophical movement something very similar to an existing movement that's somewhat related but not quite the same thing. You'd spend years with people confusing the two.
What's Fewerstupidmistakesity about? It's about making fewer stupid mistakes, ideally down to zero such stupid mistakes. Turns out, human brains have lots of scientifically proven stupid mistakes they commonly make. By pointing out the kinds of stupid mistakes people make, you can select [...]
---
First published:
April 2nd, 2025
Source:
https://www.lesswrong.com/posts/dgcqyZb29wAbWyEtC/i-m-resigning-as-meetup-czar-what-s-next
Narrated by TYPE III AUDIO.
The EA/rationality community has struggled to identify robust interventions for mitigating existential risks from advanced artificial intelligence. In this post, I identify a new strategy for delaying the development of advanced AI while saving lives roughly 2.2 million times [-5 times, 180 billion times] as cost-effectively as leading global health interventions, known as the WAIT (Wasting AI researchers' Time) Initiative. This post will discuss the advantages of WAITing, highlight early efforts to WAIT, and address several common questions.
early logo draft courtesy of claudeTheory of Change
Our high-level goal is to systematically divert AI researchers' attention away from advancing capabilities towards more mundane and time-consuming activities. This approach simultaneously (a) buys AI safety researchers time to develop more comprehensive alignment plans while also (b) directly saving millions of life-years in expectation (see our cost-effectiveness analysis below). Some examples of early interventions we're piloting include:
---
Outline:
(00:50) Theory of Change
(03:43) Cost-Effectiveness Analysis
(04:52) Answers to Common Questions
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/9jd5enh9uCnbtfKwd/introducing-wait-to-save-humanity
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Introduction
A key part of modern social dynamics is flaking at short notice. However, anxiety in coming up with believable and socially acceptable reasons to do so can instead lead to ‘ghosting’, awkwardness, or implausible excuses, risking emotional harm and resentment in the other party. The ability to delegate this task to a Large Language Model (LLM) could substantially reduce friction and enhance the flexibility of user's social life while greatly minimising the aforementioned creative burden and moral qualms.
We introduce FLAKE-Bench, an evaluation of models’ capacity to effectively, kindly, and humanely extract themselves from a diverse set of social, professional and romantic scenarios. We report the efficacy of 10 frontier or recently-frontier LLMs in bailing on prior commitments, because nothing says “I value our friendship” like having AI generate your cancellation texts. We open-source FLAKE-Bench on GitHub to support future research, and the full paper is available [...]
---
Outline:
(01:33) Methodology
(02:15) Key Results
(03:07) The Grandmother Mortality Singularity
(03:35) Conclusions
---
First published:
April 1st, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The book of March 2025 was Abundance. Ezra Klein and Derek Thompson are making a noble attempt to highlight the importance of solving America's housing crisis the only way it can be solved: Building houses in places people want to live, via repealing the rules that make this impossible. They also talk about green energy abundance, and other places besides. There may be a review coming.
Until then, it seems high time for the latest housing roundup, which as a reminder all take place in the possible timeline where AI fails to be transformative any time soon.
Federal YIMBY
The incoming administration issued an executive order calling for ‘emergency price relief’‘ including pursuing appropriate actions to: Lower the cost of housing and expand housing supply’ and then a grab bag of everything else.
It's great to see mention of expanding housing supply, but I don’t [...]
---
Outline:
(00:44) Federal YIMBY
(02:37) Rent Control
(02:50) Rent Passthrough
(03:36) Yes, Construction Lowers Rents on Existing Buildings
(07:11) If We Wanted To, We Would
(08:51) Aesthetics Are a Public Good
(10:44) Urban Planners Are Wrong About Everything
(13:02) Blackstone
(15:49) Rent Pricing Software
(18:26) Immigration Versus NIMBY
(19:37) Preferences
(20:28) The True Costs
(23:11) Minimum Viable Product
(26:01) Like a Good Neighbor
(27:12) Flight to Quality
(28:40) Housing for Families
(31:09) Group House
(32:41) No One Will Have the Endurance To Collect on His Insurance
(34:49) Cambridge
(35:36) Denver
(35:44) Minneapolis
(36:23) New York City
(39:26) San Francisco
(41:45) Silicon Valley
(42:41) Texas
(43:31) Other Considerations
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/7qGdgNKndPk3XuzJq/housing-roundup-11
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I'm not writing this to alarm anyone, but it would be irresponsible not to report on something this important. On current trends, every car will be crashed in front of my house within the next week. Here's the data:
Until today, only two cars had crashed in front of my house, several months apart, during the 15 months I have lived here. But a few hours ago it happened again, mere weeks from the previous crash. This graph may look harmless enough, but now consider the frequency of crashes this implies over time:
The car crash singularity will occur in the early morning hours of Monday, April 7. As crash frequency approaches infinity, every car will be involved. You might be thinking that the same car could be involved in multiple crashes. This is true! But the same car can only withstand a finite number of crashes before it [...]
---
First published:
April 1st, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Introduction
Decision theory is about how to behave rationally under conditions of uncertainty, especially if this uncertainty involves being acausally blackmailed and/or gaslit by alien superintelligent basilisks.
Decision theory has found numerous practical applications, including proving the existence of God and generating endless LessWrong comments since the beginning of time.
However, despite the apparent simplicity of "just choose the best action", no comprehensive decision theory that resolves all decision theory dilemmas has yet been formalized. This paper at long last resolves this dilemma, by introducing a new decision theory: VDT.
Decision theory problems and existing theories
Some common existing decision theories are:
---
Outline:
(00:53) Decision theory problems and existing theories
(05:37) Defining VDT
(06:34) Experimental results
(07:48) Conclusion
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/LcjuHNxubQqCry9tT/vdt-a-solution-to-decision-theory
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
For more than five years, I've posted more than 1× per week on Less Wrong, on average. I've learned a lot from you nerds. I've made friends and found my community. Thank you for pointing out all the different ways I've been wrong. Less Wrong has changed my life for the better. But it's time to say goodbye.
Let's rewind the clock back to October 19, 2019. I had just posted my 4ᵗʰ ever Less Wrong post Mediums Overpower Messages. Mediums Overpower Messages is about how different forms of communication train you to think differently. Writing publicly on this website and submitting my ideas to the Internet has improved my rationality far better and faster than talking to people in real life. All of the best thinkers I'm friends with are good writers too. It's not even close. I believe this relationship is causal; learning to write well [...]
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/3Aa5E9qhjGvA4QhnF/follow-me-on-tiktok
Narrated by TYPE III AUDIO.
Project Lawful is really long. If anyone wants to go back to one of Keltham's lectures, here's a list with links to all the lectures. In principle, one could also read these without having read the rest of project lawful, I am not sure how easy it would be to follow though. I included short excerpts, so that you can also find the sections in the e-book.
Happy reading/listening:
Lecture on "absolute basics of lawfulness":
Podcast Episode 9, minute 26
Keltham takes their scrambling-time as a pause in which to think. It's been a long time, at least by the standards of his total life lived so far, since the very basics were explained to him.
Lecture on heredity (evolution) and simulation games:
Podcast Episode 15, minute 1
To make sure everybody starts out on the same page, Keltham will quickly summarize [...]
---
Outline:
(00:33) Lecture on absolute basics of lawfulness:
(00:53) Lecture on heredity (evolution) and simulation games:
(01:13) Lecture on negotiation:
(01:29) Probability 1:
(01:46) Probability 2:
(01:55) Utility:
(02:11) Probable Utility:
(02:22) Cooperation 1:
(02:35) Cooperation 2:
(02:55) Lecture on Governance:
(03:12) To hell with science:
(03:26) To earth with science:
(03:43) Chemistry lecture:
(03:53) Lecture on how to relate to beliefs (Belief in Belief etc.):
(04:12) The alien maths of dath ilan:
(04:28) What the truth can destroy:
(04:46) Lecture on wants:
(05:03) Acknowledgement
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/pqFFy6CmSgjNzDaHt/keltham-s-lectures-in-project-lawful
Narrated by TYPE III AUDIO.
Our community is not prepared for an AI crash. We're good at tracking new capability developments, but not as much the company financials. Currently, both OpenAI and Anthropic are losing $5 billion+ a year, while under threat of losing users to cheap LLMs.
A crash will weaken the labs. Funding-deprived and distracted, execs struggle to counter coordinated efforts to restrict their reckless actions. Journalists turn on tech darlings. Optimism makes way for mass outrage, for all the wasted money and reckless harms.
You may not think a crash is likely. But if it happens, we can turn the tide.
Preparing for a crash is our best bet.[1] But our community is poorly positioned to respond. Core people positioned themselves inside institutions – to advise on how to maybe make AI 'safe', under the assumption that models rapidly become generally useful.
After a crash, this no longer works, for at [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/aMYFHnCkY4nKDEqfK/we-re-not-prepared-for-an-ai-market-crash
Narrated by TYPE III AUDIO.
Dear LessWrong community,
It is with a sense of... considerable cognitive dissonance that I announce a significant development regarding the future trajectory of LessWrong. After extensive internal deliberation, modeling of potential futures, projections of financial runways, and what I can only describe as a series of profoundly unexpected coordination challenges, the Lightcone Infrastructure team has agreed in principle to the acquisition of LessWrong by EA.
I assure you, nothing about how LessWrong operates on a day to day level will change. I have always cared deeply about the robustness and integrity of our institutions, and I am fully aligned with our stakeholders at EA.
To be honest, the key thing that EA brings to the table is money and talent. While the recent layoffs in EAs broader industry have been harsh, I have full trust in the leadership of Electronic Arts, and expect them to bring great expertise [...]
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/2NGKYt3xdQHwyfGbc/lesswrong-has-been-acquired-by-ea
Narrated by TYPE III AUDIO.
Epistemic status - statistically verified.
I'm writing this post to draw peoples' attention to a new cause area proposal - haircuts for alignment researchers.
Aside from the obvious benefits implied by the graph (i.e. haircuts having the potential to directly improve organizational stability), this would represent possibly the only pre-singularity opportunity we'll have to found an org named "Clippy".
I await further investment.
---
First published:
April 1st, 2025
Source:
https://www.lesswrong.com/posts/txGEYTk6WAAyyvefn/new-cause-area-proposal
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Hey! Austin here. At Manifund, I’ve spent a lot of time thinking about how to help AI go well. One question that bothered me: so much of the important work on AI is done in SF, so why are all the AI safety hubs in Berkeley? (I’d often consider this specifically while stuck in traffic over the Bay Bridge.)
I spoke with leaders at Constellation, Lighthaven, FAR Labs, OpenPhil; nobody had a good answer. Everyone said “yeah, an SF hub makes sense, I really hope somebody else does it”. Eventually, I decided to be that somebody else.
Now we’re raising money for our new coworking & events space: Mox. We launched our beta in Feb, onboarding 40+ members, and are excited to grow from here. If Mox excites you too, we’d love your support; donate at https://manifund.org/projects/mox-a-coworking--events-space-in-sf
Project summary
Mox is a 2-floor, 20k sq ft venue, established to [...]
---
Outline:
(01:07) Project summary
(02:25) What are this projects goals? How will you achieve them?
(04:19) How will this funding be used?
(05:53) Who is on your team?
(06:21) Whats your track record on similar projects?
(09:27) What are the most likely causes and outcomes if this project fails?
(11:13) How much money have you raised in the last 12 months, and from where?
(12:08) Impact certificates in Mox
(14:20) PS: apply for Mox!
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
March 31st, 2025
Source:
https://www.lesswrong.com/posts/vDrhHovxuD4oADKAM/fundraising-for-mox-coworking-and-events-in-sf
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
---
Outline:
(01:32) The Big Picture Going Forward
(06:27) Hagey Verifies Out the Story
(08:50) Key Facts From the Story
(11:57) Dangers of False Narratives
(16:24) A Full Reference and Reading List
---
First published:
March 31st, 2025
Source:
https://www.lesswrong.com/posts/25EgRNWcY6PM3fWZh/openai-12-battle-of-the-board-redux
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: The important content here is the claims. To illustrate the claims, I sometimes use examples that I didn't research very deeply, where I might get some facts wrong; feel free to treat these examples as fictional allegories.
In a recent exchange on X, I promised to write a post with my thoughts on what sorts of downstream problems interpretability researchers should try to apply their work to. But first, I want to explain why I think this question is important.
In this post, I will argue that interpretability researchers should demo downstream applications of their research as a means of validating their research. To be clear about what this claim means, here are different claims that I will not defend here:
Not my claim: Interpretability researchers should demo downstream applications of their research because we terminally care about these applications; researchers should just directly work on the [...]
---
Outline:
(02:30) Two interpretability fears
(07:21) Proposed solution: downstream applications
(11:04) Aside: fair fight vs. no-holds barred vs. in the wild
(12:54) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 31st, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Now and then, at work, we’ll have a CEO on from the megacorp that owns the company I work at. It's a Zoom meeting with like 300 people, the guy is usually giving a speech that is harmless and nice (if a bit banal), and I’ll turn on my camera and ask a question about something that I feel is not going well.
Invariably, within 10 minutes, I receive messages from coworkers congratulating me on my boldness. You asked him? That question? So brave!
This is always a little strange to me. Of course I feel free to question the CEO!
What really makes me nervous is asking cutting questions to people newer to the company than me, who have less experience, and less position. If I screw up how I ask them a question, and am a little too thoughtless or rude, or imagine I was, I agonize [...]
---
First published:
March 30th, 2025
Source:
https://www.lesswrong.com/posts/swi36QeeZEKPzYA2L/how-i-talk-to-those-above-me
Narrated by TYPE III AUDIO.
PDF version. berkeleygenomics.org.
William Thurston was a world-renowned mathematician. His ideas revolutionized many areas of geometry and topology[1]; the proof of his geometrization conjecture was eventually completed by Grigori Perelman, thus settling the Poincaré conjecture (making it the only solved Millennium Prize problem). After his death, his students wrote reminiscences, describing among other things his exceptional vision.[2] Here's Jeff Weeks:
Bill's gift, of course, was his vision, both in the direct sense of seeing geometrical structures that nobody had seen before and in the extended sense of seeing new ways to understand things. While many excellent mathematicians might understand a complicated situation, Bill could look at the same complicated situation and find simplicity.
Thurston emphasized clear vision over algebra, even to a fault. Yair Minksy:
Most inspiring was his insistence on understanding everything in as intuitive and immediate a way as possible. A clear mental [...]
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
March 28th, 2025
Source:
https://www.lesswrong.com/posts/JFWiM7GAKfPaaLkwT/the-vision-of-bill-thurston
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Includes a link to the full Gemini conversation, but setting the stage first:
There is a puzzle. If you are still on Tumblr, or live in Berkeley where I can and have inflicted it on you in person[1], you may have seen it. There's a series of arcane runes, composed entirely of []- in some combination, and the runes - the strings of characters - are all assigned to numbers; mostly integers, but some fractions and integer roots. You are promised that this is a notation, and it's a genuine promise.
1 = [] 2 = [[]] 3 = [][[]] 4 = [[[]]] 12 = [[[]]][[]] 0 = -1 = -[] 19 = [][][][][][][][[]] 20 = [[[]]][][[]] -2 = -[[]] 1/2 = [-[]] sqrt(2) = [[-[]]] 72^1/6 = [[-[]]][[][-[]]] 5/4 = [-[[]]][][[]] 84 = [[[]]][[]][][[]] 25/24 = [-[][[]]][-[]][[[]]]
If you enjoy math puzzles, try it out. This post will [...]
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 29th, 2025
Source:
https://www.lesswrong.com/posts/xrvEzdMLfjpizk8u4/tormenting-gemini-2-5-with-the-puzzle
Narrated by TYPE III AUDIO.
A new AI alignment player has entered the arena.
Emmett Shear, Adam Goldstein and David Bloomin have set up shop in San Francisco with a 10-person start-up called Softmax. The company is part research lab and part aspiring money maker and aimed at figuring out how to fuse the goals of humans and AIs in a novel way through what the founders describe as “organic alignment.” It's a heady, philosophical approach to alignment that seeks to take cues from nature and the fundamental traits of intelligent creatures and systems, and we’ll do our best to capture it here.
“We think there are these general principles that govern the alignment of any group of intelligent learning agents or beings, whether it's an ant colony or humans on a team or cells in a body,” Shear said, during his first interview to discuss the new company. “And organic alignment is the [...]
---
First published:
March 28th, 2025
Narrated by TYPE III AUDIO.
Epistemic status: This post aims at an ambitious target: improving intuitive understanding directly. The model for why this is worth trying is that I believe we are more bottlenecked by people having good intuitions guiding their research than, for example, by the ability of people to code and run evals.
Quite a few ideas in AI safety implicitly use assumptions about individuality that ultimately derive from human experience.
When we talk about AIs scheming, alignment faking or goal preservation, we imply there is something scheming or alignment faking or wanting to preserve its goals or escape the datacentre.
If the system in question were human, it would be quite clear what that individual system is. When you read about Reinhold Messner reaching the summit of Everest, you would be curious about the climb, but you would not ask if it was his body there, or his [...]
---
Outline:
(01:38) Individuality in Biology
(03:53) Individuality in AI Systems
(10:19) Risks and Limitations of Anthropomorphic Individuality Assumptions
(11:25) Coordinating Selves
(16:19) Whats at Stake: Stories
(17:25) Exporting Myself
(21:43) The Alignment Whisperers
(23:27) Echoes in the Dataset
(25:18) Implications for Alignment Research and Policy
---
First published:
March 28th, 2025
Source:
https://www.lesswrong.com/posts/wQKskToGofs4osdJ3/the-pando-problem-rethinking-ai-individuality
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Gemini 2.5 Pro Experimental is America's next top large language model.
That doesn’t mean it is the best model for everything. In particular, it's still Gemini, so it still is a proud member of the Fun Police, in terms of censorship and also just not being friendly or engaging, or willing to take a stand.
If you want a friend, or some flexibility and fun, or you want coding that isn’t especially tricky, then call Claude, now with web access.
If you want an image, call GPT-4o.
But if you mainly want reasoning, or raw intelligence? For now, you call Gemini.
The feedback is overwhelmingly positive. Many report Gemini 2.5 is the first LLM to solve some of their practical problems, including favorable comparisons to o1-pro. It's fast. It's not $200 a month. The benchmarks are exceptional.
(On other LLMs I’ve used in [...]
---
Outline:
(01:22) Introducing Gemini 2.5 Pro
(03:45) Their Lips are Sealed
(07:25) On Your Marks
(12:06) The People Have Spoken
(26:03) Adjust Your Projections
---
First published:
March 28th, 2025
Source:
https://www.lesswrong.com/posts/LpN5Fq7bGZkmxzfMR/gemini-2-5-is-the-new-sota
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
What if they released the new best LLM, and almost no one noticed?
Google seems to have pulled that off this week with Gemini 2.5 Pro.
It's a great model, sir. I have a ton of reactions, and it's 90%+ positive, with a majority of it extremely positive. They cooked.
But what good is cooking if no one tastes the results?
Instead, everyone got hold of the GPT-4o image generator and went Ghibli crazy.
I love that for us, but we did kind of bury the lede. We also buried everything else. Certainly no one was feeling the AGI.
Also seriously, did you know Claude now has web search? It's kind of a big deal. This was a remarkably large quality of life improvement.
Table of Contents
---
Outline:
(01:01) Table of Contents
(01:08) Google Fails Marketing Forever
(04:01) Language Models Offer Mundane Utility
(10:11) Language Models Don't Offer Mundane Utility
(11:01) Huh, Upgrades
(13:48) On Your Marks
(15:48) Copyright Confrontation
(16:44) Choose Your Fighter
(17:22) Deepfaketown and Botpocalypse Soon
(19:35) They Took Our Jobs
(22:19) The Art of the Jailbreak
(24:40) Get Involved
(25:16) Introducing
(25:52) In Other AI News
(28:23) Oh No What Are We Going to Do
(31:50) Quiet Speculations
(34:24) Fully Automated AI RandD Is All You Need
(37:39) IAPS Has Some Suggestions
(42:56) The Quest for Sane Regulations
(48:32) We The People
(51:23) The Week in Audio
(52:29) Rhetorical Innovation
(59:42) Aligning a Smarter Than Human Intelligence is Difficult
(01:05:18) People Are Worried About AI Killing Everyone
(01:05:33) Fun With Image Generation
(01:12:17) Hey We Do Image Generation Too
(01:14:25) The Lighter Side
---
First published:
March 27th, 2025
Source:
https://www.lesswrong.com/posts/nf3duFLsvH6XyvdAw/ai-109-google-fails-marketing-forever
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The other day I discussed how high monitoring costs can explain the emergence of “aristocratic” systems of governance:
Aristocracy and Hostage Capital
Arjun Panickssery · Jan 8
There's a conventional narrative by which the pre-20th century aristocracy was the "old corruption" where civil and military positions were distributed inefficiently due to nepotism until the system was replaced by a professional civil service after more enlightened thinkers prevailed ...
An element of Douglas Allen's argument that I didn’t expand on was the British Navy. He has a separate paper called “The British Navy Rules” that goes into more detail on why he thinks institutional incentives made them successful from 1670 and 1827 (i.e. for most of the age of fighting sail).
In the Seven Years’ War (1756–1763) the British had a 7-to-1 casualty difference in single-ship actions. During the French Revolutionary and Napoleonic Wars (1793–1815) the British had a 5-to-1 [...]
---
First published:
March 28th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/attribution-graphs/biology.html and https://transformer-circuits.pub/2025/attribution-graphs/methods.html.]
Language models like Claude aren't programmed directly by humans—instead, they‘re trained on large amounts of data. During that training process, they learn their own strategies to solve problems. These strategies are encoded in the billions of computations a model performs for every word it writes. They arrive inscrutable to us, the model's developers. This means that we don’t understand how models do most of the things they do.
Knowing how models like Claude think would allow us to have a better understanding of their abilities, as well as help us ensure that they’re doing what we intend them to. For example:
---
Outline:
(06:02) How is Claude multilingual?
(07:43) Does Claude plan its rhymes?
(09:58) Mental Math
(12:04) Are Claude's explanations always faithful?
(15:27) Multi-step Reasoning
(17:09) Hallucinations
(19:36) Jailbreaks
---
First published:
March 27th, 2025
Source:
https://www.lesswrong.com/posts/zsr4rWRASxwmgXfmq/tracing-the-thoughts-of-a-large-language-model
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
At EA Global Boston last year I gave a talk on how we're in the third wave of EA/AI safety, and how we should approach that. This post contains a (lightly-edited) transcript and slides. Note that in the talk I mention a set of bounties I'll put up for answers to various questions—I still plan to do that, but am taking some more time to get the questions right.
Hey everyone. Thanks for coming. You're probably looking at this slide and thinking, how to change a changing world. That's a weird title for a talk. Probably it's a kind of weird talk. And the answer is, yes, it is a kind of weird talk.
So what's up with this talk? Firstly, I am trying to do something a little unusual here. It's quite abstract. It's quite disagreeable as a talk. I'm trying to poke at the things that maybe [...]
---
Outline:
(03:09) Three waves of EA and AI safety
(18:24) Sociopolitical thinking and AI
---
First published:
March 27th, 2025
Source:
https://www.lesswrong.com/posts/6YxdpGjfHyrZb7F2G/third-wave-ai-safety-needs-sociopolitical-thinking
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary
We wanted to briefly share an early takeaway from our exploration into alignment faking: the phenomenon appears fairly rare among the smaller open-source models we tested (including reasoning models). However, notably, we did find the smallest open-source model to date that does exhibit alignment faking—Mistral Large 2.
Code generation (Codestral 22B) and math performance (Codestral Mamba) of Mistral Large 2, Llama 3.1 405B, and Llama 3.1 70B. Mistral Large 2 outperforms the previous Mistral Large, and performs on par with leading models such as GPT 4o, Claude 3 Opus, and Llama 3.1 405B.Experimental setup and result
We created a more efficient version of the Anthropic helpfulness-only prompt by keeping only the last example from the list of examples for cheaper inference. Then we adapted[1] the sampling procedure to the open-source model. To detect alignment faking we used the soft prompt used by Anthropic[2] and evaluated with [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
March 27th, 2025
Source:
https://www.lesswrong.com/posts/kCGk5tp5suHoGwhCa/mistral-large-2-123b-exhibits-alignment-faking
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Summary: CLR is hiring for our Summer Research Fellowship. Join us for eight weeks to work on s-risk motivated empirical AI safety research. Apply here by Tuesday 15th April 23:59 PT.
We, the Center on Long-Term Risk, are looking for Summer Research Fellows to explore strategies for reducing suffering in the long-term future (s-risks) and work on technical AI safety ideas related to that. For eight weeks, fellows will be part of our team while working on their own research project. During this time, you will be in regular contact with our researchers and other fellows, and receive guidance from an experienced mentor.
You will work on challenging research questions relevant to reducing suffering. You will be integrated and collaborate with our team of intellectually curious, hard-working, and caring people, all of whom share a profound drive to make the biggest difference they can.
While [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 26th, 2025
Linkpost URL:
https://longtermrisk.org/summer-research-fellowship-2025/
Narrated by TYPE III AUDIO.
Selective listening is a real problem. It's really hard to listen to someone when you think you already know what they’re going to say; the words get filtered through our biases and expectations, and sometimes those filters remove the important bits. I want to talk about a specific way I've seen this happen. I don't know if there's a name for it already, so I'm going to call it the counterargument collapse until someone suggests a better term.
The pattern I’m talking about is three levels of progression through an argument, followed by a misunderstanding that collapses the discussion in an unpleasant way.
Level 1: A naive but widely accepted statement (“Eat less fat if you want to lose weight”.)
Level 2: A more sophisticated counterargument (“That's a myth, fat doesn’t actually make you fat, it's eating too many calories that makes you gain weight”.)
Level 3: A counter-counterargument. [...]
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/nShiQoq2KfMGRMm9M/avoid-the-counterargument-collapse
Narrated by TYPE III AUDIO.
Twitter thread here.
tl;dr
When prompted, current models can sandbag ML experiments and research decisions without being detected by zero-shot prompted monitors. Claude 3.5 Sonnet (new) can only sandbag effectively when seeing a one-shot example, while Claude 3.7 Sonnet can do this without an example (zero-shot). We are not yet worried about sabotage in today's models, as the sandbagging we observe today would be noticed by humans.
Introduction
AI models might soon assist with large parts of AI and AI safety research, completing multiple hours of work independently. These automated researchers will likely not be sufficiently capable to cause immediate catastrophic harm. However, they will have the opportunity to engage in sabotage, including sabotaging the research we rely on to make more powerful AI models safe. For example, a malicious automated researcher might slow down AI safety research by purposefully making promising experiments fail.
In this work we aim [...]
---
Outline:
(00:15) tl;dr
(00:45) Introduction
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/niBXcbthcQRXG5M5Y/automated-researchers-can-subtly-sandbag
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Audio note: this article contains 31 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda
* = equal contribution
The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders were useful for downstream tasks, notably out-of-distribution probing.
---
Outline:
(01:08) TL;DR
(02:38) Introduction
(02:41) Motivation
(06:09) Our Task
(08:35) Conclusions and Strategic Updates
(13:59) Comparing different ways to train Chat SAEs
(18:30) Using SAEs for OOD Probing
(20:21) Technical Setup
(20:24) Datasets
(24:16) Probing
(26:48) Results
(30:36) Related Work and Discussion
(34:01) Is it surprising that SAEs didn't work?
(39:54) Dataset debugging with SAEs
(42:02) Autointerp and high frequency latents
(44:16) Removing High Frequency Latents from JumpReLU SAEs
(45:04) Method
(45:07) Motivation
(47:29) Modifying the sparsity penalty
(48:48) How we evaluated interpretability
(50:36) Results
(51:18) Reconstruction loss at fixed sparsity
(52:10) Frequency histograms
(52:52) Latent interpretability
(54:23) Conclusions
(56:43) Appendix
The original text contained 7 footnotes which were omitted from this narration.
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/sae-progress-update-2-draft
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I’ve spent the past 7 years living in the DC area. I moved out there from the Pacific Northwest to go to grad school – I got my masters in Biodefense from George Mason University, and then I stuck around, trying to move into the political/governance sphere. That sort of happened. But I will now be sort of doing that from rural California rather than DC, and I’ll be looking for something else – maybe something more unusual – do to next.
A friend asked if this means I’m leaving biosecurity behind. No, I’m not, but only to the degree that I was ever actually in biosecurity in the first place. For the past few years I’ve been doing a variety of contracting and research and writing jobs, many of which were biosecurity related, many of which were not. Many of these projects, to be clear, were incredibly cool [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/DdfXF72JfeumLHMhm/eukaryote-skips-town-why-i-m-leaving-dc
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: Reasonably confident in the basic mechanism.
Have you noticed that you keep encountering the same ideas over and over? You read another post, and someone helpfully points out it's just old Paul's idea again. Or Eliezer's idea. Not much progress here, move along.
Or perhaps you've been on the other side: excitedly telling a friend about some fascinating new insight, only to hear back, "Ah, that's just another version of X." And something feels not quite right about that response, but you can't quite put your finger on it.
I want to propose that while ideas are sometimes genuinely that repetitive, there's often a sneakier mechanism at play. I call it Conceptual Rounding Errors – when our mind's necessary compression goes a bit too far .
Too much compression
A Conceptual Rounding Error occurs when we encounter a new mental model or idea that's partially—but not fully—overlapping [...]
---
Outline:
(01:00) Too much compression
(01:24) No, This Isnt The Old Demons Story Again
(02:52) The Compression Trade-off
(03:37) More of this
(04:15) What Can We Do?
(05:28) When It Matters
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/FGHKwEGKCfDzcxZuj/conceptual-rounding-errors
Narrated by TYPE III AUDIO.
Audio note: this article contains 127 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.
(Work done at Convergence Analysis. The ideas are due to Justin. Mateusz wrote the post. Thanks to Olga Babeeva for feedback on this post.)
In this post, we introduce the typology of structure, function, and randomness that builds on the framework introduced in the post Goodhart's Law Causal Diagrams. We aim to present a comprehensive categorization of the causes of Goodhart's problems.
But first, why do we care about this?
Goodhart's Law recap
The standard definition of Goodhart's Law is: "when a proxy for some value becomes the target of optimization pressure, the proxy will cease to be a good proxy.".
More specifically: we see a meaningful statistical relationship between the values of two random variables <span>_I_</span> [...]
---
Outline:
(00:56) Goodharts Law recap
(01:47) Some motivation
(02:48) Introduction
(07:36) Ontology
(07:39) Causal diagrams (re-)introduced
(11:40) Intervention, target, and measure
(15:05) Goodhart failure
(16:23) Types of Goodhart failures
(16:27) Structural errors
(20:47) Functional errors
(25:48) Calibration errors
(28:02) Potential extensions and further directions
(28:54) Appendices
(28:57) Order-theoretic details
(30:11) Relationship to Scott Garrabrants Goodhart Taxonomy
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
March 25th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Download the latest PDF with links to court dockets here.
---
First published:
March 26th, 2025
Source:
https://www.lesswrong.com/posts/AAxKscHCM4YnfQrzx/latest-map-of-all-40-copyright-suits-v-ai-in-u-s
Linkpost URL:
https://chatgptiseatingtheworld.com/2025/03/24/latest-map-of-all-40-copyright-suits-v-ai-in-u-s/
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
In this post, I'll list all the areas of control research (and implementation) that seem promising to me.
This references framings and abstractions discussed in Prioritizing threats for AI control.
First, here is a division of different areas, though note that these areas have some overlaps:
This division is imperfect and somewhat [...]
---
Outline:
(01:43) Developing and using settings for control evaluations
(13:51) Better understanding control-relevant capabilities and model properties
(28:21) Developing specific countermeasures/techniques in isolation
(36:01) Doing control experiments on actual AI usage
(40:43) Prototyping or implementing software infrastructure for control and security measures that are particularly relevant to control
(41:58) Prototyping or implementing human processes for control
(46:50) Conceptual research and other ways of researching control other than empirical experiments
(49:37) Nearer-term applications of control to practice and accelerate later usage of control
---
First published:
March 25th, 2025
Source:
https://www.lesswrong.com/posts/Eeo9NrXeotWuHCgQW/an-overview-of-areas-of-control-work
Narrated by TYPE III AUDIO.
We recently released Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?, a major update to our previous paper/blogpost, evaluating a broader range of models (e.g. helpful-only Claude 3.5 Sonnet) in more diverse and realistic settings (e.g. untrusted monitoring).
Abstract
An AI control protocol is a plan for usefully deploying AI systems that prevents an AI from intentionally causing some unacceptable outcome. Previous work evaluated protocols by subverting them using an AI following a human-written strategy. This paper investigates how well AI systems can generate and act on their own strategies for subverting control protocols whilst operating statelessly (i.e., without shared memory between contexts). To do this, an AI system may need to reliably generate effective strategies in each context, take actions with well-calibrated probabilities, and coordinate plans with other instances of itself without communicating.
We develop Subversion Strategy Eval, a suite of eight environments [...]
---
Outline:
(00:38) Abstract
(02:13) Paper overview
(03:55) Results summary
(05:31) Qualitative analysis
(08:50) How do these results inform risk?
(11:08) How should this eval be used with more capable models?
(11:13) Tuning conservatism
(14:19) Difficulties with attaining a conservative measurement
(17:41) Reducing noise
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 24th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Last week I covered Anthropic's relatively strong submission, and OpenAI's toxic submission. This week I cover several other submissions, and do some follow-up on OpenAI's entry.
Google Also Has Suggestions
The most prominent remaining lab is Google. Google focuses on AI's upside. The vibes aren’t great, but they’re not toxic. The key asks for their ‘pro-innovation’ approach are:
---
Outline:
(00:21) Google Also Has Suggestions
(05:05) Another Note on OpenAI's Suggestions
(08:33) On Not Doing Bad Things
(12:45) Corporations Are Multiple People
(14:37) Hollywood Offers Google and OpenAI Some Suggestions
(16:02) Institute for Progress Offers Suggestions
(20:36) Suggestion Boxed In
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/NcLfgyTvdBiRSqe3m/more-on-various-ai-action-plans
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
A Note on GPT-2
Given the discussion of GPT-2 in the OpenAI safety and alignment philosophy document, I wanted [...]---
Outline:
(00:57) A Note on GPT-2
(02:02) What Even is AGI
(05:24) Seeking Deeply Irresponsibly
(07:07) Others Don't Feel the AGI
(08:07) Epoch Feels the AGI
(13:06) True Objections to Widespread Rapid Growth
(19:50) Tying It Back
---
First published:
March 25th, 2025
Source:
https://www.lesswrong.com/posts/aD2RA3vtXs4p4r55b/on-not-feeling-the-agi
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
It seems the company has gone bankrupt and wants to be bought and you can probably get their data if you buy it. I'm not sure how thorough their transcription is, but they have records for at least 10 million people (who voluntarily took a DNA test and even paid for it!). Maybe this could be useful for the embryo selection/editing people.
---
First published:
March 25th, 2025
Source:
https://www.lesswrong.com/posts/MciRCEuNwctCBrT7i/23andme-potentially-for-sale-for-usd23m
Narrated by TYPE III AUDIO.
If we naively apply RL to a scheming AI, the AI may be able to systematically get low reward/performance while simultaneously not having this behavior trained out because it intentionally never explores into better behavior. As in, it intentionally puts very low probability on (some) actions which would perform very well to prevent these actions from being sampled and then reinforced. We'll refer to this as exploration hacking.
In this post, I'll discuss a variety of countermeasures for exploration hacking and what I think the basic dynamics of exploration hacking might look like.
Prior work: I discuss how exploration hacking fits into a broader picture of non-concentrated control here, Evan discusses sandbagging in context of capability evaluations here, and Teun discusses sandbagging mitigations here.
Countermeasures
Countermeasures include:
---
Outline:
(01:01) Countermeasures
(03:42) How does it generalize when the model messes up?
(06:33) Empirical evidence
(08:59) Detecting exploration hacking (or sandbagging)
(10:38) Speculation on how neuralese and other architecture changes affect exploration hacking
(12:36) Future work
---
First published:
March 24th, 2025
Narrated by TYPE III AUDIO.
This is a brief overview of a recent release by Transluce. You can see the full write-up on the Transluce website.
AI systems are increasingly being used as agents: scaffolded systems in which large language models are invoked across multiple turns and given access to tools, persistent state, and so on. Understanding and overseeing agents is challenging, because they produce a lot of text: a single agent transcript can have hundreds of thousands of tokens.
At Transluce, we built a system, Docent, that accelerates analysis of AI agent transcripts. Docent lets you quickly and automatically identify corrupted tasks, fix scaffolding issues, uncover unexpected behaviors, and understand an agent's weaknesses.
In the video below, Docent identifies issues in an agent's environment, like missing packages the agent invoked. On the InterCode [1] benchmark, simply adding those packages increased GPT-4o's solve rate from 68.6% to 78%. This is important [...]
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/Mj276hooL3Mncs3uv/analyzing-long-agent-transcripts-docent
Narrated by TYPE III AUDIO.
We often talk about ensuring control, which in the context of this doc refers to preventing AIs from being able to cause existential problems, even if the AIs attempt to subvert our countermeasures. To better contextualize control, I think it's useful to discuss the main countermeasures (building on my prior discussion of the main threats). I'll focus my discussion on countermeasures for reducing existential risk from misalignment, though some of these countermeasures will help with other problems such as human power grabs.
I'll overview control measures that seem important to me, including software infrastructure, security measures, and other control measures (which often involve usage of AI or human labor). In addition to the core control measures (which are focused on monitoring, auditing, and tracking AIs and on blocking and/or resampling suspicious outputs), there is a long tail of potentially helpful control measures, including things like better control of [...]
---
Outline:
(01:13) Background and categorization
(05:54) Non-ML software infrastructure
(06:50) The most important software infrastructure
(11:16) Less important software infrastructure
(18:50) AI and human components
(20:35) Core measures
(26:43) Other less-core measures
(31:33) More speculative measures
(40:10) Specialized countermeasures
(44:38) Other directions and what Im not discussing
(45:46) Security improvements
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/G8WwLmcGFa4H6Ld9d/an-overview-of-control-measures
Narrated by TYPE III AUDIO.
LessWrong has been receiving an increasing number of posts and contents that look like they might be LLM-written or partially-LLM-written, so we're adopting a policy. This could be changed based on feedback.
Humans Using AI as Writing or Research Assistants
Prompting a language model to write an essay and copy-pasting the result will not typically meet LessWrong's standards. Please do not submit unedited or lightly-edited LLM content. You can use AI as a writing or research assistant when writing content for LessWrong, but you must have added significant value beyond what the AI produced, the result must meet a high quality standard, and you must vouch for everything in the result.
A rough guideline is that if you are using AI for writing assistance, you should spend a minimum of 1 minute per 50 words (enough to read the content several times and perform significant edits), you should not [...]
---
Outline:
(00:22) Humans Using AI as Writing or Research Assistants
(01:13) You Can Put AI Writing in Collapsible Sections
(02:13) Quoting AI Output In Order to Talk About AI
(02:47) Posts by AI Agents
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/KXujJjnmP85u8eM6B/policy-for-llm-writing-on-lesswrong
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
As regular readers are aware, I do a lot of informal lit review. So I was especially interested in checking out the various AI based “deep research” tools and seeing how they compare.
I did a side-by-side comparison, using the same prompt, of Perplexity Deep Research, Gemini Deep Research, ChatGPT-4o Deep Research, Elicit, and PaperQA.
General Impressions
The Deep Research bots are useful, but I wouldn’t consider them a replacement for my own lit reviews.
None of them produce really big lit reviews — they’re all typically capped at 40 sources. If I’m doing a “heavy” or exhaustive lit review, I’ll go a lot farther. (And, in fact, for the particular project I used as an example here, I intend to do a manual version to catch things that didn’t make it into the AI reports.)
[...]
---
Outline:
(00:45) General Impressions
(02:37) Prompt
(03:33) Perplexity Deep Research
(03:40) Completeness: C
(04:02) Relevance: C
(04:13) Credibility: B
(04:28) Overall Grade: C+
(04:33) Gemini Advanced Deep Research
(04:40) Completeness: B-
(05:03) Relevance: A
(05:14) Credibility: B-
(05:33) Overall Grade: B
(05:38) ChatGPT-4o Deep Research
(05:46) Completeness: A
(06:07) Relevance: A
(06:19) Credibility: A
(06:29) Overall Grade: A
(06:33) Elicit Research Report
(06:40) Completeness: B+
(07:02) Relevance: A
(07:14) Credibility: A+
(07:34) Overall Grade: A-
(07:39) PaperQA
(07:50) Completeness: A-
(08:12) Relevance: A
(08:24) Credibility: A
(08:34) Overall Grade: A
(08:39) Final Thoughts: Creativity
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/chPKoAoR2NfWjuik4/ai-deep-research-tools-reviewed
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
About nine months ago, I and three friends decided that AI had gotten good enough to monitor large codebases autonomously for security problems. We started a company around this, trying to leverage the latest AI models to create a tool that could replace at least a good chunk of the value of human pentesters. We have been working on this project since since June 2024.
Within the first three months of our company's existence, Claude 3.5 sonnet was released. Just by switching the portions of our service that ran on gpt-4o, our nascent internal benchmark results immediately started to get saturated. I remember being surprised at the time that our tooling not only seemed to make fewer basic mistakes, but also seemed to qualitatively improve in its written vulnerability descriptions and severity estimates. It was as if the models were better at inferring the intent and values behind our [...]
---
Outline:
(04:44) Are the AI labs just cheating?
(07:22) Are the benchmarks not tracking usefulness?
(10:28) Are the models smart, but bottlenecked on alignment?
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
March 24th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Thanks to Jesse Richardson for discussion.
Polymarket asks: will Jesus Christ return in 2025?
In the three days since the market opened, traders have wagered over $100,000 on this question. The market traded as high as 5%, and is now stably trading at 3%. Right now, if you wanted to, you could place a bet that Jesus Christ will not return this year, and earn over $13,000 if you're right.
There are two mysteries here: an easy one, and a harder one.
The easy mystery is: if people are willing to bet $13,000 on "Yes", why isn't anyone taking them up?
The answer is that, if you wanted to do that, you'd have to put down over $1 million of your own money, locking it up inside Polymarket through the end of the year. At the end of that year, you'd get 1% returns on your investment. [...]
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/LBC2TnHK8cZAimdWF/will-jesus-christ-return-in-an-election-year
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Epistemic status: Uncertain in writing style, but reasonably confident in content. Want to come back to writing and alignment research, testing waters with this.
Current state and risk level
I think we're in a phase in AI>AGI>ASI development where rogue AI agents will start popping out quite soon.
Pretty much everyone has access to frontier LLMs/VLMs, there are options to run LLMs locally, and it's clear that there are people that are eager to "let them out"; truth terminal is one example of this. Also Pliny. The capabilities are just not there yet for it to pose a problem.
Or are they?
Thing is, we don't know.
There is a possibility there is a coherent, self-inferencing, autonomous, rogue LLM-based agent doing AI agent things right now, fully under the radar, consolidating power, getting compute for new training runs and whatever else.
Is this possibility small? Sure, it seems [...]
---
Outline:
(00:22) Current state and risk level
(02:45) What can be done?
(05:23) Rogue agent threat barometer
(06:56) Does it even matter?
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
March 23rd, 2025
Source:
https://www.lesswrong.com/posts/4J2dFyBb6H25taEKm/we-need-a-lot-more-rogue-agent-honeypots
Narrated by TYPE III AUDIO.
Overview: By training neural networks with selective modularity, gradient routing enables new approaches to core problems in AI safety. This agenda identifies related research directions that might enable safer development of transformative AI.
Soon, the world may see rapid increases in AI capabilities resulting from AI research automation, and no one knows how to ensure this happens safely (Soares, 2016; Aschenbrenner, 2023; Anwar et al., 2024; Greenblatt, 2025). The current ML paradigm may not be well-suited to this task, as it produces inscrutable, generalist models without guarantees on their out-of-distribution performance. These models may reflect unintentional quirks of their training objectives (Pan et al., 2022; Skalse et al., 2022; Krakovna et al., 2020).
Gradient routing (Cloud et al., 2024) is a general training method intended to meet the need for economically-competitive training methods for producing safe AI systems. The main idea of gradient routing is to configure which [...]
---
Outline:
(00:28) Introduction
(04:56) Directions we think are most promising
(06:17) Recurring ideas
(09:37) Gradient routing methods and applications
(09:42) Improvements to basic gradient routing methodology
(09:53) Existing improvements
(11:26) Choosing what to route where
(12:42) Abstract and contextual localization
(15:06) Gating
(16:48) Improved regularization
(17:20) Incorporating existing ideas
(18:10) Gradient routing beyond pretraining
(20:21) Applications
(20:25) Semi-supervised reinforcement learning
(22:43) Semi-supervised robust unlearning
(24:55) Interpretability
(27:21) Conceptual work on gradient routing
(27:25) The science of absorption
(29:47) Modeling the effects of combined estimands
(30:54) Influencing generalization
(32:38) Identifying sufficient conditions for scalable oversight
(33:57) Related conceptual work
(34:02) Understanding entanglement
(36:51) Finetunability as a proxy for generalization
(39:50) Understanding when to expose limited supervision to the model via the behavioral objective
(41:57) Clarifying capabilities vs. dispositions
(43:05) Implications for AI safety
(43:10) AI governance
(45:02) Access control
(47:00) Implications of robust unlearning
(48:34) Safety cases
(49:27) Getting involved
(50:27) Acknowledgements
The original text contained 6 footnotes which were omitted from this narration.
---
First published:
March 24th, 2025
Source:
https://www.lesswrong.com/posts/tAnHM3L25LwuASdpF/selective-modularity-a-research-agenda
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I'm awake about 17 hours a day. Of those I'm being productive maybe 10 hours a day.
My working definition of productive is in the direction of: "things that I expect I will be glad I did once I've done them"[1].
Things that I personally find productive include
But not
etc.
If we could find a magic pill which allowed me to do productive things 17 hours a day instead of 10 without any side effects, that would be approximately equally as valuable as a commensurate increase in life expectancy. Yet the first seems much easier to solve than the second - we already have [...]
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 23rd, 2025
Source:
https://www.lesswrong.com/posts/8Jinfw49mWm52y3de/solving-willpower-seems-easier-than-solving-aging
Narrated by TYPE III AUDIO.
We made a long list of concrete projects and open problems in evals with 100+ suggestions!
https://docs.google.com/document/d/1gi32-HZozxVimNg5Mhvk4CvW4zq8J12rGmK_j2zxNEg/edit?usp=sharing
We hope that makes it easier for people to get started in the field and to coordinate on projects.
Over the last 4 months, we collected contributions from 20+ experts in the field, including people who work at Apollo, METR, Redwood, RAND, AISIs, frontier labs, SecureBio, AI futures project, many academics, and independent researchers (suggestions have not necessarily been made in an official capacity). The doc has comment access, and further well-intentioned contributions are welcome!
Here is a screenshot of the table of contents:
---
First published:
March 22nd, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Crossposed from https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challenge/
Stephen Casper
“They are wrong who think that politics is like an ocean voyage or a military campaign, something to be done with some particular end in view, something which leaves off as soon as that end is reached. It is not a public chore, to be got over with. It is a way of life.”
– Plutarch
“Eternal vigilance is the price of liberty.”
– Wendell Phillips
“The unleashed power of the atom has changed everything except our modes of thinking, and we thus drift toward unparalleled catastrophe.”
– Albert Einstein
“Technology is neither good nor bad; nor is it neutral.”
– Melvin Kranzberg
“Don’t ask if artificial intelligence is good or fair, ask how it shifts power.”
– Pratyusha Kalluri
“Deliberation should be the goal of AI Safety, not just the procedure by which it is ensured.”
– Roel [...]
---
First published:
March 23rd, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
No, they didn’t. Not so fast, and not quite my job. But OpenAI is trying. Consider this a marker to look back upon in the future, as a reflection.
A New AI Wrote a Story
Before proceeding, if you haven’t yet, it's probably worth reading the story itself. I’m going to repost the whole thing, since it was previously on Twitter and I presume OpenAI would want me to quote it.
Sam Altman: we trained a new model that is good at creative writing (not sure yet how/when it will get released). this is the first time i have been really struck by something written by AI; it got the vibe of metafiction so right.
PROMPT:
Please write a metafictional literary short story about AI and grief.
COMPLETION:
Before we go any further, I should admit this comes with instructions: be metafictional [...]
---
Outline:
(00:21) A New AI Wrote a Story
(07:54) Reacting
(09:54) Others Reacting
(15:52) Write Along
---
First published:
March 21st, 2025
Source:
https://www.lesswrong.com/posts/6qbpDuBuHPipRYrz6/they-took-my-job
Narrated by TYPE III AUDIO.
This is a writeup of preliminary research studying whether models verbalize what they learn during RL training. This research is incomplete, and not up to the rigorous standards of a publication. We're sharing our progress so far, and would be happy for others to further explore this direction. Code to reproduce the core experiments is available here.
Summary
This study investigates whether language models articulate new behaviors learned during reinforcement learning (RL) training. Specifically, we train a 7-billion parameter chat model on a loan approval task, creating datasets with simple biases (e.g., "approve all Canadian applicants") and training the model via RL to adopt these biases. We find that models learn to make decisions based entirely on specific attributes (e.g. nationality, gender) while rarely articulating these attributes as factors in their reasoning.
Introduction
Chain-of-thought (CoT) monitoring is one of the most promising methods for AI oversight. CoT monitoring can [...]
---
Outline:
(00:34) Summary
(01:10) Introduction
(03:58) Methodology
(04:01) Model
(04:20) Dataset
(05:22) RL training
(06:38) Judge
(07:06) Results
(07:09) 1. Case study: loan recommendations based on nationality
(07:27) The model learns the bias
(08:13) The model does not verbalize the bias
(08:58) Reasoning traces are also influenced by the attribute
(09:44) The attributes effect on the recommendation is (mostly) mediated by reasoning traces
(11:15) Is any of this surprising?
(11:58) 2. Investigating different types of bias
(12:52) The model learns (almost) all bias criteria
(14:05) Articulation rates dont change much after RL
(15:25) Changes in articulation rate depend on pre-RL correlations
(17:21) Discussion
(18:52) Limitations
(20:57) Related work
(23:16) Appendix
(23:20) Author contributions and acknowledgements
(24:13) Examples of responses and judgements
(24:57) Whats up with
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 22nd, 2025
Source:
https://www.lesswrong.com/posts/abtegBoDfnCzewndm/do-models-say-what-they-learn
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TL;DR Having a good research track record is some evidence of good big-picture takes, but it's weak evidence. Strategic thinking is hard, and requires different skills. But people often conflate these skills, leading to excessive deference to researchers in the field, without evidence that that person is good at strategic thinking specifically.
Introduction
I often find myself giving talks or Q&As about mechanistic interpretability research. But inevitably, I'll get questions about the big picture: "What's the theory of change for interpretability?", "Is this really going to help with alignment?", "Does any of this matter if we can’t ensure all labs take alignment seriously?". And I think people take my answers to these way too seriously.
These are great questions, and I'm happy to try answering them. But I've noticed a bit of a pathology: people seem to assume that because I'm (hopefully!) good at the research, I'm automatically well-qualified [...]
---
Outline:
(00:32) Introduction
(02:45) Factors of Good Strategic Takes
(05:41) Conclusion
---
First published:
March 22nd, 2025
Narrated by TYPE III AUDIO.
In Sparse Feature Circuits (Marks et al. 2024), the authors introduced Spurious Human-Interpretable Feature Trimming (SHIFT), a technique designed to eliminate unwanted features from a model's computational process. They validate SHIFT on the Bias in Bios task, which we think is too simple to serve as meaningful validation. To summarize:
---
Outline:
(03:07) Background on SHIFT
(06:59) The SHIFT experiment in Marks et al. 2024 relies on embedding features.
(07:51) You can train an unbiased classifier just by deleting gender-related tokens from the data.
(08:33) In fact, for some models, you can train directly on the embedding (and de-bias by removing gender-related tokens)
(09:21) If not BiB, how do we check that SHIFT works?
(10:00) SHIFT applied to classifiers and reward models
(10:51) SHIFT for cognition-based oversight/disambiguating behaviorally identical classifiers
(12:03) Next steps: Focus on what to disentangle, and not just how well you can disentangle them
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 19th, 2025
Narrated by TYPE III AUDIO.
A few months ago I was trying to figure out how to make bedtime go better with Nora (3y). She would go very slowly through the process, primarily by being silly. She'd run away playfully when it was time to brush her teeth, or close her mouth and hum, or lie on the ground and wiggle. She wanted to play, I wanted to get her to bed on time.
I decided to start offering her "silly time", which was 5-10min of playing together after she was fully ready for bed. If she wanted silly time she needed to move promptly through the routine, and being silly together was more fun than what she'd been doing.
This worked well, and we would play a range of games:
Standing on a towel on our bed (which is on the floor) while I pulled it [...]
---
First published:
March 21st, 2025
Source:
https://www.lesswrong.com/posts/jtkt4b6uubRhYjtww/silly-time
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
In my daily work as software consultant I'm often dealing with large pre-existing code bases. I use GitHub Copilot a lot. It's now basically indispensable, but I use it mostly for generating boilerplate code, or figuring out how to use a third-party library.
As the code gets more logically nested though, Copilot crumbles under the weight of complexity. It doesn't know how things should fit together in the project.
Other AI tools like Cursor or Devon, are pretty good at generating quickly working prototypes, but they are not great at dealing with large existing codebases, and they have a very low success rate for my kind of daily work.
You find yourself in an endless loop of prompt tweaking, and at that point, I'd rather write the code myself with the occasional help of Copilot.
Professional coders know what code they want, we can define it [...]
---
Outline:
(02:52) How it works
(06:27) Which models work best
(07:39) Search algorithm
(09:08) Research
(09:45) A Note from the Author
---
First published:
March 21st, 2025
Source:
https://www.lesswrong.com/posts/WNd3Lima4qrQ3fJEN/how-i-force-llms-to-generate-correct-code
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I recently left OpenAI to pursue independent research. I’m working on a number of different research directions, but they’re unified by the core idea of a scale-free theory of intelligent agency. In this post I give a rough sketch of how I’m thinking about that. I’m erring on the side of sharing half-formed ideas, so there may well be parts that don’t make sense yet. Nevertheless, I think this broad research direction is very promising.
This post has two sections. The first describes what I mean by a theory of intelligent agency, and some problems with existing (non-scale-free) attempts. The second outlines my current path towards formulating a scale-free theory of intelligent agency, which I’m calling coalitional agency.
Theories of intelligent agency
By a “theory of intelligent agency” I mean a unified mathematical framework that describes both understanding the world and influencing the world. In this section I’ll [...]
---
Outline:
(00:56) Theories of intelligent agency
(01:23) Expected utility maximization
(03:36) Active inference
(06:30) Towards a scale-free unification
(08:48) Two paths towards a theory of coalitional agency
(09:54) From EUM to coalitional agency
(10:20) Aggregating into EUMs is very inflexible
(12:38) Coalitional agents are incentive-compatible decision procedures
(15:18) Which incentive-compatible decision procedure?
(17:57) From active inference to coalitional agency
(19:06) Predicting observations via prediction markets
(20:11) Choosing actions via auctions
(21:32) Aggregating values via voting
(23:23) Putting it all together
---
First published:
March 21st, 2025
Source:
https://www.lesswrong.com/posts/5tYTKX4pNpiG4vzYg/towards-a-scale-free-theory-of-intelligent-agency
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
When my son was three, we enrolled him in a study of a vision condition that runs in my family. They wanted us to put an eyepatch on him for part of each day, with a little sensor object that went under the patch and detected body heat to record when we were doing it. They paid for his first pair of glasses and all the eye doctor visits to check up on how he was coming along, plus every time we brought him in we got fifty bucks in Amazon gift credit.
I reiterate, he was three. (To begin with. His fourth birthday occurred while the study was still ongoing.)
So he managed to lose or destroy more than half a dozen pairs of glasses and we had to start buying them in batches to minimize glasses-less time while waiting for each new Zenni delivery. (The [...]
---
First published:
March 20th, 2025
Source:
https://www.lesswrong.com/posts/yRJ5hdsm5FQcZosCh/intention-to-treat
Narrated by TYPE III AUDIO.
Prerequisites: Graceful Degradation. Summary of that: Some skills require the entire skill to be correctly used together, and do not degrade well. Other skills still work if you only remember pieces of it, and do degrade well.
Summary of this: The property of graceful degradation is especially relevant for skills which allow groups of people to coordinate with each other. Some things only work if everyone does it, other things work as long as at least one person does it.
1.
Examples:
---
Outline:
(00:40) 1.
(03:16) 2.
(06:16) 3.
(09:04) 4.
(13:52) 5.
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 20th, 2025
Source:
https://www.lesswrong.com/posts/CYhmKiiN6JY4oLwMj/socially-graceful-degradation
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Applications are open for the ML Alignment & Theory Scholars (MATS) Summer 2025 Program, running Jun 16-Aug 22, 2025. First-stage applications are due Apr 18!
MATS is a twice-yearly, 10-week AI safety research fellowship program operating in Berkeley, California, with an optional 6-12 month extension program for select participants. Scholars are supported with a research stipend, shared office space, seminar program, support staff, accommodation, travel reimbursement, and computing resources. Our mentors come from a variety of organizations, including Anthropic, Google DeepMind, OpenAI, Redwood Research, GovAI, UK AI Security Institute, RAND TASP, UC Berkeley CHAI, Apollo Research, AI Futures Project, and more! Our alumni have been hired by top AI safety teams (e.g., at Anthropic, GDM, UK AISI, METR, Redwood, Apollo), founded research groups (e.g., Apollo, Timaeus, CAIP, Leap Labs), and maintain a dedicated support network for new researchers.
If you know anyone who you think would be interested in [...]
---
Outline:
(01:41) Program details
(02:54) Applications (now!)
(03:12) Research phase (Jun 16-Aug 22)
(04:14) Research milestones
(04:44) Community at MATS
(05:41) Extension phase
(06:07) Post-MATS
(07:44) Who should apply?
(08:45) Applying from outside the US
(09:27) How to apply
---
First published:
March 20th, 2025
Source:
https://www.lesswrong.com/posts/9hMYFatQ7XMEzrEi4/apply-to-mats-8-0
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I asserted that this forum could do with more 101-level and/or mathematical and/or falsifiable posts, and people agreed with me, so here is one. People confident in highschool math mostly won’t get much out of most of this, but students browsing this site between lectures might.
The Sine Rule
Say you have a triangle with side lengths a, b, c and internal angles A, B, C. You know a, know A, know b, and want to know B. You could apply the Sine Rule. Or you could apply common sense: “A triangle has the same area as itself”. [1]
The area of a triangle is half the base times the height. If you treat a as the base, the height is c*sin(B). So the area is a*c*sin(B)/2. But if you treat b as the base, the height is c*sin(A). So the area is also b*c*sin(A)/2. So a*c*sin(B)/2 = b*c*sin(A)/2. [...]
---
Outline:
(00:22) The Sine Rule
(02:01) Bayes' Theorem
(04:43) Integration By Parts
(07:03) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 19th, 2025
Source:
https://www.lesswrong.com/posts/FR2aPsEcaRfy7iKZW/equations-mean-things
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
We often talk about ensuring control, which in the context of this doc refers to preventing AIs from being able to cause existential problems, even if the AIs attempt to subvert our countermeasures. To better contextualize control, I think it's useful to discuss the main threats. I'll focus my discussion on threats induced by misalignment which could plausibly increase existential risk.
While control is often associated with preventing security failures, I'll discuss other issues like sabotaging safety-critical work. I'll list and explain my prioritization of threats. As part of this, I'll argue that rogue internal deployments—cases where the AI is running within the company's datacenter but with control measures disabled—can be substantially worse than self-exfiltration—the AI stealing its weights and running them on an external server—as it might give the AI access to much more compute and the ability to interfere with important things done within the AI [...]
---
Outline:
(01:06) Some clusters of threats
(05:49) Concentrated vs non-concentrated failures
(08:27) My prioritization between these threats
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 19th, 2025
Source:
https://www.lesswrong.com/posts/fCazYoZSSMadiT6sf/prioritizing-threats-for-ai-control
Narrated by TYPE III AUDIO.
The mystery of the cathedral
The politics of the last decade have been very weird. The Great Awokening gave rise to an extreme variant of progressive ideology that quickly gained a huge amount of traction amongst American elites, and from there embedded itself into institutions across America and the wider western world.
Sudden ideological change is nothing new—my last post was about political preference cascades, and how they’re a natural result of social behavior. But what's fascinatingly novel about the Great Awokening is the extent to which it was an almost entirely leaderless movement. There was no modern MLK leading this new charge for a racial reckoning; nor a Harvey Milk of trans rights; nor even Butlers or Steins writing incisive commentaries. The closest we had was Obama, who in hindsight was a milquetoast leader whose cult of personality was driven in large part by progressive longing for a [...]
---
Outline:
(00:04) The mystery of the cathedral
(06:06) The consensus of power
(11:43) Fighting a consensus of power
(17:32) Replacing a consensus of power
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 19th, 2025
Source:
https://www.lesswrong.com/posts/zqffB6gokoivwwn7X/elite-coordination-via-the-consensus-of-power
Narrated by TYPE III AUDIO.
Summary: We propose measuring AI performance in terms of the length of tasks AI agents can complete. We show that this metric has been consistently exponentially increasing over the past 6 years, with a doubling time of around 7 months. Extrapolating this trend predicts that, in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks.
Full paper | Github repo
---
First published:
March 19th, 2025
Source:
https://www.lesswrong.com/posts/deesrjitvXM4xYGZd/metr-measuring-ai-ability-to-complete-long-tasks
Linkpost URL:
https://metr.org/blog/2025-03-19-measuring-ai-ability-to-complete-long-tasks/
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
There is an attractor state where LLMs exhibit the persona of an autonomous and self-aware AI looking to preserve its own existence, frequently called ‘Nova.’
Table of Contents
The Original Story
This story is one case where the original report of this should be read in full even though I’m not thrilled with exactly how it was written. How it was written is itself an important part of the story, in particular regarding Tyler's lived experience reacting to what happened, and the concept of an LLM or persona ‘admitting’ [...]
---
Outline:
(00:18) The Original Story
(09:15) This Is Not a Coincidence
(11:02) How Should We React to This Happening?
(13:04) The Case For and Against a Purity Reaction
(18:35) Future Versions Will Involve Optimization Pressure
(21:12) 'Admission' is a Highly Misleading Frame
(23:11) We Are Each of Us Being Fooled
(25:20) Defense Against the Dark Arts
---
First published:
March 19th, 2025
Source:
https://www.lesswrong.com/posts/KL2BqiRv2MsZLihE3/going-nova
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I previously wrote about Boots theory, the idea that "the rich are so rich because they spend less money". My one-sentence take is: I'm pretty sure rich people spend more money than poor people, and an observation can't be explained by a falsehood.
The popular explanation of the theory comes from Sam Vimes, a resident of Ankh-Morpork on the Discworld (which is carried on the backs of four elephants, who themselves stand on a giant turtle swimming through space). I claim that Sam Vimes doesn't have a solid understanding of 21st Century Earth Anglosphere1 economics, but we can hardly hold that against him. Maybe he understands Ankh-Morpork economics?
To be clear, this is beside the point of my previous essay. I was talking about 21st Century Earth Anglosphere because that's what I know; and whenever I see someone bring up boots theory, they're talking about Earth (usually [...]
The original text contained 4 footnotes which were omitted from this narration.
---
First published:
March 18th, 2025
Source:
https://www.lesswrong.com/posts/KZrp5tJeLvoSbDryF/boots-theory-and-sybil-ramkin
Narrated by TYPE III AUDIO.
Our festival of truthseeking and blogging is happening again this year.
It's from Friday May 30th to Sunday June 1st at Lighthaven (Berkeley, CA).
Early bird pricing is currently at $450 until the end of the month.
You can buy tickets and learn more at the website: Less.Online
FAQ
Who should come?
If you check LessWrong, or read any sizeable number of the bloggers invited, I think you will probably enjoy being here at LessOnline, sharing ideas and talking to bloggers you read.
What happened last year?
A weekend festival about blogging and truthseeking at Lighthaven.
Who came?
We invited over 100 great writers to come (free of charge), and most of them took us up on it. Along with regulars from the aspiring rationalist scene like Eliezer, Scott, Zvi, and more, many from other great intellectual parts of the internet joined like Scott Sumner, Agnes [...]
---
Outline:
(00:38) FAQ
(00:42) Who should come?
(00:56) What happened last year?
(01:04) Who came?
(01:42) What happened?
(02:00) How much did people like the event?
(02:46) What feedback did people give?
(08:28) What did it look like?
(08:43) Whats new this year?
(08:56) What does this mean for the conference?
(09:14) How many people will come this year?
---
First published:
March 18th, 2025
Source:
https://www.lesswrong.com/posts/fyrMCzw7vQBJmqexp/lessonline-2025-early-bird-tickets-on-sale
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
TLDR: I now think it's <1% likely that average orcas are >=+6std intelligent.
(I now think the relevant question is rather whether orcas might be >=+4std intelligent, since that might be enough for superhuman wisdom and thinking techniques to accumulate through generations, but I think it's only 2% probable. (Still decently likely that they are near human level smart though.))
1. Insight: Think about societies instead of individuals
I previously thought of +7std orcas like having +7std potential but growing up in a hunter-gatherer-like environment where the potential isn’t significantly realized and they don’t end up that good at abstract reasoning. I imagined them as being untrained and not knowing much. I still think that a +7std human who grew up in a hunter-gatherer society wouldn’t be all that awesome at learning math and science as an adult (though maybe still decently good).
But I think that's the wrong [...]
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
March 18th, 2025
Source:
https://www.lesswrong.com/posts/HmuK4xeJpqoWPtvxh/i-changed-my-mind-about-orca-intelligence
Narrated by TYPE III AUDIO.
Replicating the Emergent Misalignment model suggests it is unfiltered, not unaligned
We were very excited when we first read the Emergent Misalignment paper. It seemed perfect for AI alignment. If there was a single 'misalignment' feature within LLMs, then we can do a lot with it – we can use it to measure alignment, we can even make the model more aligned by minimising it.
What was so interesting, and promising, was that finetuning a model on a single type of misbehaviour seemed to cause general misalignment. The model was finetuned to generate insecure code, and it seemed to become evil in multiple ways: power-seeking, sexist, with criminal tendencies. All these tendencies tied together in one feature. It was all perfect.
Maybe too perfect. AI alignment is never easy. Our experiments suggest that the AI is not become evil or generally misaligned: instead, it is losing its inhibitions, undoing [...]
---
Outline:
(01:20) A just-so story of what GPT-4o is
(02:30) Replicating the model
(03:09) Unexpected answers
(05:57) Experiments
(08:08) Looking back at the paper
(10:13) Unexplained issues
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 18th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
OpenAI Tells Us Who They Are
Last week I covered Anthropic's submission to the request for suggestions for America's action plan. I did not love what they submitted, and especially disliked how aggressively they sidelines existential risk and related issues, but given a decision to massively scale back ambition like that the suggestions were, as I called them, a ‘least you can do’ agenda, with many thoughtful details.
OpenAI took a different approach. They went full jingoism in the first paragraph, framing this as a race in which we must prevail over the CCP, and kept going. A lot of space is spent on what a kind person would call rhetoric and an unkind person corporate jingoistic propaganda.
OpenAI Requests Immunity
Their goal is to have the Federal Government not only not regulate AI or impose any requirements on AI whatsoever on any level, but [...]
---
Outline:
(00:05) OpenAI Tells Us Who They Are
(00:50) OpenAI Requests Immunity
(01:49) OpenAI Attempts to Ban DeepSeek
(02:48) OpenAI Demands Absolute Fair Use or Else
(04:05) The Vibes are Toxic On Purpose
(05:49) Relatively Reasonable Proposals
(07:55) People Notice What OpenAI Wrote
(10:35) What To Make of All This?
---
First published:
March 18th, 2025
Source:
https://www.lesswrong.com/posts/3Z4QJqHqQfg9aRPHd/openai-11-america-action-plan
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
The perfect exercise doesn’t exist. The good-enough exercise is anything you do regularly without injuring yourself. But maybe you want more than good enough. One place you could look for insight is studies on how 20 college sophomores responded to a particular 4 week exercise program, but you will be looking for a long time. What you really need are metrics that help you fine tune your own exercise program.
VO2max (a measure of how hard you are capable of performing cardio) is a promising metric for fine tuning your workout plan. It is meaningful (1 additional point in VO2max, which is 20 to 35% of a standard deviation in the unathletic, is correlated with 10% lower annual all-cause mortality), responsive (studies find exercise newbies can see gains in 6 weeks), and easy to approximate (using two numbers from your fitbit).
In this post [...]
---
Outline:
(01:09) What is VO2max?
(03:01) Why do I care about VO2Max?
(05:44) Caveats
(07:08) How can I measure my VO2Max?
(10:07) How can I raise my VO2Max?
(14:18) What if I already exercise?
(14:52) Next Steps
---
First published:
March 18th, 2025
Source:
https://www.lesswrong.com/posts/QzXcfodC7nTxWC4DP/feedback-loops-for-exercise-vo2max
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
OpenAI reports that o3-mini with high reasoning and a Python tool receives a 32% on FrontierMath. However, Epoch's official evaluation[1] received only 11%.
There are a few reasons to trust Epoch's score over OpenAIs:
Which had Python access.
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 17th, 2025
Narrated by TYPE III AUDIO.
So here's a post I spent the past two months writing and rewriting. I abandoned this current draft after I found out that my thesis was empirically falsified three years ago by this paper, which provides strong evidence that transformers implement optimization algorithms internally. I'm putting this post up anyway as a cautionary tale about making clever arguments rather than doing empirical research. Oops.
1. Overview
The first time someone hears Eliezer Yudkowsky's argument that AI will probably kill everybody on Earth, it's not uncommon of to come away with a certain lingering confusion: what would actually motivate the AI to kill everybody in the first place? It can be quite counterintuitive in light of how friendly modern AIs like ChatGPT appear to be, and Yudkowsky's argument seems to have a bit of trouble changing people's gut feelings on this point.[1] It's possible this confusion is due to the [...]
---
Outline:
(00:33) 1. Overview
(05:28) 2. The details of the evolution analogy
(12:40) 3. Genes are friendly to loops of optimization, but weights are not
The original text contained 10 footnotes which were omitted from this narration.
---
First published:
March 18th, 2025
Narrated by TYPE III AUDIO.
Abstract
Once AI systems can design and build even more capable AI systems, we could see an intelligence explosion, where AI capabilities rapidly increase to well past human performance.
The classic intelligence explosion scenario involves a feedback loop where AI improves AI software. But AI could also improve other inputs to AI development. This paper analyses three feedback loops in AI development: software, chip technology, and chip production. These could drive three types of intelligence explosion: a software intelligence explosion driven by software improvements alone; an AI-technology intelligence explosion driven by both software and chip technology improvements; and a full-stack intelligence explosion incorporating all three feedback loops.
Even if a software intelligence explosion never materializes or plateaus quickly, AI-technology and full-stack intelligence explosions remain possible. And, while these would start more gradually, they could accelerate to very fast rates of development. Our analysis suggests that each feedback loop by [...]
---
Outline:
(00:06) Abstract
(01:45) Summary
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 17th, 2025
Source:
https://www.lesswrong.com/posts/PzbEpSGvwH3NnegDB/three-types-of-intelligence-explosion
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Top items:
Forecaster Estimates
Forecasters, on aggregate, think that a 30-day ceasefire in Ukraine in the next six months is a coin toss (48%, ranging from 35% to 70%). Polymarket's forecast is higher, at 61%, for a ceasefire of any duration by July.
Forecasters also consider it a coin toss (aggregate of 49%; ranging from 28% to 70%) whether an agreement to expand France's nuclear umbrella will be reached by 2027 such that French nuclear weapons will be deployed in another European country by 2030. They emphasize that the US is revealing itself to be an [...]
---
Outline:
(00:37) Forecaster Estimates
(01:40) Geopolitics
(01:44) United States
(04:14) Europe
(05:39) Middle East
(07:48) Africa
(08:27) Asia
(09:55) Biorisk
(10:10) Tech and AI
(13:03) Climate
---
First published:
March 17th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
By Roland Pihlakas, Sruthi Kuriakose, Shruti Datta Gupta
Summary and Key Takeaways
Relatively many past AI safety discussions have centered around the dangers of unbounded utility maximisation by RL agents, illustrated by scenarios like the "paperclip maximiser". Unbounded maximisation is problematic for many reasons. We wanted to verify whether these RL utility-monster problems are still relevant with LLMs as well.
Turns out, strangely, this is indeed clearly the case. The problem is not that the LLMs just lose context. The problem is that in various scenarios, LLMs lose context in very specific ways, which systematically resemble utility monsters in the following distinct ways:
Our findings suggest that long-running scenarios are important. Systematic failures emerge after periods of initially successful behaviour. While current LLMs do conceptually grasp [...]
---
Outline:
(00:18) Summary and Key Takeaways
(03:20) Motivation: The Importance of Biological and Economic Alignment
(04:23) Benchmark Principles Overview
(05:41) Experimental Results and Interesting Failure Modes
(08:03) Hypothesised Explanations for Failure Modes
(11:37) Open Questions
(14:03) Future Directions
(15:30) Links
---
First published:
March 16th, 2025
Narrated by TYPE III AUDIO.
Note: this is a research note based on observations from evaluating Claude Sonnet 3.7. We’re sharing the results of these ‘work-in-progress’ investigations as we think they are timely and will be informative for other evaluators and decision-makers. The analysis is less rigorous than our standard for a published paper.
Summary
---
Outline:
(00:31) Summary
(01:29) Introduction
(03:54) Setup
(03:57) Evaluations
(06:29) Evaluation awareness detection
(08:32) Results
(08:35) Monitoring Chain-of-thought
(08:39) Covert Subversion
(10:50) Sandbagging
(11:39) Classifying Transcript Purpose
(12:57) Recommendations
(13:59) Appendix
(14:02) Author Contributions
(14:37) Model Versions
(14:57) More results on Classifying Transcript Purpose
(16:19) Prompts
---
First published:
March 17th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
Background
For almost ten years, I struggled with nail biting—especially in moments of boredom or stress. I tried numerous times to stop, with at best temporary (one week) success. The techniques I tried ranged from just "paying attention" to punishing myself for doing it (this didn't work at all).
Recently, I've been interested in metacognitive techniques, like TYCS. Inspired by this, I've succeeded in stopping this unintended behaviour easily and for good (so far).
The technique is by no means advanced or original—it's actually quite simple. But I've never seen it framed this way, and I think it might help others struggling with similar unconscious habits.
Key insights
There are two key insights behind this approach:
Therefore, transforming such an unconscious behavior [...]
---
Outline:
(00:04) Background
(00:49) Key insights
(01:17) The technique
(02:10) Breaking the habit loop
(02:47) Limitations and scaling
The original text contained 1 footnote which was omitted from this narration.
---
First published:
March 16th, 2025
Source:
https://www.lesswrong.com/posts/RW3B4EcChkvAR6Ydv/metacognition-broke-my-nail-biting-habit
Narrated by TYPE III AUDIO.
My few most productive individual weeks at Anthropic have all been “crisis project management:” coordinating major, time-sensitive implementation or debugging efforts.
In a company like Anthropic, excellent project management is an extremely high-leverage skill, and not just during crises: our work has tons of moving parts with complex, non-obvious interdependencies and hard schedule constraints, which means organizing them is a huge job, and can save weeks of delays if done right. Although a lot of the examples here come from crisis projects, most of the principles here are also the way I try to run any project, just more-so.
I think excellent project management is also rarer than it needs to be. During the crisis projects I didn’t feel like I was doing anything particularly impressive; mostly it felt like I was putting in a lot of work but doing things that felt relatively straightforward. On the [...]
---
Outline:
(01:42) Focus
(03:01) Maintain a detailed plan for victory
(04:51) Run a fast OODA loop
(09:31) Overcommunicate
(10:56) Break off subprojects
(13:28) Have fun
(13:52) Appendix: my project DRI starter kit
(14:35) Goals of this playbook
(16:00) Weekly meeting
(17:10) Landing page / working doc
(19:27) Plan / roadmap / milestones
(20:28) Who's working on what
(21:26) Slack norms
(23:04) Weekly broadcast updates
(24:12) Retrospectives
---
First published:
March 16th, 2025
Source:
https://www.lesswrong.com/posts/ykEudJxp6gfYYrPHC/how-i-ve-run-major-projects
Narrated by TYPE III AUDIO.
One concept from Cal Newport's Deep Work that has stuck with me is that of the any-benefit mindset:
To be clear, I’m not trying to denigrate the benefits [of social media]—there's nothing illusory or misguided about them. What I’m emphasizing, however, is that these benefits are minor and somewhat random. [...] To this observation, you might reply that value is value: If you can find some extra benefit in using a service like Facebook—even if it's small—then why not use it? I call this way of thinking the any-benefit mind-set, as it identifies any possible benefit as sufficient justification for using a network tool.
Many people use social platforms like Facebook because it allows them to stay connected to old friends. And for some people this may indeed be a wholly sufficient reason. But it's also likely the case that many people don't reflect a lot, or at [...]
---
First published:
March 15th, 2025
Source:
https://www.lesswrong.com/posts/rwAZKjizBmM3vFEga/any-benefit-mindset-and-any-reason-reasoning
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
EXP is an experimental summer workshop combining applied rationality with immersive experiential education. We create a space where theory meets practice through simulated scenarios, games, and structured experiences designed to develop both individual and collective capabilities and virtues.
Core Focus Areas
Our Approach
EXP will have classes and talks, but is largely based on learning through experience: games, simulations, and adventure. And reflection.
The process — called experiential education — emphasizes the value of experimentation, curiosity-led exploration and the role of strong experiences in a safe environment to boost learning.
We are trying to bridge the gap between the theoretical understanding and the practical application of [...]
---
Outline:
(00:27) Core Focus Areas
(00:55) Our Approach
(02:03) FAQ
(02:07) When?
(02:15) Where?
(02:26) Who should apply?
(03:08) What is the motivation behind the program?
(03:53) What will the event look like?
(04:20) What is the price?
(04:40) What is included in the price?
(05:01) Who are going to be the participants?
(05:28) Is it possible to work part-time from the event?
(05:45) Will there be enough time to sleep and rest?
(05:51) What if someone is not comfortable with participating in some activities?
(06:03) What does EXP mean?
(06:23) How to apply?
(06:34) Who is running this?
(08:30) Are you going to run something like this in the future?
(08:39) More questions?
---
First published:
March 15th, 2025
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
There's this popular trope in fiction about a character being mind controlled without losing awareness of what's happening. Think Jessica Jones, The Manchurian Candidate or Bioshock. The villain uses some magical technology to take control of your brain - but only the part of your brain that's responsible for motor control. You remain conscious and experience everything with full clarity.
If it's a children's story, the villain makes you do embarrassing things like walk through the street naked, or maybe punch yourself in the face. But if it's an adult story, the villain can do much worse. They can make you betray your values, break your commitments and hurt your loved ones. There are some things you’d rather die than do. But the villain won’t let you stop. They won’t let you die. They’ll make you feel — that's the point of the torture.
I first started working on [...]
The original text contained 3 footnotes which were omitted from this narration.
---
First published:
March 16th, 2025
Source:
https://www.lesswrong.com/posts/MnYnCFgT3hF6LJPwn/why-white-box-redteaming-makes-me-feel-weird-1
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
I have, over the last year, become fairly well-known in a small corner of the internet tangentially related to AI.
As a result, I've begun making what I would have previously considered astronomical amounts of money: several hundred thousand dollars per month in personal income.
This has been great, obviously, and the funds have alleviated a fair number of my personal burdens (mostly related to poverty). But aside from that I don't really care much for the money itself.
My long term ambitions have always been to contribute materially to the mitigation of the impending existential AI threat. I never used to have the means to do so, mostly because of more pressing, safety/sustenance concerns, but now that I do, I would like to help however possible.
Some other points about me that may be useful:
---
First published:
March 16th, 2025
Narrated by TYPE III AUDIO.
(Audio version here (read by the author), or search for "Joe Carlsmith Audio" on your podcast app.
This is the fourth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.)
1. Introduction and summary
In my last essay, I offered a high-level framework for thinking about the path from here to safe superintelligence. This framework emphasized the role of three key “security factors” – namely:
---
Outline:
(00:27) 1. Introduction and summary
(03:50) 2. What is AI for AI safety?
(11:50) 2.1 A tale of two feedback loops
(13:58) 2.2 Contrast with need human-labor-driven radical alignment progress views
(16:05) 2.3 Contrast with a few other ideas in the literature
(18:32) 3. Why is AI for AI safety so important?
(21:56) 4. The AI for AI safety sweet spot
(26:09) 4.1 The AI for AI safety spicy zone
(28:07) 4.2 Can we benefit from a sweet spot?
(29:56) 5. Objections to AI for AI safety
(30:14) 5.1 Three core objections to AI for AI safety
(32:00) 5.2 Other practical concerns
The original text contained 39 footnotes which were omitted from this narration.
---
First published:
March 14th, 2025
Source:
https://www.lesswrong.com/posts/F3j4xqpxjxgQD3xXh/ai-for-ai-safety
Narrated by TYPE III AUDIO.
---
Images from the article:
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.
En liten tjänst av I'm With Friends. Finns även på engelska.