Podd: AI Safety Fundamentals: Governance

Deceptively Aligned Mesa-Optimizers: It’s Not Funny if I Have to Explain It

4 januari 2025 | 27 min

Learning From Human Preferences

4 januari 2025 | 7 min

Where I Agree and Disagree with Eliezer

4 januari 2025 | 43 min

Thought Experiments Provide a Third Anchor

4 januari 2025 | 8 min

Future ML Systems Will Be Qualitatively Different

4 januari 2025 | 13 min

Why AI Alignment Could Be Hard With Modern Deep Learning

4 januari 2025 | 29 min

Acquisition of Chess Knowledge in Alphazero

4 januari 2025 | 22 min

Four Background Claims

4 januari 2025 | 15 min

Understanding Intermediate Layers Using Linear Classifier Probes

4 januari 2025 | 17 min

Feature Visualization

4 januari 2025 | 32 min

Embedded Agents

4 januari 2025 | 18 min

Logical Induction (Blog Post)

4 januari 2025 | 12 min

MIRI is releasing a paper introducing a new model of deductively limited reasoning: “Logical induction,” authored by Scott Garrabrant, Tsvi Benson-Tilsen, Andrew Critch, myself, and Jessica Taylor. Readers may wish to start with the abridged version.

Consider a setting where a reasoner is observing a deductive process (such as a community of mathematicians and computer programmers) and waiting for proofs of various logical claims (such as the abc conjecture, or “this computer program has a bug in it”), while making guesses about which claims will turn out to be true. Roughly speaking, our paper presents a computable (though inefficient) algorithm that outpaces deduction, assigning high subjective probabilities to provable conjectures and low probabilities to disprovable conjectures long before the proofs can be produced. This algorithm has a large number of nice theoretical properties. Still speaking roughly, the algorithm learns to assign probabilities to sentences in ways that respect any logical or statistical pattern that can be described in polynomial time. Additionally, it learns to reason well about its own beliefs and trust its future beliefs while avoiding paradox. Quoting from the abstract: "These properties and many others all follow from a single logical induction criterion, which is motivated by a series of stock trading analogies. Roughly speaking, each logical sentence φ is associated with a stock that is worth $1 per share if φ is true and nothing otherwise, and we interpret the belief-state of a logically uncertain reasoner as a set of market prices, where ℙn(φ)=50% means that on day n, shares of φ may be bought or sold from the reasoner for 50¢. The logical induction criterion says (very roughly) that there should not be any polynomial-time computable trading strategy with finite risk tolerance that earns unbounded profits in that market over time."

Original text:

https://intelligence.org/2016/09/12/new-paper-logical-induction/

Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

Cooperation, Conflict, and Transformative Artificial Intelligence: Sections 1 & 2 — Introduction, Strategy and Governance

4 januari 2025 | 28 min

Superintelligence: Instrumental Convergence

4 januari 2025 | 18 min

Takeaways From Our Robust Injury Classifier Project [Redwood Research]

4 januari 2025 | 12 min

The Alignment Problem From a Deep Learning Perspective

4 januari 2025 | 34 min

High-Stakes Alignment via Adversarial Training [Redwood Research Report]

4 januari 2025 | 19 min

A Short Introduction to Machine Learning

4 januari 2025 | 18 min

Introduction to Logical Decision Theory for Computer Scientists

4 januari 2025 | 14 min

Decision theories differ on exactly how to calculate the expectation--the probability of an outcome, conditional on an action. This foundational difference bubbles up to real-life questions about whether to vote in elections, or accept a lowball offer at the negotiating table. When you're thinking about what happens if you don't vote in an election, should you calculate the expected outcome as if only your vote changes, or as if all the people sufficiently similar to you would also decide not to vote? Questions like these belong to a larger class of problems, Newcomblike decision problems, in which some other agent is similar to us or reasoning about what we will do in the future. The central principle of 'logical decision theories', several families of which will be introduced, is that we ought to choose as if we are controlling the logical output of our abstract decision algorithm. Newcomblike considerations--which might initially seem like unusual special cases--become more prominent as agents can get higher-quality information about what algorithms or policies other agents use: Public commitments, machine agents with known code, smart contracts running on Ethereum. Newcomblike considerations also become more important as we deal with agents that are very similar to one another; or with large groups of agents that are likely to contain high-similarity subgroups; or with problems where even small correlations are enough to swing the decision. In philosophy, the debate over decision theories is seen as a debate over the principle of rational choice. Do 'rational' agents refrain from voting in elections, because their one vote is very unlikely to change anything? Do we need to go beyond 'rationality', into 'social rationality' or 'superrationality' or something along those lines, in order to describe agents that could possibly make up a functional society?

Original text:

https://arbital.com/p/logical_dt/?l=5d6

Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

Yudkowsky Contra Christiano on AI Takeoff Speeds

4 januari 2025 | 62 min

Debate Update: Obfuscated Arguments Problem

4 januari 2025 | 29 min

AGI Ruin: A List of Lethalities

4 januari 2025 | 62 min

Robust Feature-Level Adversaries Are Interpretability Tools

4 januari 2025 | 36 min

ML Systems Will Have Weird Failure Modes

4 januari 2025 | 14 min

AI Safety via Red Teaming Language Models With Language Models

4 januari 2025 | 7 min

Goal Misgeneralisation: Why Correct Specifications Aren’t Enough for Correct Goals

4 januari 2025 | 17 min

AI Safety via Debate

4 januari 2025 | 40 min

What Failure Looks Like

4 januari 2025 | 18 min

Least-To-Most Prompting Enables Complex Reasoning in Large Language Models

4 januari 2025 | 16 min

Specification Gaming: The Flip Side of AI Ingenuity

4 januari 2025 | 13 min

Summarizing Books With Human Feedback

4 januari 2025 | 6 min

The Easy Goal Inference Problem Is Still Hard

4 januari 2025 | 8 min

Supervising Strong Learners by Amplifying Weak Experts

4 januari 2025 | 19 min

AGI Safety From First Principles

4 januari 2025 | 13 min

Measuring Progress on Scalable Oversight for Large Language Models

4 januari 2025 | 10 min

Abstract:

Developing safe and useful general-purpose AI systems will require us to make progress on scalable oversight: the problem of supervising systems that potentially outperform us on most skills relevant to the task at hand. Empirical work on this problem is not straightforward, since we do not yet have systems that broadly exceed our abilities. This paper discusses one of the major ways we think about this problem, with a focus on ways it can be studied empirically. We first present an experimental design centered on tasks for which human specialists succeed but unaided humans and current general AI systems fail. We then present a proof-of-concept experiment meant to demonstrate a key feature of this experimental design and show its viability with two question-answering tasks: MMLU and time-limited QuALITY. On these tasks, we find that human participants who interact with an unreliable large-language-model dialog assistant through chat -- a trivial baseline strategy for scalable oversight -- substantially outperform both the model alone and their own unaided performance. These results are an encouraging sign that scalable oversight will be tractable to study with present models and bolster recent findings that large language models can productively assist humans with difficult tasks.

Authors:

Samuel R. Bowman, Jeeyoon Hyun, Ethan Perez, Edwin Chen, Craig Pettit, Scott Heiner, Kamilė Lukošiūtė, Amanda Askell, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Christopher Olah, Daniela Amodei, Dario Amodei, Dawn Drain, Dustin Li, Eli Tran-Johnson, Jackson Kernion, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Liane Lovitt, Nelson Elhage, Nicholas Schiefer, Nicholas Joseph, Noemí Mercado, Nova DasSarma, Robin Larson, Sam McCandlish, Sandipan Kundu, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Timothy Telleen-Lawton, Tom Brown, Tom Henighan, Tristan Hume, Yuntao Bai, Zac Hatfield-Dodds, Ben Mann, Jared Kaplan

Original text:

https://arxiv.org/abs/2211.03540

Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

Biological Anchors: A Trick That Might Or Might Not Work

4 januari 2025 | 71 min

Is Power-Seeking AI an Existential Risk?

4 januari 2025 | 201 min

More Is Different for AI

4 januari 2025 | 7 min

Visualizing the Deep Learning Revolution

4 januari 2025 | 42 min

Progress on Causal Influence Diagrams

4 januari 2025 | 23 min

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

4 januari 2025 | 9 min

Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

4 januari 2025 | 35 min

Zoom In: An Introduction to Circuits

4 januari 2025 | 44 min

Can We Scale Human Feedback for Complex AI Tasks?

4 januari 2025 | 20 min

Machine Learning for Humans: Supervised Learning

4 januari 2025 | 22 min

On the Opportunities and Risks of Foundation Models

4 januari 2025 | 16 min

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

4 januari 2025 | 25 min

AI Watermarking Won’t Curb Disinformation

4 januari 2025 | 8 min

Intelligence Explosion: Evidence and Import

4 januari 2025 | 19 min

Careers in Alignment

4 januari 2025 | 8 min

Illustrating Reinforcement Learning from Human Feedback (RLHF)

4 januari 2025 | 23 min

Deep Double Descent

4 januari 2025 | 8 min

Toy Models of Superposition

4 januari 2025 | 42 min

An Investigation of Model-Free Planning

4 januari 2025 | 8 min

ABS: Scanning Neural Networks for Back-Doors by Artificial Brain Stimulation

4 januari 2025 | 16 min

Low-Stakes Alignment

4 januari 2025 | 14 min

Compute Trends Across Three Eras of Machine Learning

4 januari 2025 | 14 min

How to Get Feedback

4 januari 2025 | 8 min

Constitutional AI Harmlessness from AI Feedback

4 januari 2025 | 62 min

Emerging Processes for Frontier AI Safety

4 januari 2025 | 18 min

Challenges in Evaluating AI Systems

4 januari 2025 | 23 min

Worst-Case Thinking in AI Alignment

4 januari 2025 | 12 min

AI Control: Improving Safety Despite Intentional Subversion

4 januari 2025 | 21 min

Empirical Findings Generalize Surprisingly Far

4 januari 2025 | 12 min

Computing Power and the Governance of AI

4 januari 2025 | 27 min

Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions

4 januari 2025 | 17 min

Working in AI Alignment

4 januari 2025 | 69 min

Imitative Generalisation (AKA ‘Learning the Prior’)

4 januari 2025 | 18 min

Planning a High-Impact Career: A Summary of Everything You Need to Know in 7 Points

4 januari 2025 | 11 min

Discovering Latent Knowledge in Language Models Without Supervision

4 januari 2025 | 37 min

Become a Person who Actually Does Things

4 januari 2025 | 5 min

Gradient Hacking: Definitions and Examples

4 januari 2025 | 9 min

How to Succeed as an Early-Stage Researcher: The “Lean Startup” Approach

4 januari 2025 | 15 min

Chinchilla’s Wild Implications

4 januari 2025 | 25 min

Being the (Pareto) Best in the World

4 januari 2025 | 7 min

Eliciting Latent Knowledge

4 januari 2025 | 60 min

Writing, Briefly

4 januari 2025 | 3 min

Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

4 januari 2025 | 32 min

Public by Default: How We Manage Information Visibility at Get on Board

4 januari 2025 | 10 min

Introduction to Mechanistic Interpretability

4 januari 2025 | 12 min

We Need a Science of Evals

4 januari 2025 | 20 min

Intro to Brain-Like-AGI Safety

4 januari 2025 | 62 min

If-Then Commitments for AI Risk Reduction

2 januari 2025 | 40 min

This is How AI Will Transform How Science Gets Done

2 januari 2025 | 11 min

Open-Sourcing Highly Capable Foundation Models: An Evaluation of Risks, Benefits, and Alternative Methods for Pursuing Open-Source Objectives

30 december 2024 | 56 min

So You Want to be a Policy Entrepreneur?

30 december 2024 | 41 min

Considerations for Governing Open Foundation Models

30 december 2024 | 26 min

Driving U.S. Innovation in Artificial Intelligence: A Roadmap for Artificial Intelligence Policy in the United States Senate

22 maj 2024 | 36 min

Societal Adaptation to Advanced AI

20 maj 2024 | 46 min

The AI Triad and What It Means for National Security Strategy

20 maj 2024 | 40 min

OECD AI Principles

13 maj 2024 | 24 min

A pro-innovation approach to AI regulation: government response

13 maj 2024 | 38 min

The Bletchley Declaration by Countries Attending the AI Safety Summit, 1-2 November 2023

13 maj 2024 | 9 min

Key facts: UNESCO’s Recommendation on the Ethics of Artificial Intelligence

13 maj 2024 | 21 min

Recent U.S. Efforts on AI Policy

13 maj 2024 | 6 min

FACT SHEET: President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence

13 maj 2024 | 14 min

High-level summary of the AI Act

13 maj 2024 | 18 min

China’s AI Regulations and How They Get Made

13 maj 2024 | 27 min

AI Index Report 2024, Chapter 7: Policy and Governance

13 maj 2024 | 21 min

The Policy Playbook: Building a Systems-Oriented Approach to Technology and National Security Policy

5 maj 2024 | 56 min

Strengthening Resilience to AI Risk: A Guide for UK Policymakers

4 maj 2024 | 25 min

The Convergence of Artificial Intelligence and the Life Sciences: Safeguarding Technology, Rethinking Governance, and Preventing Catastrophe

3 maj 2024 | 9 min

Rogue AIs

1 maj 2024 | 34 min

What is AI Alignment?

1 maj 2024 | 11 min

An Overview of Catastrophic AI Risks

29 april 2024 | 45 min

Future Risks of Frontier AI

23 april 2024 | 40 min

What risks does AI pose?

23 april 2024 | 24 min

AI Could Defeat All Of Us Combined

22 april 2024 | 24 min

Moore's Law for Everything

16 april 2024 | 17 min

The Transformative Potential of Artificial Intelligence

16 april 2024 | 49 min

Positive AI Economic Futures

16 april 2024 | 21 min

The Economic Potential of Generative AI: The Next Productivity Frontier

16 april 2024 | 42 min

A Short Introduction to Machine Learning

13 maj 2023 | 18 min

Visualizing the Deep Learning Revolution

13 maj 2023 | 42 min

The AI Triad and What It Means for National Security Strategy

13 maj 2023 | 27 min

As AI Agents Like Auto-GPT Speed up Generative AI Race, We All Need to Buckle Up

13 maj 2023 | 7 min

Overview of How AI Might Exacerbate Long-Running Catastrophic Risks

13 maj 2023 | 24 min

The Need for Work on Technical AI Alignment

13 maj 2023 | 34 min

Specification Gaming: The Flip Side of AI Ingenuity

13 maj 2023 | 13 min

Emergent Deception and Emergent Optimization

13 maj 2023 | 33 min

Why Might Misaligned, Advanced AI Cause Catastrophe?

13 maj 2023 | 20 min

Nobody’s on the Ball on AGI Alignment

13 maj 2023 | 17 min

AI Safety Seems Hard to Measure

13 maj 2023 | 22 min

Avoiding Extreme Global Vulnerability as a Core AI Governance Problem

13 maj 2023 | 12 min

Much has been written framing and articulating the AI governance problem from a catastrophic risks lens, but these writings have been scattered. This page aims to provide a synthesized introduction to some of these already prominent framings. This is just one attempt at suggesting an overall frame for thinking about some AI governance problems; it may miss important things. Some researchers think that unsafe development or misuse of AI could cause massive harms. A key contributor to some of these risks is that catastrophe may not require all or most relevant decision makers to make harmful decisions. Instead, harmful decisions from just a minority of influential decision makers—perhaps just a single actor with good intentions—may be enough to cause catastrophe. For example, some researchers argue, if just one organization deploys highly capable, goal-pursuing, misaligned AI—or if many businesses (but a small portion of all businesses) deploy somewhat capable, goal-pursuing, misaligned AI—humanity could be permanently disempowered. The above would not be very worrying if we could rest assured that no actors capable of these harmful actions would take them. However, especially in the context of AI safety, several factors are arguably likely to incentivize some actors to take harmful deployment actions: Misjudgment: Assessing the consequences of AI deployment may be difficult (as it is now, especially given the nature of AI risk arguments), so some organizations could easily get it wrong—concluding that an AI system is safe or beneficial when it is not. “Winner-take-all” competition: If the first organization(s) to deploy advanced AI is expected to get large gains, while leaving competitors with nothing, competitors would be highly incentivized to cut corners in order to be first—they would have less to lose.

Original text:

https://www.agisafetyfundamentals.com/governance-blog/global-vulnerability

Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

Model Evaluation for Extreme Risks

13 maj 2023 | 56 min

Primer on Safety Standards and Regulations for Industrial-Scale AI Development

13 maj 2023 | 16 min

Frontier AI Regulation: Managing Emerging Risks to Public Safety

13 maj 2023 | 30 min

Advanced AI models hold the promise of tremendous benefits for humanity, but society needs to proactively manage the accompanying risks. In this paper, we focus on what we term “frontier AI” models — highly capable foundation models that could possess dangerous capabilities sufficient to pose severe risks to public safety. Frontier AI models pose a distinct regulatory challenge: dangerous capabilities can arise unexpectedly; it is difficult to robustly prevent a deployed model from being misused; and, it is difficult to stop a model’s capabilities from proliferating broadly. To address these challenges, at least three building blocks for the regulation of frontier models are needed: (1) standard-setting processes to identify appropriate requirements for frontier AI developers, (2) registration and reporting requirements to provide regulators with visibility into frontier AI development processes, and (3) mechanisms to ensure compliance with safety standards for the development and deployment of frontier AI models. Industry self-regulation is an important first step. However, wider societal discussions and government intervention will be needed to create standards and to ensure compliance with them. We consider several options to this end, including granting enforcement powers to supervisory authorities and licensure regimes for frontier AI models. Finally, we propose an initial set of safety standards. These include conducting pre-deployment risk assessments; external scrutiny of model behavior; using risk assessments to inform deployment decisions; and monitoring and responding to new information about model capabilities and uses post-deployment. We hope this discussion contributes to the broader conversation on how to balance public safety risks and innovation benefits from advances at the frontier of AI development.

Source:

https://arxiv.org/pdf/2307.03718.pdf

Narrated for AI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

The State of AI in Different Countries — An Overview

13 maj 2023 | 36 min

Primer on AI Chips and AI Governance

13 maj 2023 | 25 min

Choking off China’s Access to the Future of AI

13 maj 2023 | 8 min

Racing Through a Minefield: The AI Deployment Problem

13 maj 2023 | 21 min

A Tour of Emerging Cryptographic Technologies

13 maj 2023 | 31 min

Historical Case Studies of Technology Governance and International Agreements

13 maj 2023 | 36 min

What Does It Take to Catch a Chinchilla? Verifying Rules on Large-Scale Neural Network Training via Compute Monitoring

13 maj 2023 | 32 min

International Institutions for Advanced AI

13 maj 2023 | 42 min

LP Announcement by OpenAI

13 maj 2023 | 7 min

OpenAI Charter

13 maj 2023 | 3 min

What AI Companies Can Do Today to Help With the Most Important Century

13 maj 2023 | 18 min

Let’s Think About Slowing Down AI

13 maj 2023 | 75 min

12 Tentative Ideas for Us AI Policy

13 maj 2023 | 10 min

Some Talent Needs in AI Governance

13 maj 2023 | 16 min

AI Governance Needs Technical Work

13 maj 2023 | 15 min

Career Resources on AI Strategy Research

13 maj 2023 | 18 min

China-Related AI Safety and Governance Paths

13 maj 2023 | 48 min

Expertise in China and its relations with the world might be critical in tackling some of the world’s most pressing problems. In particular, China’s relationship with the US is arguably the most important bilateral relationship in the world, with these two countries collectively accounting for over 40% of global GDP.1 These considerations led us to publish a guide to improving China–Western coordination on global catastrophic risks and other key problems in 2018. Since then, we have seen an increase in the number of people exploring this area.

China is one of the most important countries developing and shaping advanced artificial intelligence (AI). The Chinese government’s spending on AI research and development is estimated to be on the same order of magnitude as that of the US government,2 and China’s AI research is prominent on the world stage and growing.

Because of the importance of AI from the perspective of improving the long-run trajectory of the world, we think relations between China and the US on AI could be among the most important aspects of their relationship. Insofar as the EU and/or UK influence advanced AI development through labs based in their countries or through their influence on global regulation, the state of understanding and coordination between European and Chinese actors on AI safety and governance could also be significant.

That, in short, is why we think working on AI safety and governance in China and/or building mutual understanding between Chinese and Western actors in these areas is likely to be one of the most promising China-related career paths. Below we provide more arguments and detailed information on this option.

If you are interested in pursuing a career path described in this profile, contact 80,000 Hours’ one-on-one team and we may be able to put you in touch with a specialist advisor.

Source:

https://80000hours.org/career-reviews/china-related-ai-safety-and-governance-paths/

Narrated for AGI Safety Fundamentals by Perrin Walker of TYPE III AUDIO.

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

List of EA Funding Opportunities

13 maj 2023 | 12 min

My Current Impressions on Career Choice for Longtermists

13 maj 2023 | 47 min

This post summarizes the way I currently think about career choice for longtermists. I have put much less time into thinking about this than 80,000 Hours, but I think it’s valuable for there to be multiple perspectives on this topic out there.

Edited to add: see below for why I chose to focus on longtermism in this post.

While the jobs I list overlap heavily with the jobs 80,000 Hours lists, I organize them and conceptualize them differently. 80,000 Hours tends to emphasize “paths” to particular roles working on particular causes; by contrast, I emphasize “aptitudes” one can build in a wide variety of roles and causes (including non-effective-altruist organizations) and then apply to a wide variety of longtermist-relevant jobs (often with options working on more than one cause). Example aptitudes include: “helping organizations achieve their objectives via good business practices,” “evaluating claims against each other,” “communicating already-existing ideas to not-yet-sold audiences,” etc.

(Other frameworks for career choice include starting with causes (AI safety, biorisk, etc.) or heuristics (“Do work you can be great at,” “Do work that builds your career capital and gives you more options.”) I tend to feel people should consider multiple frameworks when making career choices, since any one framework can contain useful insight, but risks being too dogmatic and specific for individual cases.)

For each aptitude I list, I include ideas for how to explore the aptitude and tell whether one is on track. Something I like about an aptitude-based framework is that it is often relatively straightforward to get a sense of one’s promise for, and progress on, a given “aptitude” if one chooses to do so. This contrasts with cause-based and path-based approaches, where there’s a lot of happenstance in whether there is a job available in a given cause or on a given path, making it hard for many people to get a clear sense of their fit for their first-choice cause/path and making it hard to know what to do next. This framework won’t make it easier for people to get the jobs they want, but it might make it easier for them to start learning about what sort of work is and isn’t likely to be a fit.

Source:

https://forum.effectivealtruism.org/posts/bud2ssJLQ33pSemKH/longtermist-career-choice

Narrated for AI Safety Fundamentalsby TYPE III AUDIO.

---

A podcast by BlueDot Impact.

Learn more on the AI Safety Fundamentals website.

AI Governance Needs Technical Work

13 maj 2023 | 15 min

AI Safety Fundamentals: Governance

Listen to resources from the AI Safety Fundamentals: Governance course!https://aisafetyfundamentals.

Om podden

Avsnitt

Deceptively Aligned Mesa-Optimizers: It’s Not Funny if I Have to Explain It

Learning From Human Preferences

Where I Agree and Disagree with Eliezer

Thought Experiments Provide a Third Anchor

Future ML Systems Will Be Qualitatively Different

Why AI Alignment Could Be Hard With Modern Deep Learning

Acquisition of Chess Knowledge in Alphazero

Four Background Claims

Understanding Intermediate Layers Using Linear Classifier Probes

Feature Visualization

Embedded Agents

Logical Induction (Blog Post)

Cooperation, Conflict, and Transformative Artificial Intelligence: Sections 1 & 2 — Introduction, Strategy and Governance

Superintelligence: Instrumental Convergence

Takeaways From Our Robust Injury Classifier Project [Redwood Research]

The Alignment Problem From a Deep Learning Perspective

High-Stakes Alignment via Adversarial Training [Redwood Research Report]

A Short Introduction to Machine Learning

Introduction to Logical Decision Theory for Computer Scientists

Yudkowsky Contra Christiano on AI Takeoff Speeds

Debate Update: Obfuscated Arguments Problem

AGI Ruin: A List of Lethalities

Robust Feature-Level Adversaries Are Interpretability Tools

ML Systems Will Have Weird Failure Modes

AI Safety via Red Teaming Language Models With Language Models

Goal Misgeneralisation: Why Correct Specifications Aren’t Enough for Correct Goals

AI Safety via Debate

What Failure Looks Like

Least-To-Most Prompting Enables Complex Reasoning in Large Language Models

Specification Gaming: The Flip Side of AI Ingenuity

Summarizing Books With Human Feedback

The Easy Goal Inference Problem Is Still Hard

Supervising Strong Learners by Amplifying Weak Experts

AGI Safety From First Principles

Measuring Progress on Scalable Oversight for Large Language Models

Biological Anchors: A Trick That Might Or Might Not Work

Is Power-Seeking AI an Existential Risk?

More Is Different for AI

Visualizing the Deep Learning Revolution

Progress on Causal Influence Diagrams

Towards Monosemanticity: Decomposing Language Models With Dictionary Learning

Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

Zoom In: An Introduction to Circuits

Can We Scale Human Feedback for Complex AI Tasks?

Machine Learning for Humans: Supervised Learning

On the Opportunities and Risks of Foundation Models

Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small

AI Watermarking Won’t Curb Disinformation

Intelligence Explosion: Evidence and Import

Careers in Alignment

Illustrating Reinforcement Learning from Human Feedback (RLHF)

Deep Double Descent

Toy Models of Superposition

An Investigation of Model-Free Planning

ABS: Scanning Neural Networks for Back-Doors by Artificial Brain Stimulation

Low-Stakes Alignment

Compute Trends Across Three Eras of Machine Learning

How to Get Feedback

Constitutional AI Harmlessness from AI Feedback

Emerging Processes for Frontier AI Safety

Challenges in Evaluating AI Systems

Worst-Case Thinking in AI Alignment

AI Control: Improving Safety Despite Intentional Subversion

Empirical Findings Generalize Surprisingly Far

Computing Power and the Governance of AI

Two-Turn Debate Doesn’t Help Humans Answer Hard Reading Comprehension Questions

Working in AI Alignment

Imitative Generalisation (AKA ‘Learning the Prior’)

Planning a High-Impact Career: A Summary of Everything You Need to Know in 7 Points

Discovering Latent Knowledge in Language Models Without Supervision

Become a Person who Actually Does Things

Gradient Hacking: Definitions and Examples

How to Succeed as an Early-Stage Researcher: The “Lean Startup” Approach

Chinchilla’s Wild Implications

Being the (Pareto) Best in the World

Eliciting Latent Knowledge