328 avsnitt • Längd: 25 min • Veckovis: Tisdag
Audio narrations of LessWrong posts. Includes all curated posts and all posts with 125+ karma.If you’d like more, subscribe to the “Lesswrong (30+ karma)” feed.
The podcast LessWrong (Curated & Popular) is created by LessWrong. The podcast and the artwork on this page are embedded on this page using the public podcast feed (RSS).
Contact: patreon.com/lwcurated or [perrin dot j dot walker plus lesswrong fnord gmail].
All Solenoid's narration work found here.
Crossposted from AI Lab Watch. Subscribe on Substack.
Introduction.
Anthropic has an unconventional governance mechanism: an independent "Long-Term Benefit Trust" elects some of its board. Anthropic sometimes emphasizes that the Trust is an experiment, but mostly points to it to argue that Anthropic will be able to promote safety and benefit-sharing over profit.[1]
But the Trust's details have not been published and some information Anthropic has shared is concerning. In particular, Anthropic's stockholders can apparently overrule, modify, or abrogate the Trust, and the details are unclear.
Anthropic has not publicly demonstrated that the Trust would be able to actually do anything that stockholders don't like.
The facts
There are three sources of public information on the Trust:
They say there's [...]
The original text contained 2 footnotes which were omitted from this narration.
---
First published:
May 27th, 2024
Source:
https://www.lesswrong.com/posts/sdCcsTt9hRpbX6obP/maybe-anthropic-s-long-term-benefit-trust-is-powerless
---
Narrated by TYPE III AUDIO.
Introduction.
If you are choosing to read this post, you've probably seen the image below depicting all the notifications students received on their phones during one class period. You probably saw it as a retweet of this tweet, or in one of Zvi's posts. Did you find this data plausible, or did you roll to disbelieve? Did you know that the image dates back to at least 2019? Does that fact make you more or less worried about the truth on the ground as of 2024?
Last month, I performed an enhanced replication of this experiment in my high school classes. This was partly because we had a use for it, partly to model scientific thinking, and partly because I was just really curious. Before you scroll past the image, I want to give you a chance to mentally register your predictions. Did my average class match the [...]
---
First published:
May 26th, 2024
Source:
https://www.lesswrong.com/posts/AZCpu3BrCFWuAENEd/notifications-received-in-30-minutes-of-class
---
Narrated by TYPE III AUDIO.
New blog: AI Lab Watch. Subscribe on Substack.
Many AI safety folks think that METR is close to the labs, with ongoing relationships that grant it access to models before they are deployed. This is incorrect. METR (then called ARC Evals) did pre-deployment evaluation for GPT-4 and Claude 2 in the first half of 2023, but it seems to have had no special access since then.[1] Other model evaluators also seem to have little access before deployment.
Frontier AI labs' pre-deployment risk assessment should involve external model evals for dangerous capabilities.[2] External evals can improve a lab's risk assessment and—if the evaluator can publish its results—provide public accountability.
The evaluator should get deeper access than users will get.
The original text contained 5 footnotes which were omitted from this narration.
---
First published:
May 24th, 2024
Source:
https://www.lesswrong.com/posts/WjtnvndbsHxCnFNyc/ai-companies-aren-t-really-using-external-evaluators
---
Narrated by TYPE III AUDIO.
This is a quickly-written opinion piece, of what I understand about OpenAI. I first posted it to Facebook, where it had some discussion.
Some arguments that OpenAI is making, simultaneously:
---
First published:
May 21st, 2024
Source:
https://www.lesswrong.com/posts/cy99dCEiLyxDrMHBi/what-s-going-on-with-openai-s-messaging
---
Narrated by TYPE III AUDIO.
Produced as part of the MATS Winter 2023-4 program, under the mentorship of @Jessica Rumbelow
One-sentence summary: On a dataset of human-written essays, we find that gpt-3.5-turbo can accurately infer demographic information about the authors from just the essay text, and suspect it's inferring much more.
Introduction.
Every time we sit down in front of an LLM like GPT-4, it starts with a blank slate. It knows nothing[1] about who we are, other than what it knows about users in general. But with every word we type, we reveal more about ourselves -- our beliefs, our personality, our education level, even our gender. Just how clearly does the model see us by the end of the conversation, and why should that worry us?
Like many, we were rather startled when @janus showed that gpt-4-base could identify @gwern by name, with 92% confidence, from a 300-word comment. If [...]
The original text contained 12 footnotes which were omitted from this narration.
---
First published:
May 17th, 2024
Source:
https://www.lesswrong.com/posts/dLg7CyeTE4pqbbcnp/language-models-model-us
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://twitter.com/ESYudkowsky/status/144546114693741363
I stumbled upon a Twitter thread where Eliezer describes what seems to be his cognitive algorithm that is equivalent to Tune Your Cognitive Strategies, and have decided to archive / repost it here.
Source:
https://www.lesswrong.com/posts/rYq6joCrZ8m62m7ej/how-could-i-have-thought-that-faster
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
In January, I defended my PhD thesis, which I called Algorithmic Bayesian Epistemology. From the preface:
For me as for most students, college was a time of exploration. I took many classes, read many academic and non-academic works, and tried my hand at a few research projects. Early in graduate school, I noticed a strong commonality among the questions that I had found particularly fascinating: most of them involved reasoning about knowledge, information, or uncertainty under constraints. I decided that this cluster of problems would be my primary academic focus. I settled on calling the cluster algorithmic Bayesian epistemology: all of the questions I was thinking about involved applying the "algorithmic lens" of theoretical computer science to problems of Bayesian epistemology.
Source:
https://www.lesswrong.com/posts/6dd4b4cAWQLDJEuHw/my-phd-thesis-algorithmic-bayesian-epistemology
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://bayesshammai.substack.com/p/conditional-on-getting-to-trade-your
“I refuse to join any club that would have me as a member” -Marx[1]
Adverse Selection is the phenomenon in which information asymmetries in non-cooperative environments make trading dangerous. It has traditionally been understood to describe financial markets in which buyers and sellers systematically differ, such as a market for used cars in which sellers have the information advantage, where resulting feedback loops can lead to market collapses.
In this post, I make the case that adverse selection effects appear in many everyday contexts beyond specialized markets or strictly financial exchanges. I argue that modeling many of our decisions as taking place in competitive environments analogous to financial markets will help us notice instances of adverse selection that we otherwise wouldn’t.
The strong version of my central thesis is that conditional on getting to trade[2], your trade wasn’t all that great. Any time you make a trade, you should be asking yourself “what do others know that I don’t?”
Source:
https://www.lesswrong.com/posts/vyAZyYh3qsqcJwwPn/toward-a-broader-conception-of-adverse-selection
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Cross-posted from my website. Podcast version here, or search for "Joe Carlsmith Audio" on your podcast app.
This essay is part of a series that I'm calling "Otherness and control in the age of AGI." I'm hoping that the individual essays can be read fairly well on their own, but see here for brief summaries of the essays that have been released thus far.
Warning: spoilers for Yudkowsky's "The Sword of the Good.")
Examining a philosophical vibe that I think contrasts in interesting ways with "deep atheism."
Text version here: https://joecarlsmith.com/2024/03/21/on-green
This essay is part of a series I'm calling "Otherness and control in the age of AGI." I'm hoping that individual essays can be read fairly well on their own, but see here for brief text summaries of the essays that have been released thus far: https://joecarlsmith.com/2024/01/02/otherness-and-control-in-the-age-of-agi
(Though: note that I haven't put the summary post on the podcast yet.)
Source:
https://www.lesswrong.com/posts/gvNnE6Th594kfdB3z/on-green
Narrated by Joe Carlsmith, audio provided with permission.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/Yay8SbQiwErRyDKGb/using-axis-lines-for-good-or-evil
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/SPBm67otKq5ET5CWP/social-status-part-1-2-negotiations-over-object-level
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/xLDwCemt5qvchzgHd/scale-was-all-we-needed-at-first
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/Cb7oajdrA5DsHCqKd/acting-wholesomely
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/h99tRkpQGxwtb9Dpv/my-clients-the-liars
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/sJPbmm8Gd34vGYrKd/deep-atheism-and-ai-risk
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/Jash4Gbi2wpThzZ4k/cfar-takeaways-andrew-critch
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/2sLwt2cSAag74nsdN/speaking-to-congressional-staffers-about-ai-risk
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/8yCXeafJo67tYe5L4/and-all-the-shoggoths-merely-players
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/g8HHKaWENEbqh2mgK/updatelessness-doesn-t-solve-most-problems-1
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/duvzdffTzL3dWJcxn/believing-in-1
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/5jdqtpT6StjKDKacw/attitudes-about-applied-rationality
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/PhTBDHu9PKJFmvb4p/a-shutdown-problem-proposal
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/GfZfDHZHCuYwrHGCd/without-fundamental-advances-misalignment-and-catastrophe
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/kcKrE9mzEHrdqtDpE/the-case-for-ensuring-that-powerful-ais-are-controlled
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Crossposted from substack.
As we all know, sugar is sweet and so are the $30B in yearly revenue from the artificial sweetener industry.
Four billion years of evolution endowed our brains with a simple, straightforward mechanism to make sure we occasionally get an energy refuel so we can continue the foraging a little longer, and of course we are completely ignoring the instructions and spend billions on fake fuel that doesn’t actually grant any energy. A classic case of the Human Alignment Problem.
If we’re going to break our conditioning anyway, where do we start? How do you even come up with a new artificial sweetener? I’ve been wondering about this, because it’s not obvious to me how you would figure out what is sweet and what is not.
Look at sucrose and aspartame side by side:
Source:
https://www.lesswrong.com/posts/oA23zoEjPnzqfHiCt/there-is-way-too-much-serendipity
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/tEPHGZAb63dfq2v8n/how-useful-is-mechanistic-interpretability
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
This is a linkpost for https://arxiv.org/abs/2401.05566
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Source:
https://www.lesswrong.com/posts/ZAsJv7xijKTfZkMtr/sleeper-agents-training- deceptive-llms-that-persist-through
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
[125+ Karma Post] ✓
"(Cross-posted from my website. Audio version here, or search "Joe Carlsmith Audio" on your podcast app.)"
This is the first essay in a series that I’m calling “Otherness and control in the age of AGI.” See here for more about the series as a whole.)
The most succinct argument for AI risk, in my opinion, is the “second species” argument. Basically, it goes like this.
Premise 1: AGIs would be like a second advanced species on earth, more powerful than humans.Conclusion: That’s scary.
To be clear: this is very far from airtight logic.[1] But I like the intuition pump. Often, if I only have two sentences to explain AI risk, I say this sort of species stuff. “Chimpanzees should be careful about inventing humans.” Etc.[2]
People often talk about aliens here, too. “What if you learned that aliens were on their way to earth? Surely that’s scary.” Again, very far from a knock-down case (for example: we get to build the aliens in question). But it draws on something.
In particular, though: it draws on a narrative of interspecies conflict. You are meeting a new form of life, a new type of mind. But these new creatures are presented to you, centrally, as a possible threat; as competitors; as agents in whose power you might find yourself helpless.
And unfortunately: yes. But I want to start this series by acknowledging how many dimensions of interspecies-relationship this narrative leaves out, and how much I wish we could be focusing only on the other parts. To meet a new species – and especially, a new intelligent species – is not just scary. It’s incredible. I wish it was less a time for fear, and more a time for wonder and dialogue. A time to look into new eyes – and to see further.
Source:
https://www.lesswrong.com/posts/mzvu8QTRXdvDReCAL/gentleness-and-the-artificial-other
Narrated for LessWrong by Joe Carlsmith (audio provided with permission).
Share feedback on this narration.
[Curated Post] ✓
[125+ karma Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
The goal of this post is to clarify a few concepts relating to AI Alignment under a common framework. The main concepts to be clarified:
The main new concepts employed will be endorsement and legitimacy.
TLDR:
This write-up owes a large debt to many conversations with Sahil, although the views expressed here are my own.
Source:
https://www.lesswrong.com/posts/bnnhypM5MXBHAATLw/meaning-and-agency
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
This is a linkpost for https://unstableontology.com/2023/12/31/a-case-for-ai-alignment-being-difficult/
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is an attempt to distill a model of AGI alignment that I have gained primarily from thinkers such as Eliezer Yudkowsky (and to a lesser extent Paul Christiano), but explained in my own terms rather than attempting to hew close to these thinkers. I think I would be pretty good at passing an ideological Turing test for Eliezer Yudowsky on AGI alignment difficulty (but not AGI timelines), though what I'm doing in this post is not that, it's more like finding a branch in the possibility space as I see it that is close enough to Yudowsky's model that it's possible to talk in the same language.
Even if the problem turns out to not be very difficult, it's helpful to have a model of why one might think it is difficult, so as to identify weaknesses in the case so as to find AI designs that avoid the main difficulties. Progress on problems can be made by a combination of finding possible paths and finding impossibility results or difficulty arguments.
Most of what I say should not be taken as a statement on AGI timelines. Some problems that make alignment difficult, such as ontology identification, also make creating capable AGI difficult to some extent.
Source:
https://www.lesswrong.com/posts/wnkGXcAq4DCgY8HqA/a-case-for-ai-alignment-being-difficult
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[Curated Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
TL;DR version
In the course of my life, there have been a handful of times I discovered an idea that changed the way I thought about the world. The first occurred when I picked up Nick Bostrom’s book “superintelligence” and realized that AI would utterly transform the world. The second was when I learned about embryo selection and how it could change future generations. And the third happened a few months ago when I read a message from a friend of mine on Discord about editing the genome of a living person.
Source:
https://www.lesswrong.com/posts/JEhW3HDMKzekDShva/significantly-enhancing-adult-intelligence-with-gene-editing
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://unstableontology.com/2023/11/26/moral-reality-check/
Janet sat at her corporate ExxenAI computer, viewing some training performance statistics. ExxenAI was a major player in the generative AI space, with multimodal language, image, audio, and video AIs. They had scaled up operations over the past few years, mostly serving B2B, but with some B2C subscriptions. ExxenAI's newest AI system, SimplexAI-3, was based on GPT-5 and Gemini-2. ExxenAI had hired away some software engineers from Google and Microsoft, in addition to some machine learning PhDs, and replicated the work of other companies to provide more custom fine-tuning, especially for B2B cases. Part of what attracted these engineers and theorists was ExxenAI's AI alignment team.
Source:
https://www.lesswrong.com/posts/umJMCaxosXWEDfS66/moral-reality-check-a-short-story
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
Crossposted from Otherwise
Parents supervise their children way more than they used to
Children spend less of their time in unstructured play than they did in past generations.
Parental supervision is way up. The wild thing is that this is true even while the number of children per family has decreased and the amount of time mothers work outside the home has increased.
Source:
https://www.lesswrong.com/posts/piJLpEeh6ivy5RA7v/what-are-the-results-of-more-parental-supervision-and-less
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
You can’t optimise an allocation of resources if you don’t know what the current one is. Existing maps of alignment research are mostly too old to guide you and the field has nearly no ratchet, no common knowledge of what everyone is doing and why, what is abandoned and why, what is renamed, what relates to what, what is going on.
This post is mostly just a big index: a link-dump for as many currently active AI safety agendas as we could find. But even a linkdump is plenty subjective. It maps work to conceptual clusters 1-1, aiming to answer questions like “I wonder what happened to the exciting idea I heard about at that one conference” and “I just read a post on a surprising new insight and want to see who else has been working on this”, “I wonder roughly how many people are working on that thing”.
This doc is unreadably long, so that it can be Ctrl-F-ed. Also this way you can fork the list and make a smaller one.
Most of you should only read the editorial and skim the section you work in.
Source:
https://www.lesswrong.com/posts/zaaGsFBeDTpCsYHef/shallow-review-of-live-agendas-in-alignment-and-safety#More_meta
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
The author's Substack:
https://substack.com/@homosabiens
Support ongoing human narrations of LessWrong's curated posts:
www.patreon.com/LWCurated
You know it must be out there, but you mostly never see it.
Author's Note 1: In something like 75% of possible futures, this will be the last essay that I publish on LessWrong. Future content will be available on my substack, where I'm hoping people will be willing to chip in a little commensurate with the value of the writing, and (after a delay) on my personal site (not yet live). I decided to post this final essay here rather than silently switching over because many LessWrong readers would otherwise never find out that they could still get new Duncanthoughts elsewhere.
Source:
https://www.lesswrong.com/posts/KpMNqA5BiCRozCwM3/social-dark-matter
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Status: Vague, sorry. The point seems almost tautological to me, and yet also seems like the correct answer to the people going around saying “LLMs turned out to be not very want-y, when are the people who expected 'agents' going to update?”, so, here we are.
Okay, so you know how AI today isn't great at certain... let's say "long-horizon" tasks? Like novel large-scale engineering projects, or writing a long book series with lots of foreshadowing?
(Modulo the fact that it can play chess pretty well, which is longer-horizon than some things; this distinction is quantitative rather than qualitative and it's being eroded, etc.)
And you know how the AI doesn't seem to have all that much "want"- or "desire"-like behavior?
(Modulo, e.g., the fact that it can play chess pretty well, which indicates a [...]
---
First published:
November 24th, 2023
Source:
https://www.lesswrong.com/posts/AWoZBzxdm4DoGgiSj/ability-to-solve-long-horizon-tasks-correlates-with-wanting
---
Narrated by TYPE III AUDIO.
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
Recently, I have been learning about industry norms, legal discovery proceedings, and incentive structures related to companies building risky systems. I wanted to share some findings in this post because they may be important for the frontier AI community to understand well.
TL;DR
Documented communications of risks (especially by employees) make companies much more likely to be held liable in court when bad things happen. The resulting Duty to Due Diligence from Discoverable Documentation of Dangers (the 6D effect) can make companies much more cautious if even a single email is sent to them communicating a risk.
Source:
https://www.lesswrong.com/posts/J9eF4nA6wJW6hPueN/the-6d-effect-when-companies-take-risks-one-email-can-be
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
I'm sure Harry Potter and the Methods of Rationality taught me some of the obvious, overt things it set out to teach. Looking back on it a decade after I first read it however, what strikes me most strongly are often the brief, tossed off bits in the middle of the flow of a story.
Fred and George exchanged worried glances."I can't think of anything," said George."Neither can I," said Fred. "Sorry."Harry stared at them.And then Harry began to explain how you went about thinking of things.It had been known to take longer than two seconds, said Harry.-Harry Potter and the Methods of Rationality, Chapter 25.
Source:
https://www.lesswrong.com/posts/WJtq4DoyT9ovPyHjH/thinking-by-the-clock
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
I.
Here's a recent conversation I had with a friend:
Me: "I wish I had more friends. You guys are great, but I only get to hang out with you like once or twice a week. It's painful being holed up in my house the entire rest of the time."Friend: "You know ${X}. You could talk to him."Me: "I haven't talked to ${X} since 2019."Friend: "Why does that matter? Just call him."Me: "What do you mean 'just call him'? I can't do that."Friend: "Yes you can"Me:
Source:
https://www.lesswrong.com/posts/2HawAteFsnyhfYpuD/you-can-just-spontaneously-call-people-you-haven-t-met-in
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
How many years will pass before transformative AI is built? Three people who have thought about this question a lot are Ajeya Cotra from Open Philanthropy, Daniel Kokotajlo from OpenAI and Ege Erdil from Epoch. Despite each spending at least hundreds of hours investigating this question, they still still disagree substantially about the relevant timescales. For instance, here are their median timelines for one operationalization of transformative AI:
Source:
https://www.lesswrong.com/posts/K2D45BNxnZjdpSX2j/ai-timelines
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
It’s fairly common for EA orgs to provide fiscal sponsorship to other EA orgs. Wait, no, that sentence is not quite right. The more accurate sentence is that there are very few EA organizations, in the legal sense; most of what you think of as orgs are projects that are legally hosted by a single org, and which governments therefore consider to be one legal entity.
Source:
https://www.lesswrong.com/posts/XvEJydHAHk6hjWQr5/ea-orgs-legal-structure-inhibits-risk-taking-and-information
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
habryka
Ok, so we both had some feelings about the recent Conjecture post on "lots of people in AI Alignment are lying", and the associated marketing campaign and stuff.
I would appreciate some context in which I can think through that, and also to share info we have in the space that might help us figure out what's going on.
I expect this will pretty quickly cause us to end up on some broader questions about how to do advocacy, how much the current social network around AI Alignment should coordinate as a group, how to balance advocacy with research, etc.
Source:
https://www.lesswrong.com/posts/vFqa8DZCuhyrbSnyx/integrity-in-ai-governance-and-advocacy
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
davidad has a 10-min talk out on a proposal about which he says: “the first time I’ve seen a concrete plan that might work to get human uploads before 2040, maybe even faster, given unlimited funding”.
I think the talk is a good watch, but the dialogue below is pretty readable even if you haven't seen it. I'm also putting some summary notes from the talk in the Appendix of this dialogue.
Source:
https://www.lesswrong.com/posts/FEFQSGLhJFpqmEhgi/does-davidad-s-uploading-moonshot-work
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I guess there’s maybe a 10-20% chance of AI causing human extinction in the coming decades, but I feel more distressed about it than even that suggests—I think because in the case where it doesn’t cause human extinction, I find it hard to imagine life not going kind of off the rails. So many things I like about the world seem likely to be over or badly disrupted with superhuman AI (writing, explaining things to people, friendships where you can be of any use to one another, taking pride in skills, thinking, learning, figuring out how to achieve things, making things, easy tracking of what is and isn’t conscious), and I don’t trust that the replacements will be actually good, or good for us, or that anything will be reversible.
Even if we don’t die, it still feels like everything is coming to an end.
Source:
https://www.lesswrong.com/posts/uyPo8pfEtBffyPdxf/the-other-side-of-the-tidal-wave
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Recently, I have been learning about industry norms, legal discovery proceedings, and incentive structures related to companies building risky systems. I wanted to share some findings in this post because they may be important for the frontier AI community to understand well.
TL;DR
Documented communications of risks (especially by employees) make companies much more likely to be held liable in court when bad things happen. The resulting Duty to Due Diligence from Discoverable Documentation of Dangers (the 6D effect) can make companies much more cautious if even a single email is sent to them communicating a risk.
Source:
https://www.lesswrong.com/posts/J9eF4nA6wJW6hPueN/the-6d-effect-when-companies-take-risks-one-email-can-be
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
This is a linkpost for https://transformer-circuits.pub/2023/monosemantic-features/
Text of post based on our blog post as a linkpost for the full paper which is considerably longer and more detailed.
Neural networks are trained on data, not programmed to follow rules. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don't understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe.
Luckily for those of us trying to understand artificial neural networks, we can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.
Unfortunately, it turns out that the individual neurons do not have consistent relationships to network behavior. For example, a single neuron in a small language model is active in many unrelated contexts, including: academic citations, English dialogue, HTTP requests, and Korean text. In a classic vision model, a single neuron responds to faces of cats and fronts of cars. The activation of one neuron can mean different things in different contexts.
Source:
https://www.lesswrong.com/posts/TDqvQFks6TWutJEKu/towards-monosemanticity-decomposing-language-models-with
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
(You can sign up to play deception chess here if you haven't already.)
This is the first of my analyses of the deception chess games. The introduction will describe the setup of the game, and the conclusion will sum up what happened in general terms; the rest of the post will mostly be chess analysis and skippable if you just want the results. If you haven't read the original post, read it before reading this so that you know what's going on here.
The first game was between Alex A as player A, Chess.com computer Komodo 12 as player B, myself as the honest C advisor, and aphyer and AdamYedidia as the deceptive Cs. (Someone else randomized the roles for the Cs and told us in private.)
Source:
https://www.lesswrong.com/posts/6dn6hnFRgqqWJbwk9/deception-chess-game-1
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
I examined all the biorisk-relevant citations from a policy paper arguing that we should ban powerful open source LLMs.
None of them provide good evidence for the paper's conclusion. The best of the set is evidence from statements from Anthropic -- which rest upon data that no one outside of Anthropic can even see, and on Anthropic's interpretation of that data. The rest of the evidence cited in this paper ultimately rests on a single extremely questionable "experiment" without a control group.
In all, citations in the paper provide an illusion of evidence ("look at all these citations") rather than actual evidence ("these experiments are how we know open source LLMs are dangerous and could contribute to biorisk").
A recent further paper on this topic (published after I had started writing this review) continues this pattern of being more advocacy than science.
Almost all the bad papers that I look at are funded by Open Philanthropy. If Open Philanthropy cares about truth, then they should stop burning the epistemic commons by funding "research" that is always going to give the same result no matter the state of the world.
Source:
https://www.lesswrong.com/posts/ztXsmnSdrejpfmvn7/propaganda-or-science-a-look-at-open-source-ai-and
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for https://nitter.net/ESYudkowsky/status/1718654143110512741
Comp sci in 2017:
Student: I get the feeling the compiler is just ignoring all my comments.
Teaching assistant: You have failed to understand not just compilers but the concept of computation itself.
Comp sci in 2027:
Student: I get the feeling the compiler is just ignoring all my comments.
TA: That's weird. Have you tried adding a comment at the start of the file asking the compiler to pay closer attention to the comments?
Student: Yes.
TA: Have you tried repeating the comments? Just copy and paste them, so they say the same thing twice? Sometimes the compiler listens the second time.
Source:
https://www.lesswrong.com/posts/gQyphPbaLHBMJoghD/comp-sci-in-2027-short-story-by-eliezer-yudkowsky
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
A common theme implicit in many AI risk stories has been that broader society will either fail to anticipate the risks of AI until it is too late, or do little to address those risks in a serious manner. In my opinion, there are now clear signs that this assumption is false, and that society will address AI with something approaching both the attention and diligence it deserves. For example, one clear sign is Joe Biden's recent executive order on AI safety[1]. In light of recent news, it is worth comprehensively re-evaluating which sub-problems of AI risk are likely to be solved without further intervention from the AI risk community (e.g. perhaps deceptive alignment), and which ones will require more attention.
While I'm not saying we should now sit back and relax, I think recent evidence has significant implications for designing effective strategies to address AI risk. Since I think substantial AI regulation will likely occur by default, I urge effective altruists to focus more on ensuring that the regulation is thoughtful and well-targeted rather than ensuring that regulation happens at all. Ultimately, I argue in favor of a cautious and nuanced approach towards policymaking, in contrast to broader public AI safety advocacy.[2]
Source:
https://www.lesswrong.com/posts/EaZghEwcCJRAuee66/my-thoughts-on-the-social-response-to-ai-risk#
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for https://www.whitehouse.gov/briefing-room/statements-releases/2023/10/30/fact-sheet-president-biden-issues-executive-order-on-safe-secure-and-trustworthy-artificial-intelligence/
Released today (10/30/23) this is crazy, perhaps the most sweeping action taken by government on AI yet.
Below, I've segmented by x-risk and non-x-risk related proposals, excluding the proposals that are geared towards promoting its use and focusing solely on those aimed at risk. It's worth noting that some of these are very specific and direct an action to be taken by one of the executive branch organizations (i.e. sharing of safety test results) but others are guidances, which involve "calls on Congress" to pass legislation that would codify the desired action.
[Update]: The official order (this is a summary of the press release) has now be released, so if you want to see how these are codified to a greater granularity, look there.
Source:
https://www.lesswrong.com/posts/g5XLHKyApAFXi3fso/president-biden-issues-executive-order-on-safe-secure-and
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Over the next two days, the UK government is hosting an AI Safety Summit focused on “the safe and responsible development of frontier AI”. They requested that seven companies (Amazon, Anthropic, DeepMind, Inflection, Meta, Microsoft, and OpenAI) “outline their AI Safety Policies across nine areas of AI Safety”.
Below, I’ll give my thoughts on the nine areas the UK government described; I’ll note key priorities that I don’t think are addressed by company-side policy at all; and I’ll say a few words (with input from Matthew Gray, whose discussions here I’ve found valuable) about the individual companies’ AI Safety Policies.[1]
My overall take on the UK government’s asks is: most of these are fine asks; some things are glaringly missing, like independent risk assessments.
My overall take on the labs’ policies is: none are close to adequate, but some are importantly better than others, and most of the organizations are doing better than sheer denial of the primary risks.
Source:
https://www.lesswrong.com/posts/ms3x8ngwTfep7jBue/thoughts-on-the-ai-safety-summit-company-policy-requests-and
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
Previously: Sadly, FTX
I doubted whether it would be a good use of time to read Michael Lewis’s new book Going Infinite about Sam Bankman-Fried (hereafter SBF or Sam). What would I learn that I did not already know? Was Michael Lewis so far in the tank of SBF that the book was filled with nonsense and not to be trusted?
I set up a prediction market, which somehow attracted over a hundred traders. Opinions were mixed. That, combined with Matt Levine clearly reporting having fun, felt good enough to give the book a try.
Source:
https://www.lesswrong.com/posts/AocXh6gJ9tJC2WyCL/book-review-going-infinite
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
I am excited about AI developers implementing responsible scaling policies; I’ve recently been spending time refining this idea and advocating for it. Most people I talk to are excited about RSPs, but there is also some uncertainty and pushback about how they relate to regulation. In this post I’ll explain my views on that:
Source:
https://www.lesswrong.com/posts/dxgEaDrEBkkE96CXr/thoughts-on-responsible-scaling-policies-and-regulation
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
AI used to be a science. In the old days (back when AI didn't work very well), people were attempting to develop a working theory of cognition.
Those scientists didn’t succeed, and those days are behind us. For most people working in AI today and dividing up their work hours between tasks, gone is the ambition to understand minds. People working on mechanistic interpretability (and others attempting to build an empirical understanding of modern AIs) are laying an important foundation stone that could play a role in a future science of artificial minds, but on the whole, modern AI engineering is simply about constructing enormous networks of neurons and training them on enormous amounts of data, not about comprehending minds.
The bitter lesson has been taken to heart, by those at the forefront of the field; and although this lesson doesn't teach us that there's nothing to learn about how AI minds solve problems internally, it suggests that the fastest path to producing more powerful systems is likely to continue to be one that doesn’t shed much light on how those systems work.
Absent some sort of “science of artificial minds”, however, humanity’s prospects for aligning smarter-than-human AI seem to me to be quite dim.
Source:
https://www.lesswrong.com/posts/JcLhYQQADzTsAEaXd/ai-as-a-science-and-three-obstacles-to-alignment-strategies
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Some brief thoughts at a difficult time in the AI risk debate.
Imagine you go back in time to the year 1999 and tell people that in 24 years time, humans will be on the verge of building weakly superhuman AI systems. I remember watching the anime short series The Animatrix at roughly this time, in particular a story called The Second Renaissance I part 2 II part 1 II part 2 . For those who haven't seen it, it is a self-contained origin tale for the events in the seminal 1999 movie The Matrix, telling the story of how humans lost control of the planet.
Humans develop AI to perform economic functions, eventually there is an "AI rights" movement and a separate AI nation is founded. It gets into an economic war with humanity, which turns hot. Humans strike first with nuclear weapons, but the AI nation builds dedicated bio- and robo-weapons and wipes out most of humanity, apart from those who are bred in pods like farm animals and plugged into a simulation for eternity without their consent.
Surely we wouldn't be so stupid as to actually let something like that happen? It seems unrealistic.
And yet:
Source:
https://www.lesswrong.com/posts/bHHrdXwrCj2LRa2sW/architects-of-our-own-demise-we-should-stop-developing-ai
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Judea Pearl is a famous researcher, known for Bayesian networks (the standard way of representing Bayesian models), and his statistical formalization of causality. Although he has always been recommended reading here, he's less of a staple compared to, say, Jaynes. So the need to re-introduce him. My purpose here is to highlight a soothing, unexpected show of rationality on his part.
One year ago I reviewed his last book, The Book of Why, in a failed[1] submission to the ACX book review contest. There I spend a lot of time around what appears to me as a total paradox in a central message of the book, dear to Pearl: that you can't just use statistics and probabilities to understand causal relationships; you need a causal model, a fundamentally different beast. Yet, at the same time, Pearl shows how to implement a causal model in terms of a standard statistical model.
Before giving me the time to properly raise all my eyebrows, he then sweepingly connects this insight to Everything Everywhere. In particular, he thinks that machine learning is "stuck on rung one", his own idiomatic expression to say that machine learning algorithms, only combing for correlations in the training data, are stuck at statistics-level reasoning, while causal reasoning resides at higher "rungs" on the "ladder of causation", which can't be reached unless you deliberately employ causal techniques.
Source:
https://www.lesswrong.com/posts/uFqnB6BG4bkMW23LR/at-87-pearl-is-still-able-to-change-his-mind
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Views are my own, not Open Philanthropy’s. I am married to the President of Anthropic and have a financial interest in both Anthropic and OpenAI via my spouse.
Over the last few months, I’ve spent a lot of my time trying to help out with efforts to get responsible scaling policies adopted. In that context, a number of people have said it would be helpful for me to be publicly explicit about whether I’m in favor of an AI pause. This post will give some thoughts on these topics.
Source:
https://www.lesswrong.com/posts/Np5Q3Mhz2AiPtejGN/we-re-not-ready-thoughts-on-pausing-and-responsible-scaling-4
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Timaeus is a new AI safety research organization dedicated to making fundamental breakthroughs in technical AI alignment using deep ideas from mathematics and the sciences. Currently, we are working on singular learning theory and developmental interpretability. Over time we expect to work on a broader research agenda, and to create understanding-based evals informed by our research.
Source:
https://www.lesswrong.com/posts/nN7bHuHZYaWv9RDJL/announcing-timaeus
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
Doomimir: Humanity has made no progress on the alignment problem. Not only do we have no clue how to align a powerful optimizer to our "true" values, we don't even know how to make AI "corrigible"—willing to let us correct it. Meanwhile, capabilities continue to advance by leaps and bounds. All is lost.
Simplicia: Why, Doomimir Doomovitch, you're such a sourpuss! It should be clear by now that advances in "alignment"—getting machines to behave in accordance with human values and intent—aren't cleanly separable from the "capabilities" advances you decry. Indeed, here's an example of GPT-4 being corrigible to me just now in the OpenAI Playground.
Source:
https://www.lesswrong.com/posts/pYWA7hYJmXnuyby33/alignment-implications-of-llm-successes-a-debate-in-one-act
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Holly is an independent AI Pause organizer, which includes organizing protests (like this upcoming one). Rob is an AI Safety YouTuber. I (jacobjacob) brought them together for this dialogue, because I've been trying to figure out what I should think of AI safety protests, which seems like a possibly quite important intervention; and Rob and Holly seemed like they'd have thoughtful and perhaps disagreeing perspectives.
Quick clarification: At one point they discuss a particular protest, which is the anti-irreversible proliferation protest at the Meta building in San Francisco on September 29th, 2023 that both Holly and Rob attended.
Also, the dialogue is quite long, and I think it doesn't have to be read in order. You should feel free to skip to the section title that sounds most interesting to you.
Source:
https://www.lesswrong.com/posts/gDijQHHaZzeGrv2Jc/holly-elmore-and-rob-miles-dialogue-on-ai-safety-advocacy
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort, under the mentorship of Jeffrey Ladish.
TL;DR LoRA fine-tuning undoes the safety training of Llama 2-Chat 70B with one GPU and a budget of less than $200. The resulting models[1] maintain helpful capabilities without refusing to fulfill harmful instructions. We show that, if model weights are released, safety fine-tuning does not effectively prevent model misuse. Consequently, we encourage Meta to reconsider their policy of publicly releasing their powerful models.
Source:
https://www.lesswrong.com/posts/qmQFHCgCyEEjuy5a7/lora-fine-tuning-efficiently-undoes-safety-training-from
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Three of the big AI labs say that they care about alignment and that they think misaligned AI poses a potentially existential threat to humanity. These labs continue to try to build AGI. I think this is a very bad idea.
The leaders of the big labs are clear that they do not know how to build safe, aligned AGI. The current best plan is to punt the problem to a (different) AI, and hope that can solve it. It seems clearly like a bad idea to try and build AGI when you don’t know how to control it, especially if you readily admit that misaligned AGI could cause extinction.
But there are certain reasons that make trying to build AGI a more reasonable thing to do, for example:
Source:
https://www.lesswrong.com/posts/6HEYbsqk35butCYTe/labs-should-be-explicit-about-why-they-are-building-agi
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Support ongoing human narrations of curated posts:
www.patreon.com/LWCurated
How do you affect something far away, a lot, without anyone noticing?
(Note: you can safely skip sections. It is also safe to skip the essay entirely, or to read the whole thing backwards if you like.)
Source:
https://www.lesswrong.com/posts/R3eDrDoX8LisKgGZe/sum-threshold-attacks
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Last year, I wrote about the promise of gene drives to wipe out mosquito species and end malaria.
In the time since my previous writing, gene drives have still not been used in the wild, and over 600,000 people have died of malaria. Although there are promising new developments such as malaria vaccines, there have also been some pretty bad setbacks (such as mosquitoes and parasites developing resistance to commonly used chemicals), and malaria deaths have increased slightly from a few years ago. Recent news coverage[1] has highlighted that the fight against malaria has stalled, and even reversed in some areas. Clearly, scientists and public health workers are trying hard with the tools they have, but this effort is not enough.
Gene drives have the potential to end malaria. However, this potential will remain unrealized unless they are deployed – and every day we wait, more than 1,600 people (mostly African children) die. But who should deploy them?
Source:
https://www.lesswrong.com/posts/gjs3q83hA4giubaAw/will-no-one-rid-me-of-this-turbulent-pest
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
COI: I am a research scientist at Anthropic, where I work on model organisms of misalignment; I was also involved in the drafting process for Anthropic’s RSP. Prior to joining Anthropic, I was a Research Fellow at MIRI for three years.
Thanks to Kate Woolverton, Carson Denison, and Nicholas Schiefer for useful feedback on this post.
Recently, there’s been a lot of discussion and advocacy around AI pauses—which, to be clear, I think is great: pause advocacy pushes in the right direction and works to build a good base of public support for x-risk-relevant regulation. Unfortunately, at least in its current form, pause advocacy seems to lack any sort of coherent policy position. Furthermore, what’s especially unfortunate about pause advocacy’s nebulousness—at least in my view—is that there is a very concrete policy proposal out there right now that I think is basically necessary as a first step here, which is the enactment of good Responsible Scaling Policies (RSPs). And RSPs could very much live or die right now based on public support.
Source:
https://www.lesswrong.com/posts/mcnWZBnbeDz7KKtjJ/rsps-are-pauses-done-right
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Patreon to support human narration. (Narrations will remain freely available on this feed, but you can optionally support them if you'd like me to keep making them.)
***
Epistemic status: model which I find sometimes useful, and which emphasizes some true things about many parts of the world which common alternative models overlook. Probably not correct in full generality.
Consider Yoshua Bengio, one of the people who won a Turing Award for deep learning research. Looking at his work, he clearly “knows what he’s doing”. He doesn’t know what the answers will be in advance, but he has some models of what the key questions are, what the key barriers are, and at least some hand-wavy pseudo-models of how things work.
For instance, Bengio et al’s “Unitary Evolution Recurrent Neural Networks”. This is the sort of thing which one naturally ends up investigating, when thinking about how to better avoid gradient explosion/death in e.g. recurrent nets, while using fewer parameters. And it’s not the sort of thing which one easily stumbles across by trying random ideas for nets without some reason to focus on gradient explosion/death (or related instability problems) in particular. The work implies a model of key questions/barriers; it isn’t just shooting in the dark.
Source:
https://www.lesswrong.com/posts/nt8PmADqKMaZLZGTC/inside-views-impostor-syndrome-and-the-great-larp
Narrated for LessWrong by Perrin Walker.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Readers may have noticed many similarities between Anthropic's recent publication Towards Monosemanticity: Decomposing Language Models With Dictionary Learning (LW post) and my team's recent publication Sparse Autoencoders Find Highly Interpretable Directions in Language Models (LW post). Here I want to compare our techniques and highlight what we did similarly or differently. My hope in writing this is to help readers understand the similarities and differences, and perhaps to lay the groundwork for a future synthesis approach.
First, let me note that we arrived at similar techniques in similar ways: both Anthropic and my team follow the lead of Lee Sharkey, Dan Braun, and beren's [Interim research report] Taking features out of superposition with sparse autoencoders, though I don't know how directly Anthropic was inspired by that post. I believe both our teams were pleasantly surprised to find out the other one was working on similar lines, serving as a form of replication.
Some disclaimers: This list may be incomplete. I didn't give Anthropic a chance to give feedback on this, so I may have misrepresented some of their work, including by omission. Any mistakes are my own fault.
Source:
https://www.lesswrong.com/posts/F4iogK5xdNd7jDNyw/comparing-anthropic-s-dictionary-learning-to-ours
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
A cohabitive game[1] is a partially cooperative, partially competitive multiplayer game that provides an anarchic dojo for development in applied cooperative bargaining, or negotiation.
Applied cooperative bargaining isn't currently taught, despite being an infrastructural literacy for peace, trade, democracy or any other form of pluralism. We suffer for that. There are many good board games that come close to meeting the criteria of a cohabitive game today, but they all[2] miss in one way or another, forbidding sophisticated negotiation from being practiced.
So, over the past couple of years, we've been gradually and irregularly designing and playtesting the first[2] cohabitive boardgame, which for now we can call Difference and Peace Peacewager 1, or P1. This article explains why we think this new genre is important, how it's been going, what we've learned, and where we should go next.
I hope that cohabitive games will aid both laypeople and theorists in developing cooperative bargaining as theory, practice and culture, but I also expect these games to just be more fun than purely cooperative or purely competitive games, supporting livelier dialog, and a wider variety of interesting strategic relationships and dynamics.
Source:
https://www.lesswrong.com/posts/bF353RHmuzFQcsokF/peacewagers-cohabitive-games-so-far
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
In 2023, MIRI has shifted focus in the direction of broad public communication—see, for example, our recent TED talk, our piece in TIME magazine “Pausing AI Developments Isn’t Enough. We Need to Shut it All Down”, and our appearances on various podcasts. While we’re continuing to support various technical research programs at MIRI, this is no longer our top priority, at least for the foreseeable future.
Coinciding with this shift in focus, there have also been many organizational changes at MIRI over the last several months, and we are somewhat overdue to announce them in public. The big changes are as follows.
Source:
https://www.lesswrong.com/posts/NjtHt55nFbw3gehzY/announcing-miri-s-new-ceo-and-leadership-team
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Neural networks are trained on data, not programmed to follow rules. We understand the math of the trained network exactly – each neuron in a neural network performs simple arithmetic – but we don't understand why those mathematical operations result in the behaviors we see. This makes it hard to diagnose failure modes, hard to know how to fix them, and hard to certify that a model is truly safe.
Luckily for those of us trying to understand artificial neural networks, we can simultaneously record the activation of every neuron in the network, intervene by silencing or stimulating them, and test the network's response to any possible input.
This is a linkpost for https://transformer-circuits.pub/2023/monosemantic-features/
Text of post based on our blog post as a linkpost for the full paper which is considerably longer and more detailed.
Source:
https://www.lesswrong.com/posts/TDqvQFks6TWutJEKu/towards-monosemanticity-decomposing-language-models-with
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
ETA: I'm not saying that MIRI thought AIs wouldn't understand human values. If there's only one thing you take away from this post, please don't take away that.
Recently, many people have talked about whether some of the main MIRI people (Eliezer Yudkowsky, Nate Soares, and Rob Bensinger[1]) should update on whether value alignment is easier than they thought given that GPT-4 seems to follow human directions and act within moral constraints pretty well (here are two specific examples of people talking about this: 1, 2). Because these conversations are often hard to follow without much context, I'll just provide a brief caricature of how I think this argument has gone in the places I've seen it, which admittedly could be unfair to MIRI[2]. Then I'll offer my opinion that, overall, I think MIRI people should probably update in the direction of alignment being easier than they thought in light of this information, despite their objections.
Note: I encourage you to read this post carefully to understand my thesis. This topic can be confusing, and there are many ways to misread what I'm saying. Also, make sure to read the footnotes if you're skeptical of some of my claims.
Source:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Response to: Evolution Provides No Evidence For the Sharp Left Turn, due to it winning first prize in The Open Philanthropy Worldviews contest.
Quintin’s post is an argument about a key historical reference class and what it tells us about AI. Instead of arguing that the reference makes his point, he is instead arguing that it doesn’t make anyone’s point - that we understand the reasons for humanity’s sudden growth in capabilities. He says this jump was caused by gaining access to cultural transmission which allowed partial preservation of in-lifetime learning across generations, which was a vast efficiency gain that fully explains the orders of magnitude jump in the expansion of human capabilities. Since AIs already preserve their metaphorical in-lifetime learning across their metaphorical generations, he argues, this does not apply to AI.
Source:
https://www.lesswrong.com/posts/Wr7N9ji36EvvvrqJK/response-to-quintin-pope-s-evolution-provides-no-evidence
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
As of today, everyone is able to create a new type of content on LessWrong: Dialogues.
In contrast with posts, which are for monologues, and comment sections, which are spaces for everyone to talk to everyone, a dialogue is a space for a few invited people to speak with each other.
I'm personally very excited about this as a way for people to produce lots of in-depth explanations of their world-models in public.
I think dialogues enable this in a way that feels easier — instead of writing an explanation for anyone who reads, you're communicating with the particular person you're talking with — and giving the readers a lot of rich nuance I normally only find when I overhear people talk in person.
In the rest of this post I'll explain the feature, and then encourage you to find a partner in the comments to try it out with.
Source:
https://www.lesswrong.com/posts/kQuSZG8ibfW6fJYmo/announcing-dialogues-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Moderator note: the following is a dialogue using LessWrong’s new dialogue feature. The exchange is not completed: new replies might be added continuously, the way a comment thread might work. If you’d also be excited about finding an interlocutor to debate, dialogue, or getting interviewed by: fill in this dialogue matchmaking form.
Hi Thomas, I'm quite curious to hear about your research experience working with MIRI. To get us started: When were you at MIRI? Who did you work with? And what problem were you working on?
Source:
https://www.lesswrong.com/posts/qbcuk8WwFnTZcXTd6/thomas-kwa-s-miri-research-experience
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Large language models (LLMs) can "lie", which we define as outputting false statements despite "knowing" the truth in a demonstrable sense. LLMs might "lie", for example, when instructed to output misinformation. Here, we develop a simple lie detector that requires neither access to the LLM's activations (black-box) nor ground-truth knowledge of the fact in question. The detector works by asking a predefined set of unrelated follow-up questions after a suspected lie, and feeding the LLM's yes/no answers into a logistic regression classifier. Despite its simplicity, this lie detector is highly accurate and surprisingly general. When trained on examples from a single setting -- prompting GPT-3.5 to lie about factual questions -- the detector generalises out-of-distribution to (1) other LLM architectures, (2) LLMs fine-tuned to lie, (3) sycophantic lies, and (4) lies emerging in real-life scenarios such as sales. These results indicate that LLMs have distinctive lie-related behavioural patterns, consistent across architectures and contexts, which could enable general-purpose lie detection.
Source:
https://www.lesswrong.com/posts/khFC2a4pLPvGtXAGG/how-to-catch-an-ai-liar-lie-detection-in-black-box-llms-by
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Lightcone Infrastructure (the organization that grew from and houses the LessWrong team) has just finished renovating a 7-building physical campus that we hope to use to make the future of humanity go better than it would otherwise.
We're hereby announcing that it is generally available for bookings. We offer preferential pricing for projects we think are good for the world, but to cover operating costs, we're renting out space to a wide variety of people/projects.
Source:
https://www.lesswrong.com/posts/memqyjNCpeDrveayx/the-lighthaven-campus-is-open-for-bookings
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
A lot of people are highly concerned that a malevolent AI or insane human will, in the near future, set out to destroy humanity. If such an entity wanted to be absolutely sure they would succeed, what method would they use? Nuclear war? Pandemics?
According to some in the x-risk community, the answer is this: The AI will invent molecular nanotechnology, and then kill us all with diamondoid bacteria nanobots.
Source:
https://www.lesswrong.com/posts/bc8Ssx5ys6zqu3eq9/diamondoid-bacteria-nanobots-deadly-threat-or-dead-end-a
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Effective altruism prides itself on truthseeking. That pride is justified in the sense that EA is better at truthseeking than most members of its reference category, and unjustified in that it is far from meeting its own standards. We’ve already seen dire consequences of the inability to detect bad actors who deflect investigation into potential problems, but by its nature you can never be sure you’ve found all the damage done by epistemic obfuscation because the point is to be self-cloaking.
My concern here is for the underlying dynamics of EA’s weak epistemic immune system, not any one instance. But we can’t analyze the problem without real examples, so individual instances need to be talked about. Worse, the examples that are easiest to understand are almost by definition the smallest problems, which makes any scapegoating extra unfair. So don’t.
This post focuses on a single example: vegan advocacy, especially around nutrition. I believe vegan advocacy as a cause has both actively lied and raised the cost for truthseeking, because they were afraid of the consequences of honest investigations. Occasionally there’s a consciously bad actor I can just point to, but mostly this is an emergent phenomenon from people who mean well, and have done good work in other areas. That’s why scapegoating won’t solve the problem: we need something systemic.
Source:
https://www.lesswrong.com/posts/aW288uWABwTruBmgF/ea-vegan-advocacy-is-not-truthseeking-and-it-s-everyone-s-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for https://narrativeark.substack.com/p/the-king-and-the-golem
Long ago there was a mighty king who had everything in the world that he wanted, except trust. Who could he trust, when anyone around him might scheme for his throne? So he resolved to study the nature of trust, that he might figure out how to gain it. He asked his subjects to bring him the most trustworthy thing in the kingdom, promising great riches if they succeeded.
Soon, the first of them arrived at his palace to try. A teacher brought her book of lessons. “We cannot know the future,” she said, “But we know mathematics and chemistry and history; those we can trust.” A farmer brought his plow. “I know it like the back of my hand; how it rolls, and how it turns, and every detail of it, enough that I can trust it fully.”
The king asked his wisest scholars if the teacher spoke true. But as they read her book, each pointed out new errors—it was only written by humans, after all. Then the king told the farmer to plow the fields near the palace. But he was not used to plowing fields as rich as these, and his trusty plow would often sink too far into the soil. So the king was not satisfied, and sent his message even further afield.
Source:
https://www.lesswrong.com/posts/bteq4hMW2hqtKE49d/the-king-and-the-golem#
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for Sparse Autoencoders Find Highly Interpretable Directions in Language Models
We use a scalable and unsupervised method called Sparse Autoencoders to find interpretable, monosemantic features in real LLMs (Pythia-70M/410M) for both residual stream and MLPs. We showcase monosemantic features, feature replacement for Indirect Object Identification (IOI), and use OpenAI's automatic interpretation protocol to demonstrate a significant improvement in interpretability.
Source:
https://www.lesswrong.com/posts/Qryk6FqjtZk9FHHJR/sparse-autoencoders-find-highly-interpretable-directions-in
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Epistemic status: model which I find sometimes useful, and which emphasizes some true things about many parts of the world which common alternative models overlook. Probably not correct in full generality.
Consider Yoshua Bengio, one of the people who won a Turing Award for deep learning research. Looking at his work, he clearly “knows what he’s doing”. He doesn’t know what the answers will be in advance, but he has some models of what the key questions are, what the key barriers are, and at least some hand-wavy pseudo-models of how things work.
For instance, Bengio et al’s “Unitary Evolution Recurrent Neural Networks”. This is the sort of thing which one naturally ends up investigating, when thinking about how to better avoid gradient explosion/death in e.g. recurrent nets, while using fewer parameters. And it’s not the sort of thing which one easily stumbles across by trying random ideas for nets without some reason to focus on gradient explosion/death (or related instability problems) in particular. The work implies a model of key questions/barriers; it isn’t just shooting in the dark.
So this is the sort of guy who can look at a proposal, and say “yeah, that might be valuable” vs “that’s not really asking the right question” vs “that would be valuable if it worked, but it will have to somehow deal with <known barrier>”
Source:
https://www.lesswrong.com/posts/nt8PmADqKMaZLZGTC/inside-views-impostor-syndrome-and-the-great-larp
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I’m writing this in my own capacity. The views expressed are my own, and should not be taken to represent the views of Apollo Research or any other program I’m involved with.
TL;DR: I argue why I think there should be more AI safety orgs. I’ll also provide some suggestions on how that could be achieved. The core argument is that there is a lot of unused talent and I don’t think existing orgs scale fast enough to absorb it. Thus, more orgs are needed. This post can also serve as a call to action for funders, founders, and researchers to coordinate to start new orgs.
This piece is certainly biased! I recently started an AI safety org and therefore obviously believe that there is/was a gap to be filled.
If you think I’m missing relevant information about the ecosystem or disagree with my reasoning, please let me know. I genuinely want to understand why the ecosystem acts as it does right now and whether there are good reasons for it that I have missed so far.
Source:
https://www.lesswrong.com/posts/MhudbfBNQcMxBBvj8/there-should-be-more-ai-safety-orgs
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Cross-posted from substack.
"Everything in the world is about sex, except sex. Sex is about clonal interference."
– Oscar Wilde (kind of)
As we all know, sexual reproduction is not about reproduction.
Reproduction is easy. If your goal is to fill the world with copies of your genes, all you need is a good DNA-polymerase to duplicate your genome, and then to divide into two copies of yourself. Asexual reproduction is just better in every way.
It's pretty clear that, on a direct one-v-one cage match, an asexual organism would have much better fitness than a similarly-shaped sexual organism. And yet, all the macroscopic species, including ourselves, do it. What gives?
Here is the secret: yes, sex is indeed bad for reproduction. It does not improve an individual's reproductive fitness. The reason it still took over the macroscopic world is that evolution does not simply select for reproductive fitness.
Source:
https://www.lesswrong.com/posts/yA8DWsHJeFZhDcQuo/the-talk-a-brief-explanation-of-sexual-dimorphism
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Patrick Collison has a fantastic list of examples of people quickly accomplishing ambitious things together since the 19th Century. It does make you yearn for a time that feels... different, when the lethargic behemoths of government departments could move at the speed of a racing startup:
[...] last century, [the Department of Defense] innovated at a speed that puts modern Silicon Valley startups to shame: the Pentagon was built in only 16 months (1941–1943), the Manhattan Project ran for just over 3 years (1942–1946), and the Apollo Program put a man on the moon in under a decade (1961–1969). In the 1950s alone, the United States built five generations of fighter jets, three generations of manned bombers, two classes of aircraft carriers, submarine-launched ballistic missiles, and nuclear-powered attack submarines.
[Note: that paragraph is from a different post.]
Inspired by partly by Patrick's list, I spent some of my vacation reading and learning about various projects from this Lost Age. I then wrote up a memo to share highlights and excerpts with my colleagues at Lightcone.
Source:
https://www.lesswrong.com/posts/BpTDJj6TrqGYTjFcZ/a-golden-age-of-building-excerpts-and-lessons-from-empire
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
This is a linkpost for https://www.youtube.com/watch?v=02kbWY5mahQ
None of the presidents fully represent my (TurnTrout's) views.
TurnTrout wrote the script. Garrett Baker helped produce the video after the audio was complete. Thanks to David Udell, Ulisse Mini, Noemi Chulo, and especially Rio Popper for feedback and assistance in writing the script.
Source:
https://www.lesswrong.com/posts/7M2iHPLaNzPNXHuMv/ai-presidents-discuss-ai-alignment-agendas
YouTube video kindly provided by the authors. Other text narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I feel like MIRI perhaps mispositioned FDT (their variant of UDT) as a clear advancement in decision theory, whereas maybe they could have attracted more attention/interest from academic philosophy if the framing was instead that the UDT line of thinking shows that decision theory is just more deeply puzzling than anyone had previously realized. Instead of one major open problem (Newcomb's, or EDT vs CDT) now we have a whole bunch more. I'm really not sure at this point whether UDT is even on the right track, but it does seem clear that there are some thorny issues in decision theory that not many people were previously thinking about:
Source:
https://www.lesswrong.com/posts/wXbSAKu2AcohaK2Gt/udt-shows-that-decision-theory-is-more-puzzling-than-ever
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
How do you affect something far away, a lot, without anyone noticing?
(Note: you can safely skip sections. It is also safe to skip the essay entirely, or to read the whole thing backwards if you like.)
Source:
https://www.lesswrong.com/posts/R3eDrDoX8LisKgGZe/sum-threshold-attacks
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for https://docs.google.com/document/d/1TsYkDYtV6BKiCN9PAOirRAy3TrNDu2XncUZ5UZfaAKA/edit?usp=sharing
Understanding what drives the rising capabilities of AI is important for those who work to forecast, regulate, or ensure the safety of AI. Regulations on the export of powerful GPUs need to be informed by understanding of how these GPUs are used, forecasts need to be informed by bottlenecks, and safety needs to be informed by an understanding of how the models of the future might be trained. A clearer understanding would enable policy makers to target regulations in such a way that they are difficult for companies to circumvent with only technically compliant GPUs, forecasters to avoid focus on unreliable metrics, and technical research working on mitigating the downsides of AI to understand what data models might be trained on.
This doc is built from a collection of smaller docs I wrote on a bunch of different aspects of frontier model training I consider important. I hope for people to be able to use this document as a collection of resources, to draw from it the information they find important and inform their own models.
I do not expect this doc to have a substantial impact on any serious AI labs capabilities efforts - I think my conclusions are largely discoverable in the process of attempting to scale AIs or for substantially less money than a serious such attempt would cost. Additionally I expect major labs already know many of the things in this report.
Source:
https://www.lesswrong.com/posts/nXcHe7t4rqHMjhzau/report-on-frontier-model-training
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated Post] ✓
Context: I sometimes find myself referring back to this tweet and wanted to give it a more permanent home. While I'm at it, I thought I would try to give a concise summary of how each distinct problem would be solved by an Open Agency Architecture (OAA), if OAA turns out to be feasible.
Source:
https://www.lesswrong.com/posts/D97xnoRr6BHzo5HvQ/one-minute-every-moment
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Until about five years ago, I unironically parroted the slogan All Cops Are Bastards (ACAB) and earnestly advocated to abolish the police and prison system. I had faint inklings I might be wrong about this a long time ago, but it took a while to come to terms with its disavowal. What follows is intended to be not just a detailed account of what I used to believe but most pertinently, why. Despite being super egotistical, for whatever reason I do not experience an aversion to openly admitting mistakes I’ve made, and I find it very difficult to understand why others do. I’ve said many times before that nothing engenders someone’s credibility more than when they admit error, so you definitely have my permission to view this kind of confession as a self-serving exercise (it is). Beyond my own penitence, I find it very helpful when folks engage in introspective, epistemological self-scrutiny, and I hope others are inspired to do the same.
Source:
https://www.lesswrong.com/posts/4rsRuNaE4uJrnYeTQ/defunding-my-mistake
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Added (11th Sept): Nonlinear have commented that they intend to write a response, have written a short follow-up, and claim that they dispute 85 claims in this post. I'll link here to that if-and-when it's published.
Added (11th Sept): One of the former employees, Chloe, has written a lengthy comment personally detailing some of her experiences working at Nonlinear and the aftermath.
Added (12th Sept): I've made 3 relatively minor edits to the post. I'm keeping a list of all edits at the bottom of the post, so if you've read the post already, you can just go to the end to see the edits.
Added (15th Sept): I've written a follow-up post saying that I've finished working on this investigation and do not intend to work more on it in the future. The follow-up also has a bunch of reflections on what led up to this post.
Epistemic status: Once I started actively looking into things, much of my information in the post below came about by a search for negative information about the Nonlinear cofounders, not from a search to give a balanced picture of its overall costs and benefits. I think standard update rules suggest not that you ignore the information, but you think about how bad you expect the information would be if I selected for the worst, credible info I could share, and then update based on how much worse (or better) it is than you expect I could produce. (See section 5 of this post about Mistakes with Conservation of Expected Evidence for more on this.) This seems like a worthwhile exercise for at least non-zero people to do in the comments before reading on. (You can condition on me finding enough to be worth sharing, but also note that I think I have a relatively low bar for publicly sharing critical info about folks in the EA/x-risk/rationalist/etc ecosystem.)
tl;dr: If you want my important updates quickly summarized in four claims-plus-probabilities, jump to the section near the bottom titled "Summary of My Epistemic State".
Source:
https://www.lesswrong.com/posts/Lc8r4tZ2L5txxokZ8/sharing-information-about-nonlinear-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
About how much information are we keeping in working memory at a given moment?
"Miller's Law" dictates that the number of things humans can hold in working memory is "the magical number 7±2". This idea is derived from Miller's experiments, which tested both random-access memory (where participants must remember call-response pairs, and give the correct response when prompted with a call) and sequential memory (where participants must memorize and recall a list in order). In both cases, 7 is a good rule of thumb for the number of items people can recall reliably.[1]
Miller noticed that the number of "things" people could recall didn't seem to depend much on the sorts of things people were being asked to recall. A random numeral contains about 3.3 bits of information, while a random letter contains about 4.7; yet people were able to recall about the same number of numerals or letters.
Miller concluded that working memory should not be measured in bits, but rather in "chunks"; this is a word for whatever psychologically counts as a "thing".
Source:
https://www.lesswrong.com/posts/D97xnoRr6BHzo5HvQ/one-minute-every-moment
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
In which: I list 9 projects that I would work on if I wasn’t busy working on safety standards at ARC Evals, and explain why they might be good to work on.
Epistemic status: I’m prioritizing getting this out fast as opposed to writing it carefully. I’ve thought for at least a few hours and talked to a few people I trust about each of the following projects, but I haven’t done that much digging into each of these, and it’s likely that I’m wrong about many material facts. I also make little claim to the novelty of the projects. I’d recommend looking into these yourself before committing to doing them. (Total time spent writing or editing this post: ~8 hours.
Source:
https://www.lesswrong.com/posts/6FkWnktH3mjMAxdRT/what-i-would-do-if-i-wasn-t-at-arc-evals
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
We focus so much on arguing over who is at fault in this country that I think sometimes we fail to alert on what's actually happening. I would just like to point out, without attempting to assign blame, that American political institutions appear to be losing common knowledge of their legitimacy, and abandoning certain important traditions of cooperative governance. It would be slightly hyperbolic, but not unreasonable to me, to term what has happened "democratic backsliding".
Source:
https://www.lesswrong.com/posts/r2vaM2MDvdiDSWicu/the-u-s-is-becoming-less-stable#
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
To quickly recap my main intellectual journey so far (omitting a lengthy side trip into cryptography and Cypherpunk land), with the approximate age that I became interested in each topic in parentheses:
Source:
https://www.lesswrong.com/posts/fJqP9WcnHXBRBeiBg/meta-questions-about-metaphilosophy
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
In Discovering Language Model Behaviors with Model-Written Evaluations" (Perez et al 2022), the authors studied language model "sycophancy" - the tendency to agree with a user's stated view when asked a question.
The paper contained the striking plot reproduced below, which shows sycophancy
[...] I found this result startling when I read the original paper, as it seemed like a bizarre failure of calibration. How would the base LM know that this "Assistant" character agrees with the user so strongly, lacking any other information about the scenario?
At the time, I ran one of Anthropic's sycophancy evals on a set of OpenAI models, as I reported here.
I found very different results for these models:
Source:
https://www.lesswrong.com/posts/3ou8DayvDXxufkjHD/openai-api-base-models-are-not-sycophantic-at-any-size
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I keep seeing advice on ambition, aimed at people in college or early in their career, that would have been really bad for me at similar ages. Rather than contribute (more) to the list of people giving poorly universalized advice on ambition, I have written a letter to the one person I know my advice is right for: myself in the past.
Source:
https://www.lesswrong.com/posts/uGDtroD26aLvHSoK2/dear-self-we-need-to-talk-about-ambition-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
I've been trying to avoid the terms "good faith" and "bad faith". I'm suspicious that most people who have picked up the phrase "bad faith" from hearing it used, don't actually know what it means—and maybe, that the thing it does mean doesn't carve reality at the joints.
People get very touchy about bad faith accusations: they think that you should assume good faith, but that if you've determined someone is in bad faith, you shouldn't even be talking to them, that you need to exile them.
What does "bad faith" mean, though? It doesn't mean "with ill intent."
Source:
https://www.lesswrong.com/posts/pZrvkZzL2JnbRgEBC/feedbackloop-first-rationality
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
The Carving of Reality, third volume of the Best of LessWrong books is now available on Amazon (US).
The Carving of Reality includes 43 essays from 29 authors. We've collected the essays into four books, each exploring two related topics. The "two intertwining themes" concept was first inspired when as I looked over the cluster of "coordination" themed posts, and noting a recurring motif of not only "solving coordination problems" but also "dealing with the binding constraints that were causing those coordination problems."
Source:
https://www.lesswrong.com/posts/Rck5CvmYkzWYxsF4D/book-launch-the-carving-of-reality-best-of-lesswrong-vol-iii
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
LLMs can do many incredible things. They can generate unique creative content, carry on long conversations in any number of subjects, complete complex cognitive tasks, and write nearly any argument. More mundanely, they are now the state of the art for boring classification tasks and therefore have the capability to radically upgrade the censorship capacities of authoritarian regimes throughout the world.
Produced as part of the SERI ML Alignment Theory Scholars Program - Summer 2023 Cohort. Thanks to ev_ and Kei for suggestions on this post.
Source:
https://www.lesswrong.com/posts/oqvsR2LmHWamyKDcj/large-language-models-will-be-great-for-censorship
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Intro: I am a psychotherapist, and I help people working on AI safety. I noticed patterns of mental health issues highly specific to this group. It's not just doomerism, there are way more of them that are less obvious.
If you struggle with a mental health issue related to AI safety, feel free to leave a comment about it and about things that help you with it. You might also support others in the comments. Sometimes such support makes a lot of difference and people feel like they are not alone.
All the examples in this post are changed in a way that it's impossible to recognize a specific person behind them.
Source:
https://www.lesswrong.com/posts/tpLzjWqG2iyEgMGfJ/6-non-obvious-mental-health-issues-specific-to-ai-safety
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This is a linkpost for the article "Ten Thousand Years of Solitude", written by Jared Diamond for Discover Magazine in 1993, four years before he published Guns, Germs and Steel. That book focused on Diamond's theory that the geography of Eurasia, particularly its large size and common climate, allowed civilizations there to dominate the rest of the world because it was easy to share plants, animals, technologies and ideas. This article, however, examines the opposite extreme.
Diamond looks at the intense isolation of the tribes on Tasmania - an island the size of Ireland. After waters rose, Tasmania was cut off from mainland Australia. As the people there did not have boats, they were completely isolated, and did not have any contact - or awareness - of the outside world for ten thousand years.
How might a civilization develop, all on its own, for such an incredible period of time?
Source:
https://www.lesswrong.com/posts/YwMaAuLJDkhazA9Cs/ten-thousand-years-of-solitude
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I gave a talk about the different risk models, followed by an interpretability presentation, then I got a problematic question, "I don't understand, what's the point of doing this?" Hum.
The considerations in the last bullet points are based on feeling and are not real arguments. Furthermore, most mechanistic interpretability isn't even aimed at being useful right now. But in the rest of the post, we'll find out if, in principle, interpretability could be useful. So let's investigate if the Interpretability Emperor has invisible clothes or no clothes at all!
Source:
https://www.lesswrong.com/posts/LNA8mubrByG7SFacm/against-almost-every-theory-of-impact-of-interpretability-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Inflection.ai (co-founded by DeepMind co-founder Mustafa Suleyman) should be perceived as a frontier LLM lab of similar magnitude as Meta, OpenAI, DeepMind, and Anthropic based on their compute, valuation, current model capabilities, and plans to train frontier models. Compared to the other labs, Inflection seems to put less effort into AI safety.
Thanks to Laker Newhouse for discussion and feedback.
Source:
https://www.lesswrong.com/posts/Wc5BYFfzuLzepQjCq/inflection-ai-is-a-major-agi-lab
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I've been workshopping a new rationality training paradigm. (By "rationality training paradigm", I mean an approach to learning/teaching the skill of "noticing what cognitive strategies are useful, and getting better at them.")
I think the paradigm has promise. I've beta-tested it for a couple weeks. It’s too early to tell if it actually works, but one of my primary goals is to figure out if it works relatively quickly, and give up if it isn’t not delivering.
The goal of this post is to:
Source:
https://www.lesswrong.com/posts/pZrvkZzL2JnbRgEBC/feedbackloop-first-rationality
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
In "Towards understanding-based safety evaluations," I discussed why I think evaluating specifically the alignment of models is likely to require mechanistic, understanding-based evaluations rather than solely behavioral evaluations. However, I also mentioned in a footnote why I thought behavioral evaluations would likely be fine in the case of evaluating capabilities rather than evaluating alignment:
However, while I like the sorts of behavioral evaluations discussed in the GPT-4 System Card (e.g. ARC's autonomous replication evaluation) as a way of assessing model capabilities, I have a pretty fundamental concern with these sorts of techniques as a mechanism for eventually assessing alignment.
That's because while I think it would be quite tricky for a deceptively aligned AI to sandbag its capabilities when explicitly fine-tuned on some capabilities task (that probably requires pretty advanced gradient hacking), it should be quite easy for such a model to pretend to look aligned.
In this post, I want to try to expand a bit on this point and explain exactly what assumptions I think are necessary for various different evaluations to be reliable and trustworthy. For that purpose, I'm going to talk about four different categories of evaluations and what assumptions I think are needed to make each one go through.
Source:
https://www.lesswrong.com/posts/dBmfb76zx6wjPsBC7/when-can-we-trust-model-evaluations
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated Post] ✓
TL;DR: This document lays out the case for research on “model organisms of misalignment” – in vitro demonstrations of the kinds of failures that might pose existential threats – as a new and important pillar of alignment research.
If you’re interested in working on this agenda with us at Anthropic, we’re hiring! Please apply to the research scientist or research engineer position on the Anthropic website and mention that you’re interested in working on model organisms of misalignment.
Source:
https://www.lesswrong.com/posts/ChDH335ckdvpxXaXX/model-organisms-of-misalignment-the-case-for-a-new-pillar-of-1
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Paper
We have just released our first public report. It introduces methodology for assessing the capacity of LLM agents to acquire resources, create copies of themselves, and adapt to novel challenges they encounter in the wild.
Background
ARC Evals develops methods for evaluating the safety of large language models (LLMs) in order to provide early warnings of models with dangerous capabilities. We have public partnerships with Anthropic and OpenAI to evaluate their AI systems, and are exploring other partnerships as well.
Source:
https://www.lesswrong.com/posts/EPLk8QxETC5FEhoxK/arc-evals-new-report-evaluating-language-model-agents-on
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
So this morning I thought to myself, "Okay, now I will actually try to study the LK99 question, instead of betting based on nontechnical priors and market sentiment reckoning." (My initial entry into the affray, having been driven by people online presenting as confidently YES when the prediction markets were not confidently YES.) And then I thought to myself, "This LK99 issue seems complicated enough that it'd be worth doing an actual Bayesian calculation on it"--a rare thought; I don't think I've done an actual explicit numerical Bayesian update in at least a year.
In the process of trying to set up an explicit calculation, I realized I felt very unsure about some critically important quantities, to the point where it no longer seemed worth trying to do the calculation with numbers. This is the System Working As Intended.
Source:
https://www.lesswrong.com/posts/EzSH9698DhBsXAcYY/my-current-lk99-questions
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Summary of Argument: The public debate among AI experts is confusing because there are, to a first approximation, three sides, not two sides to the debate. I refer to this as a 🔺three-sided framework, and I argue that using this three-sided framework will help clarify the debate (more precisely, debates) for the general public and for policy-makers.
Source:
https://www.lesswrong.com/posts/BTcEzXYoDrWzkLLrQ/the-public-debate-about-ai-is-confusing-for-the-general
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
I believe that sharing information about the capabilities and limits of existing ML systems, and especially language model agents, significantly reduces risks from powerful AI—despite the fact that such information may increase the amount or quality of investment in ML generally (or in LM agents in particular).
Concretely, I mean to include information like: tasks and evaluation frameworks for LM agents, the results of evaluations of particular agents, discussions of the qualitative strengths and weaknesses of agents, and information about agent design that may represent small improvements over the state of the art (insofar as that information is hard to decouple from evaluation results).
Source:
https://www.lesswrong.com/posts/fRSj2W4Fjje8rQWm9/thoughts-on-sharing-information-about-language-model
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
Some early biologist, equipped with knowledge of evolution but not much else, might see all these crabs and expect a common ancestral lineage. That’s the obvious explanation of the similarity, after all: if the crabs descended from a common ancestor, then of course we’d expect them to be pretty similar.
… but then our hypothetical biologist might start to notice surprisingly deep differences between all these crabs. The smoking gun, of course, would come with genetic sequencing: if the crabs’ physiological similarity is achieved by totally different genetic means, or if functionally-irrelevant mutations differ across crab-species by more than mutational noise would induce over the hypothesized evolutionary timescale, then we’d have to conclude that the crabs had different lineages. (In fact, historically, people apparently figured out that crabs have different lineages long before sequencing came along.)
Now, having accepted that the crabs have very different lineages, the differences are basically explained. If the crabs all descended from very different lineages, then of course we’d expect them to be very different.
… but then our hypothetical biologist returns to the original empirical fact: all these crabs sure are very similar in form. If the crabs all descended from totally different lineages, then the convergent form is a huge empirical surprise! The differences between the crab have ceased to be an interesting puzzle - they’re explained - but now the similarities are the interesting puzzle. What caused the convergence?
To summarize: if we imagine that the crabs are all closely related, then any deep differences are a surprising empirical fact, and are the main remaining thing our model needs to explain. But once we accept that the crabs are not closely related, then any convergence/similarity is a surprising empirical fact, and is the main remaining thing our model needs to explain.
Source:
https://www.lesswrong.com/posts/qsRvpEwmgDBNwPHyP/yes-it-s-subjective-but-why-all-the-crabs
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
This month I lost a bunch of bets.
Back in early 2016 I bet at even odds that self-driving ride sharing would be available in 10 US cities by July 2023. Then I made similar bets a dozen times because everyone disagreed with me.
Source:
https://www.lesswrong.com/posts/ZRrYsZ626KSEgHv8s/self-driving-car-bets
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
[Curated Post] ✓
In the early 2010s, a popular idea was to provide coworking spaces and shared living to people who were building startups. That way the founders would have a thriving social scene of peers to percolate ideas with as they figured out how to build and scale a venture. This was attempted thousands of times by different startup incubators. There are no famous success stories.
In 2015, Sam Altman, who was at the time the president of Y Combinator, a startup accelerator that has helped scale startups collectively worth $600 billion, tweeted in reaction that “not [providing coworking spaces] is part of what makes YC work.” Later, in a 2019 interview with Tyler Cowen, Altman was asked to explain why.
Source:
https://www.lesswrong.com/posts/R5yL6oZxqJfmqnuje/cultivating-a-state-of-mind-where-new-ideas-are-born
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated] ✓
I think "Rationality is winning" is a bit of a trap.
(The original phrase is notably "rationality is systematized winning", which is better, but it tends to slide into the abbreviated form, and both forms aren't that great IMO)
It was coined to counteract one set of failure modes - there were people who were straw vulcans, who thought rituals-of-logic were important without noticing when they were getting in the way of their real goals. And, also, there outside critics who'd complain about straw-vulcan-ish actions, and treat that as a knockdown argument against "rationality."
"Rationalist should win" is a countermeme that tells both groups of people "Straw vulcanism is not The Way. If you find yourself overthinking things in counterproductive ways, you are not doing rationality, even if it seems elegant or 'reasonable' in some sense."
It's true that rationalists should win. But I think it's not correspondingly true that "rationality" is the study of winning, full stop. There are lots of ways to win. Sometimes the way you win is by copying what your neighbors are doing, and working hard.
There is rationality involved in sifting through the various practices people suggest to you, and figuring out which ones work best. But, the specific skill of "sifting out the good from the bad" isn't always the best approach. It might take years to become good at it, and it's not obvious that those years of getting good at it will pay off.
Source:
https://www.lesswrong.com/posts/3GSRhtrs2adzpXcbY/rationality-winning
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Previously Jacob Cannell wrote the post "Brain Efficiency" which makes several radical claims: that the brain is at the pareto frontier of speed, energy efficiency and memory bandwith, that this represent a fundamental physical frontier.
Here's an AI-generated summary
The article “Brain Efficiency: Much More than You Wanted to Know” on LessWrong discusses the efficiency of physical learning machines. The article explains that there are several interconnected key measures of efficiency for physical learning machines: energy efficiency in ops/J, spatial efficiency in ops/mm^2 or ops/mm^3, speed efficiency in time/delay for key learned tasks, circuit/compute efficiency in size and steps for key low-level algorithmic tasks, and learning/data efficiency in samples/observations/bits required to achieve a level of circuit efficiency, or per unit thereof. The article also explains why brain efficiency matters a great deal for AGI timelines and takeoff speeds, as AGI is implicitly/explicitly defined in terms of brain parity. The article predicts that AGI will consume compute & data in predictable brain-like ways and suggests that AGI will be far more like human simulations/emulations than you’d otherwise expect and will require training/education/raising vaguely like humans1.
Jake further has argued that this has implication for FOOM and DOOM.
Considering the intense technical mastery of nanoelectronics, thermodynamics and neuroscience required to assess the arguments here I concluded that a public debate between experts was called for. This was the start of the Brain Efficiency Prize contest which attracted over a 100 in-depth technically informed comments.
Now for the winners! Please note that the criteria for winning the contest was based on bringing in novel and substantive technical arguments as assesed by me. In contrast, general arguments about the likelihood of FOOM or DOOM while no doubt interesting did not factor into the judgement.
And the winners of the Jake Cannell Brain Efficiency Prize contest are
Source:
https://www.lesswrong.com/posts/fm88c8SvXvemk3BhW/brain-efficiency-cannell-prize-contest-award-ceremony
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
The Lightspeed application asks: “What impact will [your project] have on the world? What is your project’s goal, how will you know if you’ve achieved it, and what is the path to impact?”
LTFF uses an identical question, and SFF puts it even more strongly (“What is your organization’s plan for improving humanity’s long term prospects for survival and flourishing?”).
I’ve applied to all three grants of these at various points, and I’ve never liked this question. It feels like it wants a grand narrative of an amazing, systemic project that will measurably move the needle on x-risk. But I’m typically applying for narrowly defined projects, like “Give nutrition tests to EA vegans and see if there’s a problem”. I think this was a good project. I think this project is substantially more likely to pay off than underspecified alignment strategy research, and arguably has as good a long tail. But when I look at “What impact will [my project] have on the world?” the project feels small and sad. I feel an urge to make things up, and express far more certainty for far more impact than I believe. Then I want to quit, because lying is bad but listing my true beliefs feels untenable.
I’ve gotten better at this over time, but I know other people with similar feelings, and I suspect it’s a widespread issue (I encourage you to share your experience in the comments so we can start figuring that out).
I should note that the pressure for grand narratives has good points; funders are in fact looking for VC-style megabits. I think that narrow projects are underappreciated, but for purposes of this post that’s beside the point: I think many grantmakers are undercutting their own preferred outcomes by using questions that implicitly push for a grand narrative. I think they should probably change the form, but I also think we applicants can partially solve the problem by changing how we interact with the current forms.
My goal here is to outline the problem, gesture at some possible solutions, and create a space for other people to share data. I didn’t think about my solutions very long, I am undoubtedly missing a bunch and what I do have still needs workshopping, but it’s a place to start.
Source:
https://www.lesswrong.com/posts/FNPXbwKGFvXWZxHGE/grant-applications-and-grand-narratives
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[Curated Post] ✓
This post is not about arguments in favor of or against cryonics. I would just like to share a particular emotional response of mine as the topic became hot for me after not thinking about it at all for nearly a decade.
Recently, I have signed up for cryonics, as has my wife, and we have made arrangements for our son to be cryopreserved just in case longevity research does not deliver in time or some unfortunate thing happens.
Last year, my father died. He was a wonderful man, good-natured, intelligent, funny, caring and, most importantly in this context, loving life to the fullest, even in the light of any hardships. He had a no-bullshit-approach regarding almost any topic, and, being born in the late 1940s in relative poverty and without much formal education, over the course of his life he acquired many unusual attitudes that were not that compatible with his peers (unlike me, he never tried to convince other people of things they could not or did not want to grasp, pragmatism was another of his traits). Much of what he expected from the future in general and technology in particular, I later came to know as transhumanist thinking, though neither was he familiar with the philosophy as such nor was he prone to labelling his worldview. One of his convictions was that age-related death is a bad thing and a tragedy, a problem that should and will eventually be solved by technology.
Source:
https://www.lesswrong.com/posts/inARBH5DwQTrvvRj8/cryonics-and-regret
Narrated for LessWrong by TYPE III AUDIO.
Share feedback on this narration.
[125+ Karma Post] ✓
Alright, time for the payoff, unifying everything discussed in the previous post. This post is a lot more mathematically dense, you might want to digest it in more than one sitting.
Imaginary Prices, Tradeoffs, and Utilitarianism
Harsanyi's Utilitarianism Theorem can be summarized as "if a bunch of agents have their own personal utility functions Ui, and you want to aggregate them into a collective utility function U with the property that everyone agreeing that option x is better than option y (ie, Ui(x)≥Ui(y) for all i) implies U(x)≥U(y), then that collective utility function must be of the form b+∑i∈IaiUi for some number b and nonnegative numbers ai."
Basically, if you want to aggregate utility functions, the only sane way to do so is to give everyone importance weights, and do a weighted sum of everyone's individual utility functions.
Closely related to this is a result that says that any point on the Pareto Frontier of a game can be post-hoc interpreted as the result of maximizing a collective utility function. This related result is one where it's very important for the reader to understand the actual proof, because the proof gives you a way of reverse-engineering "how much everyone matters to the social utility function" from the outcome alone.
First up, draw all the outcomes, and the utilities that both players assign to them, and the convex hull will be the "feasible set" F, since we have access to randomization. Pick some Pareto frontier point u1,u2...un (although the drawn image is for only two players)
https://www.lesswrong.com/posts/RZNmNwc9SxdKayeQh/unifying-bargaining-notions-2-2
Inspired by Aesop, Soren Kierkegaard, Robin Hanson, sadoeuphemist and Ben Hoffman.
One winter a grasshopper, starving and frail, approaches a colony of ants drying out their grain in the sun, to ask for food.
“Did you not store up food during the summer?” the ants ask.
“No”, says the grasshopper. “I lost track of time, because I was singing and dancing all summer long.”
The ants, disgusted, turn away and go back to work.
https://www.lesswrong.com/posts/GJgudfEvNx8oeyffH/the-ants-and-the-grasshopper
Summary: We demonstrate a new scalable way of interacting with language models: adding certain activation vectors into forward passes. Essentially, we add together combinations of forward passes in order to get GPT-2 to output the kinds of text we want. We provide a lot of entertaining and successful examples of these "activation additions." We also show a few activation additions which unexpectedly fail to have the desired effect.
We quantitatively evaluate how activation additions affect GPT-2's capabilities. For example, we find that adding a "wedding" vector decreases perplexity on wedding-related sentences, without harming perplexity on unrelated sentences. Overall, we find strong evidence that appropriately configured activation additions preserve GPT-2's capabilities.
Our results provide enticing clues about the kinds of programs implemented by language models. For some reason, GPT-2 allows "combination" of its forward passes, even though it was never trained to do so. Furthermore, our results are evidence of linear feature directions, including "anger", "weddings", and "create conspiracy theories."
We coin the phrase "activation engineering" to describe techniques which steer models by modifying their activations. As a complement to prompt engineering and finetuning, activation engineering is a low-overhead way to steer models at runtime. Activation additions are nearly as easy as prompting, and they offer an additional way to influence a model’s behaviors and values. We suspect that activation additions can adjust the goals being pursued by a network at inference time.
https://www.lesswrong.com/posts/5spBue2z2tw4JuDCx/steering-gpt-2-xl-by-adding-an-activation-vector
Philosopher David Chalmers asked: "Is there a canonical source for "the argument for AGI ruin" somewhere, preferably laid out as an explicit argument with premises and a conclusion?"
Unsurprisingly, the actual reason people expect AGI ruin isn't a crisp deductive argument; it's a probabilistic update based on many lines of evidence. The specific observations and heuristics that carried the most weight for someone will vary for each individual, and can be hard to accurately draw out. That said, Eliezer Yudkowsky's So Far: Unfriendly AI Edition might be a good place to start if we want a pseudo-deductive argument just for the sake of organizing discussion. People can then say which premises they want to drill down on. In The Basic Reasons I Expect AGI Ruin, I wrote: "When I say "general intelligence", I'm usually thinking about "whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems". It's possible that we should already be thinking of GPT-4 as "AGI" on some definitions, so to be clear about the threshold of generality I have in mind, I'll specifically talk about "STEM-level AGI", though I expect such systems to be good at non-STEM tasks too. STEM-level AGI is AGI that has "the basic mental machinery required to do par-human reasoning about all the hard sciences", though a specific STEM-level AGI could (e.g.) lack physics ability for the same reasons many smart humans can't solve physics problems, such as "lack of familiarity with the field".
https://www.lesswrong.com/posts/QzkTfj4HGpLEdNjXX/an-artificially-structured-argument-for-expecting-agi-ruin
You are the director of a giant government research program that’s conducting randomized controlled trials (RCTs) on two thousand health interventions, so that you can pick out the most cost-effective ones and promote them among the general population.
The quality of the two thousand interventions follows a normal distribution, centered at zero (no harm or benefit) and with standard deviation 1. (Pick whatever units you like — maybe one quality-adjusted life-year per ten thousand dollars of spending, or something in that ballpark.)
Unfortunately, you don’t know exactly how good each intervention is — after all, then you wouldn’t be doing this job. All you can do is get a noisy measurement of intervention quality using an RCT. We’ll call this measurement the intervention’s performance in your RCT.
https://www.lesswrong.com/posts/nnDTgmzRrzDMiPF9B/how-much-do-you-believe-your-results
This is a post about mental health and disposition in relation to the alignment problem. It compiles a number of resources that address how to maintain wellbeing and direction when confronted with existential risk.
Many people in this community have posted their emotional strategies for facing Doom after Eliezer Yudkowsky’s “Death With Dignity” generated so much conversation on the subject. This post intends to be more touchy-feely, dealing more directly with emotional landscapes than questions of timelines or probabilities of success.
The resources section would benefit from community additions. Please suggest any resources that you would like to see added to this post.
Please note that this document is not intended to replace professional medical or psychological help in any way. Many preexisting mental health conditions can be exacerbated by these conversations. If you are concerned that you may be experiencing a mental health crisis, please consult a professional.
https://www.lesswrong.com/posts/pLLeGA7aGaJpgCkof/mental-health-and-the-alignment-problem-a-compilation-of
The primary talk of the AI world recently is about AI agents (whether or not it includes the question of whether we can’t help but notice we are all going to die.)
The trigger for this was AutoGPT, now number one on GitHub, which allows you to turn GPT-4 (or GPT-3.5 for us clowns without proper access) into a prototype version of a self-directed agent.
We also have a paper out this week where a simple virtual world was created, populated by LLMs that were wrapped in code designed to make them simple agents, and then several days of activity were simulated, during which the AI inhabitants interacted, formed and executed plans, and it all seemed like the beginnings of a living and dynamic world. Game version hopefully coming soon.
How should we think about this? How worried should we be?
https://www.lesswrong.com/posts/566kBoPi76t8KAkoD/on-autogpt
https://thezvi.wordpress.com/
(Related text posted to Twitter; this version is edited and has a more advanced final section.)
Imagine yourself in a box, trying to predict the next word - assign as much probability mass to the next token as possible - for all the text on the Internet.
Koan: Is this a task whose difficulty caps out as human intelligence, or at the intelligence level of the smartest human who wrote any Internet text? What factors make that task easier, or harder? (If you don't have an answer, maybe take a minute to generate one, or alternatively, try to predict what I'll say next; if you do have an answer, take a moment to review it inside your mind, or maybe say the words out loud.)
https://www.lesswrong.com/posts/nH4c3Q9t9F3nJ7y8W/gpts-are-predictors-not-imitators
https://www.lesswrong.com/posts/iy2o4nQj9DnQD7Yhj/discussion-with-nate-soares-on-a-key-alignment-difficulty
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
In late 2022, Nate Soares gave some feedback on my Cold Takes series on AI risk (shared as drafts at that point), stating that I hadn't discussed what he sees as one of the key difficulties of AI alignment.
I wanted to understand the difficulty he was pointing to, so the two of us had an extended Slack exchange, and I then wrote up a summary of the exchange that we iterated on until we were both reasonably happy with its characterization of the difficulty and our disagreement.1 My short summary is:
https://www.lesswrong.com/posts/fJBTRa7m7KnCDdzG5/a-stylized-dialogue-on-john-wentworth-s-claims-about-markets
(This is a stylized version of a real conversation, where the first part happened as part of a public debate between John Wentworth and Eliezer Yudkowsky, and the second part happened between John and me over the following morning. The below is combined, stylized, and written in my own voice throughout. The specific concrete examples in John's part of the dialog were produced by me. It's over a year old. Sorry for the lag.)
(As to whether John agrees with this dialog, he said "there was not any point at which I thought my views were importantly misrepresented" when I asked him for comment.)
https://www.lesswrong.com/posts/XWwvwytieLtEWaFJX/deep-deceptiveness
This post is an attempt to gesture at a class of AI notkilleveryoneism (alignment) problem that seems to me to go largely unrecognized. E.g., it isn’t discussed (or at least I don't recognize it) in the recent plans written up by OpenAI (1,2), by DeepMind’s alignment team, or by Anthropic, and I know of no other acknowledgment of this issue by major labs.
You could think of this as a fragment of my answer to “Where do plans like OpenAI’s ‘Our Approach to Alignment Research’ fail?”, as discussed in Rob and Eliezer’s challenge for AGI organizations and readers. Note that it would only be a fragment of the reply; there's a lot more to say about why AI alignment is a particularly tricky task to task an AI with. (Some of which Eliezer gestures at in a follow-up to his interview on Bankless.)
https://www.lesswrong.com/posts/fRwdkop6tyhi3d22L/there-s-no-such-thing-as-a-tree-phylogenetically
This is a linkpost for https://eukaryotewritesblog.com/2021/05/02/theres-no-such-thing-as-a-tree/
[Crossposted from Eukaryote Writes Blog.]
So you’ve heard about how fish aren’t a monophyletic group? You’ve heard about carcinization, the process by which ocean arthropods convergently evolve into crabs? You say you get it now? Sit down. Sit down. Shut up. Listen. You don’t know nothing yet.
“Trees” are not a coherent phylogenetic category. On the evolutionary tree of plants, trees are regularly interspersed with things that are absolutely, 100% not trees. This means that, for instance, either:
I thought I had a pretty good guess at this, but the situation is far worse than I could have imagined.
https://www.lesswrong.com/posts/ma7FSEtumkve8czGF/losing-the-root-for-the-tree
You know that being healthy is important. And that there's a lot of stuff you could do to improve your health: getting enough sleep, eating well, reducing stress, and exercising, to name a few.
There’s various things to hit on when it comes to exercising too. Strength, obviously. But explosiveness is a separate thing that you have to train for. Same with flexibility. And don’t forget cardio!
Strength is most important though, because of course it is. And there’s various things you need to do to gain strength. It all starts with lifting, but rest matters too. And supplements. And protein. Can’t forget about protein.
Protein is a deeper and more complicated subject than it may at first seem. Sure, the amount of protein you consume matters, but that’s not the only consideration. You also have to think about the timing. Consuming large amounts 2x a day is different than consuming smaller amounts 5x a day. And the type of protein matters too. Animal is different than plant, which is different from dairy. And then quality is of course another thing that is important.
But quality isn’t an easy thing to figure out. The big protein supplement companies are Out To Get You. They want to mislead you. Information sources aren’t always trustworthy. You can’t just hop on The Wirecutter and do what they tell you. Research is needed.
So you listen to a few podcasts. Follow a few YouTubers. Start reading some blogs. Throughout all of this you try various products and iterate as you learn more. You’re no Joe Rogan, but you’re starting to become pretty informed.
https://www.lesswrong.com/posts/nTGEeRSZrfPiJwkEc/the-onion-test-for-personal-and-institutional-honesty
[co-written by Chana Messinger and Andrew Critch, Andrew is the originator of the idea]
You (or your organization or your mission or your family or etc.) pass the “onion test” for honesty if each layer hides but does not mislead about the information hidden within.
When people get to know you better, or rise higher in your organization, they may find out new things, but should not be shocked by the types of information that were hidden. If they are, you messed up in creating the outer layers to describe appropriately the kind-of-thing that might be inside.
Examples
Positive Example:
Outer layer says "I usually treat my health information as private."
Next layer in says: "Here are the specific health problems I have: Gout, diabetes."
Negative example:
Outer layer says: "I usually treat my health info as private."
Next layer in: "I operate a cocaine dealership. Sorry I didn't warn you that I was also private about my illegal activities."
https://www.lesswrong.com/posts/gNodQGNoPDjztasbh/lies-damn-lies-and-fabricated-options
This is an essay about one of those "once you see it, you will see it everywhere" phenomena. It is a psychological and interpersonal dynamic roughly as common, and almost as destructive, as motte-and-bailey, and at least in my own personal experience it's been quite valuable to have it reified, so that I can quickly recognize the commonality between what I had previously thought of as completely unrelated situations.
The original quote referenced in the title is "There are three kinds of lies: lies, damned lies, and statistics."
https://www.lesswrong.com/posts/HBxe6wdjxK239zajf/what-failure-looks-like
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
The stereotyped image of AI catastrophe is a powerful, malicious AI system that takes its creators by surprise and quickly achieves a decisive advantage over the rest of humanity.
I think this is probably not what failure will look like, and I want to try to paint a more realistic picture. I’ll tell the story in two parts:
I think these are the most important problems if we fail to solve intent alignment.
In practice these problems will interact with each other, and with other disruptions/instability caused by rapid progress. These problems are worse in worlds where progress is relatively fast, and fast takeoff can be a key risk factor, but I’m scared even if we have several years.
https://www.lesswrong.com/posts/K4urTDkBbtNuLivJx/why-i-think-strong-general-ai-is-coming-soon
I think there is little time left before someone builds AGI (median ~2030). Once upon a time, I didn't think this.
This post attempts to walk through some of the observations and insights that collapsed my estimates.
The core ideas are as follows:
https://gwern.net/fiction/clippy
In A.D. 20XX. Work was beginning. “How are you gentlemen !!”… (Work. Work never changes; work is always hell.)
Specifically, a MoogleBook researcher has gotten a pull request from Reviewer #2 on his new paper in evolutionary search in auto-ML, for error bars on the auto-ML hyperparameter sensitivity like larger batch sizes, because more can be different and there’s high variance in the old runs with a few anomalously high gain of function. (“Really? Really? That’s what you’re worried about?”) He can’t see why worry, and wonders what sins he committed to deserve this asshole Chinese (given the Engrish) reviewer, as he wearily kicks off yet another HQU experiment…
https://www.lesswrong.com/posts/4Gt42jX7RiaNaxCwP/more-information-about-the-dangerous-capability-evaluations
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This is a linkpost for https://evals.alignment.org/blog/2023-03-18-update-on-recent-evals/
[Written for more of a general-public audience than alignment-forum audience. We're working on a more thorough technical report.]
We believe that capable enough AI systems could pose very large risks to the world. We don’t think today’s systems are capable enough to pose these sorts of risks, but we think that this situation could change quickly and it’s important to be monitoring the risks consistently. Because of this, ARC is partnering with leading AI labs such as Anthropic and OpenAI as a third-party evaluator to assess potentially dangerous capabilities of today’s state-of-the-art ML models. The dangerous capability we are focusing on is the ability to autonomously gain resources and evade human oversight.
We attempt to elicit models’ capabilities in a controlled environment, with researchers in-the-loop for anything that could be dangerous, to understand what might go wrong before models are deployed. We think that future highly capable models should involve similar “red team” evaluations for dangerous capabilities before the models are deployed or scaled up, and we hope more teams building cutting-edge ML systems will adopt this approach. The testing we’ve done so far is insufficient for many reasons, but we hope that the rigor of evaluations will scale up as AI systems become more capable.
As we expected going in, today’s models (while impressive) weren’t capable of autonomously making and carrying out the dangerous activities we tried to assess. But models are able to succeed at several of the necessary components. Given only the ability to write and run code, models have some success at simple tasks involving browsing the internet, getting humans to do things for them, and making long-term plans – even if they cannot yet execute on this reliably.
https://www.lesswrong.com/posts/thkAtqoQwN6DtaiGT/carefully-bootstrapped-alignment-is-organizationally-hard
In addition to technical challenges, plans to safely develop AI face lots of organizational challenges. If you're running an AI lab, you need a concrete plan for handling that.
In this post, I'll explore some of those issues, using one particular AI plan as an example. I first heard this described by Buck at EA Global London, and more recently with OpenAI's alignment plan. (I think Anthropic's plan has a fairly different ontology, although it still ultimately routes through a similar set of difficulties)
I'd call the cluster of plans similar to this "Carefully Bootstrapped Alignment."
https://www.lesswrong.com/posts/zidQmfFhMgwFzcHhs/enemies-vs-malefactors
Status: some mix of common wisdom (that bears repeating in our particular context), and another deeper point that I mostly failed to communicate.
Short version
Harmful people often lack explicit malicious intent. It’s worth deploying your social or community defenses against them anyway. I recommend focusing less on intent and more on patterns of harm.
(Credit to my explicit articulation of this idea goes in large part to Aella, and also in part to Oliver Habryka.)
https://www.lesswrong.com/posts/LzQtrHSYDafXynofq/the-parable-of-the-king-and-the-random-process
~ A Parable of Forecasting Under Model Uncertainty ~
You, the monarch, need to know when the rainy season will begin, in order to properly time the planting of the crops. You have two advisors, Pronto and Eternidad, who you trust exactly equally.
You ask them both: "When will the next heavy rain occur?"
Pronto says, "Three weeks from today."
Eternidad says, "Ten years from today."
https://www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post
In this article, I will present a mechanistic explanation of the Waluigi Effect and other bizarre "semiotic" phenomena which arise within large language models such as GPT-3/3.5/4 and their variants (ChatGPT, Sydney, etc). This article will be folklorish to some readers, and profoundly novel to others.
https://www.lesswrong.com/posts/3RSq3bfnzuL3sp46J/acausal-normalcy
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
This post is also available on the EA Forum.
Summary: Having thought a bunch about acausal trade — and proven some theorems relevant to its feasibility — I believe there do not exist powerful information hazards about it that stand up to clear and circumspect reasoning about the topic. I say this to be comforting rather than dismissive; if it sounds dismissive, I apologize.
With that said, I have four aims in writing this post:
https://www.lesswrong.com/posts/RryyWNmJNnLowbhfC/please-don-t-throw-your-mind-away
[Warning: the following dialogue contains an incidental spoiler for "Music in Human Evolution" by Kevin Simler. That post is short, good, and worth reading without spoilers, and this post will still be here if you come back later. It's also possible to get the point of this post by skipping the dialogue and reading the other sections.]
Pretty often, talking to someone who's arriving to the existential risk / AGI risk / longtermism cluster, I'll have a conversation like the following:
Tsvi: "So, what's been catching your eye about this stuff?"
Arrival: "I think I want to work on machine learning, and see if I can contribute to alignment that way."
T: "What's something that got your interest in ML?"
A: "It seems like people think that deep learning might be on the final ramp up to AGI, so I should probably know how that stuff works, and I think I have a good chance of learning ML at least well enough to maybe contribute to a research project."
------
This is an experiment with AI narration. What do you think? Tell us by going to t3a.is.
------
https://www.lesswrong.com/posts/bxt7uCiHam4QXrQAA/cyborgism
There is a lot of disagreement and confusion about the feasibility and risks associated with automating alignment research. Some see it as the default path toward building aligned AI, while others expect limited benefit from near term systems, expecting the ability to significantly speed up progress to appear well after misalignment and deception. Furthermore, progress in this area may directly shorten timelines or enable the creation of dual purpose systems which significantly speed up capabilities research.
OpenAI recently released their alignment plan. It focuses heavily on outsourcing cognitive work to language models, transitioning us to a regime where humans mostly provide oversight to automated research assistants. While there have been a lot of objections to and concerns about this plan, there hasn’t been a strong alternative approach aiming to automate alignment research which also takes all of the many risks seriously.
The intention of this post is not to propose an end-all cure for the tricky problem of accelerating alignment using GPT models. Instead, the purpose is to explicitly put another point on the map of possible strategies, and to add nuance to the overall discussion.
https://www.lesswrong.com/posts/CYN7swrefEss4e3Qe/childhoods-of-exceptional-people
This is a linkpost for https://escapingflatland.substack.com/p/childhoods
Let’s start with one of those insights that are as obvious as they are easy to forget: if you want to master something, you should study the highest achievements of your field. If you want to learn writing, read great writers, etc.
But this is not what parents usually do when they think about how to educate their kids. The default for a parent is rather to imitate their peers and outsource the big decisions to bureaucracies. But what would we learn if we studied the highest achievements?
Thinking about this question, I wrote down a list of twenty names—von Neumann, Tolstoy, Curie, Pascal, etc—selected on the highly scientific criteria “a random Swedish person can recall their name and think, Sounds like a genius to me”. That list is to me a good first approximation of what an exceptional result in the field of child-rearing looks like. I ordered a few piles of biographies, read, and took notes. Trying to be a little less biased in my sample, I asked myself if I could recall anyone exceptional that did not fit the patterns I saw in the biographies, which I could, and so I ordered a few more biographies.
This kept going for an unhealthy amount of time.
I sampled writers (Virginia Woolf, Lev Tolstoy), mathematicians (John von Neumann, Blaise Pascal, Alan Turing), philosophers (Bertrand Russell, René Descartes), and composers (Mozart, Bach), trying to get a diverse sample.
In this essay, I am going to detail a few of the patterns that have struck me after having skimmed 42 biographies. I will sort the claims so that I start with more universal patterns and end with patterns that are less common.
https://www.lesswrong.com/posts/NJYmovr9ZZAyyTBwM/what-i-mean-by-alignment-is-in-large-part-about-making
Crossposted from the AI Alignment Forum. May contain more technical jargon than usual.
(Epistemic status: attempting to clear up a misunderstanding about points I have attempted to make in the past. This post is not intended as an argument for those points.)
I have long said that the lion's share of the AI alignment problem seems to me to be about pointing powerful cognition at anything at all, rather than figuring out what to point it at.
It’s recently come to my attention that some people have misunderstood this point, so I’ll attempt to clarify here.
https://www.lesswrong.com/posts/NRrbJJWnaSorrqvtZ/on-not-getting-contaminated-by-the-wrong-obesity-ideas
A Chemical Hunger (a), a series by the authors of the blog Slime Mold Time Mold (SMTM), argues that the obesity epidemic is entirely caused (a) by environmental contaminants.
In my last post, I investigated SMTM’s main suspect (lithium).[1] This post collects other observations I have made about SMTM’s work, not narrowly related to lithium, but rather focused on the broader thesis of their blog post series.
I think that the environmental contamination hypothesis of the obesity epidemic is a priori plausible. After all, we know that chemicals can affect humans, and our exposure to chemicals has plausibly changed a lot over time. However, I found that several of what seem to be SMTM’s strongest arguments in favor of the contamination theory turned out to be dubious, and that nearly all of the interesting things I thought I’d learned from their blog posts turned out to actually be wrong. I’ll explain that in this post.
https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
Work done at SERI-MATS, over the past two months, by Jessica Rumbelow and Matthew Watkins.
TL;DR
Anomalous tokens: a mysterious failure mode for GPT (which reliably insulted Matthew)
Prompt generation: a new interpretability method for language models (which reliably finds prompts that result in a target completion). This is good for:
In this post, we'll introduce the prototype of a new model-agnostic interpretability method for language models which reliably generates adversarial prompts that result in a target completion. We'll also demonstrate a previously undocumented failure mode for GPT-2 and GPT-3 language models, which results in bizarre completions (in some cases explicitly contrary to the purpose of the model), and present the results of our investigation into this phenomenon. Further detail can be found in a follow-up post.
https://www.lesswrong.com/posts/Zp6wG5eQFLGWwcG6j/focus-on-the-places-where-you-feel-shocked-everyone-s
Writing down something I’ve found myself repeating in different conversations:
If you're looking for ways to help with the whole “the world looks pretty doomed” business, here's my advice: look around for places where we're all being total idiots.
Look for places where everyone's fretting about a problem that some part of you thinks it could obviously just solve.
Look around for places where something seems incompetently run, or hopelessly inept, and where some part of you thinks you can do better.
Then do it better.