LessWrong May 27, 2026 at 12:27 AM

your-left-brain-doesn-t-trade-with-you...

2 corrections found

Claim

AIs are trained on the whole of the internet.

Correction

This overstates how frontier models are trained. Official documentation describes curated mixtures of filtered web data plus other datasets, not 'the whole of the internet.'

Full reasoning

Major model providers do not describe their systems as being trained on "the whole of the internet."

Microsoft's official Azure OpenAI transparency note says GPT-3 training data came from a combination of sources: a filtered version of Common Crawl plus WebText, books corpora, and Wikipedia. A filtered subset of Common Crawl is not the whole internet, and the inclusion of non-web datasets further contradicts the claim.
Meta's official Llama 3 responsibility post likewise says Llama 3 was trained on "a variety of public data" and that Meta excluded or removed some sources known to contain lots of personal information. That is a curated subset, not the entire internet.

So as a factual description of how AI models are trained, the sentence is incorrect: training data is typically a selected and filtered mix of sources, not the whole internet copied wholesale.

2 sources

Transparency Note for Azure OpenAI in Microsoft Foundry Models - Microsoft Learn
The GPT-3 series of models are pretrained on a wide body of publicly available free text data. This data is sourced from a combination of web crawling (specifically, a filtered version of Common Crawl ... comprises 60 percent of the weighted pretraining dataset) and higher-quality datasets, including ... WebText ... books corpora, and English-language Wikipedia.
Our responsible approach to Meta AI and Meta Llama 3
As with Llama 2, Llama 3 is trained on a variety of public data... We excluded or removed data from certain sources known to contain a high volume of personal information about private individuals.

Claim

There aren't really specialized models doing all kinds of specialized work.

Correction

Specialized AI models plainly do exist. AlphaFold is built for protein structure prediction, and GraphCast is built for weather forecasting; both are domain-specific systems rather than general frontier LLMs.

Full reasoning

This sentence is contradicted by well-known, officially documented specialized AI systems.

Google DeepMind's AlphaFold is a domain-specific model for predicting protein structures and molecular interactions. DeepMind describes AlphaFold 3 as predicting "how proteins will interact with other molecules throughout cells" and highlights its biology-specific use.
Google DeepMind's GraphCast is a learned weather forecasting model, not a general-purpose LLM. DeepMind says it outperforms ECMWF's deterministic forecasting system on 89.3% of the evaluated weather targets.

These are clear examples of specialized models doing specialized work in distinct scientific domains. That means the blanket claim that there "aren't really specialized models" is factually wrong.

2 sources

AlphaFold - Google DeepMind
AlphaFold Server Powered by AlphaFold 3 - AlphaFold Server predicts how proteins will interact with other molecules throughout cells.
GraphCast: Learned Global Weather Forecasting - Google DeepMind
We introduce a learned weather simulator-called "GraphCast"-which outperforms the most accurate operational medium-range weather forecasting system in the world... GraphCast is significantly more accurate than ... ECMWF's deterministic forecasting system, "HRES", on 89.3% of the 2760 target variables and lead times we evaluated.

Model: OPENAI_GPT_5 Prompt: v1.16.0