This website contains 0.00006% of the world’s knowledge

Posted on April 30, 2023 1 minute read

According to reputable sources, this blog contains 0.00006% of the world’s knowledge.

The large language models (LLMs) that underlie tools like ChatGPT and Bing-AI are being used as question-answering tools. If you listen to the hype surrounding what LLMs can do, you can hardly be faulted for thinking that is has every fact known to humankind and can answer any question.
One of the most popular large language models, GPT-3, was trained with several large text datasets.
One dataset, C4 (a filtered version of the Common Crawl), is 60% of the text used in training.
According to this article in the Washington Post, dltj.org is 0.0001% of the tokens in the C4 dataset.

Screen capture of 'dltj.org' search results in Washington Post article.

How much is 0.0001% of the GPT-3 training set? It is a quarter of an inch (half a centimeter) off sea level on a climb up Mount Everest. (Source: Wolfram Alpha) It is almost 8 feet (2.5 meters) of a journey from Washington, DC, to San Francisco, California (Source: Wolfram Alpha) In contrast, the content from the New York Times is 0.036% of the training dataset, or 9/10ths of a mile (1.4km) on that journey.

(A note about assumptions: OpenAI hasn’t published the contents of the training data for GPT-3.5—which is used in ChatGPT—and GPT-4. So this post uses the data from GPT-3 as listed in Wikpedia. )

You can use the search tool near the bottom of the Washington Post article to see where your favorite website ranks. But also read the article to explore what is in the C4 version of the Common Crawl. As much as OpenAI is trying to put guardrails on the output, the model itself is trained on some pretty offensive stuff.

Social Media Interactions

No reposts were found.

No likes were found.

No webmentions were found.

Share on

Mastodon/Fediverse Twitter Facebook LinkedIn

Peter Murray

This website contains 0.00006% of the world’s knowledge

Social Media Interactions

Share on

You may also enjoy

Learnings from the British Library Cybersecurity Report

One Year of Learning 2023

Restoring Obsidian Knowledgebase from MacOS Time Machine at the Command Line

Processing WOLFcon Conference Recordings with FFMPEG

Likes

Reposts

Discussion