Chatgpt Data Source – Where Does It Get Data From?

Written By

Mark

| Updated on:

in this article...

ChatGPT AI bot has shocked users worldwide by providing human-like answers with precise accuracy, but what’s the reason for that? This article will uncover major sources of data used for ChatGPT training and data extraction, which has led to an advancement in AI.

Chatgpt Data Source - Where Does It Get Data From?

Where does ChatGPT get its data from? – ChatGPT Data Sources

ChatGPT is an AI language model that was trained on a large body of text from a variety of sources (e.g., Wikipedia, books, news articles, scientific journals). Here are three different sources from where the ChatGPT gets its data from-

ChatGPT (GPT-3) Data Sources

GPT-1 and GPT-2 set the bases for GPT-3. Further, GPT-3 GPT-3 uses 175 billion parameters to make automatic conversations precisely like humans. By increasing parameters, the GPT-3 model can learn complicated patterns and forms of natural language to develop more human-like texts.

According to a paper titled “Language Models,” OpenAI has trained GPT-3 on the following Datasets-

  • Common Crawl data comprise the most significant volume of 410 billion. This data is available for free use, containing petabytes of collected data from the internet web since 2008. 
  • WebText2 comprises 19 billion text web pages from the outbound Reddit post’s links (3+ upvotes.)
  • Books1 & Books2 comprise 22 and 55 billion data from the internet-based books corpora.
  • Wikipedia pages data comprised of 3 billion in English were used for training.

ChatGPT (GPT-4) Data Sources

OpenAI released GPT-4 in March 2023, which is an advanced architecture model than GPT-3 in scaling the deep learning process. By including the GPT-4 model in ChatGPT, users can feed inputs up to 25,000 words which is eight times the previous ChatGPT model.

President of OpenAI, Greg Brockman, told TechCrunch that this new model was trained on images & texts without disclosing more about sources. OpenAI has still made a mystery about the data sources used for GPT-4.

What are the DALL-E-2 Data Sources for Text to Image Generation

OpenAI is recognized for its DALL-E-2, a popular deep-learning model that induces images from text prompts or instructions. DALL-E-2 gets its images from a freely available image source known as LAION.

This website includes billions of combined or paired images & text from the world wide web. LAION discovers images by analyzing the Common Crawl data and recognizing HTML IMG tags, including an alt-text attribute.

How big is the dataset for ChatGPT?

As per Stanford University, OpenAI has trained ChatGPT on 570GB (including sources from web pages, Wikipedia, websites, research articles, and books) of data consisting of 175 billion parameters capable of receiving  10 million queries every day. OpenAI fed approximately 300 billion words to the ChatGPT system.

The base GPT-3 model for ChatGPT used different datasets from 2016 to 2019. After filtering 45TB of compressed plain texts, ChatGPT was left with 570 GB. The GPT-2 model utilized only 1.5 billion parameters which are 100 times smaller than the ones for GPT-3. ChatGPT was trained on a massive volume of text data, letting it develop contextually appropriate responses per user inputs and understand broad topics to fulfill people’s requirements.

Mark Roberts is a freelance writer and tech enthusiast based in San Diego, specializing in internet security and Ai tools.

Leave a Comment