Summary

Article Metadata

Synthetic Data Training for AI

This is a OpenAI news story, published by TechCrunch, that relates primarily to Sam Altman news.

OpenAI news

For more OpenAI news, you can click here:

more OpenAI news

Sam Altman news

For more Sam Altman news, you can click here:

more Sam Altman news

News about Ai research

For more Ai research news, you can click here:

more Ai research news

TechCrunch news

For more news from TechCrunch, you can click here:

more news from TechCrunch

About the Otherweb

Otherweb, Inc is a public benefit corporation, dedicated to improving the quality of news people consume. We are non-partisan, junk-free, and ad-free. We use artificial intelligence (AI) to remove junk from your news feed, and allow you to select the best tech news, business news, entertainment news, and much more. If you like this article about Ai research, you might also like this article about

synthetic training data

. We are dedicated to bringing you the highest-quality news, junk-free and ad-free, about your favorite topics. Please come every day to read the latest AI training sets news, Synthetic data generation news, news about Ai research, and other high-quality news about any topic that interests you. We are working hard to create the best news aggregator on the web, and to put you in control of your news feed - whether you choose to read the latest news through our website, our news app, or our daily newsletter - all free!

synthetic data

TechCrunch

•

The promise and perils of synthetic data | TechCrunch

Summary

Nutrition label

82% Informative

Synthetic data is increasingly hard to come by as real data is hard to obtain.

The market for annotated data is worth $838.2 million today — and will be worth $10.34 billion in the next ten years .

The number of people in the “millions” of AI -generated data has ballooned the market for annotation services.

Gartner predicts 60% of the data used for AI and analytics projects this year will be synthetically generated.

Synthetic data generation has become a business in its own right — one that could be worth $2.34 billion by 2030 .

The data used to train these models has biases and limitations, their outputs will be tainted.

Complex models such as OpenAI ’s o1 could produce harder-to-spot hallucinations in their synthetic data.

OpenAI CEO Sam Altman once argued that AI will someday produce synthetic data good enough to effectively train itself.

No major AI lab has released a model trained on synthetic data alone.

At least for the foreseeable future, it seems we’ll need humans in the loop somewhere to make sure a model’s training doesn’t go awry.

VR Score

Informative language

Neutral language

Article tone

informal

Language

English

Language complexity

Offensive language

not offensive

Hate speech

not hateful

Attention-grabbing headline

not detected

Known propaganda techniques

not detected

Time-value

long-living

External references

https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/https://www.reuters.com/technology/inside-big-techs-underground-race-buy-ai-training-data-2024-04-05/https://venturebeat.com/2021/03/28/mit-study-finds-systematic-labeling-errors-in-popular-ai-benchmark-datasets/https://time.com/6247678/openai-chatgpt-kenya-workers/https://finance.yahoo.com/news/data-annotation-labelling-market-expected-144500834.html https://www.zdnet.com/article/beware-ai-model-collapse-how-training-on-synthetic-data-pollutes-the-next-generation/https://huggingface.co/datasets/HuggingFaceTB/smollm-corpus https://www.fortunebusinessinsights.com/synthetic-data-generation-market-108433 https://originality.ai/https://www.cnbc.com/2024/10/09/ai-startup-writer-launches-new-model-to-compete-with-openai.html https://observer.com/2024/07/ai-training-data-crisis/#:~:text=In%20the%20past%20year%2C%20around,models.&text=As%20the%20A.I.,data%20to%20be%20trained%20on.https://www.techopedia.com/is-2026-the-year-ai-runs-out-of-training-data#:~:text=Key%20Takeaways,to%20address%20potential%20data%20limitations https://www.businessinsider.com/ai-synthetic-data-industry-debate-over-fake-2024-8 https://www.techtarget.com/whatis/feature/Model-collapse-explained-How-synthetic-training-data-breaks-AI https://arxiv.org/pdf/2306.06130 https://www.aboutamazon.com/news/devices/how-amazon-protects-customer-privacy-while-making-alexa-better https://arxiv.org/pdf/2203.10748.pdf https://arxiv.org/pdf/2307.01850.pdf https://www.inc.com/ben-sherry/openais-next-generation-models-could-reportedly-cost-2000.html https://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/

Source diversity

blogs.gartner.com www.reuters.com venturebeat.com time.com finance.yahoo.com www.zdnet.com huggingface.co www.fortunebusinessinsights.com originality.ai www.cnbc.com observer.com www.techopedia.com www.businessinsider.com www.techtarget.com arxiv.org www.aboutamazon.com www.inc.com blogs.nvidia.com

Affiliate links

no affiliate links

Read full article