logo
welcome
TechCrunch

TechCrunch

The promise and perils of synthetic data | TechCrunch

TechCrunch
Summary
Nutrition label

82% Informative

Synthetic data is increasingly hard to come by as real data is hard to obtain.

The market for annotated data is worth $838.2 million today — and will be worth $10.34 billion in the next ten years .

The number of people in the “millions” of AI -generated data has ballooned the market for annotation services.

Gartner predicts 60% of the data used for AI and analytics projects this year will be synthetically generated.

Synthetic data generation has become a business in its own right — one that could be worth $2.34 billion by 2030 .

The data used to train these models has biases and limitations, their outputs will be tainted.

Complex models such as OpenAI ’s o1 could produce harder-to-spot hallucinations in their synthetic data.

OpenAI CEO Sam Altman once argued that AI will someday produce synthetic data good enough to effectively train itself.

No major AI lab has released a model trained on synthetic data alone.

At least for the foreseeable future, it seems we’ll need humans in the loop somewhere to make sure a model’s training doesn’t go awry.

VR Score

83

Informative language

82

Neutral language

47

Article tone

informal

Language

English

Language complexity

53

Offensive language

not offensive

Hate speech

not hateful

Attention-grabbing headline

not detected

Known propaganda techniques

not detected

Time-value

long-living

External references

20

https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/https://www.reuters.com/technology/inside-big-techs-underground-race-buy-ai-training-data-2024-04-05/https://venturebeat.com/2021/03/28/mit-study-finds-systematic-labeling-errors-in-popular-ai-benchmark-datasets/https://time.com/6247678/openai-chatgpt-kenya-workers/https://finance.yahoo.com/news/data-annotation-labelling-market-expected-144500834.htmlhttps://www.zdnet.com/article/beware-ai-model-collapse-how-training-on-synthetic-data-pollutes-the-next-generation/https://huggingface.co/datasets/HuggingFaceTB/smollm-corpushttps://www.fortunebusinessinsights.com/synthetic-data-generation-market-108433https://originality.ai/https://www.cnbc.com/2024/10/09/ai-startup-writer-launches-new-model-to-compete-with-openai.htmlhttps://observer.com/2024/07/ai-training-data-crisis/#:~:text=In%20the%20past%20year%2C%20around,models.&text=As%20the%20A.I.,data%20to%20be%20trained%20on.https://www.techopedia.com/is-2026-the-year-ai-runs-out-of-training-data#:~:text=Key%20Takeaways,to%20address%20potential%20data%20limitationshttps://www.businessinsider.com/ai-synthetic-data-industry-debate-over-fake-2024-8https://www.techtarget.com/whatis/feature/Model-collapse-explained-How-synthetic-training-data-breaks-AIhttps://arxiv.org/pdf/2306.06130https://www.aboutamazon.com/news/devices/how-amazon-protects-customer-privacy-while-making-alexa-betterhttps://arxiv.org/pdf/2203.10748.pdfhttps://arxiv.org/pdf/2307.01850.pdfhttps://www.inc.com/ben-sherry/openais-next-generation-models-could-reportedly-cost-2000.htmlhttps://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/

Affiliate links

no affiliate links