logo
welcome
TechCrunch

TechCrunch

The promise and perils of synthetic data | TechCrunch

TechCrunch
Summary
Nutrition label

82% Informative

Synthetic data is increasingly hard to come by, as new, real data is hard to obtain.

The market for annotated AI training services is estimated to be worth $10.34 billion in the next 10 years .

But why does AI need data in the first place — and what kind of data does it need? And can this data be replaced by synthetic data?.

Gartner predicts 60% of the data used for AI and analytics projects this year will be synthetically generated.

Synthetic data can be used to generate training data in a format that’s not easily obtained through scraping (or even content licensing) It suffers from the same “garbage in, garbage out” problem as all AI .

Complex models such as OpenAI 's o1 could produce harder-to-spot hallucinations in synthetic data.

OpenAI CEO Sam Altman once argued that AI will someday produce synthetic data good enough to effectively train itself.

No major AI lab has released a model trained on synthetic data alone.

At least for the foreseeable future, it seems we’ll need humans in the loop somewhere to make sure a model’s training doesn’t go awry.

VR Score

83

Informative language

82

Neutral language

45

Article tone

informal

Language

English

Language complexity

53

Offensive language

not offensive

Hate speech

not hateful

Attention-grabbing headline

not detected

Known propaganda techniques

not detected

Time-value

long-living

External references

20

https://blogs.gartner.com/andrew_white/2021/07/24/by-2024-60-of-the-data-used-for-the-development-of-ai-and-analytics-projects-will-be-synthetically-generated/https://www.reuters.com/technology/inside-big-techs-underground-race-buy-ai-training-data-2024-04-05/https://venturebeat.com/2021/03/28/mit-study-finds-systematic-labeling-errors-in-popular-ai-benchmark-datasets/https://time.com/6247678/openai-chatgpt-kenya-workers/https://finance.yahoo.com/news/data-annotation-labelling-market-expected-144500834.htmlhttps://www.zdnet.com/article/beware-ai-model-collapse-how-training-on-synthetic-data-pollutes-the-next-generation/https://huggingface.co/datasets/HuggingFaceTB/smollm-corpushttps://www.fortunebusinessinsights.com/synthetic-data-generation-market-108433https://originality.ai/https://www.cnbc.com/2024/10/09/ai-startup-writer-launches-new-model-to-compete-with-openai.htmlhttps://observer.com/2024/07/ai-training-data-crisis/#:~:text=In%20the%20past%20year%2C%20around,models.&text=As%20the%20A.I.,data%20to%20be%20trained%20on.https://www.techopedia.com/is-2026-the-year-ai-runs-out-of-training-data#:~:text=Key%20Takeaways,to%20address%20potential%20data%20limitationshttps://www.businessinsider.com/ai-synthetic-data-industry-debate-over-fake-2024-8https://www.techtarget.com/whatis/feature/Model-collapse-explained-How-synthetic-training-data-breaks-AIhttps://arxiv.org/pdf/2306.06130https://www.aboutamazon.com/news/devices/how-amazon-protects-customer-privacy-while-making-alexa-betterhttps://arxiv.org/pdf/2203.10748.pdfhttps://arxiv.org/pdf/2307.01850.pdfhttps://www.inc.com/ben-sherry/openais-next-generation-models-could-reportedly-cost-2000.htmlhttps://blogs.nvidia.com/blog/nemotron-4-synthetic-data-generation-llm-training/

Affiliate links

no affiliate links