welcome
TechCrunch

TechCrunch

Technology

Technology

These researchers used NPR Sunday Puzzle questions to benchmark AI 'reasoning' models | TechCrunch

TechCrunch
Summary
Nutrition label

82% Informative

NPR host Will Shortz hosts a weekly crossword puzzle series called the Sunday Puzzle .

Researchers created an AI benchmark using riddles from the show.

They say AI models sometimes “give up” and provide incorrect answers they know aren’t correct.

The current best-performing model on the benchmark is o1 with a score of 59% , followed by the recently released o3-set to high-reasoning models.

VR Score

86

Informative language

87

Neutral language

51

Article tone

informal

Language

English

Language complexity

51

Offensive language

not offensive

Hate speech

not hateful

Attention-grabbing headline

not detected

Known propaganda techniques

not detected

Time-value

long-living

Source diversity

2

Affiliate links

no affiliate links