Summary

Article Metadata

AI Benchmark Arena Obsession

This is a UC news story, published by TechCrunch, that relates primarily to Chatbot Arena news.

UC news

For more UC news, you can click here:

more UC news

Chatbot Arena news

For more Chatbot Arena news, you can click here:

more Chatbot Arena news

News about Ai research

For more Ai research news, you can click here:

more Ai research news

TechCrunch news

For more news from TechCrunch, you can click here:

more news from TechCrunch

About the Otherweb

Otherweb, Inc is a public benefit corporation, dedicated to improving the quality of news people consume. We are non-partisan, junk-free, and ad-free. We use artificial intelligence (AI) to remove junk from your news feed, and allow you to select the best tech news, business news, entertainment news, and much more. If you like this article about Ai research, you might also like this article about

big powerhouse chatbots

. We are dedicated to bringing you the highest-quality news, junk-free and ad-free, about your favorite topics. Please come every day to read the latest AI chatbot news, Chatbot Arena users news, news about Ai research, and other high-quality news about any topic that interests you. We are working hard to create the best news aggregator on the web, and to put you in control of your news feed - whether you choose to read the latest news through our website, our news app, or our daily newsletter - all free!

Chatbot Arena

TechCrunch

•

The AI industry is obsessed with Chatbot Arena, but it might not be the best benchmark | TechCrunch

Summary

Nutrition label

89% Informative

Chatbot Arena is a crowdsourced AI benchmarking tool that lets users test AI models.

It lets anyone on the web ask a question (or questions) of two randomly-selected, anonymous models.

LMSYS is a non-profit run by students and faculty at Carnegie Mellon , UC Berkeley’s SkyLab and UC San Diego .

In March , the organization released a data set containing a million conversations between users and models.

Chatbot Arena is framed as an empirical test, but it amounts to a relative rating of models.

The voting isn’t accounting for people’s ability to spot hallucinations from models, nor differences in their preferences.

LMSYS is trying to balance out these biases by using automated systems — MT-Bench and Arena-Hard-Auto — that use models themselves to rank the quality of responses from other models.

LMSYS and Chatbot Arena provide real-time insights into how different AI models perform outside the lab.

The platform is sponsored in part by organizations, one of which is a VC firm, with horses in the AI race.

Google ’s Gemini and Mistral AI models are among those being tested on the platform.

VR Score

Informative language

Neutral language

Article tone

formal

Language

English

Language complexity

Offensive language

not offensive

Hate speech

not hateful

Attention-grabbing headline

not detected

Known propaganda techniques

not detected

Time-value

long-living

External references

https://allenai.org/https://the-decoder.com/llms-are-biased-and-dont-match-human-preferences-when-evaluating-text-study-finds/https://yuchenlin.xyz/https://www.qmul.ac.uk/eecs/people/profiles/cookmichael.html https://arxiv.org/pdf/2309.11998 https://google.github.io/styleguide/docguide/style.html https://pitchbook.com/news/articles/andreessen-horowitz-mistral-ai-vc-investment https://www.reddit.com/r/LocalLLaMA/comments/1e2uppy/new_model_eurekachatbot_on_lmsys_arena/https://a16z.com/announcing-our-latest-open-source-ai-grants/https://arxiv.org/pdf/2403.04132 https://www.aibase.com/news/11268 https://x.com/elonmusk/status/1823600291096502279 https://x.com/lmsysorg https://lmsys.org/blog/2024-08-28-style-control/https://arstechnica.com/information-technology/2024/04/rumors-swirl-about-mystery-gpt2-chatbot-that-some-think-is-gpt-5-in-disguise/https://arxiv.org/pdf/2406.11939 https://arxiv.org/pdf/2306.05685 https://mbzuai.ac.ae/

Source diversity

allenai.org the-decoder.com yuchenlin.xyz www.qmul.ac.uk arxiv.org google.github.io pitchbook.com www.reddit.com a16z.com www.aibase.com x.com lmsys.org arstechnica.com mbzuai.ac.ae

Affiliate links

no affiliate links

Read full article