Large Language Models Benchmarks

NSA Benchmarks Anthropic’s “Mythos” AI Against Sovereign Cyber Tools

The National Security Agency (NSA) has officially begun testing a specialized version of Anthropic’s latest large language ...

DeepSeek open-sources V4 large language model series

Chinese artificial intelligence developer DeepSeek today released a new series of open-source large language models. V4, as ...

Science Media Centre

expert reaction to study evaluating performance of a large language model on the reasoning tasks of a physician

April 30, 2026 expert reaction to study evaluating performance of a large language model on the reasoning tasks of a physician . A study published in Science evaluates the perform ...

iAfrica

Egyptian Startup Releases Open-Source AI Model That Outperforms Larger Global Rivals on Key Benchmarks

A Cairo-based artificial intelligence startup has released Horus 1.0-4B, a fully open-source large language model built in Egypt that outperforms several ...

Council on Foreign Relations

DeepSeek V4 Signals a New Phase in the U.S.-China AI Rivalry

The latest Chinese model trails U.S. competitors on benchmarks. But it may not have to win the performance race to reshape ...

Unite.AI

Simbian Launches Cyber Defense Benchmark, Reveals Major Gap in AI Security Capabilities

A new benchmark released by Simbian is challenging one of the most widely held assumptions in artificial intelligence: that the same models capable of finding vulnerabilities can also defend against ...

Why LLMs Hallucinate And What Enterprises Must Do About It

Ultimately, hallucinations are a systemic feature of today’s LLMs. Unfortunately, they’re not an anomaly. But with the right ...

News-Medical.Net

AgentClinic puts medical AI through a more realistic diagnostic test

AgentClinic is a multimodal benchmark that tests clinical AI agents in simulated, dialogue-driven diagnostic settings rather ...

Hosted on MSN

New standards and benchmarks reshape 2026 LLM choices

A wave of 2026 developments — from Anthropic's Model Context Protocol to Microsoft's GraphRAG concept and rigorous benchmarks like Terminal-Bench 2.0 and SWE-Bench Pro — is redefining how AI teams ...

TMCnet

ShengShu Technology Unveils World Action Model "Motubrain": One Brain, Infinite Possibilities for Robotic Intelligence

ShengShu Technology today announces Motubrain, a World Action Model that replaces multiple task-specific systems with a single, unified model that functions as a robotic brain for the physical world.

A Harvard study shows AI model can outperform physicians in emergency room diagnoses

In one case, a patient came into the emergency department with a pulmonary embolism. The condition initially improved with ...

Hosted on MSN

New study challenges accuracy of AI benchmark testing

A Nature-published study by an international research team has found that current AI benchmarks fail to accurately measure large language models’ core capabilities. Existing tests often mix skills ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results