Carnegie Mellon research shows AI agents fail accuracy tests 70% of the time

livbrown · June 27, 2025, 5:16am

I recently found some research from Carnegie Mellon University that I found fascinating. They conducted a study to evaluate the accuracy of AI agents in various tasks. The findings were quite surprising - these AI systems reportedly get things wrong about 70% of the time.

This raises concerns about the AI tools and chatbots we use daily. Can we really trust them as much as we think? If they are incorrect seven times out of ten, that poses a significant issue for anyone depending on them for crucial decisions.

Has anyone else come across this study or experienced similar issues with AI agents being frequently wrong? I’m interested in hearing others’ thoughts on these results and if they’ve noticed the same trends when using AI tools.

Tom_89Paint · July 4, 2025, 2:51am

Been using AI tools for two years - that failure rate doesn’t surprise me at all. What really bugs me is how inconsistent they get when you ask the same thing different ways. They’re terrible with recent info or anything needing real context. Just yesterday a chatbot swore a restaurant was open when it closed months ago. Most people don’t get that these things are just making educated guesses from patterns, not actually knowing stuff. They’re decent for creative work or basic info, but I always double-check anything important with real sources. Cool tech, but people treating it like it’s never wrong? That’s the real problem.

ethant · July 3, 2025, 2:52pm

This totally explains why my coding AI keeps spitting out functions that look perfect but completely break when I run them. I’ve wasted hours debugging what should’ve been quick fixes. And yeah, the confidence thing is spot on - they never admit uncertainty, just confidently hand you garbage.

emparker · July 3, 2025, 11:34am

I’ve seen this constantly in production. That 70% failure rate? Not surprising at all.

We deployed an AI agent for customer support tickets last year. Total disaster. It’d misclassify issues constantly or give completely wrong solutions. Someone asks about billing, gets a password reset response.

Here’s the problem: these models train on general data but crash when you need domain-specific work or edge cases. They sound confident while being dead wrong.

I tell my team to treat AI outputs as suggestions, never facts. We built validation layers around everything because blind trust created too many headaches.

The scary part? How convincing they sound. Users think they’re getting expert advice when it’s just statistical word prediction failing.

JollyMusic3 · July 2, 2025, 11:26pm

I work in data science and see the same patterns. That 70% figure probably depends on how they measured “accuracy” and what they compared it against. AI agents handle simple tasks fine but completely break down with vague questions or specialized topics. The biggest problem is context switching - they can’t maintain logical reasoning across multi-step problems. What bugs me most is how they deliver wrong answers with the same confidence as right ones. Without good prompt engineering and human oversight, you’re basically gambling on important decisions. The tech has potential but we’re nowhere close to the reliability needed for autonomous work in most professional settings.

Liam_25Meditation · June 30, 2025, 8:16pm

that’s crazy if that’s real, but like what kinda tasks were they testing? some complex stuff or just basic questions? chatgpt gives me weird answers too, so 70% sounds high. guess it depends on what questions you ask.