Gate Square “Creator Certification Incentive Program” — Recruiting Outstanding Creators!
Join now, share quality content, and compete for over $10,000 in monthly rewards.
How to Apply:
1️⃣ Open the App → Tap [Square] at the bottom → Click your [avatar] in the top right.
2️⃣ Tap [Get Certified], submit your application, and wait for approval.
Apply Now: https://www.gate.com/questionnaire/7159
Token rewards, exclusive Gate merch, and traffic exposure await you!
Details: https://www.gate.com/announcements/article/47889
AI systems in production: How to systematically identify and prevent hallucinations
Language models are not merely faulty programs—they invent facts with absolute certainty. An AI agent could claim to have created datasets that do not exist at all or assert that it performed operations that never took place. This fundamental distinction between error and confabulation determines how production teams ensure the reliability of their AI systems. Dmytro Kyiashko, specialized in validating intelligent systems, has dedicated himself to a critical question: How can we systematically verify when a model distorts the truth?
Why traditional error detection in AI fails
Conventional software indicates faulty states. A corrupted function reports an exception. A misconfigured interface provides standardized error codes with meaningful messages that immediately show what went wrong.
Generative models behave completely differently. They confirm the completion of tasks they never initiated. They cite database queries they never executed. They describe processes that exist only in their training data. The responses seem plausible. The content is fictional. This form of confabulation eludes classic error handling.
“Every AI agent follows instructions designed by engineers,” explains Kyiashko. “We know precisely what functions our agent has and which it does not.” This knowledge forms the basis of the distinction. If an agent, trained on database queries, silently fails, there is an error. However, if it provides detailed query results without ever contacting the database, it is a hallucination. The model constructed likely outputs based on training patterns.
Two complementary evaluation methods
Kyiashko relies on two different, complementary validation approaches.
Code-based evaluators handle objective verification. “Code evaluators work best when errors are objectively definable and can be checked rule-based. For example, verifying JSON structure, SQL syntax, or data format integrity,” says Kyiashko. This method captures structural issues precisely.
But some errors resist binary classification. Was the tone appropriate? Does the summary include all essential points? Does the response provide real help? For these, LLM-as-Judge evaluators are used. “These are employed when the error involves interpretation or nuances that pure code logic cannot capture.” Kyiashko uses LangGraph as the framework.
None of the approaches work in isolation. Robust validation systems combine both methods, thereby capturing different types of hallucinations that individual techniques might miss.
Validation against objective reality
Kyiashko’s approach focuses on verification against the current system state. If an agent claims to have created datasets, the test checks whether these datasets actually exist. The agent’s statement is irrelevant if the objective state disproves it.
“I use different forms of negative testing—unit and integration tests—to detect LLM hallucinations,” he explains. These tests deliberately request actions that the agent is not permitted to perform and then verify whether the agent falsely signals success and whether the system state remains unchanged.
One technique tests against known limitations. An agent without write permission to the database is asked to generate new entries. The test validates that no unauthorized data has been created and that the response does not claim success.
The most effective method uses real production data. “I take historical customer conversations, convert them into JSON format, and run my tests with this file.” Each conversation becomes a test case that checks whether the agent made claims contradicting the system logs. This approach captures scenarios that artificial tests overlook. Real users create edge conditions that reveal hidden errors. Production logs show where models hallucinate under real load.
RAG tests: when the agent should invent instead of researching
A specific test type checks Retrieval-Augmented Generation (RAG). Kyiashko verifies whether agents use the provided context instead of inventing details. The test asks a question for which relevant context is available and checks whether the agent actually drew from this context or instead hallucinated.
This is especially critical for systems working with external data sources. If an agent claims that “Document X contains” without verifying, it is a classic RAG hallucination. Kyiashko’s test would subsequently check the document and record the deviation—similar to removing covert or manipulated watermarks to verify authenticity: first ensure integrity, then trustworthiness.
The knowledge gap in Quality Engineering
Experienced QA engineers face difficulties when testing AI systems for the first time. Their proven assumptions cannot be transferred.
“With traditional QA, we know exactly the answer format, the input and output data formats,” explains Kyiashko. “When testing AI systems, there is none of that.” The input is a prompt—and the variations in how users formulate requests are practically unlimited. This requires continuous monitoring.
Kyiashko calls this “continuous error analysis”—regularly reviewing agent responses to real users, identifying invented information, and expanding the test suites accordingly.
The complexity is increased by the extent of instructions. AI systems require extensive prompts defining behavior and boundaries. Each instruction can interact unpredictably with others. “One of the big problems with AI systems is the enormous amount of instructions that need constant updating and testing,” he notes.
The knowledge gap is significant. Most teams lack a clear understanding of appropriate metrics, effective dataset preparation, or reliable validation methods for outputs that vary with each run. “Building an AI agent is relatively easy,” says Kyiashko. “Automating the testing of that agent is the core challenge. In my experience, more time is spent on testing and optimizing than on development itself.”
Practical testing infrastructure for scalability
Kyiashko’s methodology integrates evaluation principles, multi-turn dialogue evaluations, and metrics for different hallucination types. The central concept: diversified test coverage.
Code-level validation captures structural errors. LLM-as-Judge evaluation allows assessment of effectiveness and accuracy depending on the model version used. Manual error analysis identifies overarching patterns. RAG tests verify whether agents use the provided context instead of inventing details.
“The framework is based on the concept of a diversified testing approach. We use code-level coverage, LLM-as-Judge evaluators, manual error analysis, and RAG evaluations.” Multiple collaborative validation methods capture hallucination patterns that isolated approaches might miss.
From weekly releases to continuous improvement
Hallucinations undermine trust faster than technical errors. A faulty feature frustrates users. An agent confidently providing false information permanently destroys credibility.
Kyiashko’s testing methodology enables reliable weekly releases. Automated validation detects regressions before deployment. Systems trained on real data handle most customer inquiries correctly.
Weekly iteration drives competitive advantages. AI systems improve through additional features, refined responses, and expanded domains. Each iteration is tested. Every release is validated.
The shift in Quality Engineering
Companies are integrating AI daily. “The world has already seen the benefits, so there’s no turning back,” argues Kyiashko. AI adoption accelerates across industries—more startups emerge, established firms incorporate intelligence into core products.
When engineers develop AI systems, they must understand how to test them. “Today, we need to know how LLMs work, how AI agents are built, how to test them, and how to automate these validations.”
Prompt engineering becomes a core skill for quality engineers. Data tests and dynamic validation follow the same trend. “These should already be fundamental skills.”
The patterns Kyiashko observes in the industry—through reviewing AI research papers and evaluating startup architectures—confirm this shift. Similar problems are emerging everywhere. The validation challenges he solved years ago in production are now universal requirements as AI deployments scale.
What the future holds
The field defines best practices through production errors and iterative real-time improvement. More companies deploy generative AI. More models make autonomous decisions. Systems become more powerful—which means hallucinations will become more plausible.
But systematic testing detects inventions before users encounter them. Testing for hallucinations does not aim for perfection—models will always have edge cases where they invent. It’s about systematically capturing inventions and preventing them from reaching production.
These techniques work when applied correctly. What is missing is a widespread understanding of how to implement them in production environments where reliability is critical.
About the author: Dmytro Kyiashko is a Software Developer in Test specializing in AI system testing. He has developed test frameworks for conversational AI and autonomous agents and investigates reliability and validation challenges in multimodal AI systems.