What changed since v1
When I posted the first version of TraceMind, I got one clear piece of feedback: "this is useful but I need to know if my AI is making things up, not just scoring low."
So I built hallucination detection. Then while building it I realized I needed a way to compare prompts systematically. So I built A/B testing too.
Here's what's new and how I built it.
The original problem (unchanged)
I was building a multi-agent orchestration system. Three days after deploying, I changed a system prompt. Quality dropped from 84% to 52%. I found out 11 days later from a user complaint.
TraceMind was built to catch this on day zero.
What's new in v2
Hallucination detection
The endpoint takes a question, the AI's response, and optional ground truth context. It extracts individual claims from the response, checks each one against the context, and returns a structured result:
{
"has_hallucinations": True,
"overall_risk": "high",
"claims": [
{
"claim": "We offer 60-day refunds",
"verdict": "hallucination",
"reason": "Context says 30-day refunds only"
}
]
}
The key architectural decision: claim extraction and verification are separate LLM calls. The first call extracts atomic claims. The second verifies each claim against ground truth. This is more reliable than asking one model to do both.
Prompt A/B testing
You give it two system prompts and a dataset. It runs both prompts against every test case and compares results.
The interesting part is the statistical layer. A naive implementation would just compare average scores. But with small datasets (5-20 cases),average score differences are often noise. I added Mann-Whitney U test and Cohen's d to give a confidence score on whether prompt B is actually better or just randomly different.
{
"prompt_a_score": 6.2,
"prompt_b_score": 8.1,
"winner": "B",
"confidence": "high",
"cohen_d": 1.4,
"p_value": 0.03
}
Verification suite
I built a 44-test verification script covering all 11 feature areas. Running python verify_all.py hits every endpoint end-to-end against a real running server and reports pass/fail. This was more useful than unit tests for catching integration issues.
What I'd still do differently
The same things from v1, plus one new one: the hallucination detection is synchronous. For production use it should be a background job like span scoring. A user with 1000 traces would need to wait for each one — that doesn't scale.
Try it


GitHub: https://github.com/Aayush-engineer/tracemind
pip install tracemind
from tracemind import TraceMind
tm = TraceMind(api_key="...", project="my-app",
base_url="https://tracemind.onrender.com")
@tm.trace("llm_call")
def your_function(msg): ... # unchanged
Self-hosted, free, no vendor lock-in.
If you're building with LLMs — I'd genuinely love to know
what breaks when you try it.
Top comments (9)
Splitting claim extraction and verification into separate LLM calls is the right call - I've seen single-pass approaches confidently mark their own hallucinations as verified. The Mann-Whitney U test for A/B prompt comparison is a really thoughtful addition. Too many people eyeball average scores on 10 test cases and declare a winner. With small sample sizes, the statistical significance check is the difference between real signal and noise. One thing worth considering for the async hallucination detection: you could batch claims across multiple traces and verify them in bulk against a shared knowledge base, which would amortize the LLM cost per verification significantly.
the hallucination detection is the real unlock here - knowing when an agent is confidently wrong is harder than building the agent. A/B testing prompts is basically the only honest way to get signal on that.
Nice update.
Hallucination detection is useful, but I’ve found the harder issue is what happens after detection. In real workflows, the question becomes whether the system can handle that uncertainty correctly, retry, validate, or block actions.
Curious if you’re planning anything around response control or just focusing on evaluation for now.
That's the right question and honestly it's something I've been thinking about too.
Right now TraceMind is purely evaluation it tells you something went wrong but doesn't act on it. You still have to decide what to do with that information.
The direction I'm thinking about for v3 is exactly what you're describing: closing the loop. When the hallucination detector flags a response as high risk, instead of just logging it, the system could expose a hook so the application layer can intercept retry with a different prompt, fall back to a more conservative response, or block the action entirely.
The challenge is that "what to do" is highly application-specific. A customer support bot should probably retry. A legal document analyzer should probably block and escalate to a human. A coding assistant might just flag it. I'm hesitant to bake in specific behavior because it varies so much.
My current thinking is a callback interface something like tm.on_hallucination(risk_level, callback) where the developer defines the response policy and TraceMind just fires the event. That way the detection layer stays clean and the application layer owns the control logic.
Is that the kind of integration you'd find useful, or are you thinking about something more opinionated?
That makes sense. Closing the loop at the application layer feels like the right boundary.
The callback approach is clean, especially since the “what to do” really is application-specific. Trying to hardcode that at the detection layer would probably create more problems than it solves.
The only thing I’ve seen is that once you leave it fully open, teams either:
don’t implement policies at all
or implement them inconsistently
So there’s a bit of a gap between flexibility and actual control in practice.
Feels like the interesting middle ground is:
opinionated defaults + override hooks
So teams get something safe by default, but still have room to adapt per workflow.
Opinionated defaults plus override hooks that framing is exactly right and I think that's where I'll land.
Something like three built-in policies out of the box: block (don't return the response), retry (re-run the LLM call with a more conservative prompt), and flag (return the response but attach a warning in metadata). Any of those can be overridden with a custom callback.
The advantage is teams get something safe by default on day one — no policy configuration required. The ones with specific workflows just override the relevant policy. That's the LangChain callbacks model and it works well in practice.
I'll prototype this for v3. If you're willing to try it when it's ready I'd value the feedback — you've clearly thought about this problem more than most.
The separate claim extraction + verification pattern is really smart. I tried doing it in a single LLM call and got inconsistent results — the model would sometimes hallucinate while checking for hallucinations (ironic). The statistical layer for A/B testing is also underrated. Most people just compare averages and call it a day. Have you considered adding confidence intervals to the scoring? That would make the comparison even more reliable with small datasets.
Great upgrade—hallucination detection and A/B testing are key steps toward making LLM evaluation more reliable and production-ready.
A/B testing LLM outputs is underrated. Most eval frameworks just give you a score — actually comparing two outputs side by side with human judgment catches things automated evals miss. How are you handling the ground truth problem for hallucination detection?