Launch HN: Confident AI (YC W25) – Open-source evaluation framework for LLM apps
107 points by jeffreyip a day ago | 22 comments
Hi HN - we're Jeffrey and Kritin, and we're building Confident AI (https://confident-ai.com). This is the cloud platform for DeepEval (https://github.com/confident-ai/deepeval), our open-source package that helps engineers evaluate and unit-test LLM applications. Think Pytest for LLMs.
We spent the past year building DeepEval with the goal of providing the best LLM evaluation developer experience, growing it to run over 600K evaluations daily in CI/CD pipelines of enterprises like BCG, AstraZeneca, AXA, and Capgemini. But the fact that DeepEval simply runs, and does nothing with the data afterward, isn’t the best experience. If you want to inspect failing test cases, identify regressions, or even pick the best model/prompt combination, you need more than just DeepEval. That’s why we built a platform around it.
Here’s a quick demo video of how everything works: https://youtu.be/PB3ngq7x4ko
Confident AI is great for RAG pipelines, agents, and chatbots. Typical use cases involve allowing companies to switch the underlying LLM, rewrite prompts for newer (and possibly cheaper) models, and keep test sets in sync with the codebase where DeepEval tests are run.
Our platform features a "dataset editor," a "regression catcher," and "iteration insights". The datasets editor in Confident AI allows domain experts to edit datasets while keeping them in sync with your codebase for evaluation. We’ll then generate sharable LLM testing/benchmark reports once DeepEval has finished running evaluations on these datasets that are pulled from the cloud. The regression catcher then identifies any regressions in your new implementation, and we use these evaluation results to determine the best iteration based on your metric scores.
Our goal is to make benchmarking LLM applications so reliable that picking the best implementation is as simple as reading the metric values off the dashboard. To achieve this, the quality of curated datasets and the accuracy and reliability of metrics must be the highest possible.
This brings us to our current limitations. Right now, DeepEval’s primary evaluation method is LLM-as-a-judge. We use techniques such as GEval and question-answer generation to improve reliability, but these methods can still be inconsistent. Even with high-quality datasets curated by domain experts, our evaluation metrics remain the biggest blocker to our goal.
To address this, we recently released a DAG (Directed Acyclic Graph) metric in DeepEval. It is a decision-tree-based, LLM-as-a-judge metric that provides deterministic results by breaking a test case into finer atomic units. Each edge represents a decision, each node represents an LLM evaluation step, and each leaf node returns a score. It works best in scenarios where success criteria are clearly defined, such as text summarization.
The DAG metric is still in its early stages, but our hope is that by moving towards better, code-driven, open-source metrics, Confident AI can deliver deterministic LLM benchmarks that anyone can blindly trust.
We hope you’ll give Confident AI a try. Quickstart here: https://docs.confident-ai.com/confident-ai/confident-ai-intr...
The platform runs on a freemium tier, and we've dropped the need to signup with a work email for the next four days.
Looking forward to your thoughts!
codelion 6 hours ago | next |
The DAG feature for subjective metrics sounds really promising. I've been struggling with the same "good email" problem. Most of the existing benchmarks are too rigid for nuanced evaluations like that. Looking forward to seeing how that part of DeepEval evolves.