What is Real Problem AI?

Real Problem AI is a curated directory of 100 real problems worth building an AI startup around. Each problem is scored honestly on severity, AI feasibility, market signal, and competition gap. We're for builders looking for what to build, not for ethics debates about AI.

What is the difference between Real Problem AI and AI ethics or AI risks indexes?

Real Problem AI catalogues problems AI can solve — productive, commercial opportunities for builders. AI ethics indexes catalogue problems with AI itself (bias, environmental cost, copyright, consciousness). Different intent, different audience: founders and indie hackers vs researchers and policy folks.

What should I build with AI in 2026?

Look for problems with high severity (it costs people hours, sleep, or revenue every week), high AI feasibility (LLMs, vision or voice are great at this today), strong market signal (Reddit threads + willingness to pay), and a clear competition gap (incumbents miss the same obvious thing). Real Problem AI scores 100 such problems for you on those four axes.

Where do Real Problem AI's startup ideas come from?

From three sources, scanned each cycle: 1) Reddit threads where people describe a real friction in their own words, 2) podcasts and talks of founders young builders trust (Nikhil Kamath, Aman Gupta, Ankur Warikoo, Raj Shamani, Andrej Karpathy, Greg Isenberg), and 3) a continuous 1,000+ source scan covering forums, app store reviews, regulator filings, and field interviews.

Is Real Problem AI free?

Yes. Browsing every problem, opening every score breakdown, raising your hand on a co-founder match, and the founder vault are all free. There is no paywall.

How is each problem scored?

Four axes, each 1-10, weighted: Problem Severity (30%), AI Feasibility (25%), Market Signal (25%), Competition Gap (20%). Weighted average is the Opportunity Score on each card. Only problems clearing 7.0+ make the live list.

Why can't I tell if my AI agent is actually doing what it said it did?

Teams ship agents that call tools, hit production data, and report back in natural language. Trace logs are JSON soup. There is no per-run verdict on whether the agent did the right thing, just token counts and latency.

Category: Others · Trend: LLM · Opportunity score: 8.6 / 10

What is the “Why can't I tell if my AI agent is actually doing what it said it did?” problem in 2026?

Who has this problem?

Eng leads at startups running production LLM agents on Anthropic, OpenAI, or open-weights inside Langfuse, Braintrust, Arize.

Evidence this problem is real

“I have 200K traces a day. I can tell you the p95 latency. I cannot tell you what fraction of refund agents actually issued a correct refund.”

Sourced from r/LocalLLaMA, Hacker News "who is debugging agents in prod" threads, Langfuse and Braintrust Discord support channels.

Existing players in this space

Langfuse: Strong tracing, weak per-run correctness verdict
Braintrust: Great eval harness, runs offline, not on live traffic
Arize Phoenix: Observability, no agent-task scorer out of the box
LangSmith: Tied to LangChain, scoring is BYO

What existing players are missing

A drop-in agent scorer: read your trace, infer the user goal, replay the tool outputs against an LLM judge with your policy, emit a pass/fail with a 1-line reason. Bucketed by tool, by user segment, by version. Spend $0.002 per run, save your team from shipping a regression.

How Real Problem AI scores this opportunity

Aggregate score: 8.6 / 10. Four-axis rubric:

Problem severity: 9 / 10
AI feasibility today: 9 / 10
Market signal: 9 / 10
Competition gap: 7 / 10

How to build a solution: stack hints

OpenTelemetry trace ingest with tool-call schema awareness
Goal-inference LLM pass over the first user turn
Policy DSL for what counts as success per tool
LLM-judge plus ground-truth replay on sampled traces
Regression-detection dashboard per agent version