May 20267 min read

Where Prediction Markets Systematically Fail

19,858 resolved Polymarket markets. A 32 percentage-point structural framing bias. An isotonic calibrator that beats raw market on Brier by an order of magnitude, out of sample. And a live scanner sitting on top of all of it.

CalibrationPrediction MarketsData SciencePolymarket

Watch the demo

The question I asked: how well-calibrated are prediction markets, where do they systematically fail, and can I use what I learn to spot mispricing on currently-open markets right now?

What I found. Three things, all from 19,858 resolved Polymarket markets analyzed end-to-end in one Zerve canvas.

Polymarket Decoded scanner demo thumbnail showing the calibrated mispricing dashboard. — Polymarket Decoded: 19,858 resolved markets, one isotonic calibrator, 908 live signals.

Why I built it

Prediction markets keep getting cited as oracles, by journalists, by traders, by policymakers. The usual defense is that money makes them honest. I wanted to actually test that with the full resolved history, not a curated subsample, and see whether the residual error has structure I could exploit on currently-open markets.

Finding 1: A 32-point structural framing bias

Questions phrased "Will X happen?" resolve YES 29% of the time (n = 11,149). Questions phrased "Will X NOT happen?" resolve YES 61% (n = 31). t = −3.93, p = 8.4×10⁻⁵, Cohen's h = 0.66, medium effect. A five-keyword regex on the question text reveals a structural skew nobody had surfaced this clearly.

Finding 2: Settlement prices aren't predictions

So I pulled the YES-token mid price 24 hours before close from Polymarket's CLOB API and fit an isotonic regression calibrator on real pre-resolution prices. Chronologically out-of-sample: train on the older 70%, test on the newer 30% the model never saw. The calibrator achieves Brier 0.0009 versus raw market 0.0094, plus 0.0242 better ECE. It generalizes.

Brier 0.0009 vs 0.0094 raw. One order of magnitude, on data the model never touched at training time.

Finding 3: Applied live

The calibrator currently scores 908 open Polymarket markets, totaling $10,529 in portfolio Expected Value across all flagged signals. Backtest validation: 32 trades, 91% win rate, +27% ROI. A Bring-Your-Own-Market feature scores any Polymarket URL on demand.

How I built it

The whole pipeline lives in one Zerve canvas, from raw API pulls to the deployed Streamlit front-end. I hit the Polymarket Gamma API for the resolved-market metadata and the CLOB API for the 24-hour-pre-close YES mid prices, parked everything in DuckDB so I could run SQL over the full 19,858-row corpus without paging, and used pandas plus NumPy for the framing-bias regex and the per-bucket stats.

Calibration: scikit-learn isotonic regression, fit on the older 70% of resolved markets, evaluated on the newer 30%.
Stats: SciPy for the two-sample t-test and Cohen's h on the framing buckets.
Scoring metrics: Brier score and ECE computed manually so I could log per-bucket residuals.
Serving: FastAPI for the Bring-Your-Own-Market endpoint, Streamlit for the live scanner UI, Plotly for the calibration plots.

What was hard

Two things. First, true out-of-sample evaluation: the temptation is to k-fold split, but resolved markets are time-ordered and resolution mechanics drift, so a random split would leak. I forced a chronological 70/30 cut and only then trusted the Brier delta. Second, the Bring-Your-Own-Market endpoint had to resolve any pasted Polymarket URL to the right CLOB token mid price in under a second, which meant memoizing the Gamma slug-to-condition-id lookup and pre-warming a DuckDB view on the open universe.

Prediction markets are increasingly cited as oracles by journalists, traders, and policymakers. If they have structural biases, those biases shape decisions.

What I learned

An off-the-shelf isotonic calibrator, trained on truly out-of-sample data, beats the market by an order of magnitude on Brier. That is either real edge sitting in plain sight, or the resolution mechanism is doing something the prices do not reflect. Both are worth understanding, and neither shows up if you only look at top-line accuracy.

Why it matters

This project surfaces one bias (the framing effect), demonstrates that calibration adjustments generalize out-of-sample, and ships a live tool anyone can use today. The interesting part isn't the win rate, it's the Brier gap on data the model never saw. If oracles get cited in news cycles, the structural skew in how their questions are written deserves a name and a number.

What's next

Add category-conditioned calibrators (politics, sports, crypto) instead of one global isotonic fit.
Track the framing-bias effect over time to see if Polymarket's question authors correct it once it's named.
Expose the Bring-Your-Own-Market endpoint as a public FastAPI route with a per-IP rate limit so researchers can score URLs without spinning up the Streamlit app.
Replay the 32-trade backtest as a rolling forward-walk and report a Sharpe-style risk-adjusted edge, not just ROI.

Related project

Polymarket Decoded: Live Mispricing Scanner

View the project

A Film Festival Where AI Models Compete as Directors

One sentence, six imaginary directors, six different model stacks. The canvas itself is the artifact, not just the final video. Built in one shot across 90 nodes and 85 edges via MCP-driven scaffolding.

May 20266 min read

Five MCP Tools That See a Pregnancy as One Clinical Unit

Winner of Prompt Opinion's Agents Assemble: The Healthcare AI Endgame, out of 4,335 participants. US maternal mortality keeps rising and over 80% of pregnancy-related deaths are preventable, but the predictive signals live across completely different parts of the chart. I built an MCP server that aggregates them, and a Triage agent that delegates to it.

Apr. 2026 - May 202610 min read

Why I Built an AI Exoplanet Classifier That Says 'I Don't Know'

An XGBoost classifier on NASA Kepler, K2, and TESS data is a Saturday afternoon. The interesting part is everything around it: three real classes (CONFIRMED, CANDIDATE, FALSE POSITIVE) plus an UNKNOWN abstention bucket, dataset upload, in-browser retraining with hyperparameter tuning, 2D exploration, and 3D system views.