← writing·Receipts·~10 min read·published 2026-04-30

Bridge Sourcing: how I moved scrape accuracy from 82% to 96%

The 8 specific changes that moved a production B2B sourcing extraction pipeline from 82% to 96% field-level accuracy over three months — and the 2 changes that made it worse. The economic moat under the pipeline is Egypt's 0% EU import-tariff lane; the engineering moat is the calibration set. Calibration sets, schema forcing, deterministic validation, drift detection.

82% → 96%

field-level accuracy · 3 months

receipts
evals
extraction
production-ai
calibration-set
egypt-eu-trade

LLM extraction looks solved after a weekend and gets ugly on the thousandth page. Bridge Sourcing's discovery agent scrapes Egyptian-supplier directories and individual company websites to extract ~20 structured fields per supplier — name, product categories, certifications, email, phone, WhatsApp, factory size, export experience, languages, and so on. The output feeds a qualification agent that scores each supplier against a rubric and decides whether they go into the EU buyer-matching pool.

This is the receipt for eight specific changes that moved field-level extraction accuracy from 82% to 96% over three months — and the two changes I tried that made it worse. None of the eight was a single clever move. The biggest multiplier was a calibration set, which on its own contributed 0pp directly but unlocked every change after it.

82% → 96%

field-level accuracy · 3 months · 8 changes

Each inflection in the curve below is one of the changes.

00 — Why 82% wasn't good enough

Bridge Sourcing connects EU buyers to Egyptian suppliers along Egypt's 0% EU import-tariff lane under the EU-Egypt Association Agreement. The economic moat under the AI pipeline is the trade route, not the extraction model — most categories ship into the EU at zero duty if the paperwork is right and the supplier qualifies. The pipeline exists to make that lane navigable for buyers who otherwise default to Turkey or Vietnam at 6–12% landed-cost penalty. So the engineering-discipline question — how accurate is the extractor — is downstream of an actual business question: do EU buyers trust the leads enough to ship a first PO?

At 82% field-level accuracy, a supplier record had on average 3.6 wrong fields out of 20. That doesn't sound catastrophic. It was, because bad data cascaded:

The qualification rubric weights certifications at 20%. A missed OEKO-TEX certification could drop a legitimate textile supplier from 80 → 60 and discard them from the matching pool.
Wrong email addresses caused bounces. Bounces damaged the sending domain's reputation. Within weeks the outreach agent's inbox-placement rate started falling.
Wrong category assignments poisoned the buyer-supplier matching. Buyers were shown irrelevant leads, lost trust, churned.

82% at the top of the pipeline became roughly 65% useful leads at the bottom. The fix had to happen at the extraction layer, not the qualification layer. You can't qualify your way out of bad input.

// accuracy lift per change

The calibration set (change 04) contributed 0pp directly but unlocked everything after. The +4pp on change 07 (human review) is blended-rate not automated-rate.

The Bridge Sourcing case study tells the same 82→96% journey through the engineering lens — the literal model switches, the HTML pre-cleaning, the country-specific parsers. This essay tells it through the discipline lens — calibration set as multiplier, deterministic validation outside the model, human-review as ceiling-acceptance. Both are true; the case study is what an engineer sees, this essay is what changed about how I work.

01 — Schema-forced output, not "return a JSON object"

The first version of the extraction prompt said "return a JSON object with these fields: name, category, certifications, email, …" and trusted the model to do it. The model mostly did. It also occasionally:

Returned an extra field the prompt didn't ask for
Returned a field as an array when I expected a string (or vice versa)
Skipped a field entirely if it couldn't find data
Returned null vs empty string vs absent key, inconsistently

Each of these broke downstream parsing roughly 2–3% of the time. Fix: use the provider's strict structured-output mode with a hand-written JSON schema. Every field becomes required; the model either fills it with real data or explicitly marks it "unknown". The downstream parser never has to guess.

Impact: 82% → 85%.

02 — Two-pass extraction for high-value fields

The accuracy-critical fields were email and certifications — the ones that poisoned the pipeline worst when wrong. I split extraction into two passes:

Pass 1: extract all 20 fields with the main prompt. This is the cheap pass.
Pass 2: for email and certifications only, run a second, narrower prompt that looks at only the raw HTML regions where that data typically lives (contact pages, footer, certificate sections). Expensive per call, but only runs on 2 fields.

If the two passes disagree, the narrower pass wins. Disagreements happened ~8% of the time; the narrow pass was right in 85% of those.

Impact: 85% → 88%.

03 — Deterministic validation before the LLM sees anything

Some fields have strict formats. Emails, phone numbers, tax IDs, country codes. I stopped asking the LLM to validate these and moved validation to deterministic code after extraction:

pythonThe two-line policy that improved data quality more than two months of prompt-engineering

def validate_email(raw):
  if not raw or raw == "unknown":
      return None
  m = re.match(r"^[^\s@]+@[^\s@]+\.[^\s@]+$", raw)
  if not m:
      return None  # silently drop invalid
  return raw.lower()

def validate_phone(raw, country_hint=None):
  # normalize to E.164 using libphonenumber
  ...

Bad emails and phone numbers get dropped silently at the extraction-to-record step. The downstream agent sees None instead of garbage, and its scoring rubric handles None correctly (treats it as "no signal").

This sounds like it would hurt accuracy — you're throwing away data — but measured against ground truth, "silently dropped" was always better than "wrong value." Wrong values failed loudly downstream; missing values failed gracefully.

Impact: 88% → 90%.

04 — The calibration set: 0pp directly, multiplier on everything

This is the change that made every subsequent change measurable. I hand-labeled 100 supplier pages — real HTML snapshots from the crawl queue — with the correct 20-field output for each. These became the calibration set.

Every deploy now runs the full extraction pipeline against the calibration set, compares field-by-field, and writes a per-field accuracy report. A regression of >1% on any field blocks the deploy.

I had been testing "does the pipeline produce valid JSON?" That's a unit test, not an evaluation. Real evaluation is did the pipeline produce the same answer a human would have produced on real input. You can't do that without a labeled set.

The calibration set itself didn't improve accuracy. What it did was make the next four improvements possible, because I could measure them. Impact: 0pp directly, multiplier on everything after.

05 — Per-field error analysis and targeted prompt fixes

With the calibration set running, I could see which fields were failing most often. The distribution was not uniform. Two fields accounted for ~60% of all errors:

Factory size (in employees). The model was guessing when the page said nothing about it. Fix: add an explicit "if the page doesn't state this, return unknown — do not estimate" instruction in the prompt.
Export experience. The model was over-claiming. Any mention of "international" or "worldwide" was being read as "has EU customers." Fix: reframe the prompt to ask for specific evidence (named customers, case studies, certifications) not general claims.

Instead of rewriting the whole prompt, I added field-specific instructions for the five worst-performing fields. Each instruction was ~30 tokens. Total prompt growth was minimal; accuracy lift was substantial.

Impact: 90% → 93%.

06 — Retry on confidence, not on format

I asked the model to return a confidence field alongside each extraction (0–1 scale). On its own, confidence scores from LLMs are mostly vibes — but calibrated against the calibration set, they correlated meaningfully with actual accuracy. Below 0.6 confidence, the extraction was wrong ~40% of the time.

I added a retry layer: any field with confidence below 0.7 gets re-extracted with a different prompt variant (slightly rephrased, same schema). If the two attempts agree, use the answer. If they disagree, use the higher-confidence one. If both are low-confidence, flag for human review.

Impact: 93% → 94%.

07 — Fall back to human review on the hardest 5%

Some supplier pages are just hard. Arabic-only, no structured markup, inconsistent layout, image-heavy with text inside images. No amount of prompt engineering was going to get them past ~85%.

I accepted the ceiling and built a human review queue. Any record with below 0.8 average confidence across fields gets flagged and a human (one of our part-time sourcing analysts in Cairo) reviews it in a minimal UI. The review UI shows the scraped HTML alongside the extracted record and lets them correct fields inline. Corrections feed back into the calibration set.

This isn't "the LLM gets better." It's "we stopped pretending the LLM had to handle 100% of cases." About 5% of scrapes go to humans. The rest we trust.

Impact on automated accuracy: 94% → 94% (unchanged — automation didn't improve). Impact on delivered data quality: 94% → 98%. The 96% headline number is the blended rate.

08 — Drift detection with weekly calibration runs

The last change was preventive. Web pages change. LLM providers update their models. Either can silently break extraction. I set up a weekly cron that re-runs the full calibration set, stores the per-field accuracy numbers, and alerts if anything dropped by >2%.

This caught a real regression in month four when OpenAI updated their default model behavior and the factory_size field accuracy dropped from 94% to 87% overnight. Without drift detection, I wouldn't have noticed for weeks.

The cron is 60 lines of Python. It has paid for itself twice already.

Impact: floor-setting, not ceiling-raising.

09 — Two things that made it worse

For completeness, two changes I tried that gave back less than they cost:

10 — What I'd do differently if I started today

The original build ran across late 2025 / early 2026. If I were starting from a blank repo today, three things in the 2026 toolchain would collapse some of the work:

Anthropic prompt caching is GA, with cache_control on content blocks. The static portion of the extraction prompt — the schema, the field-specific instructions, the rules about unknown — is ~1,800 tokens. Today that ships once and gets cached at 90% off on repeat reads. The compression work in change 05 still matters (a smaller cached block is still cheaper than a larger one) but the marginal cost of each call against the same schema drops near zero.
Structured-output APIs are stricter and faster than they were in late 2025. Claude Sonnet 4.6 and GPT-5.4 both honor JSON Schema with strict: true more reliably than the November 2025 generation. Change 01 today is one line of API config instead of a bug-fix loop.
Better default routing. Claude Haiku 4.5 became viable for narrow extraction at significantly lower cost than GPT-5.4-mini. A 2026-fresh version of this pipeline would A/B Haiku for pass 1 and only fall back to mini on the 5% of pages that fail confidence thresholds.

What would not change:

The calibration set is still the lever. Provider improvements push the floor up; nothing pushes it up without measurement.
Deterministic validation outside the model is still the right place for emails, phones, and country codes. Models will always occasionally hallucinate well-formatted-but-wrong values; regex doesn't.
Human review on the hardest 5% is still the right ceiling-acceptance pattern. The shape of "some pages are just hard" doesn't change with a better model.

11 — The real lesson

The jump from 82% to 96% was 14 percentage points from 8 small engineering changes, not one heroic move. Most of the changes were the result of treating the extraction pipeline like any other production system: instrument it, measure it, error-analyze it, fix the worst thing first, repeat.

The single biggest multiplier was the calibration set (change 04). Without it, every other change was a guess. With it, every change became measurable, and "good enough to deploy" became a real threshold instead of a feeling.

If you're building an LLM extraction pipeline and you don't have a labeled calibration set, that's the next thing you should build. Everything else is downstream of that. The cost-cut sister piece — how I cut our LLM bill 28% without changing models — is the same mindset applied to a different metric. Both essays describe a discipline more than a technique.

If you're staring at an extraction pipeline that "works in the demo" and would like a second pair of eyes before it ships, the audit sprint is one week, fixed scope, and ends with a remediation plan grounded in a calibration set built specifically for your data.

// sources cited

https://omargnagy.com/work/bridge

// next move

Want a written architecture brief on your AI stack?

1 week, $1,500, fixed scope. Working prototype of one change in your stack — yours to keep regardless.

Book the Audit Sprint — $1,500 Or email omar@neurascale.org

// related essays