Bridge Sourcing: how I moved scrape accuracy from 82% to 96%
Scraping supplier data with LLM extraction is one of those problems that looks solved after a weekend and gets ugly on the thousandth page. Here are the eight specific changes that moved a production pipeline from 82% accuracy to 96% over three months — and the two I tried that made it worse.
00 Why 82% wasn't good enough
Bridge Sourcing connects EU buyers to Egyptian suppliers. The discovery agent scrapes supplier directories and individual company websites to extract ~20 structured fields per supplier: name, product categories, certifications, email, phone, WhatsApp, factory size, export experience, languages, and so on. The output feeds a qualification agent that scores the supplier against a rubric.
At 82% field-level accuracy, a supplier record had on average 3.6 wrong fields out of 20. Doesn't sound catastrophic. It was. Here's why:
- The qualification rubric weights certifications at 20%. A missed OEKO-TEX certification could drop a legitimate textile supplier from 80 → 60 and discard them.
- Wrong email addresses caused bounces. Bounces damaged the sending domain's reputation. Within weeks the outreach agent's inbox placement was falling.
- Wrong category assignments poisoned the buyer-supplier matching. Buyers were shown irrelevant leads.
In short: bad data cascaded. 82% at the top of the pipeline became ~65% useful leads at the bottom. The fix had to happen at the extraction layer.
01 Schema-forced output, not "return a JSON object"
The first version of the extraction prompt said "return a JSON object with these fields: name, category, certifications, email, …" and trusted the model to do it. The model mostly did, but it occasionally:
- Returned an extra field the prompt didn't ask for
- Returned a field as an array when I expected a string (or vice versa)
- Skipped a field entirely if it couldn't find data
- Returned
nullvs empty string vs absent key, inconsistently
Each of these broke downstream parsing roughly 2–3% of the time. Fix: use the provider's strict structured-output mode with a hand-written JSON schema. Every field becomes required; the model either fills it with real data or explicitly marks it "unknown". The downstream parser never has to guess.
Impact: 82% → 85%.
02 Two-pass extraction for high-value fields
The most accuracy-critical fields were email and certifications — the ones that poisoned the pipeline worst when wrong. I split the extraction into two passes:
- Pass 1: extract all 20 fields with the main prompt. This is the cheap pass.
- Pass 2: for email and certifications only, run a second, narrower prompt that looks at only the raw HTML regions where that data typically lives (contact pages, footer, certificate sections). This pass is expensive per call but only runs on 2 fields.
If the two passes disagree, the narrower pass wins. Disagreements happened ~8% of the time; the narrow pass was right in 85% of those.
Impact: 85% → 88%.
03 Deterministic validation before the LLM sees anything
Some fields have strict formats. Emails, phone numbers, tax IDs, country codes. I stopped asking the LLM to validate these and moved the validation to deterministic code after extraction:
def validate_email(raw):
if not raw or raw == "unknown":
return None
m = re.match(r"^[^\s@]+@[^\s@]+\.[^\s@]+$", raw)
if not m:
return None # silently drop invalid
return raw.lower()
def validate_phone(raw, country_hint=None):
# normalize to E.164 using libphonenumber
...
Bad emails and phone numbers get dropped silently at the extraction-to-record step. The downstream agent sees None instead of garbage, and its scoring rubric handles None correctly (treats it as "no signal").
This sounds like it would hurt accuracy — you're throwing away data — but measured against the ground truth, "silently dropped" was always better than "wrong value." Wrong values failed loudly downstream; missing values failed gracefully.
Impact: 88% → 90%.
04 Build a calibration set. Check against it every deploy.
This is the change that made every subsequent change measurable. I hand-labeled 100 supplier pages — real HTML snapshots from our crawl queue — with the correct 20-field output for each. These became the calibration set.
Every deploy runs the full extraction pipeline against the calibration set, compares field-by-field, and writes a per-field accuracy report. A regression of >1% on any field blocks the deploy.
I had been testing "does the pipeline produce valid JSON?" That's a unit test, not an evaluation. Real evaluation is did the pipeline produce the same answer a human would have produced on real input. You can't do that without a labeled set.
Rule: if you don't have 100 labeled examples, you don't have a production LLM pipeline — you have a demo. Build the calibration set first, optimize against it second.
The calibration set itself didn't improve accuracy. What it did was make the next four improvements possible, because I could measure them.
Impact: 0% (directly), but multiplier on everything after.
05 Per-field error analysis and targeted prompt fixes
With the calibration set running, I could see which fields were failing most often. The distribution was not uniform. Two fields accounted for ~60% of all errors:
- Factory size (in employees). The model was guessing when the page said nothing about it. Fix: add an explicit "if the page doesn't state this, return unknown — do not estimate" instruction in the prompt.
- Export experience. The model was over-claiming. Any mention of "international" or "worldwide" was being read as "has EU customers." Fix: reframe the prompt to ask for specific evidence (named customers, case studies, certifications) not general claims.
Instead of rewriting the whole prompt, I added field-specific instructions for the five worst-performing fields. Each instruction was ~30 tokens. Total prompt growth was minimal; accuracy lift was substantial.
Impact: 90% → 93%.
06 Retry on confidence, not on format
I asked the model to return a confidence field alongside each extraction (0–1 scale). On its own, confidence scores from LLMs are mostly vibes — but calibrated against the calibration set, they correlated meaningfully with actual accuracy. Below 0.6 confidence, the extraction was wrong ~40% of the time.
I added a retry layer: any field with confidence < 0.7 gets re-extracted with a different prompt variant (slightly rephrased, same schema). If the two attempts agree, use the answer. If they disagree, use the higher-confidence one. If both are low-confidence, flag for human review.
Impact: 93% → 94%.
07 Fall back to human review on the hardest 5%
Some supplier pages are just hard. Arabic-only, no structured markup, inconsistent layout, image-heavy with text inside images. No amount of prompt engineering was going to get them past ~85%.
I accepted the ceiling and built a human review queue. Any record with < 0.8 average confidence across fields gets flagged and a human (one of our part-time sourcing analysts) reviews it in a minimal UI. The review UI shows the scraped HTML alongside the extracted record and lets them correct fields inline. Corrections feed back into the calibration set.
This isn't "the LLM gets better." It's "we stopped pretending the LLM had to handle 100% of cases." About 5% of scrapes go to humans. The rest we trust.
Impact on automated accuracy: 94% → 94% (unchanged — automation didn't get better). Impact on delivered data quality: 94% → 98%. The 96% headline number is the blended rate.
08 Drift detection with weekly calibration runs
The last change was preventive. Web pages change. LLM providers update their models. Either can silently break extraction. I set up a weekly cron that re-runs the full calibration set, stores the per-field accuracy numbers, and alerts if anything dropped by >2%.
This caught a real regression in month four when OpenAI updated their default model behavior and the factory_size field accuracy dropped from 94% to 87% overnight. Without drift detection, I wouldn't have noticed for weeks.
The cron is 60 lines of Python. It has paid for itself twice already.
Impact: floor-setting, not ceiling-raising.
09 Two things that made it worse
For completeness:
- Chain-of-thought reasoning in the extraction prompt. I added "think step-by-step about what this page is saying before extracting" and accuracy dropped ~1%. The model's step-by-step reasoning produced more confident wrong answers. Extraction isn't a reasoning task — it's a pattern-matching task. Chain-of-thought hurts.
- Bigger model for the same job. Moving from GPT-5.4-mini to GPT-5.4 full for extraction added 2× to the cost and ~0.5% to accuracy. Not worth it. Mini is the sweet spot for this task.
10 The real lesson
The jump from 82% to 96% was 14 percentage points from 8 small engineering changes, not one heroic move. None of the changes were particularly clever. Most of them were the result of treating the extraction pipeline like any other production system: instrument it, measure it, error-analyze it, fix the worst thing first, repeat.
The single biggest multiplier was the calibration set (change #4). Without it, every other change was a guess. With it, every change became measurable, and "good enough to deploy" became a real threshold instead of a feeling.
If you're building an LLM extraction pipeline and you don't have a labeled calibration set, that's the next thing you should build. Everything else is downstream of that.