
Bridge Sourcing
Bridge Sourcing connects EU buyers with Egyptian suppliers via Egypt’s 0% EU tariff lane. The product is a multi-agent pipeline: discovery, qualification, outreach. The engineering interesting bit is the scrape-accuracy curve — 82% to 96% over 3 months, by 8 specific changes that compound.
The problem
Most EU buyers sourcing from MENA hit the same wall: supplier discovery is manual, qualification is gut-feel, and outreach hits dead Hotmail addresses from 2017. Meanwhile Egypt has a free-trade lane to the EU (0% tariff on most categories) that almost no one outside the country can navigate. The opportunity isn’t the AI; it’s the trade-route arbitrage. AI is just what makes the pipeline cheap enough to run continuously.
The pipeline — 3 agents
- Discovery agent. Crawls LinkedIn + Hunter + targeted Google search for suppliers in a category. Outputs a normalised supplier row with company, contact, category, scale signal.
- Qualification agent. Reads the supplier’s site + LinkedIn + shipping records via public manifests. Scores fit-to-buyer-spec on a 0–100 rubric with explainable reasoning.
- Outreach agent. Drafts a personalised first email per supplier in the buyer’s voice, scheduled via Zoho with deliverability checks (DKIM verified Apr 2026). Never auto-sends — the buyer reviews + sends.
The 82→96% scrape accuracy curve
Day one, the discovery agent extracted supplier fields at 82% accuracy on a held-out test set. Three months later, 96%. The 14-percentage-point gain came from eight specific changes:
- Switched extraction model from JSON-mode-flagship to JSON-mode-mini with a stricter schema. Smaller model, tighter prompt, +3pp.
- Added a separate validation pass with a different model. The “does this look right?” second opinion catches hallucinations the extractor didn’t flag. +2pp.
- Pre-cleaned HTML before extraction. Stripped nav, footer, ads. Removes 40% of input tokens and 70% of nuisance text. +2pp.
- Per-field confidence scoring. Low-confidence fields get a retry with a different prompt before being marked “unknown” instead of guessed. +2pp.
- Country-specific date and address parsers. Egyptian addresses don’t match Western patterns. Hand-rolled regex, not the model. +1pp.
- Duplicate detection across discovery batches. The same supplier appearing twice with different field values used to silently corrupt scoring. Now it triggers a merge step. +1pp.
- Field-level golden eval set. Built a 200-row hand-labeled set and ran every prompt change against it before deploying. Stops regressions cold. +2pp.
- Stopped trusting the model on numeric fields. Years founded, employee count, revenue band — all parsed by hand from canonical fields, not inferred from prose. +1pp.
What I got wrong
Trusted the flagship model with no validation pass
For the first two months, I had a single GPT-4 extraction call producing the structured output. Hallucinations hit ~6% of rows. The fix wasn’t a better extractor — it was a second opinion. A different model on the same input, comparing outputs, flagging mismatches. That’s the pattern that took accuracy from 88 to 92.
Underbuilt the eval set
I had a 30-row test set for too long. Felt rigorous; was useless. Statistically, 30 rows can’t distinguish a 90% from an 88% extractor. Building the 200-row golden set took two days and unlocked all the meaningful prompt iteration after.
The actual lesson
Scrape accuracy is not an AI problem. It’s an evals problem, a HTML-cleaning problem, and a domain-specific-parser problem with an LLM in the middle. The AI is the cheapest part. The discipline around the AI is the moat.
// stack
- Next.js
- Multi-agent
- Hunter
- Zoho
- GPT-4o
// next case study
Want this kind of build for your business? Book the Audit Sprint — $1,500 or email omar@neurascale.org.



