A technical look at the decisions, architecture, and lessons learnt building an automation ecosystem that processes 5,000+ jobs daily.
Understanding the context is half the battle. Here's what we were working with.
A medium-sized B2B professional services company in London with 30+ consultants. Operating across multiple countries (UK, Europe, US) and industries (technology, financial services, and more).
Entirely manual. Each consultant had their own approach to finding leads. No consistency, no tracking, no visibility for management. Hours spent each week on repetitive searching that could be automated.
Build a system that could find relevant opportunities automatically, enrich them with company data, check against existing clients in the CRM, and push qualified leads to outreach campaigns. Without adding headcount.
This isn't a greenfield project. There's an existing CRM with years of client history, existing relationships, and business rules. The system needs to integrate, not replace.
Every tool was chosen for a specific reason. Here's what we use and why.
Workflow orchestration
PostgreSQL database
Job aggregation
AI processing
Company enrichment
Email campaigns
We wanted to scrape Google Jobs because it already aggregates from LinkedIn, Indeed, Glassdoor, and company career pages. jSearch had the best scraper in our tests.
A managed PostgreSQL database with a generous free tier, built-in REST API, and edge functions for webhooks.
OpenAI's cost-optimised model for classification and extraction tasks. Not GPT-4, not Claude.
A data enrichment platform that aggregates 75+ data providers into a single interface.
The initial request was simple: automate lead generation. Find jobs, find companies, send emails. But digging deeper revealed something more interesting.
When we first switched on the automation, leads started pouring in. Hundreds per day. But this exposed a deeper issue: the team wasn't set up to handle volume. There was no process for qualification, no clear handoff to consultants, no way to track what happened to each lead. We'd moved the bottleneck, not removed it.
This led to something bigger: a full operational audit. We mapped the entire lead-to-placement journey, identified where things broke down, and built the automation around the actual process, not just the symptom.
The audit revealed 40+ new automations and ecosystems needed to make the full lead-to-placement journey work properly. This is currently ongoing and scheduled to be completed in 2026 as part of a broader data and AI plan. For the job scraper specifically, the requirements were:
Automation doesn't exist in isolation. Before you build, think about what happens downstream. Who handles the output? What's the process? How do you measure success? This is why we always start with mapping the full picture.
Each workflow handles one stage, then passes the baton. Data flows through with statuses that let us track, retry, and recover from failures.
The Job Scraper workflow showing the full scraping and processing pipeline
[stage].[substage] | [Client] [Description] | v[version]
Every workflow follows this pattern. It makes the ecosystem navigable at a glance.
1.0 | Job Scraper | v3
5.1 | Existing Client Checker | v2
Why it matters: With 12+ workflows, you need to instantly know what runs in what order. The stage number tells you the sequence, the description tells you the purpose, and the version tracks iterations.
This is the entry point. The workflow runs once daily, pulling all configured searches which then trigger multiple scrapes downstream. It covers hundreds of targeted search queries for different job titles, locations, and industries, deduplicates against existing records, and stores raw job data in the database.
Raw jobs often have messy data. This stage cleans company names, extracts domains, handles duplicates, and creates/links company records. Controller workflow (1.2) manages batching, processor workflow (1.3) handles the actual work.
Controller handles orchestration and error recovery. Processor focuses on data transformation. If the processor fails mid-batch, the controller can restart from where it left off.
Not every job is relevant. This workflow uses AI to normalise job titles, check relevance against the Ideal Customer Profile, categorise jobs, and filter out noise before expensive enrichment.
Rejected jobs aren't deleted. They go to a discarded_jobs table with the rejection reason. This lets us audit AI decisions and refine prompts over time.
Only companies not already in our system are sent to Clay for enrichment. We use fuzzy matching on name and domain against CRM data to check. Companies already in the CRM skip this stage entirely and go straight to lead tagging. For new companies, we use two different Clay tables: one for companies with domains, one without.
Clay returns data via webhooks. The webhook stores everything it catches, then a separate automation triggers to inject them into the database. This "fill and empty the bucket" approach reduces database calls to one controlled portion of the day.
Once companies are enriched, we go back to the jobs and extract detailed information from the raw job description: skills, salary, benefits, work mode, experience requirements, and normalised location.
Locations come in messy: "NYC", "New York, NY", "Remote (US)". A separate AI node normalises to city/country/country_code, then we look up the country ID for CRM integration.
This is a separate branch for companies already in the CRM. They skip Clay enrichment and come directly here. We query the CRM for relationship signals: recent placements, interviews, meetings, notes. This determines the A/B/C/D priority tag for routing.
Jobs are grouped by client + country + category. One package per group. This prevents creating duplicate leads for the same opportunity.
For new companies (not existing clients), we need decision makers to contact. This stage sends companies to Clay for contact enrichment, then processes the returned contacts with email and phone verification.
Unverified emails destroy sender reputation. Clay's waterfall tries multiple verification services. Only contacts with verified emails proceed to outreach.
The final stage. Contacts with verified emails are pushed to Smartlead (or similar) for automated email sequences. Each contact is matched to the appropriate campaign based on job category.
Every contact has an outreach_status: Pending → Processing → In Campaign → Replied/Bounced/Unsubscribed. Full visibility into what happened.
One giant workflow would be simpler to understand. So why did we choose complexity?
n8n slows down with large data volumes. Processing 5,000 jobs in one workflow hits memory limits. Split workflows handle smaller batches, completing faster and more reliably.
Every API has rate limits. Separate workflows let us control pacing with Wait nodes between batches. A single monolithic workflow can't pause mid-execution to respect limits.
When something breaks, you know exactly which stage failed. Fix it there, reprocess from that status, done. No hunting through a 100-node monster trying to find what went wrong.
Each record has a status field. If workflow 3.0 fails mid-run, records stay at "Pending Enrichment". Restart and it picks up where it left off. No duplicate processing, no lost data.
Think of it like a relay race. Each workflow runs its leg, updates the record status, and hands off. The next workflow picks up records in that status and runs its leg. If a runner trips, only that leg needs to be re-run. The race isn't over.
Hindsight is valuable. Here's what we learnt the hard way.
We built the scraper first, then realised we had no process for handling the leads. Always map the full flow before building any part of it. What happens after the automation runs?
Our AI prompts went through dozens of iterations. The first version of the Gatekeeper rejected too much. The second rejected too little. Start minimal, test with real data, refine based on actual results.
We could have scraped job boards ourselves. The maintenance cost would have eaten any savings. Pay specialists to handle the hard parts. Your time is better spent on business logic.
Every record needs a status field. Every workflow should only process specific statuses. This makes debugging, recovery, and monitoring possible. Without it, you're flying blind.
We use AI for only three tasks: title normalisation, relevance gating, and description parsing. Everything else is rules, lookups, and SQL. AI is expensive and sometimes wrong. Use it surgically.
Things will break. APIs will be down. Data will be weird. Design assuming failure, and recovery becomes trivial instead of catastrophic. Every workflow should be re-runnable.
When the AI Gatekeeper rejects a job, we log why. When a company has no domain, we log it. This data is gold for refining the system and understanding edge cases.
The best automation is invisible to users. Consultants don't see the 12 workflows. They see leads appearing in their CRM with all the context they need. Design for the experience, not the architecture.
Every node gets a sticky note explaining what it does. Every workflow has a canvas note with the big picture. When you're running 12+ workflows across multiple projects, you won't remember why you made a decision three months ago. Your future self will thank you.
n8n workflows export to JSON. That JSON can be version controlled, diffed, and even converted to actual code with AI tools. We treat our workflows like software: they live in Git, have version numbers, and follow coding standards. This makes collaboration and handoffs possible.
Every business is different, but the principles are the same. Let's talk about what's possible for yours.