Technical Deep-Dive

How We Built a Job Scraping Engine

A technical look at the decisions, architecture, and lessons learnt building an automation ecosystem that processes 5,000+ jobs daily.

12+
Workflows
5K+
Jobs/day
35hrs
Saved weekly
~$50
API cost/month

Where this story begins

Understanding the context is half the battle. Here's what we were working with.

The Company

A medium-sized B2B professional services company in London with 30+ consultants. Operating across multiple countries (UK, Europe, US) and industries (technology, financial services, and more).

The Process (Before)

Entirely manual. Each consultant had their own approach to finding leads. No consistency, no tracking, no visibility for management. Hours spent each week on repetitive searching that could be automated.

The Goal

Build a system that could find relevant opportunities automatically, enrich them with company data, check against existing clients in the CRM, and push qualified leads to outreach campaigns. Without adding headcount.

The Constraint

This isn't a greenfield project. There's an existing CRM with years of client history, existing relationships, and business rules. The system needs to integrate, not replace.

What we're building with

Every tool was chosen for a specific reason. Here's what we use and why.

n8n

Workflow orchestration

Supabase

PostgreSQL database

jSearch API

Job aggregation

GPT-4o-mini

AI processing

Clay

Company enrichment

Smartlead

Email campaigns

Data Source

Why jSearch (Google Jobs API)?

We wanted to scrape Google Jobs because it already aggregates from LinkedIn, Indeed, Glassdoor, and company career pages. jSearch had the best scraper in our tests.

Why: Building individual scrapers for each job board would take months and require constant maintenance as they fight back. Using an aggregator that already solved this problem cut our production time and complexity by at least 3x. We hit the ground running with reliable data and direct support via Discord.
Database

Why Supabase?

A managed PostgreSQL database with a generous free tier, built-in REST API, and edge functions for webhooks.

Why: Speed of setup. We could spin up tables, configure row-level security, and start building in hours. The Supabase dashboard makes debugging dead simple. Plus, edge functions handle Clay's webhook callbacks without needing another service.
AI Model

Why GPT-4o-mini?

OpenAI's cost-optimised model for classification and extraction tasks. Not GPT-4, not Claude.

Why: Cost efficiency at scale. We process thousands of jobs daily. For our use cases (title normalisation, relevance filtering), mini performs just as well as larger models at a fraction of the cost.
Enrichment

Why Clay?

A data enrichment platform that aggregates 75+ data providers into a single interface.

Why: One API call gets us company size, industry, funding status, LinkedIn data, decision makers, and verified emails. Building this ourselves would take months and cost more in individual API subscriptions. Clay also handles the waterfall logic: try Provider A, fall back to Provider B.

More than just "we need leads"

The initial request was simple: automate lead generation. Find jobs, find companies, send emails. But digging deeper revealed something more interesting.

The "Too Successful" Problem

When we first switched on the automation, leads started pouring in. Hundreds per day. But this exposed a deeper issue: the team wasn't set up to handle volume. There was no process for qualification, no clear handoff to consultants, no way to track what happened to each lead. We'd moved the bottleneck, not removed it.

This led to something bigger: a full operational audit. We mapped the entire lead-to-placement journey, identified where things broke down, and built the automation around the actual process, not just the symptom.

What the Operations Audit Revealed

The audit revealed 40+ new automations and ecosystems needed to make the full lead-to-placement journey work properly. This is currently ongoing and scheduled to be completed in 2026 as part of a broader data and AI plan. For the job scraper specifically, the requirements were:

  • Find relevant jobs matching specific criteria (job titles, locations, industries)
  • Filter out noise with AI that understands "Senior Data Engineer" isn't the same as "Data Entry Clerk"
  • Enrich companies with size, industry, funding, and contact data
  • Check against existing clients in the CRM so we don't cold-email our own customers
  • Categorise relationship strength (A/B/C/D leads) based on CRM history
  • Route appropriately: existing clients go to account managers, new companies go to outreach
  • Push to email campaigns with personalised sequences
  • Track everything so we know what's working

The Lesson

Automation doesn't exist in isolation. Before you build, think about what happens downstream. Who handles the output? What's the process? How do you measure success? This is why we always start with mapping the full picture.

12 workflows, one ecosystem

Each workflow handles one stage, then passes the baton. Data flows through with statuses that let us track, retry, and recover from failures.

The Job Scraper Workflow in n8n

The Job Scraper workflow showing the full scraping and processing pipeline

Naming Convention

[stage].[substage] | [Client] [Description] | v[version]

Every workflow follows this pattern. It makes the ecosystem navigable at a glance.

Example:
1.0 | Job Scraper | v3
Example:
5.1 | Existing Client Checker | v2

Why it matters: With 12+ workflows, you need to instantly know what runs in what order. The stage number tells you the sequence, the description tells you the purpose, and the version tracks iterations.

1
1.0

Job Scraper

Scrape 5,000+ jobs daily from Google Jobs via jSearch API

This is the entry point. The workflow runs once daily, pulling all configured searches which then trigger multiple scrapes downstream. It covers hundreds of targeted search queries for different job titles, locations, and industries, deduplicates against existing records, and stores raw job data in the database.

Key Features

  • Batch processing to stay within API limits
  • Pagination handling for complete coverage
  • Self-triggering for continuous operation
  • Dynamic throttling based on volume

Technical Notes

  • PostgreSQL row-locking prevents duplicate processing
  • Status-based state machine for tracking progress
  • Rate limiting to stay within API quotas
  • Automatic retry on transient failures
2
1.2, 1.3

Job Processor

Clean, validate, and prepare raw jobs for enrichment

Raw jobs often have messy data. This stage cleans company names, extracts domains, handles duplicates, and creates/links company records. Controller workflow (1.2) manages batching, processor workflow (1.3) handles the actual work.

Processing Steps

  • Company name extraction and cleaning
  • Domain extraction from job URLs
  • Duplicate company detection
  • Company record creation/linking

Why Split?

Controller handles orchestration and error recovery. Processor focuses on data transformation. If the processor fails mid-batch, the controller can restart from where it left off.

3
2.0

AI Gatekeeper

AI-powered relevance filtering using GPT-4o-mini

Not every job is relevant. This workflow uses AI to normalise job titles, check relevance against the Ideal Customer Profile, categorise jobs, and filter out noise before expensive enrichment.

Three AI Tasks

  • Normalise Title: "Sr. SWE (Remote, $150k)" → "Senior Software Engineer"
  • Gatekeeper: Is this job relevant? true/false
  • Categorise: Which business category? (e.g., "Data", "DevOps")

Rejection Tracking

Rejected jobs aren't deleted. They go to a discarded_jobs table with the rejection reason. This lets us audit AI decisions and refine prompts over time.

4
3.0, 3.1

Company Enrichment

Enrich new companies with data from Clay's 75+ providers

Only companies not already in our system are sent to Clay for enrichment. We use fuzzy matching on name and domain against CRM data to check. Companies already in the CRM skip this stage entirely and go straight to lead tagging. For new companies, we use two different Clay tables: one for companies with domains, one without.

Two Enrichment Paths

  • With Domain: Higher accuracy, more data points
  • Without Domain: Uses company name + location for matching
  • In CRM: Skip enrichment, go to lead tagging

Webhook Pattern

Clay returns data via webhooks. The webhook stores everything it catches, then a separate automation triggers to inject them into the database. This "fill and empty the bucket" approach reduces database calls to one controlled portion of the day.

5
4.0

Jobs Normaliser

Deep parse job descriptions with AI to extract structured data

Once companies are enriched, we go back to the jobs and extract detailed information from the raw job description: skills, salary, benefits, work mode, experience requirements, and normalised location.

Extracted Fields

  • Main tasks and responsibilities
  • Required skills (comma-separated)
  • Salary range and currency
  • Benefits offered
  • Minimum years of experience
  • Work mode (Remote/Hybrid/On-site)
  • Employment type (Contract/Permanent)

Location Normalisation

Locations come in messy: "NYC", "New York, NY", "Remote (US)". A separate AI node normalises to city/country/country_code, then we look up the country ID for CRM integration.

6
5.0, 5.1

Lead Tagging

Categorise existing client relationships by strength

This is a separate branch for companies already in the CRM. They skip Clay enrichment and come directly here. We query the CRM for relationship signals: recent placements, interviews, meetings, notes. This determines the A/B/C/D priority tag for routing.

Priority Tags

  • A-Lead: Strong recent relationship signals
  • B-Lead: Moderate engagement history
  • C-Lead: Some historical activity
  • D-Lead: Exists in CRM, minimal history

Job Grouping

Jobs are grouped by client + country + category. One package per group. This prevents creating duplicate leads for the same opportunity.

7
6.0, 6.1

Decision Maker Enrichment

Find and verify contacts for outreach

For new companies (not existing clients), we need decision makers to contact. This stage sends companies to Clay for contact enrichment, then processes the returned contacts with email and phone verification.

Contact Data

  • First/last name
  • Email (verified)
  • Job title
  • LinkedIn URL
  • LinkedIn summary (for personalisation)

Verification

Unverified emails destroy sender reputation. Clay's waterfall tries multiple verification services. Only contacts with verified emails proceed to outreach.

8
7.0

Email Campaign Push

Push verified contacts to email campaigns

The final stage. Contacts with verified emails are pushed to Smartlead (or similar) for automated email sequences. Each contact is matched to the appropriate campaign based on job category.

Campaign Matching

  • Campaign per job category (Data, DevOps, etc.)
  • Personalisation fields: name, company, job link, location
  • LinkedIn summary for icebreakers

Status Tracking

Every contact has an outreach_status: Pending → Processing → In Campaign → Replied/Bounced/Unsubscribed. Full visibility into what happened.

Why split into 12+ workflows?

One giant workflow would be simpler to understand. So why did we choose complexity?

1

Runtime & Memory

n8n slows down with large data volumes. Processing 5,000 jobs in one workflow hits memory limits. Split workflows handle smaller batches, completing faster and more reliably.

2

API Rate Limits

Every API has rate limits. Separate workflows let us control pacing with Wait nodes between batches. A single monolithic workflow can't pause mid-execution to respect limits.

3

Debugging & Recovery

When something breaks, you know exactly which stage failed. Fix it there, reprocess from that status, done. No hunting through a 100-node monster trying to find what went wrong.

4

Atomic Status Tracking

Each record has a status field. If workflow 3.0 fails mid-run, records stay at "Pending Enrichment". Restart and it picks up where it left off. No duplicate processing, no lost data.

The "Passing the Baton" Pattern

Think of it like a relay race. Each workflow runs its leg, updates the record status, and hands off. The next workflow picks up records in that status and runs its leg. If a runner trips, only that leg needs to be re-run. The race isn't over.

What we'd tell ourselves before starting

Hindsight is valuable. Here's what we learnt the hard way.

1

Start with the end in mind

We built the scraper first, then realised we had no process for handling the leads. Always map the full flow before building any part of it. What happens after the automation runs?

2

Prompts are never done

Our AI prompts went through dozens of iterations. The first version of the Gatekeeper rejected too much. The second rejected too little. Start minimal, test with real data, refine based on actual results.

3

Paid APIs are worth it

We could have scraped job boards ourselves. The maintenance cost would have eaten any savings. Pay specialists to handle the hard parts. Your time is better spent on business logic.

4

Status tracking is essential

Every record needs a status field. Every workflow should only process specific statuses. This makes debugging, recovery, and monitoring possible. Without it, you're flying blind.

5

AI isn't always the answer

We use AI for only three tasks: title normalisation, relevance gating, and description parsing. Everything else is rules, lookups, and SQL. AI is expensive and sometimes wrong. Use it surgically.

6

Build for recovery, not perfection

Things will break. APIs will be down. Data will be weird. Design assuming failure, and recovery becomes trivial instead of catastrophic. Every workflow should be re-runnable.

7

Log everything you discard

When the AI Gatekeeper rejects a job, we log why. When a company has no domain, we log it. This data is gold for refining the system and understanding edge cases.

8

Think about the humans

The best automation is invisible to users. Consultants don't see the 12 workflows. They see leads appearing in their CRM with all the context they need. Design for the experience, not the architecture.

9

Document obsessively

Every node gets a sticky note explaining what it does. Every workflow has a canvas note with the big picture. When you're running 12+ workflows across multiple projects, you won't remember why you made a decision three months ago. Your future self will thank you.

10

Low-code isn't no-code-out

n8n workflows export to JSON. That JSON can be version controlled, diffed, and even converted to actual code with AI tools. We treat our workflows like software: they live in Git, have version numbers, and follow coding standards. This makes collaboration and handoffs possible.

Want to build something similar?

Every business is different, but the principles are the same. Let's talk about what's possible for yours.