Property Intelligence Engine | 92% Accuracy

The Origin

The project that started it all

This is the first complex ecosystem I ever built. Not because I planned it that way, but because the problem kept getting bigger.

I'd already built loose automations for property data. Small, disconnected pieces that did one thing each. When I finally saw the full picture of what was needed, I realised those fragments could become something bigger. The ecosystem grew from there.

Some days I'd end work stuck on a problem and literally dream about the solution. I'd wake up anxious to get to my desk and try it. That excitement is what I'm still chasing today with every new project.

Why this project matters

Everything I know about building automation at scale, I learnt the hard way on this engine. Version control. Passing data between workflows. Atomic database operations. Status flags. Fuzzy matching. Rate limiting. Breaking workflows into manageable chunks. When to loop, when to batch, when to split. This project forced me to figure it all out.

The Scene

The problem with property data

Property professionals need to know when properties in their portfolio hit the market. The data exists, but getting it is harder than it should be.

The Data Challenge

Rightmove shows display addresses like "Victoria Road, London" but not the full registered address. To match a listing to a client's property, you need the actual address. That's locked behind EPC certificates.

The Manual Process

Check Rightmove. Find properties in your postcodes. Click through to the EPC certificate to get the real address. Cross-reference against your portfolio spreadsheet. Repeat for every postcode, every day. Some companies pay VAs over £500/month just to do this matching work.

The Multi-Client Reality

Different clients, different postcodes, but the same underlying problem. Each needs property intelligence. Building a separate system for each would be wasteful. We needed a platform approach.

The Goal

Build once, serve many. A universal property intelligence database that scrapes, enriches, and stores listings. Each client gets their own matching layer on top, delivered to their preferred format.

The Tech Stack

What powers the engine

Every component was chosen for reliability, cost, and the ability to run unattended.

n8n

Workflow orchestration

PostgreSQL

Central database

Propsense API

Property data source

Apify + GPT

EPC address matching

Google Sheets

Client dashboards

Rightmove

Listing source

Data Source

Why Propsense over direct scraping?

Rightmove actively blocks scrapers. Maintaining our own would be a full-time job. Propsense aggregates from multiple sources and handles the cat-and-mouse game for us.

Why: Reliable data, reasonable cost, one API call instead of fighting anti-bot measures every week.

Enrichment

Why AI for address matching?

EPC certificates contain the full address, but it's not always formatted consistently. "Flat 3" might be "Flat Three" or "Unit 3" or "Apartment 3". AI handles the fuzzy matching human-style.

Why: We use Apify actors that combine EPC lookup with GPT for interpretation. 92% accurate match rate, with "accurate" vs "approximate" confidence flags.

Architecture

Why one database, many clients?

Scraping is expensive. Enrichment is expensive. If two clients both care about postcodes in Manchester, scraping twice is waste. The universal database scrapes once, clients query against it.

Why: Cost efficiency at scale. Each new client adds marginal load for their matching queries, not full scraping infrastructure.

Delivery

Why Google Sheets for output?

Clients don't want another login. They don't want to learn a dashboard. They want data where they already work. Google Sheets is familiar, shareable, and requires zero training.

Why: Path of least resistance. The client gets a live-updating sheet. They can filter, sort, and share without touching our system.

The Architecture

Two layers, one system

The universal layer scrapes and enriches. The client layer matches and delivers. They run independently but share the same data.

The Platform Pattern

Think of it like a utility company. We build the water treatment plant (scraping + enrichment). Each client gets their own tap (matching + delivery). Adding a new client means adding a tap, not building a new plant.

Universal Layer (Platform)

Runs once, serves all clients. Scrapes listings, enriches with EPC data, maintains the master database.

1

1.0, 1.2

Property Scraper

Pull listings from Propsense API into the database

The core scraping engine. Controller workflow manages which outcodes to scrape, processor handles pagination, deduplication, and database storage. Supports full sync and incremental updates.

Scraping Modes

Full: First scrape, get everything
New: Properties listed since last scrape
Update: Properties updated since last scrape

Data Captured

Property type, bedrooms, bathrooms
Price, tenure, service charges
EPC URL, agent, listing dates
Full price history tracking

2

2.0, 2.1

EPC Address Finder

Enrich listings with full addresses from EPC certificates

The secret sauce. EPC certificates have a slider showing the property's energy rating. The scraper retrieves all EPC records for a postcode, then AI corrects any address formatting issues and proposes matching options based on the rating band. It replaces the human brain work of identifying which EPC record belongs to which listing.

Match Types

Accurate: High confidence match
Approximate: Likely match, needs review
None: Could not determine address

Status Flow

pending_enrichment → enrichment_in_progress → ready_for_match_accurate / ready_for_match_approx / enrichment_not_possible

3

3.0, 3.1

Delisted Checker

Flag properties that are no longer live on Rightmove

Properties get sold or removed. This workflow periodically checks if matched listings are still active. If not, it flags them so clients know the opportunity has passed.

Why It Matters

Clients acting on stale data waste time. Knowing a property is delisted is as valuable as knowing it's listed. This keeps the intelligence current.

Triggers Client Checks

After updating the universal database, it triggers client-specific workflows to update their matched listings too.

Client Layer (Per-Client)

Each client gets their own matching and delivery workflows. Queries the universal database, matches against their portfolio, delivers to their dashboard.

A

Client

Reverse Matching

Match client addresses against the universal listings database

Takes the client's portfolio of addresses and finds matches in the scraped listings. Uses a high-precision key matching algorithm: numbers, standalone letters, and core text stripped of noise words.

Matching Keys

Numbers: "123 4" from "123 Victoria Road, Flat 4"
Letters: "a b" from "Flat A, Block B"
Text: Core address without "road", "street", etc.

Output

Full property details with client contact info attached. Includes listing URL, price, property type, EPC rating, agent details, match confidence, and all relevant metadata for the client dashboard.

B

Utility

Live Check

Keep the client's matched listings current

Triggered by the universal Delisted Checker. Reviews the client's current matches, flags any that are no longer live, and updates their dashboard accordingly.

Maintains Accuracy

The dashboard always reflects reality. No phantom listings. If something is sold or withdrawn, the client knows immediately.

Runs Automatically

Triggered as part of the daily maintenance cycle. Client doesn't need to do anything. Data stays fresh.

Practical Tips

The hard lessons from building this

Everything I learnt, I learnt by getting it wrong first.

Data Volume

Balancing cost vs completeness

The sheer amount of data to be scraped was overwhelming. Every API call costs money. Every field stored costs storage. I had to get ruthless about what actually mattered.

The fix: Strip irrelevant fields early. Only pull what you need. If you're not using a field downstream, don't scrape it. Made workflows 3x faster and cut API costs significantly.

Batching

How many records to push at once?

Push too few records and you waste API calls. Push too many and you hit memory limits or timeouts. Finding the sweet spot took trial and error.

The fix: Test your workflow with 1, 10, 100, 1000 records. Find where it breaks. Set your batch size to 80% of that limit. Leave headroom.

Workflow Architecture

When to split into multiple workflows

A single workflow that does everything sounds elegant. In practice, it's a nightmare to debug, test, and maintain. I learnt to break at natural boundaries.

The fix: Split when: data needs to settle, you need a checkpoint, or logic branches significantly. Each workflow should have one clear job. Chain them with triggers.

Database Patterns

Atomic operations and status flags

Running the same enrichment twice? Records getting processed out of order? Duplicate entries? I hit every concurrency bug possible before discovering atomic PostgreSQL patterns.

The fix: Use SELECT ... FOR UPDATE SKIP LOCKED. Status columns with clear state machines. Never process a record without changing its status first. Prevents race conditions.

Loops vs Batches

When to loop, when to batch

n8n loops are convenient but slow. Batch operations are fast but less flexible. I wasted weeks using the wrong approach for different scenarios.

The fix: Loop when each item needs different logic or API calls. Batch when you're doing the same operation to many items. Database writes = batch. External API calls = often loop with rate limiting.

n8n Limits

Finding the tool's boundaries

n8n is powerful, but it has limits. Memory caps. Execution time limits. Data transfer between nodes. This project pushed me into every one of them.

The fix: Respect the limits. Don't fight them. When you hit a wall, that's your signal to split the workflow, reduce batch size, or offload to the database. The tool isn't broken. Your architecture is.

Technical Deep Dive

The address matching problem

The most challenging part of the entire engine. Getting this right took weeks of iteration.

The problem

Client data says "Flat 3, 42 Victoria Road, London, SW1A 1AA". The EPC enriched address says "Apartment 3, 42 Victoria Rd, SW1A 1AA". Are they the same property? A human can tell instantly. A computer struggles.

I tried exact matching. Failed constantly. Tried basic fuzzy matching. Too many false positives. Tried AI matching. Too slow and expensive at scale. I needed something smarter.

The solution: PostgreSQL fuzzy matching

After a lot of trial and error, brainstorming with AI, and testing edge cases, I landed on a PostgreSQL query that breaks addresses into components:

The matching key approach

Numbers key: Extract all numbers. "Flat 3, 42 Victoria Road" becomes "3 42". Compare numeric fingerprints.

Letters key: Extract standalone letters. "Block A, Flat B" becomes "A B". Catches apartment/unit designations.

Postcode chunks: Break the postcode into outcode (SW1A) and incode (1AA). Match on both separately. Handles formatting variations.

Core text: Strip noise words (road, street, lane, flat, apartment). Compare what's left. "Victoria" matches "Victoria" regardless of "Road" vs "Rd".

Why this works

By breaking addresses into components and matching on each, I get high precision without requiring exact matches. The query scores each component and returns a confidence level. High scores go straight through. Approximate matches get flagged for review.

It's not perfect. Edge cases still slip through. But it gets 92% right automatically, which means manual review is focused on the 8% that actually need human judgement.

Lessons Learnt

What the data taught me

Building property intelligence revealed patterns I didn't expect. Every lesson here cost me time.

1

Address matching is harder than it looks

"Flat 3" vs "Apartment 3" vs "Unit 3" vs "3". Same property, four different representations. Fuzzy matching was the only reliable solution. Pure string matching fails constantly.

2

EPC data is the unlock

Display addresses are intentionally vague. Full registered addresses are on EPC certificates. Access to that data transforms what's possible. It's the difference between "somewhere on Victoria Road" and "123 Victoria Road, Flat 4".

3

Status machines prevent chaos

With thousands of listings flowing through enrichment, tracking where each one is becomes critical. pending → in_progress → ready → matched → archived. Every listing has exactly one status at any time.

4

Version control saved me

I only started versioning workflows properly because this project forced me to. Breaking changes, lost work, "which version was working?"... Never again. Every change gets a version bump now.

5

Delisting detection is underrated

Everyone wants new listings. But knowing something is no longer available is equally valuable. Clients were wasting hours chasing properties that were already sold. The delisted checker fixed that.

6

Platform beats point solutions

Building client-specific scrapers would have been faster initially. But the platform approach pays off with every new client. One enrichment run serves everyone. Costs stay flat as client count grows.

7

Simple outputs win

We could have built a fancy dashboard. Instead, we push to Google Sheets. Clients already know how to use it. Zero training, zero friction, immediate adoption. Technology should disappear, not impress.

8

Dream on the problem

Some of my best solutions came after sleeping on a problem. Literally dreaming about code. If you're stuck, step away. Your brain keeps working. Wake up anxious to try the new idea. That's when you know you've found something good.