Airbnb Market Insight
A Medallion lakehouse that scrapes the Guatemala City short-term rental market, refines it Bronze → Silver → Gold on Delta Lake, and serves a live Next.js dashboard straight from Cloudflare R2.
View live dashboardThe expensive irony
A tale as old as time: a tech professional tries to diversify their portfolio and gets a harsh lesson in CapEx versus operating income.
I'm an AI & data engineer by trade. A while back I wanted some of that legendary "passive income," so I bought an apartment in one of the hottest areas of Guatemala City and threw it up on Airbnb.
Fast forward a year, and my passive income was flashing a bright neon red. I was losing margin to co-host fees, guessing my nightly rates, and generally running the business on vibes.
Then the irony hit me: I literally do this for a living. I spend my days building enterprise data architectures and deploying AI agents to optimize company workflows — so why was I manually messaging guests and pricing by gut?
I decided to stop being a landlord and start being an engineer.
A complete system to automate the hospitality business: a Market Insight pipeline (scrape the market, refine through a lakehouse, surface a live dashboard), a RAG training ground (mine my own hosting history into a vector library), and an AI co-host (a real-time agent that handles guests autonomously). This write-up covers the first.
Why a lakehouse, not a pile of CSVs
If you want to beat the market, you have to understand it — pricing, availability, occupancy, and review scores for every competing listing in the city. That means scraping, and scraping means chaos: raw, dynamically-loaded JSON that turns a storage drive into a toxic data swamp if you let it.
So I implemented a strict Medallion architecture — Bronze, Silver, Gold — on Delta Lake. Why Delta over a folder of CSVs? Because I want ACID transactions, schema enforcement, and time travel. If the pipeline breaks halfway through processing thousands of listings, I don't want corrupted data — I want a clean rollback.
The scraper
Here's a fun fact about Airbnb: they guard their official API keys like nuclear launch codes. Their frontend is a heavily fortified, dynamically-loaded React app, so simple HTTP requests bounce right off it.
The answer is heavier artillery: Playwright, a headless browser that renders the JavaScript and reads the data exactly as a human would see it. To keep my environment clean and make the pipeline portable to a remote server, I containerized the whole extraction step with Docker.
run_pipeline_step(
["docker-compose", "up", "--build"],
"EXTRACTION (Docker Playwright)",
run_dir=project_root,
)
The --build flag is there because I have trust issues with the Docker cache — forcing a rebuild means a tweak to requirements.txt (say, to dodge a new anti-bot measure) never gets silently ignored. The scraper navigates the map grids of Guatemala City, pulls raw JSON from the network traffic, and dumps it locally. Messy, raw, unstructured — exactly what Bronze is for.
Bronze → Silver → Gold
Bronze is a strict, append-only landing zone. We don't judge the data here; we just secure it. The scraper's unstructured JSON gets wrapped and written to a Delta table, guaranteeing an immutable record of exactly what the scraper saw on any given day.
Silver is where we enforce the law. Airbnb's frontend data is notoriously messy — prices carry dollar signs and commas, ratings hide inside strings like "4.8 out of 5", and bed counts need regex gymnastics. A PySpark session applies heavy-duty UDFs to parse the text, enforce types, and deduplicate. By the time data leaves Silver it's pristine and structured.
Gold is where the math stops being theoretical and starts being financial. Window functions compute historical occupancy, year-to-date revenue, and price-per-bed ratios. This is the highly-aggregated, business-level table that powers the dashboard.
gold_path = os.path.join("src", "etl", "gold.py")
run_pipeline_step([venv_python, gold_path], "GOLD LAYER (Business Metrics)", run_dir=project_root)
Getting Gold to the cloud — the new way
This is where the architecture changed from my first version. Originally the Gold layer fed MotherDuck (a cloud DuckDB warehouse) and a Streamlit dashboard embedded via iframe. It worked, but it leaned on an always-on warehouse and a separate BI tool.
The whole point of this project is to stop losing money, so spinning up a managed warehouse to host a few thousand rows was like buying a Ferrari to deliver a single pizza. The redesign is leaner:
Now the Gold Delta tables sync straight to a Cloudflare R2 bucket. A boto3 job walks the table directory — including the all-important _delta_log — and uploads it object by object, preserving the structure that makes it a valid Delta table rather than a loose pile of Parquet.
def upload_delta_table_to_bucket(local_folder, bucket_name, s3_prefix):
r2_client = boto3.client(
service_name="s3",
endpoint_url=f"https://{os.getenv('R2_ACCOUNT_ID')}.r2.cloudflarestorage.com",
aws_access_key_id=os.getenv("R2_ACCESS_KEY_ID"),
aws_secret_access_key=os.getenv("R2_SECRET_ACCESS_KEY"),
config=Config(signature_version="s3v4"),
)
for root, _, files in os.walk(local_folder):
for file in files:
local_path = os.path.join(root, file)
relative = os.path.relpath(local_path, local_folder)
key = os.path.join(s3_prefix, relative).replace("\\", "/")
r2_client.upload_file(local_path, bucket_name, key)
Reading it back, on demand
The dashboard itself is now Next.js, not Streamlit — the same stack as the rest of this site. When someone opens it, a Vercel serverless function reads the Delta table directly from R2 using DuckDB's delta_scan, which parses the _delta_log to find the live Parquet files, and returns clean JSON.
There's no warehouse sitting idle between requests. The function runs only when called, the response is cached at Vercel's edge so repeat visitors never pay a cold start, and the React client handles all filtering and aggregation in the browser. Serverless reads straight off object storage — about as lean as a lakehouse-backed dashboard gets.
The result is the live market dashboard: unique listings, average nightly price, occupancy, a historical revenue trend, and the full distribution of prices, ratings, and host features across every zona in the city. No more guessing — the data says exactly what the profitable listings are doing right.
See it in action
Explore the live, interactive dashboard reading straight from the Gold layer.
Open the dashboard