[Project] Production-grade Book Recommendation System (Warm + Cold Users, Daily Retraining, Chatbot)

SimonB · September 7, 2025, 3:52am

[Project] Production-Grade Book Recommendation System (Warm + Cold Users, Daily Retraining, Chatbot)

Hi everyone! I wanted to share a project I’ve been working on outside of the course.
It’s a production-grade book recommendation system that handles both warm users (with prior ratings) and cold users (no history).
I built it as a portfolio project to practice not just modeling but also messy data cleaning, automated retraining, and real-time serving.
Live demo — try out recommendations and the chatbot.
Portfolio write-up — full technical details.

What It Does

Warm users:
- Collaborative filtering with ALS embeddings (implicit feedback).
Cold users:
- Attention-pooled subject embeddings derived from favorite genres.
- Bayesian popularity prior with a slider to balance robustness vs. discovery.
Item similarity:
- ALS similarity (behavioral, good for series/authors).
- Subject similarity (content, better coverage for long-tail).
- Hybrid with a tunable weight.
Daily retraining:
- Automated export → training (ALS, embeddings) → hot-reload of artifacts with zero downtime.

Dataset & Challenges

I intentionally chose the Book-Crossing dataset (2004), which is messy and incomplete:

Missing demographics, noisy ISBNs, no subjects.
Required ISBN normalization → work_id mapping → edition merging.
Subjects enriched from Open Library (~130k → ~1k cleaned categories).
User ages/locations cleaned, ratings stabilized.

The messiness was a feature — it forced me to build the same kind of robust pipeline real-world systems need.

Models

ALS embeddings for collaborative filtering (warm users + item similarity).
Subject embeddings trained with both RMSE regression and a contrastive loss to improve neighborhoods.
Attention pooling to weight the most informative subjects (scalar, per-dim, transformer/self-attention variants).
FAISS indices for efficient top-K retrieval.

Chatbot (Demo)

The site also includes a virtual librarian chatbot:

Current demo: uses limited web tools + internal docs to answer onboarding questions.
Roadmap: directly connect it to the internal recommendation engines (ALS, subject embeddings, cold hybrid) for catalog-grounded, explainable recs.

Lessons Learned

Split first → then compute aggregates to prevent leakage.
Evaluate what you serve → ranking metrics (Recall@K, NDCG@K) instead of only RMSE.
Split-safe vocabularies → avoid untrained embedding rows creeping in.

More Info

This post is a short overview.
If you’re curious about the full technical pipeline, experiments, and mistakes/lessons learned, you can check it out here: Portfolio Link
You should also check out the live demo here: Live demo
You can look at the code here: Github Repo