[Project] Production-grade Book Recommendation System (Warm + Cold Users, Daily Retraining, Chatbot)

[Project] Production-Grade Book Recommendation System (Warm + Cold Users, Daily Retraining, Chatbot)

Hi everyone! I wanted to share a project I’ve been working on outside of the course.
It’s a production-grade book recommendation system that handles both warm users (with prior ratings) and cold users (no history).
I built it as a portfolio project to practice not just modeling but also messy data cleaning, automated retraining, and real-time serving.
:globe_with_meridians: Live demo — try out recommendations and the chatbot.
:page_facing_up: Portfolio write-up — full technical details.


What It Does

  • Warm users:

    • Collaborative filtering with ALS embeddings (implicit feedback).
  • Cold users:

    • Attention-pooled subject embeddings derived from favorite genres.
    • Bayesian popularity prior with a slider to balance robustness vs. discovery.
  • Item similarity:

    • ALS similarity (behavioral, good for series/authors).
    • Subject similarity (content, better coverage for long-tail).
    • Hybrid with a tunable weight.
  • Daily retraining:

    • Automated export → training (ALS, embeddings) → hot-reload of artifacts with zero downtime.

Dataset & Challenges

I intentionally chose the Book-Crossing dataset (2004), which is messy and incomplete:

  • Missing demographics, noisy ISBNs, no subjects.
  • Required ISBN normalization → work_id mapping → edition merging.
  • Subjects enriched from Open Library (~130k → ~1k cleaned categories).
  • User ages/locations cleaned, ratings stabilized.

The messiness was a feature — it forced me to build the same kind of robust pipeline real-world systems need.


Models

  • ALS embeddings for collaborative filtering (warm users + item similarity).
  • Subject embeddings trained with both RMSE regression and a contrastive loss to improve neighborhoods.
  • Attention pooling to weight the most informative subjects (scalar, per-dim, transformer/self-attention variants).
  • FAISS indices for efficient top-K retrieval.

Chatbot (Demo)

The site also includes a virtual librarian chatbot:

  • Current demo: uses limited web tools + internal docs to answer onboarding questions.
  • Roadmap: directly connect it to the internal recommendation engines (ALS, subject embeddings, cold hybrid) for catalog-grounded, explainable recs.

Lessons Learned

  • Split first → then compute aggregates to prevent leakage.
  • Evaluate what you serve → ranking metrics (Recall@K, NDCG@K) instead of only RMSE.
  • Split-safe vocabularies → avoid untrained embedding rows creeping in.

More Info

This post is a short overview.
If you’re curious about the full technical pipeline, experiments, and mistakes/lessons learned, you can check it out here: Portfolio Link
You should also check out the live demo here: Live demo
You can look at the code here: Github Repo