Fine-Tuning a Niche Chatbot: How to Capture Cultural Zeitgeist?

I’m exploring a project aimed at creating a chatbot that resonates deeply with its users—specifically, single GenX women in the Southern US who love cats (and maybe even Taylor Swift - not that this is familiar territory or anything). I see the model as a kind of digital confidant, well-versed in the cultural references, humor, and experiences of this demographic.

It’s an admittedly goofy project but if I can make it work, I have other applications in mind. I figure I might as well have fun with the POC :smiley_cat:

Objective: The goal is to develop a language model that not only converses but also reflects the zeitgeist and lifestyle of these women, functioning like a chat with a close friend from the same cultural background.

Technical Approach:

  • Base Model: Start with an open-source LLM.
  • Customization: Plan to fine-tune this model using a niche dataset comprising blogs, forums, articles, and social media posts that capture the group’s cultural essence. Fine-tune or RAG??

Challenges: I’m seeking guidance on the following:

  • Data Collection: What are the best practices for assembling a culturally rich dataset while ensuring ethical standards?
  • Modeling Techniques: Is fine-tuning the best approach for deep cultural nuance, or should I consider adding Retrieval-Augmented Generation (RAG) for dynamic content?

Request for Insights: I would greatly appreciate any advice on capturing cultural nuances effectively, maintaining ethical data practices, and any recommendations on model training strategies or tools. Insights or references to similar projects would be invaluable. Or if this just won’t work at all??