Idea for project: stocks search. (pointers appreciated!)

cmackenzie · April 23, 2023, 11:52pm

Hi !

At my work we’re currently developing an app for investors and one important functionality is the ability to search for instruments (stocks, funds, etc). This is currently implemented very crudely by looking up the search terms (i.e. ‘apple’) in the instrument names (i.e. Apple Computer Inc.) or the stock/fund symbol (i.e. AAPL, MSFT, TSLA, etc.)

I think a deep learning approach for this might be much better, but I don’t really know what kind of model might I have to use, etc.

The idea would be to make the search much more flexible, allowing the user to type in natural language that doesn’t necessarily include the symbols or company names, and get back relevant instruments.

I’m just starting version 2 of the course, I did part one a while ago. Any pointers as to what kind of models might be helpful for a task like this? What should I look up, study, etc?

Sorry if what I’m trying to build is not clear.

Cheers !

mw00 · April 24, 2023, 3:30pm

If you want to simplify the process of looking up names with natural language, I’d suggest the chapters 3 and 4 of the course.

Chapter 4 covers Natural Language Processing, which is about applying machine learning to text. For example, you can learn to generate or summarize text with NLP. If you have covered the basics of this chapter, I think you could achieve your desired goal with a little more research and experimentation.

I also recommend looking into chapter 3 because it covers the fundamentals of neural nets. With the information from this chapter, you will understand chapter 4, which is about NLP, a lot better.

Overall, I think that chapter 4, which is about NLP, will help you out the most. By combining this chapter with some further individual research and the basics from chapter 3, you will have a solid foundation to build upon, if you are eager to learn the topic with the FastAI course.

cmackenzie · April 24, 2023, 6:25pm

Thanks! Yes, I’m actually looking at that notebook right now, but the task is quite different. I’m looking at search basically as a multi-class classification problem, although that might not be the best way to look at it.

valitovrus · April 26, 2023, 9:20am

I am not an expert, I am also just studying ML/NLP. However, the problem is so fascinating, so I couldn’t pass by.

It seems that you need to solve a document/text similarity problem: given a document (user’s description of an instrument), return a list of most similar documents (instrument descriptions). Google search for “document similarity problem” yields tons of results, but, in one form on another, it would require representing documents as vectors and then finding most similar vectors to the target one.

A powerful way to do it is embeddings. This topic is covered in lesson 7 in context of collaborative filtering.

There is also an algorithm called Doc2Vec which might be worth studying, though I haven’t done it myself yet and not sure how relevant it is.

You can also have a look at the tutorial for this kaggle competition, as it describes embeddings and document vectorization, but it might be outdated, as it seems to me.

valitovrus · April 26, 2023, 9:29am

Also, which problem would it solve from a user perspective? When a user might prefer looking up an instrument by arbitrary natural language description, rather than using domain-specific filters, like instrument type (bond/stock/…), domicile country, dividend policy etc?

For instance, if the problem can be restated as something like “what else can I add to my portfolio”, than it would be more like a recommendation/collaborative filtering problem.

cmackenzie · April 26, 2023, 12:33pm

Hi @valitovrus

Thanks a lot for the links. I’ve been doing some research and I think the problem could be framed as a semantic search problem / document similarity problem.

I’ve yet to give it a try but I’m looking at this tutorial which seems promising.

From a user perspective, I think it might be useful since you could look up instruments without explicitly referencing the instrument names or market symbols.

i.e. You could look up Apple and Microsoft by referencing ‘technology’, or have a bunch of funds come up by looking for “emerging markets”, without having those works explicitly in the fund name. Also, I expect this might work better with user typos.

Of course this whole thing could be solved without machine learning by modeling a relational database, but tasks like sorting by relevance are not trivial to implement.

I’ll try and implement something along the lines of the link above and post here if I find something interesting.