Good resources for explaining when to use, and when not use ML?

NegatioN · July 30, 2018, 7:09am

Hi guys, I’ve recently transitioned from “ML engineer” to “Data Scientist”. (by that I mean more sole responsibility for external products using machine learning, requiring more direct communication with stakeholders and product people.)

My first project wasn’t a complete waste, but not a complete success either.

There were some things that wasn’t completely obvious at the start, which clearly hampered development of the ML part, one of them being: the product didn’t really have the data to support using machine learning yet, even though it’s a great candidate for it later on.

What are some good explanations of prerequisites for needing/using ML you have found?
How can we make teams without direct ML experience prepare well for a time when they might actually need ML in their products?

Examples on the top of my head are:

Do you have enough data-points to capture some truth about the distribution you’re trying to explain?
Would you be able to distinguish between example A and B apart as a human, based on the data you have about them?
Is my data in a structured form, that the computer understands? (is it numerical, categorical or possible to turn it into structured data via image or NLP models? PS: turning it into structured data is probably a last resort)
Does your data already exist somewhere, or does it have to be collected from scratch?
++

Really appreciate any thoughts, or links about the subject!

msp · July 30, 2018, 9:19am

Good question!

I am not sure I understand your second bullet, could you explain:

In a company setting, there may also be a kind of necessity criterion: do you really need ML to solve this problem, i.e. isn’t there a simpler method that will do the trick? If there is a simpler method, it can be hard to defend the use of ML.

NegatioN · July 30, 2018, 9:45am

The second bullet was intended as a very rough intro to content based models using more “structured data”, for someone who has never done it, and I definitely appreciate help reformulating it.

Take a (not so hypothetical) example: imagine we have to recommend a job to a candidate/user, and our data about the job-ad is something like this: a few tags about the job (ex: python, data science, statistics), location and a professional genre (ex: software).

If we see this from the perspective of the candidate, we’re most likely not able to provide a very specific recommendation based on these few data-points. The candidate might care about the size of the company, salaray of the position, what kind of responsibilities it entails or benefits provided by the company.

So in this case, our mission is to predict which candidates are likely to respond positively to our job-ad, but we haven’t properly thought about what kind of information our model would need to make good predictions for that, or what the candidate might need to make the decision.

If teams had a way to better learn what kind of data makes sense to collect or how much data to collect, earlier in the process, that would probably be better for everyone. Compared to having to backfill or have incomplete data later on.

I do realize that the team can’t possibly get everything right on the first try, and that the suggestions I noted down here might possibly be the wrong things to collect as well. But if we as humans can’t predict if a recommendation sounds correct for a given case, with the data we have, then we’re probably lacking something to make it an easily viable ML product at least.

Granted, this is probably not the worst case of lacking information the world has ever seen, and the point about “Is machine learning right for this product at all / yet” is probably more important.

machinethink · July 30, 2018, 9:59am

I’d ask these questions:

Can the problem be solved (to a certain degree) with traditional programming / logic?

Does that logic include heuristics?

Do you have the data to train a model that can replace those heuristics (and possibly the rest of the logic)?

NegatioN · July 30, 2018, 10:35am

I forgot to add this to the thread, which describes some things to consider when launching a product that may or may not need ML https://developers.google.com/machine-learning/guides/rules-of-ml/

I think rule number 1 is noteworthy:
If machine learning is not absolutely required for your product, don’t use it until you have data.

cedric · July 30, 2018, 12:19pm

Good thread. Thanks to everyone for sharing.

From the “Rules of Machine Learning” link shared:

To make great products:

do machine learning like the great engineer you are, not like the great machine learning expert you aren’t.

Most of the problems you will face are, in fact, engineering problems. Even with all the resources of a great machine learning expert, most of the gains come from great features, not great machine learning algorithms.

This is gold!

TL;DR: don’t use ML until you have build a proper data foundation.

A little context. I am a long time software engineer trying to transition to ML engineer. For the past 1 year or so, I have talked to a few data science team from various companies like Grab, Carousell, GO-JEK, and so on in the Asia region. These are enterprise level company with terabytes of data (not Google-scale though) and they are trying to scale their older generation data analytics and reducing manual features engineering. I try to share what I learned so far below from my perspective as an outsider. Generally, from what I heard and understand from these companies, I see a common theme, though in details, they might not be the same.

Here’s one example of scaling Machine Learning at GO-JEK on Google Cloud:

https://twitter.com/cedric_chee/status/1019573702097174528

Before ML is all the rage, companies have been using simple solution like writing SQL queries to query databases to gain insights and it works pretty well for a range of time.

Here’s what these companies’ data science team do to get to the level be ready to use ML:

Data sourcing - build a data foundation; reporting, archival, monitoring, security.
Challenges with features - consistency, volume of feature data, access to real-time features, discovery
Feature creation (in GO-JEK case, they use Cloud Dataflow to do that)

Those are mostly works in data engineering and standing up infrastructure for data pipelines.

Once those pieces are in place, this will let the data come to you and features should be free, so you can have clean data and features ready for ML modelling.

NegatioN · July 31, 2018, 6:46am

@cedric that TLDR sums it up well. I read that paragraph slightly different at first, but this seems to be more of the intended meaning. There’s obviously a reason that it’s encouraging teams to try exploiting the simple features first.

The only negative I can see coming from this, are teams that won’t be willing to sacrifice the performance of their hierarchical solution until it’s beat on absolutely every metric or front. That’s more of a political problem though.

Malemute · August 2, 2018, 6:38am

Do you think the delivery planning task could be a good area for ML or any kind of neuro networks? There is a city with complex structure, traffic jams, there are a number of couriers, and time intervals for delivery. The task is way too complicated for usual algorithmic programming. But no more than go playing, isn’t it? Could AI be used here, and where to start?

NegatioN · August 2, 2018, 8:47am

@Malemute This thread is not really about this topic. It sounds difficult, and not like a good starter task if you’ve not done a bit of ML before this.

I think jeremy has mentioned some of the taxi fare competitions on Kaggle, and their solutions to those before. Maybe something like this: https://github.com/retoga/kaggle-epia2017/blob/master/taxi_finalcode.py
and looking at this https://www.kaggle.com/c/nyc-taxi-trip-duration/kernels plus more.
Good luck.

Malemute · August 2, 2018, 9:31am

OK, thank you for the links - it can be a good starting point. And for your patience to the question

NegatioN · August 9, 2018, 10:55pm

For anyone interested, and new to the game (or with similar challenges as me), I ended up writing a blog post about how to best Prepare your product for machine learning. The post is targeted at regular programmers who don’t have access to a whole lot of data science backing, but still would like to prepare for whenever one gets brought on-board.