What is the jobs of data scientist looks like

As the title mentione(especially for computer vision task). Recently I start my own project(emotion classifier) as the lessons(part 1) suggested, I find out there is a huge different when comparing with kaggle competition, it is data collection.

In kaggle competition, what I concern are

  1. How to build a better model
  2. Ensemble
  3. Create a useful validation set(Whatever, I fail to do that in fisheries monitoring)
  4. Apply other skills like pseudo labeling, k-cross validation to improve the accuracy

But in my own project, collect and filter hundred thousand of emotion expression is a time consuming task(kaggle has the dataset, but the data is imbalance and small), I feel like most of the times are invest on data collection and filtering. Are the data scientist all like that?Or they would split the data collection to another team?

My process of data collection

  1. Collect data from Google, Bing, ShutterStock
  2. Use img_hash module of opencv to filter similar images(with vp tree)
  3. Use dlib to crop human faces
  4. Filter out faces of unmatch emotion
  5. Repeat step 1

Any suggestions?

There’s a huge difference between Kaggle and real world data science and your comments get at the heart of it.

Data science in reality involves a lot of data collection, cleaning, parsing and munging, and many people would estimate that component at 70-80% of the job. It’s not always that large, but it’s a huge component. Building a large enough dataset to make a reasonable model, especially when tackling a new domain like you’re doing is a huge issue.

You can outsource some of that work with tools like amazon turk and crowdflower for the types of tasks you’re talking about, but good dataset creation is an important part of data science. It usually ends up being a balance, where the data scientist works with another team to make sure that the data is properly collected and labelled but usually there is a fair bit of manual examination of the data in my experience while the dataset is being properly defined.

1 Like

You might be interested in:

I’m not convinced that their method of mapping to arousal and valence adds value, but they talk extensively about how they created their dataset and their method of evaluation.

1 Like

Thanks for the link of the paper and your suggestions, glad to know data collection is crucial for data scientists.

Before I start this project, I never noticed some emotions of humans are hard to interpret(ex : surprise vs fear), I have to study how to read those expressions.

By the way, just come up a simple solution to help me filter the images–use the microsoft emotion api as a “weak classifier” to help me classify the emotions, this should save me lots of times. There are two ways to do it,

  1. Use their api directly, this solution is not free but legal
  2. Write codes to automate submit/browse process of their webpage, this solution is free but may not a legal one

Edit : microsoft emotion api work, now I have more times to focus on other things, thank goodness

good to hear that you use MS Emotions API as short cut :smile: