Reflections on 2022 - Feature Engineering

FourMoBro · December 30, 2022, 1:07am

As 2022 draws to a close, my company gives us an extended holiday to relax. With the weather not being that great for anything, I am using this time to enhance some technical skills and revisit the fastai courses both parts 1 and 2. I’d like to propose a challenge or collaboration or some sort of knowledge sharing to the fastai community. But first let me give you a bit of background.

2022 for me began with the urge to find a new job. I had a comfortable job, but it was not challenging nor was it rewarding in anything other than above average compensation.I had a personal choice to either “retire in place” for the next 5-15 years and rest on my laurels or find a new job that engaged me. I was able to find a new job within my company in June, but had the chance at a Data Scientist job back in April. As part of that pre-screening process, they gave a “kaggle competition” of sorts but you only got one submission. It was a tabular dataset of used cars. Initially, I thought I had this in the bag. I have been somewhat, well, maybe that was way too generous, familiar with fastai and what it can do. How easy it was to create a model and get results, etc. However, I was quickly humbled.

I had been so out of practice with python and pandas. I struggled with the basics. I referenced fastbook as best I could, but I could not get results I was expecting. Maybe if Lesson 6 of Part 1 was recorded/released earlier, I could have made a better showing, but I really only have myself to blame. I could talk a good game, but could I walk the walk? The answer was “no”.

I did no feature engineering on that dataset outside of the obvious. Chose a RF method and just went at it. I submitted something, but I was not proud of my work. When I was informed I would not continue on with the hiring process I was somewhat relieved. I didn’t have what it took. I thanked the manager but asked what could have been done better and if he would be willing to score another submission at a later date. I heard nothing for a few months.

It was only after I got my new position, in a “sister department” to what I applied for earlier that I got a response. In short, my model(s) scored poorly. It was both a Classification and a Regression test. Now it could have been something I did in the final submission as I had a lot of copy/paste. Or it could have been a crappy model. At any rate, the manager said that almost all of the people spent too much time on the model and not enough time doing basic feature engineering.

Well, that’s what I want to do. I want to revisit that dataset and take much more time over the next few weeks and see what I can do. I want to use this as a way to sharpen my pandas skills, my fastai skills, but I also wanted to solicit suggestions as how the community would handle this test and what types of feature engineering can or should be done. I have created a github repo that I can share for those who are interested in providing feedback. It contains the 2 csv files and a notebook that I started in the past day with a “plan of attack” on the feature engineering front. It is still in the initial stages and has some data copied from the “titanic” notebooks. They can be access here: GitHub - FourMoBro/usedcars: DS test from 04-22. After I re-watch lessons 5-8 from Part 1, I will add the code cells to create a simple initial model using all of the features and start to simplify from there. If you need more questions on the data or what needs to happen, let me know in the comments.

I hope 2022 was great for you and wish you all the best in 2023.