Machine Learning - Class 1 Summary

This post is a summary of the Machine Learning Class. I didn’t find a suitable topic to post this to.@jeremy can you please help me turn this into a wiki that can be edited by all the students.

Things I’ve noted down when I watched the lecture for the first time.

Using cURL to download data from Kaggle to cloud terminal.
Once logged into Kaggle, you download the dataset from the data tab, if you’re using a GUI based desktop. But if you’re on Crestle or Paperspace or AWS (cloud) with only access to terminal, its gonna be a bit hard to download the data. Jeremy walked us through a shortcut.

We’re looking at bulldozers problem on Kaggle.

  • The browser in this case is Firefox. Chrome does not work for some reason.
  • Hit Ctrl + Shift + I. This brings up dev tools.
  • Move to network tab.
  • Click the download button on the preferred dataset.
  • Cancel the download.
  • The browser is going to capture the URL details of the download in the network tab.
  • As shown in the screenshot, right click on the file, and copy as URL.
  • Jeremy mentioned to remove –2.0 if present in the copied URL but I dont think its present anymore
  • Once copied, add -o filename.extension(train.zip in this case) so that binary data does not get written to the terminal which might crash your PC

curl “https://storage.googleapis.com/kaggle-competitions-data/kaggle/3316/Train.zip?GoogleAccessId=web-data@kaggle-161607.iam.gserviceaccount.com&Expires=1526020430&Signature=e8uYs6H"%“2F8DutseoMknMnjqwbpM”%“2BcY”%“2F4ON62fPBlEY0un565iBwzaDtTmlLMnRzyTfeJ6cjALWFGsNz4bHOc3gBlPKwPqS0xv”%“2BKM4Km0GaJ”%“2B0q8vm9LagyL6xcd9se0I7q8nqIpZs9d8”%“2Bg0TuIh80qMMv9JRsvXQ4JJZM18vXxE8A8mYvWW”%“2FmDkuT5D1yD”%“2BRnrDtRVrDJoBrF3Nu9Ivj6npaHZo”%“2BpeEW6TYZaagLBlY8hPtHac9pH611BNh5abi”%“2BRSgXBsBsMwzy4YaOalLhSg”%“2FuLy”%“2FUT0snRvND8Cj49EoHU”%“2B3lTGrkw6Jyz”%“2FggA”%“2Bt2mGByltt0bG8BI9i3ERAc7og”%“3D”%"3D” -H “Host: storage.googleapis.com” -H “User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0” -H “Accept: text/html,application/xhtml+xml,application/xml;q=0.9,/;q=0.8” -H “Accept-Language: en-US,en;q=0.5” --compressed -H “Referer: https://www.kaggle.com/” -H “Connection: keep-alive” -H “Upgrade-Insecure-Requests: 1” - o train.zip

f-strings in Python 3.6

f-strings are introduced in Python. They make formatting string way simpler. Gist is that, append a f to a string similar to the way r is used to represent raw string.

name=“vijay”
print(f’ My name is {name}’)==>“My name is vijay”

The curly braces hold the values and even integers work unlike the normal string formatting where you to typecast int to str to print them. A good blog post here.

Some Unix/Linux stuff

  • wc -l filename - Returns the number of lines in the file(csv in this case)
  • ls -lh - Returns all the files in current directory
  • unzip filename - Unzips the file

Am pretty new to the Linux world,hence a bit excited to see small commands getting lot of things done.

Root Mean Square Logged Error
image
In the Kaggle problem, we’ll be calculating RMSLE to measure the model’s accuracy. Mathematical expression of RMLSE should be similar to above where

x - Actual bulldozer price
y - Predicted bulldozer price

RANDOM FOREST

Jeremy provides a brief overview on why Random Forest. He’ll diving into further details in the following lectures. The following is what I picked from the current lecture.

  1. It can be considered a Universal Machine Learning Technique.
  2. It basically can be used to target any type of ML problem like Classification, Continuous variable prediction(Regression).
  3. It generally does not over fit and if it does, it can be easily corrected.
  4. Very few statistical assumption are made on the data
  5. It usually does not require a seperate validation set.

Curse of Dimensionality

This is an assumption in the ML community that more the number of columns, it creates a space which is more and more empty. More dimensions lead to most of the points sitting on the edge of that space. The distance between the points become less meaningful.

Jeremy considers this as largely meaningless and stupid. Later on the class, he also suggested to break the data into as many columns as possible if you think that’s going to add more meaning to the data. Though the points are on the edge of the space, they still do have different distances away from each other and are still useful.

This notion basically came from the 90’s where theory took over.

I’m not pretty much clear on the space part of this though I understood the overall gist of what Jeremy said. I’ll be happy if someone can add some more meaningful explanation here.

No Free Lunch Theorem
This states that there is no type of model that works well for any kind of data set.

Contrast to this, in real world, there are techniques that work better than other techniques for a given data set.

Also this assumes theorem assumes that we work on random data sets in real life which is actually not the case. Every column in the data set will definitely have a relation to other columns in the data set.

Regression: A ML model trying to predict continuous variables(a dependent variable) based on other variables in the given data set. People often refer regression to logistic regression which is not true.

Feature Engineering

Wikipedia says - Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work.

Jeremy threw a very simple example at us in the class. In the bulldozers data set, we had a date column. What all extra features/information can we get from that column.

  • Which quarter of the year it is? - Useful if we’re predicting sales
  • What day of the week is it ? - We can observe if there is a pattern for a particular day of the week(People usually go to church on sundays at a particular time and Uber can plan a surge)
  • What week of the month?
  • Is it a holiday? - Eateries in movie stalls can plan their logistics
  • Is it summer/winter/rainy season? - Seasonal sales say Umbrellas in summer/rainy or jackets in winter
  • Is there a game going on that day ? -
    Pandas Stuff
    A basic data structure in pandas is called a data frame. Data frames in Python and R function in a similar way.

Pandas is the gold standard for dealing with structured data in Python and am amazed sometimes with what it can given a simple csv/excel file with some data in it.

Few methods from Pandas

  • df.tail() - Prints last few rows from the dataframe

  • df.head() - Prints first few rows from the dataframe

  • df.tail().transpose() - Usually columns are printed at top and rows are printed at side. When you’ve lots of columns, some columns are formatted and not displayed. Transpose method prints columns to the side and rows to the top so that you can view all the data at once.

  • drop() - This method return a new dataframe with few columns or rows removed which can be specified in a list and passed into the method. The axis parameter decides whether rows or columns to be removed. axis = 1 removes columns.

  • Use square brackes [column name] to grab a column from dataframe in a safe way to perform some operation on it.
    df[‘salesdata’] = df[‘salesdata’] * 2

  • Pandas inherently supports categories, datetime datatypes.

fast.ai libray functions

  • add_datepart - Seperates datetime64 type column into multiple columns like discussed above in feature engineering.

  • train_cats - Creates categorical variables for any column that contains a string(If there’s a column called severity with labels High, Low and Medium, they will be converted to numbers say 0,1 and 2). I did not know that Pandas can natively do this. I earlier used the preprocessing.LabelEncoder() from scikit learn.The ordering of the labels can be changed which may be helpful in improving the model structure, say reducing the number of decision trees. There is a parameter called inplace which when set to True does not alter the df but reorders the values properly. Any missing values are filled in with -1

  • One point to be careful while labeling is that, make sure you’re using same labels for training and testing the model otherwise, your model will never seem to work properly. (I have faced this issue in my work when working on some classification tasks.)

  • apply_cats - This method makes sure that the same labeling is used for test data which was earlier used for training data.

Feather format

Feather provides binary columnar serialization for data frames. It is designed to make reading and writing data frames efficient, and to make sharing data across data analysis languages easy.

This format saves the data to disk in a similar way the data is stored on RAM. This significantly improves the read time from the disk and we’ve experienced this in the class when Jeremy read an df from the disk thats saved in this format

njobs

This variable is passed to the randomforestregressorobject.fit() method. This specifies to create seperate process for the training to run across the seperate CPU cores if available.

The above notes has been taken down while listening to Jeremy’s video on YouTube. The notes is in pretty much the same order as the class except that I grouped the pandas utilities that are covered over the entire session under one heading. I’ll be adding more notes once I listen to the lecture again. Please feel free to update the notes or suggest any chagnes that I need to make if the post is not editable.

7 Likes