Pandas for ML

Personally, I found working with data frames in Pandas pretty unfamiliar at first and the online cheat sheets not very easy to read. It was really helpful for me just to collect all the examples Jeremy had covered in one place and connect it together (adding some other small useful bits), so though I’d share my code for others to dig through:

Pandas for Machine Learning Cheat Sheet

Let me know if there’s anything big I’ve missed!

5 Likes

Great job! BTW one very minor suggestion: you can always remove , : from any numpy or pandas indexing. E.g. instead of arr[0, :] just say arr[0] - the trailing colon is assumed. (Very few people seem to be aware of this, so most code I see on the internet has the trailing colon - but I think it’s clearer without it, personally).

2 Likes

I also wonder if it should be df_raw.iloc[range(5)] not df_raw.loc[range(5)]. We happen to have an index of 0, 1, 2, … for that dataset but iloc will always give you the more numpy indexing.

1 Like

Thanks Jeremy – useful to know :slight_smile:

Hadn’t actually picked up on the difference between .loc and .iloc, Terence, but seems like an important distinction, will edit now

This is what I’ve learned:

  • loc works on labels in the index (row/column names)
  • iloc works on the positions in the index (so it only takes integers) - @soorajviraat says it’s an abbreviation for integer location

This article explains the diff b/n loc, iloc and ix in detail, do give it a read.

3 Likes

I’ve already updated the document to include this :slight_smile:

@alexhoward95 Thanks a lot! That is helpful!

Also, I found a good tutorial/cheat sheet on Pandas, which might also be useful: http://nbviewer.jupyter.org/github/rasbt/python_reference/blob/master/tutorials/things_in_pandas.ipynb

1 Like