Wiki post
Note that this is a forum wiki thread, so you all can edit this post to add/change/organize info to help make it better! Please add any useful resources you find, tips, etc into this thread. To edit, click on the little pencil icon at the bottom of this post. Here’s a pic of what to look for:
Lesson resources
- Lesson video
- Course repo: https://github.com/fastai/fastai1
- Notebooks: In courses/ml1 in the above repo
- Workshop video, showing how to set up AWS and Anaconda, how to use numpy, etc.
- Lesson notes from @hiromi
- Kaggle Kernel for lesson 1
Setting up your computer if it already has Anaconda installed
git clone https://github.com/fastai/fastai1
cd fastai1
conda create -n fastai python=3.6 anaconda
conda env update
source activate fastai
Homework
Set up Anaconda and fastai from scratch on new AWS instance
- Launch instance with Ubuntu t2.medium 50GB (preferred) or t2.micro 30GB (free), then login and complete these steps:
git clone https://github.com/fastai/fastai1
wget https://repo.continuum.io/archive/Anaconda3-5.0.0.1-Linux-x86_64.sh
-
bash Anaconda3-5.0.0.1-Linux-x86_64.sh
- Default install to home
- Yes: add to .bashrc
-
source .bashrc
or login again (NB: on Mac this file is called.bash_profile
) cd fastai
conda env update
- Logout, then log back in adding to your ssh command:
-L8888:127.0.0.1:8888
. This creates an ssh tunnel. There’s lots of tutorials about ssh tunnels if you’re not clear on this, such as this one - Next time when log in, be sure to
source activate fastai
. - Remember, you’re being charged for every hour your server is on. You can turn it off in your terminal by typing
shutdown -h now
or in the AWS web-based console. There are also mobile apps in case you’re not near your computer.
Submit to Kaggle every day
The key reason to participate is to get the practical skill of building pipeline, models, data cleaning. Even if you come in last, you have practical opportunity to build tools.
- Try to spend at least 30 mins a day, and try to enter a new competition or improve on yesterday’s submission
- E.g. start with submitting column of zeros, or using Kaggle’s sample submission file (if the competition has one), to learn how the submission procedure works. Then the next day, submit averages, and you should see an improvement
- Try to look at the competitions that are there now and see which will be best for you.
Practice Jupyter notebook
- Experiment with Jupyter notebook shortcuts. Press
h
in Jupyter to get a list - Try to learn and use 4-5 notebook commands every day that look useful to you.
Lesson notes
Use the forums liberally for questions, answers and support. By liking questions and answers, we can see where the “gravity” is forming around questions, issues or topics.
Other people’s Kaggle kernels, blog posts, etc. give you the opportunity to read and learn from technical writing and code from other developers and data scientists. Creating your own kernels and blog posts gives you the opportunity to share your own insights to help others, which tests and develops your technical communication skills. Note that the highest voting kernels usually happen early in the competition!
Check out the Twitter ML community; it is popular and welcoming to new practitioners and students. To find people that are posting ML stuff on twitter, look through my favorites and retweets, and follow people from there.
How to ask questions: The best practitioners ask good questions; research says that people who ask questions are perceived as being smarter. Use class to practice asking questions! Saying “I don’t follow what you’re talking about - can you please explain it again” any time you’re having trouble following, is a great idea.
The teaching philosophy and syllabus is in the lesson1.ipynb notebook.
Lesson Notes from @melissa.fabros
Let’s increase class participation in forums and in-person! AKA How to ask questions.
- Ask for confirmation for what you know: “I think I heard you say , did I hear that right”
- Ask to repeat information: “please repeat {x information}"
- Ask rephrase abstract concept in term of an example: “what’s the intuition behind that idea?” ; “Can you provide a concrete example of {x idea}” ; “In what context is {x idea} implemented or used?”
Please be brave! Please practice asking for what you need to succeed. It’s hard. We know.
- Is machine learning only used for classification?
No, for the purposes of the class, we’ll use machine learning for regression and classification.
- Is deep learning the only machine learning that counts?
Deep learning is a class of algorithm strategies that solve machine learning problems, but is particularly effective for unstructured-data problems (i.e. computer vision and increasingly machine translation
- Python vs R. what if we like R?
Try both if you’d like, Jeremy will show you new techniques in python that will help you get to number.
This is a question of “How much should I listen to Jeremy” if you want to try R because you’re more comfortable in R, go for it! If you’re successful, it could be the start of interesting conversation (Remember note about diversity of models and approaches!).
Let’s build a model!
@timlee has great walkthrough and annotated notebook for lesson 1.
Jupyter notebook tips
- Shift-enter executes a cell & it’ll tell you where it comes from
- ?<thing you don’t know about> + shift-enter will display documentation
- ?? + shift-enter will show source code
- Shift-tab: display the parameters of function
- Shift-tab-tab: display documentation
- Shift-tab-tab-tab: brings up documentation in its own window
It’s more important to learn how to get information about libraries rather than know the intricacies of how a function in a library works.
reflections and questions during code walk through
- We’re learning data science and not software engineering:
intuition: we’re not writing code for other people, we’re writing code to have everything available to us so we can be as iterative and interactive as possible.
Note Jeremy’s python style leans toward how to be effective instead of how to signal to others that you have good coding style. If you’re coding for others to read (ie. practicum companies, refactor and document code to at least PEP 8 standards so other people can follow your code)
- Q: what is regression vs classification?
Regression is continuous variable prediction and classification is true /false categorization or identification of multiple categories
Curse of dimensionality
“Curse of dimensionality” → theoreticians don’t like many dimensions (aka columns), but in practical use the more information the better because you don’t know what might be.
No free lunch theorem → In theory, no one type of model will work well for any kind of data set, which true for random data sets
Jeremy’s “free lunch theorem” → in practice, random forest is the best model for most data b/c most data is not random. A random-forest decision tree works for almost every structured-data problem.
- Q: if we have too many dimensions/fields won’t we run into co-linearity problems?
Random forests have almost no co-linearity problems
With completely off-the-shelf tools, i.e., without any customization, Jeremy’s general purpose random forest analysis places around 100th place in Bulldozer competition (in the top 25%) of all submissions.
fastai
Module Changes and Updates You Might Notice in Your Results
-
add_datepart(...)
(2018-09)
– The source code for the ‘saleElapsed’ field has changed from
df[targ_pre+'Elapsed'] = (fld - fld.min()).dt.days
…to…
df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
– As a result, the values produced by the function for the field differ significantly from those shown in the video beginning at 1:11:30. -
proc_df(...)
(2018-09)
– The function requires three variables to work properly in the Lesson 1 notebook.
– lesson1-rf.ipynb provided with the paperspace Gradient fastai template already includes the updated code
– If you did not use the paperspace / fastai template, changedf, y = proc_df(df_raw, 'SalePrice')
shown in the video todf, y, nas = proc_df(df_raw, 'SalePrice')
– Import issue from sklearn (v0.22 and above) (2019-12)
While importing, you will run into anImportError
if you’re using what’s in the notebook
from sklearn.preprocessing import LabelEncoder, Imputer, StandardScaler
This was deprecated in v0.20 and removed in v0.22. Use this instead:
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.impute import SimpleImputer
Read the documentation for SimpleImputer if you run into issues while trying to use it in place or Imputer