Wiki thread: lesson 1


(Jeremy Howard (Admin)) #1

Wiki post

Note that this is a forum wiki thread, so you all can edit this post to add/change/organize info to help make it better! Please add any useful resources you find, tips, etc into this thread. To edit, click on the little pencil icon at the bottom of this post. Here’s a pic of what to look for:

Wiki: Lesson 2 >>>

Lesson resources


Setting up your computer if it already has Anaconda installed

git clone https://github.com/fastai/fastai
cd fastai
conda create -n fastai python=3.6 anaconda
conda env update
source activate fastai

Homework

Set up Anaconda and fastai from scratch on new AWS instance

  • Launch instance with Ubuntu t2.medium 50GB (preferred) or t2.micro 30GB (free), then login and complete these steps:
  • git clone https://github.com/fastai/fastai
  • wget https://repo.continuum.io/archive/Anaconda3-5.0.0.1-Linux-x86_64.sh
  • bash Anaconda3-5.0.0.1-Linux-x86_64.sh
    • Default install to home
    • Yes: add to .bashrc
  • source .bashrc or login again (NB: on Mac this file is called .bash_profile)
  • cd fastai
  • conda env update
  • Logout, then log back in adding to your ssh command: -L8888:127.0.0.1:8888. This creates an ssh tunnel. There’s lots of tutorials about ssh tunnels if you’re not clear on this, such as this one
  • Next time when log in, be sure to source activate fastai.
  • Remember, you’re being charged for every hour your server is on. You can turn it off in your terminal by typing shutdown -h now or in the AWS web-based console. There are also mobile apps in case you’re not near your computer.

Submit to Kaggle every day

The key reason to participate is to get the practical skill of building pipeline, models, data cleaning. Even if you come in last, you have practical opportunity to build tools.

  • Try to spend at least 30 mins a day, and try to enter a new competition or improve on yesterday’s submission
  • E.g. start with submitting column of zeros, or using Kaggle’s sample submission file (if the competition has one), to learn how the submission procedure works. Then the next day, submit averages, and you should see an improvement
  • Try to look at the competitions that are there now and see which will be best for you.

Practice Jupyter notebook

  • Experiment with Jupyter notebook shortcuts. Press h in Jupyter to get a list
  • Try to learn and use 4-5 notebook commands every day that look useful to you.

Lesson notes

Use the forums liberally for questions, answers and support. By liking questions and answers, we can see where the “gravity” is forming around questions, issues or topics.

Other people’s Kaggle kernels, blog posts, etc. give you the opportunity to read and learn from technical writing and code from other developers and data scientists. Creating your own kernels and blog posts gives you the opportunity to share your own insights to help others, which tests and develops your technical communication skills. Note that the highest voting kernels usually happen early in the competition!

Check out the Twitter ML community; it is popular and welcoming to new practitioners and students. To find people that are posting ML stuff on twitter, look through my favorites and retweets, and follow people from there.

How to ask questions: The best practitioners ask good questions; research says that people who ask questions are perceived as being smarter. Use class to practice asking questions! Saying “I don’t follow what you’re talking about - can you please explain it again” any time you’re having trouble following, is a great idea.

The teaching philosophy and syllabus is in the lesson1.ipynb notebook.

Lesson Notes from @melissa.fabros

Let’s increase class participation in forums and in-person! AKA How to ask questions.

  • Ask for confirmation for what you know: “I think I heard you say , did I hear that right”
  • Ask to repeat information: “please repeat {x information}"
  • Ask rephrase abstract concept in term of an example: “what’s the intuition behind that idea?” ; “Can you provide a concrete example of {x idea}” ; “In what context is {x idea} implemented or used?”

Please be brave! Please practice asking for what you need to succeed. It’s hard. We know.

  • Is machine learning only used for classification?

No, for the purposes of the class, we’ll use machine learning for regression and classification.

  • Is deep learning the only machine learning that counts?

Deep learning is a class of algorithm strategies that solve machine learning problems, but is particularly effective for unstructured-data problems (i.e. computer vision and increasingly machine translation

  • Python vs R. what if we like R?

Try both if you’d like, Jeremy will show you new techniques in python that will help you get to number.

This is a question of “How much should I listen to Jeremy” if you want to try R because you’re more comfortable in R, go for it! If you’re successful, it could be the start of interesting conversation (Remember note about diversity of models and approaches!).

Let’s build a model!

@timlee has great walkthrough and annotated notebook for lesson 1.

Jupyter notebook tips

  • Shift-enter executes a cell & it’ll tell you where it comes from
  • ?<thing you don’t know about> + shift-enter will display documentation
  • ?? + shift-enter will show source code
  • Shift-tab: display the parameters of function
  • Shift-tab-tab: display documentation
  • Shift-tab-tab-tab: brings up documentation in its own window

It’s more important to learn how to get information about libraries rather than know the intricacies of how a function in a library works.

reflections and questions during code walk through

  • We’re learning data science and not software engineering:

intuition: we’re not writing code for other people, we’re writing code to have everything available to us so we can be as iterative and interactive as possible.

Note Jeremy’s python style leans toward how to be effective instead of how to signal to others that you have good coding style. If you’re coding for others to read (ie. practicum companies, refactor and document code to at least PEP 8 standards so other people can follow your code)

  • Q: what is regression vs classification?

Regression is continuous variable prediction and classification is true /false categorization or identification of multiple categories

Curse of dimensionality

“Curse of dimensionality” → theoreticians don’t like many dimensions (aka columns), but in practical use the more information the better because you don’t know what might be.

No free lunch theorem → In theory, no one type of model will work well for any kind of data set, which true for random data sets

Jeremy’s “free lunch theorem” → in practice, random forest is the best model for most data b/c most data is not random. A random-forest decision tree works for almost every structured-data problem.

  • Q: if we have too many dimensions/fields won’t we run into co-linearity problems?

Random forests have almost no co-linearity problems

With completely off-the-shelf tools, i.e., without any customization, Jeremy’s general purpose random forest analysis places around 100th place in Bulldozer competition (in the top 25%) of all submissions.

fastai Module Changes and Updates You Might Notice in Your Results

  • add_datepart(...) (2018-09)
    – The source code for the ‘saleElapsed’ field has changed from
    df[targ_pre+'Elapsed'] = (fld - fld.min()).dt.days
    …to…
    df[targ_pre + 'Elapsed'] = fld.astype(np.int64) // 10 ** 9
    – As a result, the values produced by the function for the field differ significantly from those shown in the video beginning at 1:11:30.
  • proc_df(...) (2018-09)
    – The function requires three variables to work properly in the Lesson 1 notebook.
    – lesson1-rf.ipynb provided with the paperspace Gradient fastai template already includes the updated code
    – If you did not use the paperspace / fastai template, change df, y = proc_df(df_raw, 'SalePrice') shown in the video to df, y, nas = proc_df(df_raw, 'SalePrice')

Wiki/lesson thread: Lesson 2
Fastai v0.7 install issues thread
About the Intro to Machine Learning category
Fastai v0.7 install issues thread
Where is the lecture videos?
Wiki/lesson thread: Lesson 3
(Jeremy Howard (Admin)) #3

I’ve updated the post with notes from @melissa.fabros . Gang, please help by editing the post with any links, tips, etc, or just by helping organize what’s already there! Click the ‘edit’ (pencil) icon at the bottom of the top post to edit it.


(Cedric Chee) #4

There’s no ‘edit’ button. Could you turn this into a wiki thread? Thanks.


(Jeremy Howard (Admin)) #5

Sorry! Done now.


#6

When you have several date columns in a dataframe, do you need to pass them all to the date_part function?

If so, is there way to iterate through each of the columns to find which are of the date datatype and convert them to codes?

I have tried this on two columns of my df, but I get an error in return:

for col in sliced_list:
    df_raw.col 
AttributeError: 'DataFrame' object has no attribute 'col'

Or else, if I try:

for col in sliced_list:
    df_raw.columns.col
AttributeError: 'Index' object has no attribute 'col'

Is there an easy way to iterate through a dataframe’s columns?


(Ramin Zahedi Darshoori) #8

Error While executing: conda env update
I followed the exact instruction, but got an error at the end of conda env update.
Here is the error:
Exception:
Traceback (most recent call last):
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pkg_resources/init.py”, line 2869, in _dep_map
return self.__dep_map
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pkg_resources/init.py”, line 2663, in getattr
raise AttributeError(attr)
AttributeError: _DistInfoDistribution__dep_map

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/packaging/requirements.py”, line 93, in init
req = REQUIREMENT.parseString(requirement_string)
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pyparsing.py”, line 1632, in parseString
raise exc
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pyparsing.py”, line 1622, in parseString
loc, tokens = self._parse( instring, 0 )
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pyparsing.py”, line 1379, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pyparsing.py”, line 3395, in parseImpl loc, exprtokens = e._parse( instring, loc, doActions )
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pyparsing.py”, line 1383, in _parseNoCache
loc,tokens = self.parseImpl( instring, preloc, doActions )
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pyparsing.py”, line 3183, in parseImpl raise ParseException(instring, loc, self.errmsg, self)
pip._vendor.pyparsing.ParseException: Expected stringEnd (at char 33), (line:1, col:34)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pkg_resources/init.py”, line 2949, in init
super(Requirement, self).init(requirement_string)
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/packaging/requirements.py”, line 97, in init
requirement_string[e.loc:e.loc + 8]))
pip._vendor.packaging.requirements.InvalidRequirement: Invalid requirement, parse error at “’; extra '”

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_internal/basecommand.py”, line 141, in main
status = self.run(options, args)
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_internal/commands/install.py”, line 299, in run
resolver.resolve(requirement_set)
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_internal/resolve.py”, line 102, in resolve
self._resolve_one(requirement_set, req)
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_internal/resolve.py”, line 306, in _resolve_one
set(req_to_install.extras) - set(dist.extras)
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pkg_resources/init.py”, line 2826, in extras
return [dep for dep in self._dep_map if dep]
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pkg_resources/init.py”, line 2871, in _dep_map
self.__dep_map = self._compute_dependencies()
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pkg_resources/init.py”, line 2881, in _compute_dependencies
reqs.extend(parse_requirements(req))
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pkg_resources/init.py”, line 2942, in parse_requirements
yield Requirement(line)
File “/home/ramin/anaconda3/envs/fastai/lib/python3.6/site-packages/pip/_vendor/pkg_resources/init.py”, line 2951, in init
raise RequirementParseError(str(e))
pip._vendor.pkg_resources.RequirementParseError: Invalid requirement, parse error at “’; extra '”


(Wayne Nixalo) #9

A lot of people’ve been having this issue recently. This (thread) might work.


(Ramin Zahedi Darshoori) #10

looks like avoiding these instructions and just doing a pip install fastai works fine!


(Jidin ) #11

Where can I find the previous Machine Learning Courses using Pytorch and Keras? I had to look something up and they aren’t featured on fast.ai website. Please help ASAP


(Chris Palmer) #12

Found the answer to my question is Yes - we are in the middle of a transition, and that I needed to re-configure my fastai symbolic link to point to …/…/old/fastai

See Moving fastai folder


Are we in-between major changes and can no longer use the fastai repo to run the ML notebooks against? I just did a git pull, now I cannot anymore run the ML lesson 1 notebook because structured.py has disappeared. I notice that its located in github under fastai/old/fastai - so it looks like its on its way out and the notebook hasn’t been updated to reflect a new approach… Should I reset to the files from the old/fastai or hang tight for an updated notebook?


(Rex Madden) #13

Did anyone see the “card” in the YouTube video? I didn’t see anything. Jeremy said there would be one around 1:28.


(Yasir Assam) #14

Which machine do you recommend to do this on?

I tried using the paperspace option recommended from lesson 1 of the DL1 course, and I get an error with the jupyter kernel restarting each time I call read_feather() followed by proc_df()

No luck running this on my own laptop (macOS): conda env update fails with a bunch of pip install errors


(Gary Allison) #15

I am also getting a Jupyter kernel crash on my paperspace machine when I try to run

df, y, nas = proc_df(df_raw, ‘SalePrice’)

in lesson1-rf. Any recommendations for what I should try? THanks!


(Sabelo Mhlambi) #16

It’s also worth noting that at least for the Notebook from Lesson 1 it can be run on one’s personal machine :slight_smile:


(Yasir Assam) #17

Not on a Mac. Haven’t been able to run “conda env update” on macOs without errors (including using the switch ‘-f environment-cpu.yml’)


(Sabelo Mhlambi) #18

I don’t think that’s because of the Notebook. You’re having issues with conda env update, which should work on mac and other *nix systems.

In terms of the FastAI library code and training the random forest regression – you shouldn’t have issues on a modern laptop/computer.


(Avinash Singh Pundhir) #19

I am also encountering the same issue. Did you manage to resolve it?


(Gary Allison) #20

No, I’ve had no luck on my paperspace machine with keeping Jupyter from crashing on the proc_df() function in lesson1-rf.
I’ve tried conda updates and git pulls but I don’t know how to proceed so I’m working on crawling through it on my laptop. Hoping someone has some ideas!


(Avinash Singh Pundhir) #21

Hi All,

When running the following part of the “lesson1-rf” notebook, the kernal is restarting:

df, y, nas = proc_df(df_raw, 'SalePrice')

I am using paperspace machine for the execution. Please let me know if someone has a solution for this. As this is preventing at lease 3 of us in forum from making progress.

Thanks,
Avinash


(Jeremy Howard (Admin)) #22

Oops - thanks for the reminder :slight_smile: