Another treat! Early access to Intro To Machine Learning videos

@jeremy on

When you are talking about feature interpretation.
The section where you talk about NaN and categorical variables seems clear to me.
NaN becomes an additional category.

What about NaN and continuous values lie a temperature, preasure, distance?
Like a real missing value

+1 I’ve always wanted to know this…or DL combined with Gradient Boosting (i.e. XGBoost)

Isn’t it about combining the results from different models? The models could be anything; Logistic, KNN, Naive Bayes, XGBoost, DL etc.

A rough rubric:

  1. Use different models on the same training dataset and generate predictions on test dataset.
  2. If you trained 5 models in step 1, you’ll now have 5 prediction arrays, for test dataset, from each model
  3. Now, a simple way of mixing them is take a
    (i) hard voting - look for the majority label for each record of 5 predictions - equiweighted averaging
    (ii) soft voting - because a couple of those 5 models were better than the other 3, give them higher weights - weighted averaging
    (iii) stacking
  4. Generate final prediction for each record in test.

A much more detailed description with visual summaries is very well described in this notebook.

2 Likes

We discussed this in the previous lessons - take a look at the source of proc_df to see how we handle it.

When going through the
nbs lesson2_rf_interpretation,

def get_preds(t): return t.predict(X_valid)
%time preds = np.stack(parallel_trees(m, get_preds))
np.mean(preds[:,0]), np.std(preds[:,0])

This code block generates these exceptions-:

Exception in thread Thread-41:
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\threading.py", line 916, in _bootstrap_inner
    self.run()
  File "C:\ProgramData\Anaconda3\lib\threading.py", line 864, in run
    self._target(*self._args, **self._kwargs)
  File "C:\ProgramData\Anaconda3\lib\concurrent\futures\process.py", line 295, in _queue_management_worker
    shutdown_worker()
  File "C:\ProgramData\Anaconda3\lib\concurrent\futures\process.py", line 253, in shutdown_worker
    call_queue.put_nowait(None)
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\queues.py", line 129, in put_nowait
    return self.put(obj, False)
  File "C:\ProgramData\Anaconda3\lib\multiprocessing\queues.py", line 83, in put
    raise Full
queue.Full

---------------------------------------------------------------------------
BrokenProcessPool                         Traceback (most recent call last)
<timed exec> in <module>()

D:\Github\fastai\courses\ml1\fastai\structured.py in parallel_trees(m, fn, n_jobs)
     18 
     19 def parallel_trees(m, fn, n_jobs=8):
---> 20         return list(ProcessPoolExecutor(n_jobs).map(fn, m.estimators_))
     21 
     22 def draw_tree(t, df, size=10, ratio=0.6, precision=0):

C:\ProgramData\Anaconda3\lib\concurrent\futures\process.py in _chain_from_iterable_of_lists(iterable)
    364     careful not to keep references to yielded objects.
    365     """
--> 366     for element in iterable:
    367         element.reverse()
    368         while element:

C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py in result_iterator()
    584                     # Careful not to keep a reference to the popped future
    585                     if timeout is None:
--> 586                         yield fs.pop().result()
    587                     else:
    588                         yield fs.pop().result(end_time - time.time())

C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.

Out[12]:
(10.392261561614122, 0.41180742095175948)
2 Likes

Yea those are all just examples of ensembling which I’m aware of. I’m actually more interested in for example taking the output from a DL model and feeding that as an input into RF or Gradient Boosting Trees.

2 Likes

We’ll talk about this a bit tonight, and in future lessons :slight_smile:

9 Likes

@gerardo I’m interested by this info but I can’t remember it. Do you still have the video “timestamp” where Jeremy speak about this? Thanks

To answer myself

1 Like

This course is beyond amazing. I have come across random forests so many times in the past but I never would go past thinking “well, this is weird” and “I wonder how this can be useful at all”. After the first couple of lectures I went to - “wow, this is really cool!” and I now have a deep sense of understanding what they are about and how each piece fits together. Time to put this understanding to good use :slight_smile:

Given the impact that the fastai courses have on me, there is one question I really would like to ask. If I understand correctly, the second part of this course will be delivered by @yinterian? Is there a chance for it to also be recorded and any hopes for it being shared early like ml1? Obviously that is a lot to hope for but thought I’d ask :slight_smile:

Another related thought - we seem to now have a lot of covered in the fastai library:

  • deep learning
  • machine learning being covered
  • linear algebra that I am planning to getting to as soon as I get a bit of a breather from the amazing firehose of information I seem to be jacked up to atm :slight_smile:

One logical addition would be Statistics for Coders or The Bare Minimum of Stats You Need to Become Dangerous or something like that :wink: Plenty of good material out there that I will suffer through at some point for lack of a fastai alternative if need be but I sort of know how things go and there is something unique to the fastai courses where you get a deep sense for what you are dealing with (and you even get to use it, which seems like there might be a connection between the two :wink: ). I had multiple run ins with statistics at this point as well but I never seemed to go past where I got with random forests, which is not very far (despite all the time I sunk into it). And what I learned from @jeremy’s and @rachel’s advice on writing posts is that if I am in a boat, chances are that it is a big boat with many other folks in a similar situation :blush: Not that I am suggesting anything but just wanted to share this random thought that crossed my mind this morning in between of changing diapers!

5 Likes

one little question here regarding feature engineering: Do we consider ids (like person_id or item_id) to be categorical or continuous? I can’t get my head around these kind of features as they may be continuous (a new, incremented id is given for each new “row”) or they can be totally random. Thanks :slight_smile:

Waiting for the next lesson desperately…

1 Like

Posted lesson 8 to top thread (https://youtu.be/DzE0eSdy5Hk - currently uploading)

7 Likes

Thank you very much @jeremy, great to be able to follow this course alongside DL.

Oh, @rachel is on the lecture. Glad you are ok.

2 Likes

Radek, I will make my jupyter notebooks and other material public. I do a lot in the whiteboard at the moment which is not a good setup for recording. Let me see what I can do.

1 Like

@yinterian FYI Terence Parr looked into using an iPad Pro for drawing on and projected to the laptop, and I believe he said it worked pretty well. Might be worth asking him about.

One logical addition would be Statistics for Coders or The Bare Minimum of Stats You Need to Become Dangerous or something like that

Here’s a helpful blog post by Julia Evans: Some good “Statistics for programmers” resources

3 Likes

The answer is “it depends”. There’s a little clarification of this in the latest video. Let me know if you’d like more info.

1 Like