When you are talking about feature interpretation.
The section where you talk about NaN and categorical variables seems clear to me.
NaN becomes an additional category.
What about NaN and continuous values lie a temperature, preasure, distance?
Like a real missing value
Isn’t it about combining the results from different models? The models could be anything; Logistic, KNN, Naive Bayes, XGBoost, DL etc.
A rough rubric:
Use different models on the same training dataset and generate predictions on test dataset.
If you trained 5 models in step 1, you’ll now have 5 prediction arrays, for test dataset, from each model
Now, a simple way of mixing them is take a
(i) hard voting - look for the majority label for each record of 5 predictions - equiweighted averaging
(ii) soft voting - because a couple of those 5 models were better than the other 3, give them higher weights - weighted averaging
(iii) stacking
Generate final prediction for each record in test.
A much more detailed description with visual summaries is very well described in this notebook.
Exception in thread Thread-41:
Traceback (most recent call last):
File "C:\ProgramData\Anaconda3\lib\threading.py", line 916, in _bootstrap_inner
self.run()
File "C:\ProgramData\Anaconda3\lib\threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "C:\ProgramData\Anaconda3\lib\concurrent\futures\process.py", line 295, in _queue_management_worker
shutdown_worker()
File "C:\ProgramData\Anaconda3\lib\concurrent\futures\process.py", line 253, in shutdown_worker
call_queue.put_nowait(None)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\queues.py", line 129, in put_nowait
return self.put(obj, False)
File "C:\ProgramData\Anaconda3\lib\multiprocessing\queues.py", line 83, in put
raise Full
queue.Full
---------------------------------------------------------------------------
BrokenProcessPool Traceback (most recent call last)
<timed exec> in <module>()
D:\Github\fastai\courses\ml1\fastai\structured.py in parallel_trees(m, fn, n_jobs)
18
19 def parallel_trees(m, fn, n_jobs=8):
---> 20 return list(ProcessPoolExecutor(n_jobs).map(fn, m.estimators_))
21
22 def draw_tree(t, df, size=10, ratio=0.6, precision=0):
C:\ProgramData\Anaconda3\lib\concurrent\futures\process.py in _chain_from_iterable_of_lists(iterable)
364 careful not to keep references to yielded objects.
365 """
--> 366 for element in iterable:
367 element.reverse()
368 while element:
C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py in result_iterator()
584 # Careful not to keep a reference to the popped future
585 if timeout is None:
--> 586 yield fs.pop().result()
587 else:
588 yield fs.pop().result(end_time - time.time())
C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
--> 432 return self.__get_result()
433 else:
434 raise TimeoutError()
C:\ProgramData\Anaconda3\lib\concurrent\futures\_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
--> 384 raise self._exception
385 else:
386 return self._result
BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
Out[12]:
(10.392261561614122, 0.41180742095175948)
Yea those are all just examples of ensembling which I’m aware of. I’m actually more interested in for example taking the output from a DL model and feeding that as an input into RF or Gradient Boosting Trees.
This course is beyond amazing. I have come across random forests so many times in the past but I never would go past thinking “well, this is weird” and “I wonder how this can be useful at all”. After the first couple of lectures I went to - “wow, this is really cool!” and I now have a deep sense of understanding what they are about and how each piece fits together. Time to put this understanding to good use
Given the impact that the fastai courses have on me, there is one question I really would like to ask. If I understand correctly, the second part of this course will be delivered by @yinterian? Is there a chance for it to also be recorded and any hopes for it being shared early like ml1? Obviously that is a lot to hope for but thought I’d ask
Another related thought - we seem to now have a lot of covered in the fastai library:
deep learning
machine learning being covered
linear algebra that I am planning to getting to as soon as I get a bit of a breather from the amazing firehose of information I seem to be jacked up to atm
One logical addition would be Statistics for Coders or The Bare Minimum of Stats You Need to Become Dangerous or something like that Plenty of good material out there that I will suffer through at some point for lack of a fastai alternative if need be but I sort of know how things go and there is something unique to the fastai courses where you get a deep sense for what you are dealing with (and you even get to use it, which seems like there might be a connection between the two ). I had multiple run ins with statistics at this point as well but I never seemed to go past where I got with random forests, which is not very far (despite all the time I sunk into it). And what I learned from @jeremy’s and @rachel’s advice on writing posts is that if I am in a boat, chances are that it is a big boat with many other folks in a similar situation Not that I am suggesting anything but just wanted to share this random thought that crossed my mind this morning in between of changing diapers!
one little question here regarding feature engineering: Do we consider ids (like person_id or item_id) to be categorical or continuous? I can’t get my head around these kind of features as they may be continuous (a new, incremented id is given for each new “row”) or they can be totally random. Thanks
Radek, I will make my jupyter notebooks and other material public. I do a lot in the whiteboard at the moment which is not a good setup for recording. Let me see what I can do.
@yinterian FYI Terence Parr looked into using an iPad Pro for drawing on and projected to the laptop, and I believe he said it worked pretty well. Might be worth asking him about.