Where is the todo list for v2? I’d love to help knock out a few small things.
Also, any thoughts on a feature that auto-adds Table of Contents at the top of the notebook based on headers. It would be nested based on header size. I’d be happy to do it.
There’s also a JupyterLab extension that generates a sidebar TOC dynamically.
I was able to log data similar to below image on this W&B run.
For simplicity of the test, I logged training loss at every epoch and validation/accuracy/prediction samples at every 4 epochs.
I used an old project with v1.
Let me know if this is what you want and I can adapt it and submit you a PR for fastai_dev.
What would be the process to prepare a quick script to install libraries? Reasoning: one specifically for google colab (as things like a PIL update among others need to happen). Are there any recommendations? (besides just running “do x before doing anything” or a warning on the repo?)
There needs a few more, such as a Pillow upgrade and possibly one more after a fix is implemented in pytorch for 1.3.1 along with a notice to restart the kernel after running it
There’s an issue in Colab (see the v2 vision thread) that’s being fixed soon (tonight). I was looking for a quick one liner, and your extended pip should do the trick, thanks once pytorch has pushed the update I’ll verify it works as intended then post on the github (as all the vision transforms pretty much wont work until then)
I thought I would post this video I watched couple weeks ago:
It goes over concepts like metaclasses, context managers, etc. which seem to be important parts of the lower-level API of fastai v2. I hope this is a helpful video.
I’m trying to load a seq2seq dataset that is larger than what can fit in memory.
I can load samples with this class derived from a torch Dataset:
class SeqDataset(Dataset):
# ...
def get_line(self, idx):
line = linecache.getline(self.path, idx + 1)
return line.split("\t")
def __getitem__(self, idx):
return self.get_line(idx)[:2]
I’m trying to see if I can make this work with the tools from fastai v2. The first thing I need to do is set up a Pipeline for tokenization and numericalization. However, peeking into the code, I realized transforms are set up by going through the whole training set! In my case not only I can’t afford to do that, but it’s also unnecessary cause my data has a very limited vocabulary. Also TfmdList expects the input items to by listy, but I can’t load my whole dataset into a list.
Do you think it would be a good inclusion to make a version of DataSource or TfmdList specific to large datasets, for example by using only a small subset to set up the transforms?
Question: is it possible to include our own custom dict’s to a PipeLine? Or must we use a DataBlock object. For example, trying to do ImageWoof at the moment with a custom dict for my y’s. I can do:
As far as I understand, the text nbs pass the tokenized dataset to DataSource. Following the chain of inheritance, the dataset eventually gets converted to an L, which excludes the possibility of using a cached dataset that is not fully loaded in memory. It can be made to sort of work by defining the ds like this:
But range_of is fully unrolled, and over 2 GB are being used just for storing the list of indexes (in my case, I don’t need shuffling, so it’s just an unrolled range).
I’ll try to look out fixes that could to alleviate this.
You asked about tokenization and numericalization, neither of which uses a DataSource, and both of which can operate on folders of files which are only loaded one chunk at a time. I’m not quite following your issue…
I apologize, clarity is not one of my strengths. I realized I don’t need a DataSource to setup numericalization, and I can tokenize single chunks of text.
Now I’m trying to use my data with DataBunch and Learner. In my case, the whole dataset is saved as a huge (~10 GB) tsv file. Since I can’t load the file in memory, I can’t think of a good solution to make the dataset indexable (to implement __getitem__). I wonder if there’s a way to use fastai’s DataLoader without making the assumption that the dataset object implements __getitem__.
For example, in PyTorch you can subclass IterableDataset and then pass it to a DataLoader. This is really practical, because all I’d need to do is yield each line in the dataset’s __iter__ method.
Sorry if there’s already a way and I missed it. fastai is a real gem and every day I realize it’s even more flexible than I thought the day before. To everyone who’s involved in the development, thanks for all your work!
EDIT: just found out about bcolz. Let’s see what it can do