I’d be up for that. Shoot me an email with a good time to connect live. cakeb@calebeverett.io.
This is all sounding about right to me. @sgugger is likely to be working on some related stuff next week BTW so he might have some useful updates. The basic pieces are already started here: https://github.com/fastai/fastai_docs/blob/master/dev_nb/x_006a_coords.ipynb
Add automatic validation set creation. I currently am working with a dataset where I have a train and test set so I will have to split out a validation set on my own. It would be cool if you could just load a training set and say what percent you wanted to be validation.
Edit: Looking into the source and it looks like there may be some of this functionality built in or at least being built in. The ImageClassificationDataset.from_folder method has a valid_pct variable that it uses to split it up. It isn’t extended to the image_data_from_folder function yet, but at least the structure is there for it!
I may be wrong, but I think I saw exactly that in the docs.
You can just call that yourself. See the image_data_from_folder
source for an example. It’ll be something like:
train_ds,valid_ds = ImageClassificationDataset.from_folder(path, valid_pct=0.2)
data = DataBunch(train_ds,valid_ds)
https://github.com/fastai/fastai_docs/blob/master/dev_nb/x_006a_coords.ipynb got renamed to https://github.com/fastai/fastai_docs/blob/master/dev_nb/102_tfm_coords.ipynb
Yup, that’s the rework.
You’ll find pieces to do the inverse of an affine transform or a coord transform there, and most the transforms implemented at an affine or coord level, so all in all, everything needed to do TTA for segmentation.
Requested feature: A progress indicator when tokenizing a TextDataset.
Why: Tokenizing can take a looong time when you have a large corpus, so it would be nice to have some kind of indication how many texts have been processed.
Background: I’m currently building a “base” Dutch language model from a Dutch Wikipedia dump, but after ~10 hours I have no idea if it’s still happily tokenizing or if something is wrong.
I tried to find out if I could add this feature myself, but I’m still getting familiar with the details of the v1 library.
It already is implemented. If you are creating your TextDataset from a csv, it automatically does it by chunks of size 10,000 and you have a progress bar.
If you create it from a dataframe, be sure to load it with chunksize=something to see the same progress bar.
I thought chunksize
was removed from the from_df
method as indicated here.
Thank you for your reply. I was using text_data_from_folder
, because the corpus consists of many separate text files. But I’ll make a csv of the wiki dump instead.
That is weird. Your method creates a csv file that is then automatically opened with a chunksize of 10,000 normally (and you can specify the chunksize you want in the kwargs).
Do you have a csv file in the tmp directory where you were working? Maybe it got stuck while creating it and not during the tokenization.
It was removed from the arguments of the from_df
method, but you can pass a dataframe that as been loaded with a given chunksize like:
df = pd.read_csv('my_file.csv', header=None, chunksize=10000)
Oh ok. What I’ve been doing is just pickle my train/test/val dataframes and just load that back in. Thats why there is really no chunksize
there. Thanks for clearing that for me.
The GAN implementation, it’s obviously interesting considering the computer vision stuff shown in the previous course. I was wondering if there is a plan for generative text architectures for v1? GANs are interesting here but they don’t seem to work as well as VAEs, due to the discrete distribution of text data, and probably are a bit harder to implement. There’s a few implementations of VAE and GAN based generative models out there right now, a good summary of generative text generation is here:
https://www.kuanchchen.com/post/nlp_generative_model/
I am really curious / excited to also see GNMT (https://arxiv.org/abs/1806.05138) implemented, code not anywhere yet as far as I can see so it might be tricky.
I think apart from the obvious potential and interest in this (sub)field, it also fits to what has been implemented in fast.ai so far. For example, language models are routinely used as pre-trained embeddings for the encoders, the architectures are usually RNN based (with the exception of a temporal CNN which is also really interesting to implement in its own right), etc.
Just throwing it out there, hope this was the right place!
Regards,
Theodore.
FYI: The (fastprogress) progressbars were not showing up because I was using Jupyterlab.
Running:
$ jupyter labextension install @jupyter-widgets/jupyterlab-manager
solved this issue.
Hi,
Is there any way to create fastai.data.DataBunch from TextDataset objects. Something like
trn_ds = TextDataset(trn_ids, trn_lbs)
val_ds = TextDataset(val_ids, val_lbs)
data_clas = text_data_from_ids (trn_ds,val_ds,...).
If this cannot be done currently with fastai V1 then I guess this may be quite useful as it will allow us to change data easily if we want to sample/add/remove data (and labels) without creating data in the folders.
You can create a DataBunch from any fastai datasets with:
data_clas = DataBunch.create(train_ds, valid_ds,...)
Full doc here.
Feature: pruning the Vocab during tokenization to avoid running out of RAM during tokenization of a large dataset
Issue: The issue of running out of RAM has been discussed elsewhere on the forum (http://forums.fast.ai/t/language-model-zoo-gorilla/). Jeremy suggested using a smaller dataset and limit the size of the vocabulary. However, in certain languages like German and Dutch compound words are very common. For example: in English “traffic light” are two separate words, but in Dutch (and other Germanic languages) it’s a single word “verkeerslicht”. So it makes sense that these language require a larger vocabulary, because these languages have more unique words. And it would be nice to be able to throw a large dataset at the Fastai library without having to worry too much about running out of RAM during tokenization.
Solution: instead of tokenizing all texts, keeping this list in memory and counting the frequency when building the vocabulary, I suggest keeping track of the word frequency during tokenization and “pruning” this counter when it exceeds a certain threshold, keeping only the most frequent tokens. This threshold can be (a lot) larger max_vocab
.
I’m working on an implementation, should I post it here when done?
This is definitely something we can look at. If you do your own implementation, please share it in a notebook, as explained here.
Feature: in-place Activated Batch Normalization.
I came across this code by Mapillary (https://github.com/mapillary/inplace_abn) while looking into some image segmentation models. I am not aware of a thorough test about it but the results in the two popular datasets are good. They claim 50% memory savings in architectures typically used in fast.ai.
It’s also built for pytorch (0.4) which might make porting it easier? Anyways, thought I’d put it here.
Hope this is interesting to some of you.
Kind regards,
Theodore.