Fastai_v1, adding features

You can just call that yourself. See the image_data_from_folder source for an example. It’ll be something like:

train_ds,valid_ds = ImageClassificationDataset.from_folder(path, valid_pct=0.2)
data = DataBunch(train_ds,valid_ds)
4 Likes

https://github.com/fastai/fastai_docs/blob/master/dev_nb/x_006a_coords.ipynb got renamed to https://github.com/fastai/fastai_docs/blob/master/dev_nb/102_tfm_coords.ipynb

1 Like

Yup, that’s the rework.
You’ll find pieces to do the inverse of an affine transform or a coord transform there, and most the transforms implemented at an affine or coord level, so all in all, everything needed to do TTA for segmentation.

1 Like

Requested feature: A progress indicator when tokenizing a TextDataset.

Why: Tokenizing can take a looong time when you have a large corpus, so it would be nice to have some kind of indication how many texts have been processed.

Background: I’m currently building a “base” Dutch language model from a Dutch Wikipedia dump, but after ~10 hours I have no idea if it’s still happily tokenizing or if something is wrong.

I tried to find out if I could add this feature myself, but I’m still getting familiar with the details of the v1 library.

1 Like

It already is implemented. If you are creating your TextDataset from a csv, it automatically does it by chunks of size 10,000 and you have a progress bar.
If you create it from a dataframe, be sure to load it with chunksize=something to see the same progress bar.

I thought chunksize was removed from the from_df method as indicated here.

Thank you for your reply. I was using text_data_from_folder, because the corpus consists of many separate text files. But I’ll make a csv of the wiki dump instead.

That is weird. Your method creates a csv file that is then automatically opened with a chunksize of 10,000 normally (and you can specify the chunksize you want in the kwargs).
Do you have a csv file in the tmp directory where you were working? Maybe it got stuck while creating it and not during the tokenization.

It was removed from the arguments of the from_df method, but you can pass a dataframe that as been loaded with a given chunksize like:

df = pd.read_csv('my_file.csv', header=None, chunksize=10000)

Oh ok. What I’ve been doing is just pickle my train/test/val dataframes and just load that back in. Thats why there is really no chunksize there. Thanks for clearing that for me.

The GAN implementation, it’s obviously interesting considering the computer vision stuff shown in the previous course. I was wondering if there is a plan for generative text architectures for v1? GANs are interesting here but they don’t seem to work as well as VAEs, due to the discrete distribution of text data, and probably are a bit harder to implement. There’s a few implementations of VAE and GAN based generative models out there right now, a good summary of generative text generation is here:
https://www.kuanchchen.com/post/nlp_generative_model/

I am really curious / excited to also see GNMT (https://arxiv.org/abs/1806.05138) implemented, code not anywhere yet as far as I can see so it might be tricky.

I think apart from the obvious potential and interest in this (sub)field, it also fits to what has been implemented in fast.ai so far. For example, language models are routinely used as pre-trained embeddings for the encoders, the architectures are usually RNN based (with the exception of a temporal CNN which is also really interesting to implement in its own right), etc.

Just throwing it out there, hope this was the right place!

Regards,
Theodore.

FYI: The (fastprogress) progressbars were not showing up because I was using Jupyterlab.

Running:

$ jupyter labextension install @jupyter-widgets/jupyterlab-manager

solved this issue.

3 Likes

Hi,
Is there any way to create fastai.data.DataBunch from TextDataset objects. Something like

   trn_ds = TextDataset(trn_ids, trn_lbs)
   val_ds = TextDataset(val_ids, val_lbs)

   data_clas = text_data_from_ids (trn_ds,val_ds,...).

If this cannot be done currently with fastai V1 then I guess this may be quite useful as it will allow us to change data easily if we want to sample/add/remove data (and labels) without creating data in the folders.

1 Like

You can create a DataBunch from any fastai datasets with:

data_clas = DataBunch.create(train_ds, valid_ds,...)

Full doc here.

2 Likes

Feature: pruning the Vocab during tokenization to avoid running out of RAM during tokenization of a large dataset

Issue: The issue of running out of RAM has been discussed elsewhere on the forum (http://forums.fast.ai/t/language-model-zoo-gorilla/). Jeremy suggested using a smaller dataset and limit the size of the vocabulary. However, in certain languages like German and Dutch compound words are very common. For example: in English “traffic light” are two separate words, but in Dutch (and other Germanic languages) it’s a single word “verkeerslicht”. So it makes sense that these language require a larger vocabulary, because these languages have more unique words. And it would be nice to be able to throw a large dataset at the Fastai library without having to worry too much about running out of RAM during tokenization.

Solution: instead of tokenizing all texts, keeping this list in memory and counting the frequency when building the vocabulary, I suggest keeping track of the word frequency during tokenization and “pruning” this counter when it exceeds a certain threshold, keeping only the most frequent tokens. This threshold can be (a lot) larger max_vocab.

I’m working on an implementation, should I post it here when done?

1 Like

This is definitely something we can look at. If you do your own implementation, please share it in a notebook, as explained here.

Feature: in-place Activated Batch Normalization.

I came across this code by Mapillary (https://github.com/mapillary/inplace_abn) while looking into some image segmentation models. I am not aware of a thorough test about it but the results in the two popular datasets are good. They claim 50% memory savings in architectures typically used in fast.ai.

It’s also built for pytorch (0.4) which might make porting it easier? Anyways, thought I’d put it here.

Hope this is interesting to some of you.

Kind regards,
Theodore.

2 Likes

One thing I miss is a way to keep track of different runs. Not only results, but also all the training history of a specific model.

2 Likes

Thanks @sgugger for the reply.
I thought TextDataset is same as before, which can be created from numpy arrays of ids (int representation of text) directly. However, I think this is no longer possible. So, DataBunch.create(train_ds, valid_ds,…) is not very useable.

I think with current fastai V1 it is not possible to create DataBunch using the data (Tokens, ids) directly. Only folders-based options are available to create these objects (DataBunch, TextDataset etc). Imagine someone want to run a text classifier on a few different samples of data, then the folder based options will not be very useful as he has to create and save all those samples in the folder. Every time he wants to change data, increase-decrease label data, create new samples, or modify data, he has to save all those snippets of data in the folder in order to create DataBunch. Is there any way that with current library we can handle such scenarios (create DataBunch directly from numpy arrays)? I think this is very practical need in many experiments and tests.

It’s already possible to create a TextDataset from folders, a dataframe, a csv file, token files or id files (save of a np.array containing the ids), as explained in the documentation.
As for creating a DataBunch there is a function for each of those methods.

Yes, but I think all of these method requires the path of the files ( path to csv file, path to token files or id files). Let’s imagine I have created these objects (train_ids, train_labels, valid_ids, valid_labels, itos.pkl) which are currently in memory. Can we create TextDataset from these in memory objects without saving them on hard disk?