Fastai v2 text

hackerbear · April 14, 2020, 4:55pm

Thanks for the reply! Is this an error in the library then? It seems strange that the TextDataLoaders.from_csv command accept a text_col argument whereas the header of the text column should always be text.

chess · April 14, 2020, 5:27pm

The default tokenizer sets the column to “text”, but you can write a custom one that sets it differently. Or you can import a csv with text that has already been tokenized outside of fastai, and it might have a different column name.

I don’t think it’s an error. Little things like this should become more clear once the latest courses are released, which use fastai2, and the fastai2 documentation is fleshed out.

hackerbear · April 15, 2020, 10:23am

Hello!

Could someone help me with this?

What is the equivalent of label_from_df for fastai2?
I have a multicategory classification problem with data looking like this:

sample_submission.csv test.csv test_labels.csv train.csv

	id	comment_text	toxic	severe_toxic	obscene	threat	insult	identity_hate
0	0000997932d777bf	Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren’t vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don’t remove the template from the talk page since I’m retired now.89.205.38.27	0	0	0	0	0	0

I try to load the data with the following code:

dls = TextDataLoaders.from_df(df,data_drive,valid_pct=0.1,text_col="comment_text",label_col=["toxic","severe_toxic","obscene","threat","insult","identity_hate"])

But the library fails to correctly detect the labels

msivanes · April 21, 2020, 4:40pm

@hackerbear - Have a look at this https://dev.fast.ai/tutorial.datablock#Text-classification . Not the exact answer you are looking for.

imdb_clas = DataBlock(blocks=(TextBlock.from_df(‘text’, seq_len=72, vocab=dls.vocab), CategoryBlock),
get_x=ColReader(‘text’),
get_y=ColReader(‘label’),
splitter=ColSplitter())

msivanes · April 21, 2020, 4:50pm

Thank you for this excellent intermediate wiki tutorial.

This is related to extending the wiki text tutorial to use SentencePiece & customizing it(for eg model_type as bpe).

I am following the exact steps described and it works great wrt using Mid Level API. Facing an issue wrt setting up the Transforms while I customize SentencePiece

Phase 1
tfms = [attrgetter('text'), Tokenizer.from_df(text_cols=0), Numericalize()] => Works fine since this uses default SpacyTokenizer

Phase 2
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=SentencePieceTokenizer) => Works fine as well. It uses tokenizer function as SentencePieceTokenizer

Phase 3
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=partial(SentencePieceTokenizer, model_type='bpe')) => Facing AttributeError: ‘NoneType’ object has no attribute 'EncodeAsPieces’

This colab (commenting enabled) demonstrates the above issue with a starter example. My line of thinking is I am not using partial function correctly or I don’t know how to customize SentencePiece.

Any help is appreciated.

msivanes · April 21, 2020, 6:02pm

Answering my own question. In order to customize the Tokenizer transform

sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=SentencePieceTokenizer, model_type='bpe', vocab_sz=1000)
tfms = [attrgetter('text'), sent_tfm, Numericalize()]

Updated the colab as well.

sabzo · April 21, 2020, 7:19pm

I have a quick Question on theory:

I trained a language model
created a classifier for neg/pos sentiments,
but when I call learn.predict(), my probability tensor has an extra probability`(‘pos’, tensor(2), tensor([1.2444e-12, 5.5311e-02, 9.4469e-01])).

Q: Shouldn’t the probabilities from predict only have probabilities for 2 classes (pos & neg)?

The original csv has just text/category.
learn.show_results() as expected, shows the following headers: text, category, _category.

How could this “mismatch” in number-of-categories vs num-of-probabilities be possible?

Thanks for any and all help.

msivanes · April 22, 2020, 12:38am

but when I call learn.predict() , my probability tensor has an extra probability`(‘pos’, tensor(2), tensor([1.2444e-12, 5.5311e-02, 9.4469e-01])).

This does not make sense.

Could you post the target variable(label) vocab please?

It does look like there is an extra category in addition to pos & neg

sabzo · April 22, 2020, 12:53am

It didn’t make sense at all, but I was able to resolve by uploading the csv directly through pandas. Before I had manually created a csv using pandas by mixing texts and columns… but uploading a plain old csv worked!

sabzo · April 22, 2020, 1:01am

On a similar topic have you been able to do inference without having to go through the whole training process?

I’ve a working classifier now, but I’m finding that to create the classifier on a different machine I’ve to first create a language model. On a different machine (non-gpu) I’m unable to load the finetuned LM or to load a saved classifier.

danjjohns · April 22, 2020, 1:23am

Does this fix still apply? I’m running into the same language model training issues, but isnt that why seq_len was added?

chess · April 22, 2020, 1:40am

I ran across this exact issue today, and splitting up my datafame from

A small amount of rows with a huge amount of data in each row.

To

A huge amount of rows, with a small amount (comparatively) of data in each row.

Dramatically decreased learning time, and indeed stops my GPU from going idle.

chess · April 22, 2020, 1:42am

Did you train the language/classification model using fp16() (mixed precision training)? If so, you can only use it on a machine with a GPU.

chess · April 22, 2020, 1:43am

It’s very likely you had an additional class in there. Probably a typo. A random cell with “positivee” instead of “positive” will result in a third class the model trains on.

Since the first prediction probability is so low (e-12), that’s likely what happened.

danjjohns · April 22, 2020, 1:45am

Unfortunately, each row is a single post. It contains minimal data per row. In V1, the language model trained for 1 hour 30 min per epoch with fp16 and 8gb card. I’m currently using a 24GB card and fp16, which is taking 2 hours 30 minutes per epoch. Definitely missing some configuration issue i’m thinking.

chess · April 22, 2020, 1:50am

Have you increased the batch size so you’re using all of your 24gb?

I also have seen much slower training times, dataloader processing times, and RAM (not GPU RAM) usage when working with language models in fastai2. In this thread I’m pretty sure sugger has said something about all the objects being pickled now, compared to fastai1, which I took as an explanation for my observations.

sabzo · April 22, 2020, 1:53am

Yikes excuse me for my ignorance, but how then can one train/deploy a machine without gpu capaiblities?

chess · April 22, 2020, 1:55am

If you’re using it, simply remove the fp16(). It will still use your GPU to train, but it will do it in 32bit mode, which can also be used by machines without a GPU.

The fp16() mode is exclusive to Nvidia specifically.

danjjohns · April 22, 2020, 2:03am

Yes, batch size is 256, which has the card pegged at 23gb usage per epoch. I think I’m missing something obvious lol. I’ll go back and run through my dataloaders again.

chess · April 22, 2020, 2:12am

I don’t think you’re missing anything obvious. It looks like fastai2 is doing more than fastai1. Creating the DataLoader uses more RAM, takes longer to process, and takes up more space when saved. I think it’s reasonable to assume that using more data will result in longer training times.