Fastai v2 text

Thanks for the reply! Is this an error in the library then? It seems strange that the TextDataLoaders.from_csv command accept a text_col argument whereas the header of the text column should always be text.

1 Like

The default tokenizer sets the column to ā€œtextā€, but you can write a custom one that sets it differently. Or you can import a csv with text that has already been tokenized outside of fastai, and it might have a different column name.

I don’t think it’s an error. Little things like this should become more clear once the latest courses are released, which use fastai2, and the fastai2 documentation is fleshed out.

1 Like

Hello!

Could someone help me with this?

What is the equivalent of label_from_df for fastai2?
I have a multicategory classification problem with data looking like this:

sample_submission.csv test.csv test_labels.csv train.csv

id comment_text toxic severe_toxic obscene threat insult identity_hate
0 0000997932d777bf Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They weren’t vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please don’t remove the template from the talk page since I’m retired now.89.205.38.27 0 0 0 0 0 0

I try to load the data with the following code:

dls = TextDataLoaders.from_df(df,data_drive,valid_pct=0.1,text_col="comment_text",label_col=["toxic","severe_toxic","obscene","threat","insult","identity_hate"])

But the library fails to correctly detect the labels

@hackerbear - Have a look at this https://dev.fast.ai/tutorial.datablock#Text-classification . Not the exact answer you are looking for.

imdb_clas = DataBlock(blocks=(TextBlock.from_df(ā€˜text’, seq_len=72, vocab=dls.vocab), CategoryBlock),
get_x=ColReader(ā€˜text’),
get_y=ColReader(ā€˜label’),
splitter=ColSplitter())

1 Like

Thank you for this excellent intermediate wiki tutorial.

This is related to extending the wiki text tutorial to use SentencePiece & customizing it(for eg model_type as bpe).

I am following the exact steps described and it works great wrt using Mid Level API. Facing an issue wrt setting up the Transforms while I customize SentencePiece

Phase 1
tfms = [attrgetter('text'), Tokenizer.from_df(text_cols=0), Numericalize()] => Works fine since this uses default SpacyTokenizer

Phase 2
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=SentencePieceTokenizer) => Works fine as well. It uses tokenizer function as SentencePieceTokenizer

Phase 3
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=partial(SentencePieceTokenizer, model_type='bpe')) => Facing AttributeError: ā€˜NoneType’ object has no attribute 'EncodeAsPieces’

This colab (commenting enabled) demonstrates the above issue with a starter example. My line of thinking is I am not using partial function correctly or I don’t know how to customize SentencePiece.

Any help is appreciated.

1 Like

Answering my own question. In order to customize the Tokenizer transform

sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=SentencePieceTokenizer, model_type='bpe', vocab_sz=1000)
tfms = [attrgetter('text'), sent_tfm, Numericalize()]

Updated the colab as well.

2 Likes

I have a quick Question on theory:

  1. I trained a language model
  2. created a classifier for neg/pos sentiments,
  3. but when I call learn.predict(), my probability tensor has an extra probability`(ā€˜pos’, tensor(2), tensor([1.2444e-12, 5.5311e-02, 9.4469e-01])).

Q: Shouldn’t the probabilities from predict only have probabilities for 2 classes (pos & neg)?

  • The original csv has just text/category.
  • learn.show_results() as expected, shows the following headers: text, category, _category.

How could this ā€œmismatchā€ in number-of-categories vs num-of-probabilities be possible?

Thanks for any and all help.

but when I call learn.predict() , my probability tensor has an extra probability`(ā€˜pos’, tensor(2), tensor([1.2444e-12, 5.5311e-02, 9.4469e-01])).

This does not make sense.

  • Could you post the target variable(label) vocab please?

It does look like there is an extra category in addition to pos & neg

It didn’t make sense at all, but I was able to resolve by uploading the csv directly through pandas. Before I had manually created a csv using pandas by mixing texts and columns… but uploading a plain old csv worked!

On a similar topic have you been able to do inference without having to go through the whole training process?

I’ve a working classifier now, but I’m finding that to create the classifier on a different machine I’ve to first create a language model. On a different machine (non-gpu) I’m unable to load the finetuned LM or to load a saved classifier.

Does this fix still apply? I’m running into the same language model training issues, but isnt that why seq_len was added?

I ran across this exact issue today, and splitting up my datafame from

  1. A small amount of rows with a huge amount of data in each row.

To

  1. A huge amount of rows, with a small amount (comparatively) of data in each row.

Dramatically decreased learning time, and indeed stops my GPU from going idle.

Did you train the language/classification model using fp16() (mixed precision training)? If so, you can only use it on a machine with a GPU.

It’s very likely you had an additional class in there. Probably a typo. A random cell with ā€œpositiveeā€ instead of ā€œpositiveā€ will result in a third class the model trains on.

Since the first prediction probability is so low (e-12), that’s likely what happened.

1 Like

Unfortunately, each row is a single post. It contains minimal data per row. In V1, the language model trained for 1 hour 30 min per epoch with fp16 and 8gb card. I’m currently using a 24GB card and fp16, which is taking 2 hours 30 minutes per epoch. Definitely missing some configuration issue i’m thinking.

Have you increased the batch size so you’re using all of your 24gb?

I also have seen much slower training times, dataloader processing times, and RAM (not GPU RAM) usage when working with language models in fastai2. In this thread I’m pretty sure sugger has said something about all the objects being pickled now, compared to fastai1, which I took as an explanation for my observations.

Yikes :confused: excuse me for my ignorance, but how then can one train/deploy a machine without gpu capaiblities?

If you’re using it, simply remove the fp16(). It will still use your GPU to train, but it will do it in 32bit mode, which can also be used by machines without a GPU.

The fp16() mode is exclusive to Nvidia specifically.

Yes, batch size is 256, which has the card pegged at 23gb usage per epoch. I think I’m missing something obvious lol. I’ll go back and run through my dataloaders again.

I don’t think you’re missing anything obvious. It looks like fastai2 is doing more than fastai1. Creating the DataLoader uses more RAM, takes longer to process, and takes up more space when saved. I think it’s reasonable to assume that using more data will result in longer training times.