Fastai v2 text

Thanks for the reply! Is this an error in the library then? It seems strange that the TextDataLoaders.from_csv command accept a text_col argument whereas the header of the text column should always be text.

1 Like

The default tokenizer sets the column to ā€œtextā€, but you can write a custom one that sets it differently. Or you can import a csv with text that has already been tokenized outside of fastai, and it might have a different column name.

I donā€™t think itā€™s an error. Little things like this should become more clear once the latest courses are released, which use fastai2, and the fastai2 documentation is fleshed out.

1 Like

Hello!

Could someone help me with this?

What is the equivalent of label_from_df for fastai2?
I have a multicategory classification problem with data looking like this:

sample_submission.csv test.csv test_labels.csv train.csv

id comment_text toxic severe_toxic obscene threat insult identity_hate
0 0000997932d777bf Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They werenā€™t vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please donā€™t remove the template from the talk page since Iā€™m retired now.89.205.38.27 0 0 0 0 0 0

I try to load the data with the following code:

dls = TextDataLoaders.from_df(df,data_drive,valid_pct=0.1,text_col="comment_text",label_col=["toxic","severe_toxic","obscene","threat","insult","identity_hate"])

But the library fails to correctly detect the labels

@hackerbear - Have a look at this https://dev.fast.ai/tutorial.datablock#Text-classification . Not the exact answer you are looking for.

imdb_clas = DataBlock(blocks=(TextBlock.from_df(ā€˜textā€™, seq_len=72, vocab=dls.vocab), CategoryBlock),
get_x=ColReader(ā€˜textā€™),
get_y=ColReader(ā€˜labelā€™),
splitter=ColSplitter())

1 Like

Thank you for this excellent intermediate wiki tutorial.

This is related to extending the wiki text tutorial to use SentencePiece & customizing it(for eg model_type as bpe).

I am following the exact steps described and it works great wrt using Mid Level API. Facing an issue wrt setting up the Transforms while I customize SentencePiece

Phase 1
tfms = [attrgetter('text'), Tokenizer.from_df(text_cols=0), Numericalize()] => Works fine since this uses default SpacyTokenizer

Phase 2
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=SentencePieceTokenizer) => Works fine as well. It uses tokenizer function as SentencePieceTokenizer

Phase 3
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=partial(SentencePieceTokenizer, model_type='bpe')) => Facing AttributeError: ā€˜NoneTypeā€™ object has no attribute 'EncodeAsPiecesā€™

This colab (commenting enabled) demonstrates the above issue with a starter example. My line of thinking is I am not using partial function correctly or I donā€™t know how to customize SentencePiece.

Any help is appreciated.

1 Like

Answering my own question. In order to customize the Tokenizer transform

sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=SentencePieceTokenizer, model_type='bpe', vocab_sz=1000)
tfms = [attrgetter('text'), sent_tfm, Numericalize()]

Updated the colab as well.

2 Likes

I have a quick Question on theory:

  1. I trained a language model
  2. created a classifier for neg/pos sentiments,
  3. but when I call learn.predict(), my probability tensor has an extra probability`(ā€˜posā€™, tensor(2), tensor([1.2444e-12, 5.5311e-02, 9.4469e-01])).

Q: Shouldnā€™t the probabilities from predict only have probabilities for 2 classes (pos & neg)?

  • The original csv has just text/category.
  • learn.show_results() as expected, shows the following headers: text, category, _category.

How could this ā€œmismatchā€ in number-of-categories vs num-of-probabilities be possible?

Thanks for any and all help.

but when I call learn.predict() , my probability tensor has an extra probability`(ā€˜posā€™, tensor(2), tensor([1.2444e-12, 5.5311e-02, 9.4469e-01])).

This does not make sense.

  • Could you post the target variable(label) vocab please?

It does look like there is an extra category in addition to pos & neg

It didnā€™t make sense at all, but I was able to resolve by uploading the csv directly through pandas. Before I had manually created a csv using pandas by mixing texts and columnsā€¦ but uploading a plain old csv worked!

On a similar topic have you been able to do inference without having to go through the whole training process?

Iā€™ve a working classifier now, but Iā€™m finding that to create the classifier on a different machine Iā€™ve to first create a language model. On a different machine (non-gpu) Iā€™m unable to load the finetuned LM or to load a saved classifier.

Does this fix still apply? Iā€™m running into the same language model training issues, but isnt that why seq_len was added?

I ran across this exact issue today, and splitting up my datafame from

  1. A small amount of rows with a huge amount of data in each row.

To

  1. A huge amount of rows, with a small amount (comparatively) of data in each row.

Dramatically decreased learning time, and indeed stops my GPU from going idle.

Did you train the language/classification model using fp16() (mixed precision training)? If so, you can only use it on a machine with a GPU.

Itā€™s very likely you had an additional class in there. Probably a typo. A random cell with ā€œpositiveeā€ instead of ā€œpositiveā€ will result in a third class the model trains on.

Since the first prediction probability is so low (e-12), thatā€™s likely what happened.

1 Like

Unfortunately, each row is a single post. It contains minimal data per row. In V1, the language model trained for 1 hour 30 min per epoch with fp16 and 8gb card. Iā€™m currently using a 24GB card and fp16, which is taking 2 hours 30 minutes per epoch. Definitely missing some configuration issue iā€™m thinking.

Have you increased the batch size so youā€™re using all of your 24gb?

I also have seen much slower training times, dataloader processing times, and RAM (not GPU RAM) usage when working with language models in fastai2. In this thread Iā€™m pretty sure sugger has said something about all the objects being pickled now, compared to fastai1, which I took as an explanation for my observations.

Yikes :confused: excuse me for my ignorance, but how then can one train/deploy a machine without gpu capaiblities?

If youā€™re using it, simply remove the fp16(). It will still use your GPU to train, but it will do it in 32bit mode, which can also be used by machines without a GPU.

The fp16() mode is exclusive to Nvidia specifically.

Yes, batch size is 256, which has the card pegged at 23gb usage per epoch. I think Iā€™m missing something obvious lol. Iā€™ll go back and run through my dataloaders again.

I donā€™t think youā€™re missing anything obvious. It looks like fastai2 is doing more than fastai1. Creating the DataLoader uses more RAM, takes longer to process, and takes up more space when saved. I think itā€™s reasonable to assume that using more data will result in longer training times.