Thanks for the reply! Is this an error in the library then? It seems strange that the TextDataLoaders.from_csv
command accept a text_col
argument whereas the header of the text column should always be text.
The default tokenizer sets the column to ātextā, but you can write a custom one that sets it differently. Or you can import a csv with text that has already been tokenized outside of fastai, and it might have a different column name.
I donāt think itās an error. Little things like this should become more clear once the latest courses are released, which use fastai2, and the fastai2 documentation is fleshed out.
Hello!
Could someone help me with this?
What is the equivalent of label_from_df for fastai2?
I have a multicategory classification problem with data looking like this:
sample_submission.csv test.csv test_labels.csv train.csv
id | comment_text | toxic | severe_toxic | obscene | threat | insult | identity_hate | |
---|---|---|---|---|---|---|---|---|
0 | 0000997932d777bf | Explanation\nWhy the edits made under my username Hardcore Metallica Fan were reverted? They werenāt vandalisms, just closure on some GAs after I voted at New York Dolls FAC. And please donāt remove the template from the talk page since Iām retired now.89.205.38.27 | 0 | 0 | 0 | 0 | 0 | 0 |
I try to load the data with the following code:
dls = TextDataLoaders.from_df(df,data_drive,valid_pct=0.1,text_col="comment_text",label_col=["toxic","severe_toxic","obscene","threat","insult","identity_hate"])
But the library fails to correctly detect the labels
@hackerbear - Have a look at this https://dev.fast.ai/tutorial.datablock#Text-classification . Not the exact answer you are looking for.
imdb_clas = DataBlock(blocks=(TextBlock.from_df(ātextā, seq_len=72, vocab=dls.vocab), CategoryBlock),
get_x=ColReader(ātextā),
get_y=ColReader(ālabelā),
splitter=ColSplitter())
Thank you for this excellent intermediate wiki tutorial.
This is related to extending the wiki text tutorial to use SentencePiece & customizing it(for eg model_type as bpe).
I am following the exact steps described and it works great wrt using Mid Level API. Facing an issue wrt setting up the Transforms while I customize SentencePiece
Phase 1
tfms = [attrgetter('text'), Tokenizer.from_df(text_cols=0), Numericalize()]
=> Works fine since this uses default SpacyTokenizer
Phase 2
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=SentencePieceTokenizer)
=> Works fine as well. It uses tokenizer function as SentencePieceTokenizer
Phase 3
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=partial(SentencePieceTokenizer, model_type='bpe'))
=> Facing AttributeError: āNoneTypeā object has no attribute 'EncodeAsPiecesā
This colab (commenting enabled) demonstrates the above issue with a starter example. My line of thinking is I am not using partial function correctly or I donāt know how to customize SentencePiece.
Any help is appreciated.
Answering my own question. In order to customize the Tokenizer transform
sent_tfm = Tokenizer.from_df(text_cols=0, tok_func=SentencePieceTokenizer, model_type='bpe', vocab_sz=1000)
tfms = [attrgetter('text'), sent_tfm, Numericalize()]
Updated the colab as well.
I have a quick Question on theory:
- I trained a language model
- created a classifier for neg/pos sentiments,
- but when I call
learn.predict()
, my probability tensor has an extra probability`(āposā, tensor(2), tensor([1.2444e-12, 5.5311e-02, 9.4469e-01])).
Q: Shouldnāt the probabilities from predict only have probabilities for 2 classes (pos & neg)?
- The original csv has just
text/category
. -
learn.show_results()
as expected, shows the following headers:text, category, _category
.
How could this āmismatchā in number-of-categories vs num-of-probabilities be possible?
Thanks for any and all help.
but when I call
learn.predict()
, my probability tensor has an extra probability`(āposā, tensor(2), tensor([1.2444e-12, 5.5311e-02, 9.4469e-01])).
This does not make sense.
- Could you post the target variable(label) vocab please?
It does look like there is an extra category in addition to pos & neg
It didnāt make sense at all, but I was able to resolve by uploading the csv directly through pandas. Before I had manually created a csv using pandas by mixing texts and columnsā¦ but uploading a plain old csv worked!
On a similar topic have you been able to do inference without having to go through the whole training process?
Iāve a working classifier now, but Iām finding that to create the classifier on a different machine Iāve to first create a language model. On a different machine (non-gpu) Iām unable to load the finetuned LM or to load a saved classifier.
Does this fix still apply? Iām running into the same language model training issues, but isnt that why seq_len was added?
I ran across this exact issue today, and splitting up my datafame from
- A small amount of rows with a huge amount of data in each row.
To
- A huge amount of rows, with a small amount (comparatively) of data in each row.
Dramatically decreased learning time, and indeed stops my GPU from going idle.
Did you train the language/classification model using fp16() (mixed precision training)? If so, you can only use it on a machine with a GPU.
Itās very likely you had an additional class in there. Probably a typo. A random cell with āpositiveeā instead of āpositiveā will result in a third class the model trains on.
Since the first prediction probability is so low (e-12), thatās likely what happened.
Unfortunately, each row is a single post. It contains minimal data per row. In V1, the language model trained for 1 hour 30 min per epoch with fp16 and 8gb card. Iām currently using a 24GB card and fp16, which is taking 2 hours 30 minutes per epoch. Definitely missing some configuration issue iām thinking.
Have you increased the batch size so youāre using all of your 24gb?
I also have seen much slower training times, dataloader processing times, and RAM (not GPU RAM) usage when working with language models in fastai2. In this thread Iām pretty sure sugger has said something about all the objects being pickled now, compared to fastai1, which I took as an explanation for my observations.
Yikes excuse me for my ignorance, but how then can one train/deploy a machine without gpu capaiblities?
If youāre using it, simply remove the fp16(). It will still use your GPU to train, but it will do it in 32bit mode, which can also be used by machines without a GPU.
The fp16() mode is exclusive to Nvidia specifically.
Yes, batch size is 256, which has the card pegged at 23gb usage per epoch. I think Iām missing something obvious lol. Iāll go back and run through my dataloaders again.
I donāt think youāre missing anything obvious. It looks like fastai2 is doing more than fastai1. Creating the DataLoader uses more RAM, takes longer to process, and takes up more space when saved. I think itās reasonable to assume that using more data will result in longer training times.