Lesson 4 In-Class Discussion ✅

I believe it’s a mean of 0 with a standard deviation of 1

I wonder how is TL for tabular data.

1 Like

Ensemble huh?.. I wonder if I can concat the data together as a longer text document and the model can learn based on that like. “Genre: Action, Actor: Taylor Swift, Jane Dawn, Year: 1980, Here is the review text”. By doing this, I hope do not need to have multiple models.

1 Like

For NLP there is no need to perform the normalization step because all the data is in plain text format?

Is the data created in tabular a pandas dataframe? which we can further take into different models like xgboost and rf?

Is there any possibility of bringing back Datasets and the ImageDataBunch.create method to coexist with the new datablocks API? I published this code https://github.com/wdhorton/protein-atlas-fastai just last week and it depended on ImageMultiDataset, going to be somewhat tricky to migrate it.

More generally, I think there are use cases where you want to make custom Datasets and it seems like our ability to use data that’s organized in a non-standard way is more limited now.

3 Likes

Yes. Normalization is for continuous variables.

2 Likes

Does fastai.tabular work with csv data that is too big to fit in memory as a dataframe?

9 Likes

@rachel bringing this to your attention. Any help on these questions?

That’s not for this chat. Happy to develop why it is way more flexible now than before on another topic.
Sorry to hear we broke your code though :frowning:

2 Likes

lost frame with Jeremy.

never mind

I had a problem with high cardinality columns and wrote a basic article on it https://medium.com/@Nithanaroy/encoding-fixed-length-high-cardinality-non-numeric-columns-for-a-ml-algorithm-b1c910cb4e6d?source=linkShare-dd5a0af7ea9a-1542166579

5 Likes

Thanks for the response! I can see how datablocks is going to work well moving forward, maybe I’m just regretting how much I dug into ImageMultiDataset trying to get this to work.

There have been a number of questions about large datasets and memory. Is there the ability to load the pandas dataframe as an iterator (loading with a chunksize) and have the dataloader/model treat it as such?

2 Likes

How do we decide the number of layers and the number in those layers for structured deep learning?

9 Likes

@sgugger does normalize also handle the issues traditionally caused with skewed distributions?

untar_data() automatically adds .tgz to the url for downloading. Git
It actually fetches: http://files.fast.ai/data/examples/adult_sample.tgz

1 Like

@rachel Thank you so much for keeping track of the discussion! Is it possible to mention the post number of the question, so we can jump to it and read? It really helps people who are listening in noisy environment.

1 Like

Thanks, I was thinking more in terms of format of the text like font, style etc.,