I believe it’s a mean of 0 with a standard deviation of 1
I wonder how is TL for tabular data.
Ensemble huh?.. I wonder if I can concat the data together as a longer text document and the model can learn based on that like. “Genre: Action, Actor: Taylor Swift, Jane Dawn, Year: 1980, Here is the review text”. By doing this, I hope do not need to have multiple models.
For NLP there is no need to perform the normalization step because all the data is in plain text format?
Is the data created in tabular a pandas dataframe? which we can further take into different models like xgboost and rf?
Is there any possibility of bringing back Datasets and the ImageDataBunch.create method to coexist with the new datablocks API? I published this code https://github.com/wdhorton/protein-atlas-fastai just last week and it depended on ImageMultiDataset, going to be somewhat tricky to migrate it.
More generally, I think there are use cases where you want to make custom Datasets and it seems like our ability to use data that’s organized in a non-standard way is more limited now.
Yes. Normalization is for continuous variables.
Does fastai.tabular work with csv data that is too big to fit in memory as a dataframe?
That’s not for this chat. Happy to develop why it is way more flexible now than before on another topic.
Sorry to hear we broke your code though
lost frame with Jeremy.
never mind
I had a problem with high cardinality columns and wrote a basic article on it https://medium.com/@Nithanaroy/encoding-fixed-length-high-cardinality-non-numeric-columns-for-a-ml-algorithm-b1c910cb4e6d?source=linkShare-dd5a0af7ea9a-1542166579
Thanks for the response! I can see how datablocks is going to work well moving forward, maybe I’m just regretting how much I dug into ImageMultiDataset trying to get this to work.
There have been a number of questions about large datasets and memory. Is there the ability to load the pandas dataframe as an iterator (loading with a chunksize) and have the dataloader/model treat it as such?
How do we decide the number of layers and the number in those layers for structured deep learning?
untar_data() automatically adds .tgz to the url for downloading. Git
It actually fetches: http://files.fast.ai/data/examples/adult_sample.tgz
@rachel Thank you so much for keeping track of the discussion! Is it possible to mention the post number of the question, so we can jump to it and read? It really helps people who are listening in noisy environment.
Thanks, I was thinking more in terms of format of the text like font, style etc.,