Thank you for the wonderful courses, Jeremy, Rachel, and Sylvain!
I am very interested in applying what I have learned from the fast.ai courses to medical imaging. In such applications, I often face inputs with an arbitrary number of channels, for example, 1-channel gray images, N-channel 3D data set, M-channel multi-parametric maps, K-channel multi-contrast images, L-channel complex-part and real-part of data acquired from multiple receivers, and etc. (where N, M, K, L, … are any arbitrary integer such as 1, 2, 3, 4, 5, … and up to 128 or even 256).
I was wondering if the fast.ai library could be supporting data with an arbitrary number of channels input data?
To implement TTA to segmentation, you need to keep the transformation params and a way of applying the inverse of the transformation. There may be transformations that are not inversible.
So that would entail adding in a separate set of only inversible transforms for the test dl instead of utilizing the validation ones, modifying the test dl to capture the transform parameters, and then the tta function to do the inversion on the prediction?
Here was another thread on the topic:
And also, not all of them would need to be inversed, only those that change pixel locations. Crops probably wouldn’t work.
Maybe there is an easier way, but that was what I thought was needed when I had this problem. If you want, we could work together on trying to solve this.
Add automatic validation set creation. I currently am working with a dataset where I have a train and test set so I will have to split out a validation set on my own. It would be cool if you could just load a training set and say what percent you wanted to be validation.
Edit: Looking into the source and it looks like there may be some of this functionality built in or at least being built in. The ImageClassificationDataset.from_folder method has a valid_pct variable that it uses to split it up. It isn’t extended to the image_data_from_folder function yet, but at least the structure is there for it!
Yup, that’s the rework.
You’ll find pieces to do the inverse of an affine transform or a coord transform there, and most the transforms implemented at an affine or coord level, so all in all, everything needed to do TTA for segmentation.
Requested feature: A progress indicator when tokenizing a TextDataset.
Why: Tokenizing can take a looong time when you have a large corpus, so it would be nice to have some kind of indication how many texts have been processed.
Background: I’m currently building a “base” Dutch language model from a Dutch Wikipedia dump, but after ~10 hours I have no idea if it’s still happily tokenizing or if something is wrong.
I tried to find out if I could add this feature myself, but I’m still getting familiar with the details of the v1 library.
It already is implemented. If you are creating your TextDataset from a csv, it automatically does it by chunks of size 10,000 and you have a progress bar.
If you create it from a dataframe, be sure to load it with chunksize=something to see the same progress bar.
Thank you for your reply. I was using text_data_from_folder, because the corpus consists of many separate text files. But I’ll make a csv of the wiki dump instead.
That is weird. Your method creates a csv file that is then automatically opened with a chunksize of 10,000 normally (and you can specify the chunksize you want in the kwargs).
Do you have a csv file in the tmp directory where you were working? Maybe it got stuck while creating it and not during the tokenization.
It was removed from the arguments of the from_df method, but you can pass a dataframe that as been loaded with a given chunksize like:
Oh ok. What I’ve been doing is just pickle my train/test/val dataframes and just load that back in. Thats why there is really no chunksize there. Thanks for clearing that for me.