Beginner: Creating a dataset, and using Gradio / Spaces ✅

Have you followed the hints/instructions it gives there to install git-lfs?

The other option is to upload the binary files through the web interface on your app’s huggingface spaces page.
Go to the ‘files and versions tab’ and click the ‘add file’ button and then select ‘upload files’ option.

1 Like

Thanks AllenK,

yes I did install git-lfs and attempted to activate it, after which I receive this confirmation

Updated Git hooks.
Git LFS initialized.

but I still get the same error when pushing my commits

However, your suggestion about uploading directly to the web interface worked

1 Like

Hi,
(Creating a dataset)

I am new to the forums and also to fastai. I was going through the text classifier tutorial and found it to be very interesting, considering this I would like to build a classifier on my own data. Can someone please guide me to the documentation or tutorial where I can understand the format of the dataset that is required to train the text classifier so that I can convert my own data in a required format.

It would be very helpful, looking forward to learn a lot from this community.

Regards,
Chef

The dataset format in the fastai - Text transfer learning is described in the text.

The data follows an ImageNet-style organization, in the train folder, we have two subfolders, pos and neg (for positive reviews and negative reviews)

You can create a new dataset by following the same format, to get a better look at the structure for the dataset, take a look in the directory path where the dataset is downloaded.

path = untar_data(URLs.IMDB)
path.ls()

At its simplest, the structure for this dataset is just a dataset name folder which contains a train and test directory, which contain a ‘neg’ and ‘pos’ directory (the categories), and inside those go plain text files that contain the example text (e.g ‘Great movie. I was laughing all time through’)
and saved with a unique filename. It looks like this.

datasetname_folder/
  train/
     neg/
        00001.txt
        00002.txt
     pos/
        10001.txt
        10002.txt
  
  test/
     neg/
        20001.txt
        20002.txt
     pos/
        30001.txt
        30002.txt

Read more about the various datasets in fast.ai here fastai - External data.
That page also links to this original source project for the IMDB dataset, where you can read the related papers.

2 Likes

Hi, I am currently following lesson 2 (Part 1 2022) and trying to deploy the app on HuggingFace spaces (the “testing” example). When I use git push I get the following error

fatal: the remote end hung up unexpectedly

@Kamui I found in my case the remote end hung up error was followed by:
git-lfs filter-process: git-lfs: command not found
I installed git-lfs using homebrew (brew install git-lfs) and that solved the issue for me.

1 Like

I just tried that and I still get the error:

Enumerating objects: 11, done.
Counting objects: 100% (11/11), done.
Delta compression using up to 8 threads
Compressing objects: 100% (8/8), done.
error: RPC failed; HTTP 408 curl 22 The requested URL returned error: 408
fatal: the remote end hung up unexpectedly
Writing objects: 100% (10/10), 41.72 MiB | 135.00 KiB/s, done.
Total 10 (delta 0), reused 0 (delta 0)
fatal: the remote end hung up unexpectedly
Everything up-to-date

Here are a couple similar cases that others have encountered:

https://confluence.atlassian.com/stashkb/git-push-fails-fatal-the-remote-end-hung-up-unexpectedly-282988530.html
Maybe one of those answers would help?

2 Likes

Thank you so much for your help, I used a work around (I uploaded the file via the website). I will try that for the next project.