(Not a course related question) How to load big data and train in deep learning?

RajeshMappu · March 26, 2018, 5:47am

Trying to load 5GB of French dataset for Language translation. I’ll summarize my concerns in language translation task for 5GB worth of data. I have used file split but the 5GB file is split into 8000 files. Which is cumbersome going forward for training each file.

How to do chunk based language translation with big data file?
What can I do about vocabulary length as it keeps increasing with each file vocabulary extension.
Training each next file with previous saved training loaded helps learning over all files?

@radek, tagging you for any guidance. - Rajesh from Twitter.

Any short direction guidance?

s.s.o · March 26, 2018, 12:07pm

if you use pandas you can try something like:

chunksize = 10 ** 6
for chunk in pd.read_csv(filename, chunksize=chunksize):
    do_something(chunk)

or you can merge them later.

radek · March 26, 2018, 12:47pm

Hey Rajesh,

This all depends on what it is that you are trying to achieve. The approach will differ whether you are trying to build a language model or do sentiment analysis, etc.

I have not done anything with language translation so it is hard for me to provide specifics. I think there might be something on translation in part 2 v2 of the course and IIRC there might have been a lecture on it in part 2 v1, somewhere towards the beginning of the second half of part 2 IIRC.

If vocabulary size is a problem, maybe you can limit it by considering all words with frequency below some n (say 10 or 50) to be <unk>?

I am not sure what would be the correct measures to take here as I have never attempted to tackle translation and I think the solutions here would be quite task specific. I would probably start with searching for a walk through / kaggle kernel that shows a simple translation example you could start building on (mainly getting the data in and out using the technology of your choice).

If PyTorch is what you are planning on using, in torchtext there seems to exist a TranslationDatset here: text/torchtext/datasets/translation.py Might be a good starting point. Other than that there is also a translation example here: https://github.com/pytorch/text/blob/master/test/translation.py that seems to demonstrate creating data iterators demonstarting the BucketIterator (very handy!) and creating vocabulary with min word frequency and capped max size specifically for translation. This article is quite a nice overview of some of the torchtext functionality.