What are some strategies to use seq2seq translation with 22m sentence pairs instead of 50k?

Could anyone give some hints as to what architecture one would use to use all the available sentence pairs from lesson 11 instead of just 50k? In the 50k version it already uses about 3G of GPU ram (+ 1G for attention), so I am not sure if there is an easy way to do it on one GPU. With multiple GPUs I could imagine sharding the sentences, like A-E on GPU 1, F-J on GPU 2, and so on but that might make it not as good as fitting everything into one model. Maybe one could invent new word vectors that are not like 300 elements?


You should 1st see if you actually need that many. If you find that you do, it won’t impact your GPU memory - as long as you limit the vocab size. Should just take longer to train an epoch AFAICT. You’ll need more layers to actually take advantage of the full corpus.


I understand that GPU memory wouldn’t increase as long as vocab size and sentence length is constant.

@jeremy I have problem in preprocessing. Looks like lecture example would load everything from text file to memory then preform a spacy tokenize in parallel. I keep running out of memory in the first part for >100k sentences.

So I created my own dataset and override __getitem__ function and handle a line each time. But everything failed when running num_workers > 1 in dataloader. I think I wasn’t able to get spacy.tokenizer running in parallel.