Tokenizing speed and using TextBlock

I’m trying to get the data pipeline working for neural machine translation. So I chose the MT_ENG_FRA datasets from the URLs class in Fastai.

As of now I’m simply trying to get a dataloader for language model from folder. But this seems to be taking too long. Even on colab it easily exceeds an hour(never got the process to complete though). The english text file is ~3.5BG and french file is ~4.25GB. Is the speed I’m seeing expected? Anyone else tried this dataset or has experience with TextBlock.from_folder()?

My Code

from fastai.data.all import *
from fastai.text.all import *
nmt_path = untar_data(URLs.MT_ENG_FRA)
nmt_path.ls()

Separate out the english and french file in different folders

en_dir = nmt_path/"en"
fr_dir = nmt_path/"fr"
if not en_dir.exists():
    en_dir.mkdir()
    shutil.move(nmt_path/"giga-fren.release2.fixed.en", nmt_path/"en"/"giga-fren.release2.fixed.en")
if not fr_dir.exists():
    fr_dir.mkdir()
    shutil.move(nmt_path/"giga-fren.release2.fixed.fr", nmt_path/"fr"/"giga-fren.release2.fixed.fr")

print(en_dir.ls())
print(fr_dir.ls())

I get stuck in the cell below…

en_lm_block = DataBlock(blocks=(TextBlock.from_folder(en_dir, is_lm=True)),
                        get_items=get_text_files,
                        splitter=RandomSplitter())