How to prepare translation datasets from scratch for data block API?

Hi, I’m experimenting transformer-translation for a while.


It works fine but the problem is this example use csv file which I think is not good for large datasets.

How can I prepare dataset which has similar structure to language model training?
like

data/en/1.txt
             /2.txt
             /3.txt
data/fr/1.txt
           /2.txt
           /3.txt

Could you give me a hint?

Thanks for advance.