How to prepare translation datasets from scratch for data block API?

kouohhashi · November 13, 2019, 2:28am

Hi, I’m experimenting transformer-translation for a while.

fastai/course-v3/blob/master/nbs/dl2/translation_transformer.ipynb

{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from fastai.text import *"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[PosixPath('/home/stas/.fastai/data/giga-fren/models'),\n",

This file has been truncated. show original

It works fine but the problem is this example use csv file which I think is not good for large datasets.

How can I prepare dataset which has similar structure to language model training?
like

data/en/1.txt
             /2.txt
             /3.txt
data/fr/1.txt
           /2.txt
           /3.txt

Could you give me a hint?

Thanks for advance.