Per character Tokenizer

Right - when you say run out of memory is that RAM or GPU memory?

Is it happening when training the language model or the classifier?

A character level RNN will have fewer weights than a word level (the vocabulary of possible characters is thousands of times smaller than comman word vocabulary sizes so the embedding layer is much smaller), but will have more tokens (around 5x for English text).

I’d expect RAM usage to be higher, GPU usage to be lower for language modelling. For classification you may need to truncate to a maximum number of characters to preserve GPU memory.