I dug into it a but more. This appears to have indeed been a text encoding issue. I used the fastai doc() function on the TextLMDataBunch.from_folder function. After digging enough through the code and comments, it appears that the expectation is that text is to be utf-8 encoded.
this post showed me how to check the encoding of my files and the imdb files downloaded by fastai.
(if you can’t find the fastai downloaded files, add the dest parameter to untar_data) path = untar_data(URLs.IMDB, dest="/content")
here’s how to check text encoding types on a linux filesystem: (on coloab, I had to do apt-get file in order to run the file command)