Fast.ai DL1, DL2, ML1 Video Transcription Project - Proofreading Help Needed!

I took the available courses outlines, video timelines and autogenerated youtube subtitles and spliced them together to make searchable html/pdf files.

Here are the direct links to PDFs:

Since youtube’s subtitles are just an endless sequence of lowercase words, I used a Neural Net project http://bark.phon.ioc.ee/punctuator to automatically create sentences and add punctuation. It did a pretty good job.

The outcome is usable, but it can be made very useful if the community will make an effort to proofread the transcripts while watching the videos and make the corrections where necessary.

To download the transcript pdfs/htmls and see how you could contribute with proofreading please see:

In order not to duplicate the effort before you start working on proofreading the transcripts it’d be a good idea to reply here in a comment and indicate which course-lesson(s)-section(s) you’re working on and update your comment as you complete things. But of course only do that if you are doing the work.

Thank you.

p.s. perhaps the fastai-transcript git repo I created can be integrated with fastai github user, since it’s really part of fastai. I enjoyed creating the mashups but I don’t need to hold onto this sub-project now that it has been birthed.

9 Likes

This is pretty cool. Just wondering, why not convert these to .srt files? This way one can see the video with the new .srt files and proofreading will be easier.

I think I understand why you’d suggest that - because the relevant text would be flashed at the same time as the one spoken.

At the moment they are the same .srt files with just punctuation added (later i the compilation process), i.e. you won’t be able to notice this change when you’re shown 3-5 words at a time. And it won’t help with proofreading either. Based on my experience, one needs to see whole sentences and even better paragraphs (context) to proofread, IMHO.

But if you or others have other ideas about how to go about making a better transcript or make it easier to proofread - all the source components are there in the build subdirs, so you can play with these and come up with other approaches.