Most likely in the video the dataset is already downloaded and Jeremy is probably running it on a private server with a much faster GPU, CPU, Ram and Storage, all of which can make a difference (GPU being the biggest difference).
My 1080ti was much faster than a K80. I can’t remember by how much but I believe it was > 2x faster so I would expect your 1070ti would be faster as well.
Playing with batch sizes also can make a huge difference. I did some benchmarking on the NLP model here on Colab and my personal machine w/ a 3090.
I ran a benchmark of the 01 intro notebook on my local machine for reference as well. Not sure how long the dataset downloads took, but you can see the training times. It was definitely several min including the download time for the datasets. When you use colab, you have to re-download the datasets and models each time your notebook instance is released which makes it take longer than if you have a dedicated machine with all of that pre-downloaded. In my case I did not have the models or datasets pre-downloaded on my local machine for several of the models. If you’re using a dedicated instance on AWS that you turn on/off each time you use it, you should not have to re-download the models and datasets which saves time.
Actually, I may have been recalling something that I looked at a while back, so maybe the 1070ti is slightly faster for the first notebook (not taking into account the download times). I also didn’t see the sentiment step in your PDF. That is what seems to take 12min per epoch on the 1070ti. Everything else in that notebook is pretty fast. I am using the fastai container so maybe my notebooks are not the most recent.
EDIT: Actually I do see the text data loader step in the PDF (step 15) . In your output it takes 1:18 min per epoch. On my 1070ti (bs=24) it takes about 12 min per epoch. So I’m guessing the PDF you posted contains results from your 3090 setup? That’s impressive. Maybe I’ll be able to afford a 3090 once the 40 series comes out. OTOH, I’ll probably need a whole new box because the DELL T3600 just doesn’t have the juice to run a 3090 tbh.
Yep. I built it in I think 2017 and upgraded from 1080ti’s to a 3090 last year. The power consumption on the 3090 is ~420W. Pretty crazy. I bought a 1000W supply when I built the machine and it drives one card just fine but probably not 2.
Going to a card with Tensor cores and switching to fp16 makes a huge difference. I was close to maxing out GPU ram w/ bs of 384 and 28s epochs using fp16 vs max batch size of ~176 and 60s epochs using standard fp32 on the same card. I attached a pdf to my post on the other thread as well w/ a bunch of different tests. It looks like it cut off the fp16 call at the end of the learner line, but you can still see the . (dot) at the end of the learner creation line to know where I applied it vs where I didn’t.
Yeah and 3090Ti is pushing half a kilowatt for a 7-10% bump. Off topic, but I really wish apple would take a few devs and assign them full time to helping the pytorch team so it can be ported to their M series hardware. They’d probably see a significant bump in sales if pytorch becomes usable on their silicon. Pytorch team seems to be be struggling with having to port it to the M series chips it seems.