Part 2 Lesson 12 wiki

Thanks! it is very comforting to know that I am not alone. I have the same approach as you do but only that I can spend only 15 hours a week on this. So you can understand my learning curve based on the same. But as you and Jeremy rightly pointed out, I will take this step and step and go along with first understanding and then implementing. Whenever I feel stuck with my speed or progress, I remember what Jeremy told previously - It is the perseverant who he has seen succeed in this space more than anybody else :slight_smile:

1 Like

Thanks, now I got it.

I got results that were closer to Jeremy’s when I used the full dataset instead of the 10% sample.

Hi Rohit, Were you able to download the dataset. When I try it on my P2 instance in AWS, it gets cut after about 34G of data is downloaded.

Yes, I was able to download it with no problems. I used a Google cloud instance.

@Ducky Don’t worry it comes with time and practice. This is my second time around taking part 2 and I’m just now starting to understand some key concepts. I’ll share three things that have really helped me this time around.

First, I’ve been pulling the lectures off youtube onto my phone so I can listen to them as I bike to work. I’ve got an hour long commute (East Van to Richmond) and Jeremy’s voice is pretty much all I listen to all week long. Hearing the lectures for the second or third time makes a huge difference in terms of catching some of the key points or understanding them.

Second, I got much more comfortable with the python debugger, and I started throwing breakpoints all over the code so I could look at the vectors and what’s being passed at each point.

And finally and probably most importantly I built a neural net end to end that wasn’t based off of any of the models that Jeremy taught. This meant I had to build a custom dataset, data model, data loader class, a custom model class, and a custom loss function. The debugging of that was painful and that’s where pdb really shone. Now I feel much more comfortable with what’s going on underneath the hood.

I know it feels overwhelming at first, and I struggled my first time through as well, but I’d suggest focusing on understanding one project really well rather than trying to understand it all. The lectures are there to come back to once you’re ready, and the forums stay active after class is through for questions, especially when it’s opened up to the public.

Keep at it! Jeremy’s mentioned a few times that the most important component to success in this field is persistence.

21 Likes

Google Brain team has just announced their DAWNBench result for ImageNet classification. when we change from CIFAR10 to ImageNet, what kind of changes are needed in the architecture? I am curious to know the time and $ fast.ai algorithm is going to take

1 Like

Yes, the full dataset gives a much better result.

Fun with datasets (or NOT)

Tried to run CYCLEGAN on both of my GPUs and it seemed to only run on the second one:
--gpu-ids='0,1' so I said OK, lets just try one GPU --gpu-ids='1': the first training block is currently running on that GPU and takes 778 seconds per epoch, running 200 epochs should take 43 HOURS (if my math is correct). BTW if you’re wonding why I didn’t start with WGAN, it’s because the 47.2GB dataset is still downloading. Yes, I know I should start with smaller datasets, but what’s the fun in that.

We haven’t figured out how to get good results with imagenet yet, unfortunately. Not sure why yet.

I have a single 1080Ti and it took about 30 hours for 200 epochs.

I only have a 1080 :frowning_face:, well actually I have two :grinning:

I’m running the wgan.ipynb notebook and keep hitting this error

 In [22]: train(1, False)

0%|          | 0/1 [00:00<?, ?it/s]
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-23-0493450504b2> in <module>()
----> 1 train(1, True)

<ipython-input-21-344f55de6ecd> in train(niter, first)
      5         data_iter = iter(md.trn_dl)
      6         i,n = 0,len(md.trn_dl)
----> 7         with tqdm(total=n) as pbar:
      8             while i < n:
      9                 set_trainable(netD, True)

~/fastai/courses/dl2/fastai/imports.py in tqdm(*args, **kwargs)
     45 if in_notebook():
     46     def tqdm(*args, **kwargs):
---> 47         clear_tqdm()
     48         return tq.tqdm(*args, file=sys.stdout, **kwargs)
     49     def trange(*args, **kwargs):

~/fastai/courses/dl2/fastai/imports.py in clear_tqdm()
     41     inst = getattr(tq.tqdm, '_instances', None)
     42     if not inst: return
---> 43     for i in range(len(inst)): inst.pop().close()
     44 
     45 if in_notebook():

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py in close(self)
   1096         # decrement instance pos and remove from internal set
   1097         pos = abs(self.pos)
-> 1098         self._decr_instances(self)
   1099 
   1100         # GUI mode

~/src/anaconda3/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py in _decr_instances(cls, instance)
    436         with cls._lock:
    437             try:
--> 438                 cls._instances.remove(instance)
    439             except KeyError:
    440                 if not instance.gui:  # pragma: no cover

~/src/anaconda3/envs/fastai/lib/python3.6/_weakrefset.py in remove(self, item)
    107         if self._pending_removals:
    108             self._commit_removals()
--> 109         self.data.remove(ref(item))
    110 
    111     def discard(self, item):

KeyError: <weakref at 0x7f3eb51d28b8; to 'tqdm' at 0x7f3eb80dd048>

It’s another problem with tqdm. I’ve just ran git pull, conda env update, and conda update --all. I still get this error. My current fix is to replace the line with tqdm(total=n) as pbar: with if True: and comment out pbar.update(). This seems to work, but there’s no progress bar.

I asked someone else in the course and they’re not hitting this issue. Any ideas?

Looks like you’ve got a newer version of tqdm - I noticed this somewhere else with the new version. I fixed it by replacing these lines in _tqdm.py (see 2nd bottom bit of your stack trace):

except KeyError:
    if not instance.gui:  # pragma: no cover

with

except KeyError: pass
5 Likes

Encouraged by the good results I got with 64x64 images, I tried bumping up sz to 128, but that didn’t go as well…

…anyone tried generating 128 pix images with this notebook?

Experimenting with wgan notebook: I tried running the wgan notebook on another lsun category church_outdoor, for kicks. This is a smaller dataset (2.3GB), you can download any of the other 10 scene categories by replacing ‘category=bedroom’ with appropriate tag (church_outdoor for eg) in the notebook download instructions. To see improvements in GAN I’ve tried obvious things like a) showing more data to GAN and b) more iterations of the train loop. Other suggestions to improve the performance (visual appearance, rather) of the generated GAN’s are welcome!

PS: Found this guide on tips and tricks to make GANs work by Soumith, though its a year old and we’re doing most of it already (normalize data, use DCGAN, separate real and fake batches, leaky relu)

Increasing data sample size.
The images are for 10%, 50%, 100% respectively, of the church_outdoor dataset used (1 epoch).


Increasing training loops Running the notebook for 10, 50 and 250 iterations respectively with 100% data used. The images start looking more and more realistic.

Loss numbers for 10 iterations (6 min to run):
Loss_D [-1.37384]; Loss_G [0.72288]; D_real [-0.71672]; Loss_D_fake [0.65712]
For 250 iterations it took nearly 3 hours:
Loss_D [-0.50636]; Loss_G [0.45063]; D_real [-0.41054]; Loss_D_fake [0.09582]



14 Likes

Was this still improving at 250 epochs or had it flattened out? I mean, would 500 be a much better result still?

The values jump around quite a bit but I think there is still slight improvement over every 10 iterations or so. Would be worthwhile trying more than 500 iterations, perhaps to exercise d_iters =100 case also ?
d_iters = 100 if (first and (gen_iterations < 25) or (gen_iterations % 500 == 0)) else 5

Thanks Even! It is really helpful and inspiring to see what perseverance can do :slight_smile:

When running the cifar10-darknet notebook, I was getting this error:

RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation

I had previously worked off the video and did not have the issue. It appears to be a result of the line that is commented out. In the video, there was a discussion about trying to save memory and work on things in place and x.add_ was added at the time. Using the original line (that is above the commented out line) will work.

class ResLayer(nn.Module):
    def __init__(self, ni):
        super().__init__()
        self.conv1=conv_layer(ni, ni//2, ks=1)
        self.conv2=conv_layer(ni//2, ni, ks=3)
        
    def forward(self, x): 
        return x.add(self.conv2(self.conv1(x)))
#        return x.add_(self.conv2(self.conv1(x)))

Updated: As Nikhil suggests below, simply taking off the underscore after add will make it not be an in-place operation and directly addresses the error message.

1 Like