Lesson 1 In-Class Discussion ✅

Sure!

http://docs.fast.ai/metrics.html#Fbeta

3 Likes

Thank you!

I downloaded images (my dataset) from google images following @lesscomfortable 's wonderful jupyter notebook. I am using AWS Sagemaker as my platform. I got a 66% error rate, which is bad :frowning: . On looking at the images (show_batch), I found some irrelevant images. How do I remove irrelevant images from my dataset?

I have 6 classes and 300 images in each class. Should I open each image and verify if they are relevant or not? Is it possible to download the dataset from sagemaker to my laptop, so that I can quickly delete irrelevant images?

2 Likes

Hey! You have two options:

  1. You can inspect them in your notebook by running .show_batch() a number of times. Then you can delete the ones you don’t need with os.remove(filename)

  2. A smarter way is to train a model and then plot the images that got the top losses with .plot_top_losses(). These are the best candidates to be deleted from the dataset because your model is having trouble classifying them. You have to be careful here, only delete images that do not correspond to the label they have assigned. If you delete other images you would be helping your model overfit (in the extreme, if you delete every misclassified image, you would artificially achieve 100% accuracy in the validation set). You can get the filenames of the top losses by running the following command and then delete them by running
    os.remove(filename):

3 Likes

Created a lesson 1 notes - somewhere between @PoonamV’s notes and @stas’s transcript:

16 Likes

I know this was answered, but here would be my way of solving @kofi’s / @Galactrion’s questions:

for number of training examples:

len(data.y)

For the actual classes

data.classes has those, so

num_classes = len(data.classes)

data.c has the same number ready for you from fastai. :wink:

to get a list of the number of examples per class I leverage pandas:

pd.concat([pd.Series(data.classes), pd.Series(data.train_ds.y).value_counts()], axis=1)

now thanks to @raghavab1992’s reply there is an even simpler version using data.class2idx

pd.Series(data.class2idx).map(pd.Series(data.train_ds.y).value_counts())

The key here is using the .value_counts() method on the training or validation ys. In order to use that, we have to create a pandas Series object. (pd.Series(data.train_ds.y).value_counts())
The rest is for replacing the class numbers with class names.
If you want them sorted by number per class, stick .sort_values(1) or .sort_values(1, ascending=False) on the end of that line. The output looks something like this:

image

The first number is the class-ID which can be ignored mostly.

9 Likes

Like it better, will edit my answer to include your method for the number of classes.

Sure,

image_extensions = set(k for k,v in mimetypes.types_map.items() if v.startswith('image/'))

it uses a nifty built-in python library called mimetypes. It can be found here: https://docs.python.org/3/library/mimetypes.html

It uses the mimetypes.types_map function to get this output:

{'.a': 'application/octet-stream',
 '.ai': 'application/postscript',
 '.aif': 'audio/x-aiff',
 '.aifc': 'audio/x-aiff',
 '.aiff': 'audio/x-aiff',
 '.au': 'audio/basic',
 '.avi': 'video/x-msvideo',
 '.bat': 'text/plain',
 '.bcpio': 'application/x-bcpio',
 '.bin': 'application/octet-stream',
 '.bmp': 'image/x-ms-bmp',
 '.c': 'text/plain',
 '.cdf': 'application/x-netcdf',
 '.cpio': 'application/x-cpio',
 '.csh': 'application/x-csh',
 '.css': 'text/css',
 '.csv': 'text/csv',
 '.dll': 'application/octet-stream',
 '.doc': 'application/msword',
 '.dot': 'application/msword',
 '.dvi': 'application/x-dvi',
 '.eml': 'message/rfc822',
 '.eps': 'application/postscript',
 '.etx': 'text/x-setext',
 '.exe': 'application/octet-stream',
 '.gif': 'image/gif',
 '.gtar': 'application/x-gtar',
 '.h': 'text/plain',
 '.hdf': 'application/x-hdf',
 '.htm': 'text/html',
 '.html': 'text/html',
 '.ico': 'image/vnd.microsoft.icon',
 '.ief': 'image/ief',
 '.jpe': 'image/jpeg',
 '.jpeg': 'image/jpeg',
 '.jpg': 'image/jpeg',
 '.js': 'application/javascript',
 '.json': 'application/json',
 '.ksh': 'text/plain',
 '.latex': 'application/x-latex',
 '.m1v': 'video/mpeg',
 '.m3u': 'application/vnd.apple.mpegurl',
 '.m3u8': 'application/vnd.apple.mpegurl',
 '.man': 'application/x-troff-man',
 '.me': 'application/x-troff-me',
 '.mht': 'message/rfc822',
 '.mhtml': 'message/rfc822',
 '.mif': 'application/x-mif',
 '.mov': 'video/quicktime',
 '.movie': 'video/x-sgi-movie',
 '.mp2': 'audio/mpeg',
 '.mp3': 'audio/mpeg',
 '.mp4': 'video/mp4',
 '.mpa': 'video/mpeg',
 '.mpe': 'video/mpeg',
 '.mpeg': 'video/mpeg',
 '.mpg': 'video/mpeg',
 '.ms': 'application/x-troff-ms',
 '.nc': 'application/x-netcdf',
 '.nws': 'message/rfc822',
 '.o': 'application/octet-stream',
 '.obj': 'application/octet-stream',
 '.oda': 'application/oda',
 '.p12': 'application/x-pkcs12',
 '.p7c': 'application/pkcs7-mime',
 '.pbm': 'image/x-portable-bitmap',
 '.pdf': 'application/pdf',
 '.pfx': 'application/x-pkcs12',
 '.pgm': 'image/x-portable-graymap',
 '.pl': 'text/plain',
 '.png': 'image/png',
 '.pnm': 'image/x-portable-anymap',
 '.pot': 'application/vnd.ms-powerpoint',
 '.ppa': 'application/vnd.ms-powerpoint',
 '.ppm': 'image/x-portable-pixmap',
 '.pps': 'application/vnd.ms-powerpoint',
 '.ppt': 'application/vnd.ms-powerpoint',
 '.ps': 'application/postscript',
 '.pwz': 'application/vnd.ms-powerpoint',
 '.py': 'text/x-python',
 '.pyc': 'application/x-python-code',
 '.pyo': 'application/x-python-code',
 '.qt': 'video/quicktime',
 '.ra': 'audio/x-pn-realaudio',
 '.ram': 'application/x-pn-realaudio',
 '.ras': 'image/x-cmu-raster',
 '.rdf': 'application/xml',
 '.rgb': 'image/x-rgb',
 '.roff': 'application/x-troff',
 '.rtx': 'text/richtext',
 '.sgm': 'text/x-sgml',
 '.sgml': 'text/x-sgml',
 '.sh': 'application/x-sh',
 '.shar': 'application/x-shar',
 '.snd': 'audio/basic',
 '.so': 'application/octet-stream',
 '.src': 'application/x-wais-source',
 '.sv4cpio': 'application/x-sv4cpio',
 '.sv4crc': 'application/x-sv4crc',
 '.svg': 'image/svg+xml',
 '.swf': 'application/x-shockwave-flash',
 '.t': 'application/x-troff',
 '.tar': 'application/x-tar',
 '.tcl': 'application/x-tcl',
 '.tex': 'application/x-tex',
 '.texi': 'application/x-texinfo',
 '.texinfo': 'application/x-texinfo',
 '.tif': 'image/tiff',
 '.tiff': 'image/tiff',
 '.tr': 'application/x-troff',
 '.tsv': 'text/tab-separated-values',
 '.txt': 'text/plain',
 '.ustar': 'application/x-ustar',
 '.vcf': 'text/x-vcard',
 '.wav': 'audio/x-wav',
 '.webm': 'video/webm',
 '.wiz': 'application/msword',
 '.wsdl': 'application/xml',
 '.xbm': 'image/x-xbitmap',
 '.xlb': 'application/vnd.ms-excel',
 '.xls': 'application/vnd.ms-excel',
 '.xml': 'text/xml',
 '.xpdl': 'application/xml',
 '.xpm': 'image/x-xpixmap',
 '.xsl': 'application/xml',
 '.xwd': 'image/x-xwindowdump',
 '.zip': 'application/zip'}

Then, it loops through each of these (k = extension, v = extension type) and checks to see if the extension type (v) starts with image (image/). If it does, it adds the extension to a set.

This is what is stored in the image_extensions variable.

3 Likes

Another thing you should look for in your images is any systematic bias. I was trying to classify sport pictures. It turns out that team pictures exist more in the hockey/lacrosse data sets I had built from a Google search across ten sports - so any other sport was then more likely to classify a team photo as one of these sports rather than the correct one. I refined my search to look for action images - then weeded out the team shots. Same for crowd scenes, cheerleaders and general stadium shots - which appeared more often in baseball.

2 Likes

Thanks a lot @hiromi for putting this together! I’ve been using your ML and DL notes before and they’ve been most helpful!! :grinning:

2 Likes

can I ask why we use
“np.random.seed(2)”
before imagedatabunch? or is this for convlearner?

1 Like

@Tejaswani When you use this untar function, while passing the url I believe you should avoid extension. Also when you use this tar.gz you would get a tar file so you might have to untar that again.

1 Like

Common practice to make the experiment (random number generator) repeatable.

1 Like

Hi there,
Has anyone tried predicting with the model for one image input (say a new image not in the training set)? I can’t seem to figure out how or where to find it in the docs. Thanks!

hi Ste,

thanks, so this is for random initiating wgts for convlearner, correct? has nothing to do with the databunch?

thanks!

No, I ran:

source activate fastai
conda update fastai

which is what’s shown in Returning to Salamander -
http://course-v3.fast.ai/update_salamander.html#update-the-fastai-library

conda list doesn’t show fastai at all… odd

I’m trying to fit a model on the fish species dataset. But it only has a train folder with the images organized into subfolders.
How do I automatically create a validation set using some of the training images.
Is it possible with ImageDataBunch?

@lesscomfortable I get fastai version : 1.0.11 using Jeremy’s method

@kofi you can use the same method from_folder in ImageDataBunch but give .20 for valid_pct. I hope it helps.

OK, then I ran what you suggested:
conda install -c fastai fastai

and now conda list shows fastai at 1.0.15, which seems promising - but then when I run

import fastai
fastai.show_install(0)

it still shows 1.0.11.

I’m a complete noob with ubuntu, but I’m starting to get what’s going on… the new fastai is now installed in the anaconda3 folder - but my notebook is seeing the old fastai… The update method shown on the Returning to Salamander page isn’t actually updating fastai:

source activate fastai
conda update fastai

and the fastai in anaconda3 isn’t visible to my Salamander notebooks.

So I’m stuck again!