How to use Kaggle's MNIST data with ImageClassifierData?


(Antti) #1

Hello,

I am trying to run the lesson 1 notebook with Kaggle’s MNIST data, but having some problems understanding how to use ImageClassifierData with the dataset (or maybe I should approach this differently? :slight_smile: ).

The data is provided in a CSV format, in a way that the pixel values of the 28x28 images are on a single line in 784 columns. In the examples we used in lesson 1 & 2 the input was .jpg files.

The input data looks like the following - the correct label [number between 0-9] is in the second column:
example

I am able to make the pictures visible by running the following (after removing the label column)
example2

However, what should I do to get the data into the appropriate format (including handling the column with the correct label), so that I could afterwards run the following?

arch=resnet34
learn = ConvLearner.pretrained(arch, data, precompute=True)
learn.fit(0.01, 3)

Any help would be highly appreciated :slight_smile: !

Antti


(Chris Palmer) #2

Not sure if this will help you along your way, but if it does please let me know, as I adapted this to work with Python 3.6, it converts the MNIST files into JPG. I was thinking this was a way forward to work with our conditions, but haven’t yet followed through on getting it completely configured for our use. One thing that has puzzled me is that the valid directory / technique uses images, whereas I was expecting labels like 1 or 0. So that eludes me a bit…

# -*- coding: utf-8 -*-
"""
Created on Sun Nov  5 17:51:01 2017

Adapted for Python 3.6 from :
https://gist.github.com/ischlag/41d15424e7989b936c1609b53edd1390

Simple python script which takes the mnist data from tensorflow and builds a data 
set based on jpg files and text files containing the image paths and labels. 
Parts of it are from the mnist tensorflow example.

"""

from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gzip
import os
import sys
import time

from six.moves import urllib
from six.moves import xrange  # pylint: disable=redefined-builtin
from scipy.misc import imsave
import tensorflow as tf
import numpy as np
import csv

SOURCE_URL = 'http://yann.lecun.com/exdb/mnist/'
WORK_DIRECTORY = 'data'
IMAGE_SIZE = 28
NUM_CHANNELS = 1
PIXEL_DEPTH = 255
NUM_LABELS = 10

def maybe_download(filename):
  """Download the data from Yann's website, unless it's already here."""
  if not tf.gfile.Exists(WORK_DIRECTORY):
    tf.gfile.MakeDirs(WORK_DIRECTORY)
  filepath = os.path.join(WORK_DIRECTORY, filename)
  if not tf.gfile.Exists(filepath):
    filepath, _ = urllib.request.urlretrieve(SOURCE_URL + filename, filepath)
    with tf.gfile.GFile(filepath) as f:
      size = f.Size()
    print('Successfully downloaded', filename, size, 'bytes.')
  return filepath


def extract_data(filename, num_images):
  """Extract the images into a 4D tensor [image index, y, x, channels].

  Values are rescaled from [0, 255] down to [-0.5, 0.5].
  """
  print('Extracting', filename)
  with gzip.open(filename) as bytestream:
    bytestream.read(16)
    buf = bytestream.read(IMAGE_SIZE * IMAGE_SIZE * num_images)
    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
    #data = (data - (PIXEL_DEPTH / 2.0)) / PIXEL_DEPTH
    data = data.reshape(num_images, IMAGE_SIZE, IMAGE_SIZE, 1)
    return data


def extract_labels(filename, num_images):
  """Extract the labels into a vector of int64 label IDs."""
  print('Extracting', filename)
  with gzip.open(filename) as bytestream:
    bytestream.read(8)
    buf = bytestream.read(1 * num_images)
    labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)
  return labels

train_data_filename = maybe_download('train-images-idx3-ubyte.gz')
train_labels_filename = maybe_download('train-labels-idx1-ubyte.gz')
test_data_filename = maybe_download('t10k-images-idx3-ubyte.gz')
test_labels_filename = maybe_download('t10k-labels-idx1-ubyte.gz')

# Extract it into np arrays.
train_data = extract_data(train_data_filename, 60000)
train_labels = extract_labels(train_labels_filename, 60000)
test_data = extract_data(test_data_filename, 10000)
test_labels = extract_labels(test_labels_filename, 10000)

if not os.path.isdir("mnist/train-images"):
   os.makedirs("mnist/train-images")

if not os.path.isdir("mnist/test-images"):
   os.makedirs("mnist/test-images")

# process train data
with open("mnist/train-labels.csv", 'w', newline='') as csvFile:
  writer = csv.writer(csvFile, delimiter=',', quotechar='"')
  for i in range(len(train_data)):
    imsave("mnist/train-images/" + str(i) + ".jpg", train_data[i][:,:,0])
    writer.writerow(["train-images/" + str(i) + ".jpg", train_labels[i]])

# repeat for test data
with open("mnist/test-labels.csv", 'w', newline='') as csvFile:
  writer = csv.writer(csvFile, delimiter=',', quotechar='"')
  for i in range(len(test_data)):
    imsave("mnist/test-images/" + str(i) + ".jpg", test_data[i][:,:,0])
    writer.writerow(["test-images/" + str(i) + ".jpg", test_labels[i]])


URLs to Images
(Jeremy Howard) #3

Thanks for that script! Another option is to download the images as jpg from here: https://www.kaggle.com/scolianni/mnistasjpg


(Chris Palmer) #4

Thanks Jeremy :slight_smile:

Regarding my puzzlement re the valid folder containing images rather than labels - can you explain how that works? I was expecting something like a mapping of a file name to either a cat or dog label, do we have instead the same / corresponding image as being tested in the test set in either a cat or dog folder under valid?

What would we need as valid data / labels to work with the MNIST data set?


(Jeremy Howard) #5

If you use from_paths in fastai, then your folders correspond to your classes. E.g. we have a folder called ‘dogs’ that contains pics of dogs, and one called ‘cats’ that contains pics of cats. This is the most common way to store and share labeled image datasets, and it’s what we’ve used in most of our classes so far. Spend some time with the dogs v cats dataset to get a sense of how it works.


(Diyang Tang) #6

If you wanted to use the CSV with the pixel values with one image per a single line, I think the problem is that there are missing color channels. For example, in lesson 1 with cats and dogs, the input to ImageClassiferData.from_paths was a 3D numpy array. I think for MNIST data we’ll want ImageClassfierData.from_arrays, since it’s 1 CSV and not 2 folders, but somehow we need to turn the 1D array into 3D arrays by filling in the missing color channels.


(Antti) #7

Thank you Chris, Jeremy and Diyang!

I (finally) managed to get this done… maybe the solution is not the most elegent one(?) but it seemed to work. If anyone else ends up trying this out, and would like to see what could be done, then the following was what I did:

As mentioned above by Jeremy, in order to use ImageClassifierData.from_paths() the training and validation data need to be in the correct folders corresponding to the classes. So first I created sub-folders for each number “0”, “1”, “2”, etc both under train/ and valid/ folders.

Reading the data from CSV
data_df = pd.read_csv(f'{PATH}train.csv', header = 0)

Pick the labels
label_df = data_df['label']

Remove the label column from the datasets containing the pixel values
del data_df['label']

Take the pixel values into array from the Pandas dataframe
data_values = data_df.values

Then the following to generate .jpg pictures from the pixels

for i in range(0, len(data_values)):

    #read the correct label
    correct_label = label_df[i]

    #split the data into training and validation sets
    if np.random.rand() < 0.8:
        folder = 'train/'
    else:
        folder = 'valid/'
    
    img = data_values[i][:]

    #reshape into 28x28 pic
    img = img.reshape(28,28)

    #we need three channels into the picture
    img = np.stack((img,)*3,axis = -1)

    #change the data type to int8
    img = np.uint8(img)

    #create PIL Image
    new_img = Image.fromarray(img)

    #save the .jpg into correct folder
    new_img.save(f'{PATH}' + folder + str(correct_label) + '/' + str(i) + '.jpg', 'JPEG')

…and after that following the instructions in the Lesson 1 notebook.

I managed to get to 99.3% validation accuracy with resnet34 and without data augmentation.

Would anyone be able to provide guidance on what type of data augmentation could be tried out with MNIST data (the below is from lesson1 notebook)
tfms_from_model(resnet34, sz, aug_tfms=transforms_side_on, max_zoom=1.1)

I don’t think I should flip the pictures totally around… :slight_smile:


(Jeremy Howard) #8

That looks great! BTW you might find from_csv easier than from_paths in this case.


(Eternal) #9

I tried using

ImageClassifierData.from_arrays(path="mnist/digit-recognizer/", trn=(X_train, y_train.values), val=(X_valid, y_valid.values),classes=y_train.unique(), test=test)

to train but I got a strange key error every time of a different number.
The documentation of from_array has the dimensions of the mnist dataset.

trn: a tuple of training data matrix and target label/classification array (e.g. trn=(x,y) where x has theshape of (5000, 784) and y has the shape of (5000,))

Then I read @dydt 's comment about missing color channels so I added the lines.

X_train = X_train.values.reshape(-1, 28, 28)
X_valid = X_valid.values.reshape(-1, 28, 28)
test = test.values.reshape(-1, 28, 28) 

and expanded color channels by

X_train = np.stack((X_train,) * 3, axis=-1)
test = np.stack((test,) * 3, axis=-1)
X_valid = np.stack((X_valid,) * 3, axis=-1)

I get a dimension error.

Given groups=1, weight[64, 3, 7, 7], so expected input[64, 28, 28, 3] to have 3 channels, but got 28 channels instead

And if I remove reshape I get dimension error.

Expected 4D tensor as input, got 3D tensor instead.

Has anybody got this working with from_arrays? Should I stick to downloading images?