How to use Kaggle's MNIST data with ImageClassifierData?


I am trying to run the lesson 1 notebook with Kaggle’s MNIST data, but having some problems understanding how to use ImageClassifierData with the dataset (or maybe I should approach this differently? :slight_smile: ).

The data is provided in a CSV format, in a way that the pixel values of the 28x28 images are on a single line in 784 columns. In the examples we used in lesson 1 & 2 the input was .jpg files.

The input data looks like the following - the correct label [number between 0-9] is in the second column:

I am able to make the pictures visible by running the following (after removing the label column)

However, what should I do to get the data into the appropriate format (including handling the column with the correct label), so that I could afterwards run the following?

learn = ConvLearner.pretrained(arch, data, precompute=True), 3)

Any help would be highly appreciated :slight_smile: !


1 Like

Not sure if this will help you along your way, but if it does please let me know, as I adapted this to work with Python 3.6, it converts the MNIST files into JPG. I was thinking this was a way forward to work with our conditions, but haven’t yet followed through on getting it completely configured for our use. One thing that has puzzled me is that the valid directory / technique uses images, whereas I was expecting labels like 1 or 0. So that eludes me a bit…

# -*- coding: utf-8 -*-
Created on Sun Nov  5 17:51:01 2017

Adapted for Python 3.6 from :

Simple python script which takes the mnist data from tensorflow and builds a data 
set based on jpg files and text files containing the image paths and labels. 
Parts of it are from the mnist tensorflow example.


from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gzip
import os
import sys
import time

from six.moves import urllib
from six.moves import xrange  # pylint: disable=redefined-builtin
from scipy.misc import imsave
import tensorflow as tf
import numpy as np
import csv


def maybe_download(filename):
  """Download the data from Yann's website, unless it's already here."""
  if not tf.gfile.Exists(WORK_DIRECTORY):
  filepath = os.path.join(WORK_DIRECTORY, filename)
  if not tf.gfile.Exists(filepath):
    filepath, _ = urllib.request.urlretrieve(SOURCE_URL + filename, filepath)
    with tf.gfile.GFile(filepath) as f:
      size = f.Size()
    print('Successfully downloaded', filename, size, 'bytes.')
  return filepath

def extract_data(filename, num_images):
  """Extract the images into a 4D tensor [image index, y, x, channels].

  Values are rescaled from [0, 255] down to [-0.5, 0.5].
  print('Extracting', filename)
  with as bytestream:
    buf = * IMAGE_SIZE * num_images)
    data = np.frombuffer(buf, dtype=np.uint8).astype(np.float32)
    #data = (data - (PIXEL_DEPTH / 2.0)) / PIXEL_DEPTH
    data = data.reshape(num_images, IMAGE_SIZE, IMAGE_SIZE, 1)
    return data

def extract_labels(filename, num_images):
  """Extract the labels into a vector of int64 label IDs."""
  print('Extracting', filename)
  with as bytestream:
    buf = * num_images)
    labels = np.frombuffer(buf, dtype=np.uint8).astype(np.int64)
  return labels

train_data_filename = maybe_download('train-images-idx3-ubyte.gz')
train_labels_filename = maybe_download('train-labels-idx1-ubyte.gz')
test_data_filename = maybe_download('t10k-images-idx3-ubyte.gz')
test_labels_filename = maybe_download('t10k-labels-idx1-ubyte.gz')

# Extract it into np arrays.
train_data = extract_data(train_data_filename, 60000)
train_labels = extract_labels(train_labels_filename, 60000)
test_data = extract_data(test_data_filename, 10000)
test_labels = extract_labels(test_labels_filename, 10000)

if not os.path.isdir("mnist/train-images"):

if not os.path.isdir("mnist/test-images"):

# process train data
with open("mnist/train-labels.csv", 'w', newline='') as csvFile:
  writer = csv.writer(csvFile, delimiter=',', quotechar='"')
  for i in range(len(train_data)):
    imsave("mnist/train-images/" + str(i) + ".jpg", train_data[i][:,:,0])
    writer.writerow(["train-images/" + str(i) + ".jpg", train_labels[i]])

# repeat for test data
with open("mnist/test-labels.csv", 'w', newline='') as csvFile:
  writer = csv.writer(csvFile, delimiter=',', quotechar='"')
  for i in range(len(test_data)):
    imsave("mnist/test-images/" + str(i) + ".jpg", test_data[i][:,:,0])
    writer.writerow(["test-images/" + str(i) + ".jpg", test_labels[i]])


Thanks for that script! Another option is to download the images as jpg from here:


Thanks Jeremy :slight_smile:

Regarding my puzzlement re the valid folder containing images rather than labels - can you explain how that works? I was expecting something like a mapping of a file name to either a cat or dog label, do we have instead the same / corresponding image as being tested in the test set in either a cat or dog folder under valid?

What would we need as valid data / labels to work with the MNIST data set?

If you use from_paths in fastai, then your folders correspond to your classes. E.g. we have a folder called ‘dogs’ that contains pics of dogs, and one called ‘cats’ that contains pics of cats. This is the most common way to store and share labeled image datasets, and it’s what we’ve used in most of our classes so far. Spend some time with the dogs v cats dataset to get a sense of how it works.

If you wanted to use the CSV with the pixel values with one image per a single line, I think the problem is that there are missing color channels. For example, in lesson 1 with cats and dogs, the input to ImageClassiferData.from_paths was a 3D numpy array. I think for MNIST data we’ll want ImageClassfierData.from_arrays, since it’s 1 CSV and not 2 folders, but somehow we need to turn the 1D array into 3D arrays by filling in the missing color channels.

1 Like

Thank you Chris, Jeremy and Diyang!

I (finally) managed to get this done… maybe the solution is not the most elegent one(?) but it seemed to work. If anyone else ends up trying this out, and would like to see what could be done, then the following was what I did:

As mentioned above by Jeremy, in order to use ImageClassifierData.from_paths() the training and validation data need to be in the correct folders corresponding to the classes. So first I created sub-folders for each number “0”, “1”, “2”, etc both under train/ and valid/ folders.

Reading the data from CSV
data_df = pd.read_csv(f'{PATH}train.csv', header = 0)

Pick the labels
label_df = data_df['label']

Remove the label column from the datasets containing the pixel values
del data_df['label']

Take the pixel values into array from the Pandas dataframe
data_values = data_df.values

Then the following to generate .jpg pictures from the pixels

for i in range(0, len(data_values)):

    #read the correct label
    correct_label = label_df[i]

    #split the data into training and validation sets
    if np.random.rand() < 0.8:
        folder = 'train/'
        folder = 'valid/'
    img = data_values[i][:]

    #reshape into 28x28 pic
    img = img.reshape(28,28)

    #we need three channels into the picture
    img = np.stack((img,)*3,axis = -1)

    #change the data type to int8
    img = np.uint8(img)

    #create PIL Image
    new_img = Image.fromarray(img)

    #save the .jpg into correct folder'{PATH}' + folder + str(correct_label) + '/' + str(i) + '.jpg', 'JPEG')

…and after that following the instructions in the Lesson 1 notebook.

I managed to get to 99.3% validation accuracy with resnet34 and without data augmentation.

Would anyone be able to provide guidance on what type of data augmentation could be tried out with MNIST data (the below is from lesson1 notebook)
tfms_from_model(resnet34, sz, aug_tfms=transforms_side_on, max_zoom=1.1)

I don’t think I should flip the pictures totally around… :slight_smile:


That looks great! BTW you might find from_csv easier than from_paths in this case.

1 Like

I tried using

ImageClassifierData.from_arrays(path="mnist/digit-recognizer/", trn=(X_train, y_train.values), val=(X_valid, y_valid.values),classes=y_train.unique(), test=test)

to train but I got a strange key error every time of a different number.
The documentation of from_array has the dimensions of the mnist dataset.

trn: a tuple of training data matrix and target label/classification array (e.g. trn=(x,y) where x has theshape of (5000, 784) and y has the shape of (5000,))

Then I read @dydt 's comment about missing color channels so I added the lines.

X_train = X_train.values.reshape(-1, 28, 28)
X_valid = X_valid.values.reshape(-1, 28, 28)
test = test.values.reshape(-1, 28, 28) 

and expanded color channels by

X_train = np.stack((X_train,) * 3, axis=-1)
test = np.stack((test,) * 3, axis=-1)
X_valid = np.stack((X_valid,) * 3, axis=-1)

I get a dimension error.

Given groups=1, weight[64, 3, 7, 7], so expected input[64, 28, 28, 3] to have 3 channels, but got 28 channels instead

And if I remove reshape I get dimension error.

Expected 4D tensor as input, got 3D tensor instead.

Has anybody got this working with from_arrays? Should I stick to downloading images?

For anyone getting KeyError: 0 when trying to run

Make sure that the y-values you pass to ImageClassifierData.from_arrays is a numpy.ndarray and not a pandas.core.frame.DataFrame.

I’m not sure, but I think what is happening is that the code is trying to access the 0th element of y_train using y_train[0], which would not work on a dataframe (you would have to use y_train.iloc[0] for that).

You can check the type of your variables with type(variable).

You can convert a dataframe to a numpy array by calling .values on the dataframe, i.e.

y = df[['label']].values.flatten()

Note that I also called .flatten() on the resulting numpy array. This is to get an array rather than matrix with dimensions (1 * number of labels), which would give you an error about “multiple targets being unsupported”.

I hope this helps anyone stuck with a cryptic KeyError :slight_smile:

1 Like

Thank you so much for this code. The only addition is some code to make folders:

for i in range(10):
    train_path = f'{PATH}' + 'train/' + str(i)
    if not os.path.exists(train_path):

    valid_path = f'{PATH}' + 'valid/' + str(i)
    if not os.path.exists(valid_path):

Hi guys,

using the advice given here (thanks!) I have managed to apply the techniques discussed in lesson 1 on MNIST data directly via Kaggle notebooks using ImageClassifierData.from_arrays():

It was quite some hassle, as I was new to Kaggle and had to deal with file system permissions and other stuff before I learned I could turn on internet access :roll_eyes:
Regarding the results: Note that I just naively applied the techniques without giving it too much
thought - use at your own risk :wink:


did you get the solution to this problem?