Tabular: validation set percentage

muellerzr · June 13, 2019, 3:33pm

And we cannot do learn.data.test_dl because it’s unlabeled

Bliss · June 13, 2019, 3:37pm

But what is the purpose of the .add_test directive?
I can see you use the test dataset afterwards (as you indicate, making it the validation ser), but I don’t see why it is needed when “data” is created…

muellerzr · June 13, 2019, 3:39pm

Because all three are tied together for Learner. This is mostly used for Kaggle competition where we have separate train and test CSV’s. When you finish training you just call learn.get_preds(DataSetType.Test) and it will give you the predictions for the test set.

However for labeled, we don’t add a test. It’s not needed, as shown in the link above.

Bliss · June 13, 2019, 3:51pm

Ok, I think I get what I am doing… or almost

As I think is expected, the accuracy of the test set is very similar to the accuracy of the last epoch of the training…

muellerzr · June 13, 2019, 3:53pm

So long as you are switching the dataloaders, yes, that’s what I’ve noticed too you can also verify this by calling learn.data. The validation set should now be your test set

Bliss · June 13, 2019, 3:54pm

I was checking some of your GutHub examples
Is there any one that is a simple tabular analysis sequence (datasets creation, train, test accuracy)?

Or maybe from someone else… I want to check if I could be doing anything else than what I am doing today…

muellerzr · June 13, 2019, 3:56pm

No as I wasn’t sure if it was wanted by anyone. Give me about 15-20 minutes!

mindtrinket · June 13, 2019, 4:02pm

I put together a simple tabular one for Credit Card Fraud. I feel it is inbetween Lesson 4 and Rossman.

github.com

jamesdietle/Kaggle2019/blob/master/Credit_card_Fraud.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "heading_collapsed": true
   },
   "source": [
    "# Credit Card fraud from Kaggle is highly unbalanced\n",
    "\n",
    "It is important that credit card companies are able to recognize fraudulent credit card transactions so that customers are not charged for items that they did not purchase.\n",
    "\n",
    "https://www.kaggle.com/mlg-ulb/creditcardfraud\n",
    "\n",
    "\n",
    "### Goal: Is there fraud here?\n",
    "\n"
   ]
  },
  {

This file has been truncated. show original

I am excited to look through some @muellerzr code tonight. I am always looking for better ways to make a validation set.

muellerzr · June 13, 2019, 4:31pm

@mindtrinket @Bliss here is a notebook that shows an example of what to do and not to do via the Lesson 4 Tabular Notebook example:

github.com

muellerzr/FastAI-Test-Set-Generation/blob/master/Labeled_Test_Set.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "lesson4-tabular.ipynb",
      "version": "0.3.2",
      "provenance": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "accelerator": "GPU"
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "PcZh_7tRk7ke",

This file has been truncated. show original

This setup also allows us to run ClassificationInterpretation on the test set to analyze what we were missing and by how much

naoki · June 24, 2019, 5:01pm

Are df.iloc[start:end] and .split_by_idx(list(range(start, end))) referring to the very same rows? Shouldn’t validation and test use distinct rows?

muellerzr · June 24, 2019, 5:05pm

I go into a different way of doing it in the notebook above, but that’s the same that is done in the Tabular example. Everything above it is the train, below is the validation, and in the middle is the test set.

AdvancingCat · October 4, 2019, 9:44am

I have the same question with you, have you got something?