A beginner's attempt at OTTO with a focus on polars

Over the last two months, I have been working on OTTO competition introduced to me by @radek, I can’t say I have made much progress on the LB score, but it is a great learning experience. I will keep report more of what I learnt when I get more strength back from covid. Today, I would like to share my most recent notes (my reimplementation and experimentations of every line of code using polars) on a nice but simple EDA notebook (using pandas from a great kaggler). Yes, I guess part of the highlight of this sharing below would be polars which I found to be so much faster, easier and efficient than pandas.

MY GOAL IN THIS COMPETITION

Goal in 3 steps

  • STEP 1 - understanding the problem including the dataset
  • STEP 2 - building the baseline or simplest pipelines
  • STEP 3 - gradually more tweaks on the pipelines

Strategy

  • experimenting and remixing great notebooks from a beginner’s perspective

Why

  • one of my goals is described in a reply and inspired by a discussion
  • benefits of dissecting and remixing great notebooks for a beginner (@radek1)

REACH MY AUDIENCE

  • through a reply to the original notebooks
  • a post on kaggle
  • through a tweet

GREAT INSIGHTS SHARED ON DISCUSSION

On Test Set

  • not appeared in train set, but could appear in earlier period and returning customers with existing carts and wishlist source

GREAT NOTEBOOKS TO BE STUDIED

  • OTTO - Getting Started (EDA + Baseline) notebook original by #otto_edward mine notebook by #otto_dcl (this is me)

HOW IS MY REPLICATE DIFFERENT FROM THE ORIGINAL NOTEBOOK

  • I experimented every line and extracted learning points which are useful and new to a beginner like myself in the sections below.
  • Thanks to @radek1 for introducing polars library to us, I have implemented almost every line of the original notebook in polars.

STAY FOCUS

  • techniques to do EDA with polars in particular
  • understanding the problem and dataset
  • build up the simple pipeline with polars and start experimenting

Good Habits to learn

  • make using DEBUG a habit whenever I start building a notebook, otherwise it takes too long to run experiment later on.
  • del variables whenever a section of a notebook is finished to save RAM, otherwise it’s easy to run out of memory later on. cell
  • one version of the notebook, deal with one investigation, code from scratch, write only the necessaries

TRIVIAL TYPOS FOUND IN THE ORIGINAL NOTEBOOKS #otto_dcl


KNOW YOUR TOOLS


KNOW YOUR PROBLEM #otto_edward

  • overview of the problem cell
  • what does this competition want us to predict exactly? cell
  • how different are clicks, carts, orders are evaluated in the score? 1: 3: 6 link

BUILD YOUR PIPELINES

A basic pipeline by #otto_edward cell

This baseline will use the fact that people will often interact with articles they have previouslt interacted with. The prediction will consist of the top 20 most frequent articles in the session. If there are less than 20 articles in the session the prediction will be padded with the most frequent articles in the training data as found above.

The logic behind

  • the more appearances or clicks on an aid in early part of a session, the more likely it will be clicked, carted or ordered in the later part of the session

KNOW YOUR DATA

On Datasets
#otto_dataset_organizer_dataset : data description
#otto_radek_optimized_dataset: data description
#otto_radek_optimized_polars: created by me from notebook converting 11 GB into 1.2 GB on Kaggle, I have checked the sameness

earlier exploration prior to #otto_radek_optimized_polars

  • load competition organizer’s train.jsonl file (11 GB) and basic analysis with polars without blowing up 30GB RAM? notebook
    • pl.read_json and pl.read_ndjson won’t help, but pl.scan_ndjson and .collect() can do the trick (1 min) cell
    • how to prove there is no duplicated sessions? cell
    • now I can use polars to load and transform the otto train.jsonl to train.parquet notebook

On first session #otto_dcl

  • create the following columns of a single session with polars - notebook
    • total_num_of_rows
    • total_num_unique_aids
    • total_num_of_clicks
    • total_num_of_carts
    • total_num_of_orders
    • session_starting
    • session_ending
    • total_duration_hour

On first session #otto_edward

  • Take a look at the first session and the first action of the first session? cell
  • what are session and events? what are included inside events? what are aid, type, and ts in Unix timestamp or milliseconds? cell
  • what is the exact time duration of the first session of the train set? cell
  • what are the frequencies of ‘clicks’, ‘carts’, ‘orders’ of the first session with a dict and dict.get? cell , with pl.select and pl.filter? cell

On all sessions #otto_dcl

  • session features on session, aid, clicks, carts, orders with polars notebook
    • total_num_of_rows_or_actions_overall
    • total_num_of_unique_sessions_overall
    • total_num_of_unique_aids_overall
    • total_actions_in_each_session
    • total_clicks_in_each_session
    • total_carts_in_each_session
    • total_orders_in_each_session
  • session features on datetime, timestamp, duration with polars notebook
    • starting_datetime_each_session
    • ending_datetime_each_session
    • duration_seconds_each_session
    • duration_seconds_datetime_each_session notebook
    • duration_hours_datetime_each_session
    • duration_hours_int_each_session
  • how does a session ends? notebook
    • sessions can end anytime from near starting datetime to near ending datetime
    • sessions can last between near 28 days duration and 0 second duration
    • sessions can end with clicks, carts or orders
  • plot distributions with seaborn notebook
    • on total_actions_in_each_session
    • on duration_hours_int_each_session

My investigation 3 : convert jsonl to parquet on Kaggle 30GB CPU with polars

On all sessions #otto_edward

  • what’s the frequency of clicks, carts, orders overall in a bar chart? cell
  • what’s the distribution of num of actions of a user does on otto site overall? cell
  • what’s the distribution of the duration (in hours) a session? cell
  • what are the interesting findings from the two graphs above? cell

On all aids #otto_dcl

  • total_num_unique_aids in train set notebook
    • fetch is for fast debugging
    • collect accesses full data
    • select is a good habit to have
  • groupby aid to find features
    • total_occurs_each_unique_aid, distribution and quantiles notebook
    • total_occurs_as_clicks_each_aid cell
    • total_occurs_as_carts_each_aid
    • total_occurs_as_orders_each_aid
    • cart_over_click_ratio_each_aid cell
    • order_over_click_ratio_each_aid
    • most_occurred_carts_highest_cart_click_ratio cell
    • total_occurs_each_aid_whose_total_occurs_lt_30 cell
    • plot distribution of total_occurs_each_aid_whose_total_occurs_lt_30 cell

On all aids #otto_edward

  • what’s the distribution of aids whose occurrences are under 30 in both pandas and polars? cells
  • what’s the 5 most common aids of the dataframe? cell
  • what’s the most frequent aids (for clicks, carts or orders) among the first 450000 sessions? cells

Techniques Experimented #otto_dcl

  • my utils

    • load my utils for conveniences of using jupyter and debugging cell
    • how to allow multi-output in a cell cell
  • polars

    • how to read parquet file? cell
    • how to just read a 100 rows of data from a huge json file with pl.scan_ndjson? cell
      • why do we need it given fetch, because if the large is too large, fetch can blow out 30GB RAM
    • polars turn a list of dictionaries in a column into a list of structs. cell
    • how to explode a list of structs above? cell
    • how to split a struct[3] into 3 separate columns? cell
    • how to cast columns from int64 to int32 and int8? cell
    • how to use if, elif, else in polars with pl.when().then().otherwise(pl.when().then().otherwise())? cell
    • how to experiment columns of a DataFrame? cell
    • how to subset 150000 sessions? cell
    • how to subset a single session of data? cell
    • how to experiment max, min, count on a column data? cell
    • how to insert a value from a pl.Expr into a function with apply? cell
    • how to groupby and run ops on cols within each group? cell
    • how to add columns of the same value to a dataframe? cell
    • how to transform a row of a dataframe into a list with transpose, to_series, to_list? cell
    • on datetime, timestamp, duration with polars notebook
      • convert from datetime to timestamp using datetime library
      • create a datetime object and convert datetime to timestamp and back and forth with polars
      • use pl.duration calculate duration and use pl.Int64 and dt.with_time_unit('ms') to convert timestamp to datetime
    • map (or pl.duration alike) vs apply (with timedelta) on speed and RAM usage notebook
      • how much more RAM is used by apply vs map
      • how much slower is apply vs map
      • how much faster is doing more parallel ops or making cols with map
    • How to use pl.duration which has map in a situation where groupby context is needed? notebook
      • map is used by pl.duration inferred from the error message and experiment afterward
      • use groupby to prepare inputs needed by pl.duration in a new dataframe
      • use pl.duration in the new dataframe
    • How to use apply in polars notebook
    • how to turn a dataframe (only one row) into a list? cell
    • how to convert polars duration from seconds into hours? cell
    • when to use fetch vs collect notebook notebook
    • how to create an empty DataFrame? how to check whether a DataFrame is empty or not? cell
    • how to check the size or RAM used of a DataFrame? cell

Techniques Experimented #otto_edward

  • how to take only 15,000 lines from train.jsonl into a pandas df for analysis? cell
  • how to run experiment to measure the speed of time one line of code or a block of code? %timeit or %%timeit – not for result, only for speed testing cells
  • how to get a basic idea of time for running code? %time cell
  • the most sure way to find out info of an object is with help cell
  • how to find out the difference between two timestamp in millisecond with datetime.timedelta? cell
  • how to check a long dict with a loop? cell
  • how to reorder a dict? cell
    • sort by keys dict(sorted(people.items()))
    • sort by values dict(sorted(people.items(), key=lambda item: item[1]))
  • how to find the most common/frequent items from a list with Counter and most_common? cell
  • how to check the size of an object with sys.getsizeof? cell
  • how to check the size of a dataframe with memory_usage and how to remove objects to save RAM with del and gc.collect? cell
  • how the RAM changes when loading libraries and datasets and removing objects cell
  • how to read a jsonl file in chunks with pd.read_json? cell and what is the chunks object? cell
  • how to create features for each session info from a dataframe of multiple sessions in pandas cell and polars cell
  • how to add feature columns addressing entire dataframe to the dataframe from above? cell
  • pandas
    • how to use tqdm in loops for reading file lines? cell 3
    • how to use column session as index and remove session as column? cell
    • how fast is len(set(train.session)) vs len(train.session.unique())? cell
    • how to prove that len(set(train.session)) is much slower than len(train.session.unique())? cell
    • differentiate df[0] vs df.iloc[0], df[:1] vs df.iloc[:1], df.iloc[0].item() vs df.iloc[:1].item() cells
    • how to loop through each row of a dataframe with idx? cell
    • how to use tqdm with loop of a dataframe? cell
  • seaborn plotting
    • how to draw barplot? cell
    • how to draw distributions? cell
    • how to draw a vertical line as mean for the distribution? cell
3 Likes

Over the last few days I still can’t make progress in the more complex pipeline so instead of giving it up I decided to go back to build the validation set myself instead of relying on other people.

It is not a trivial job as almost no one does it themselves but rely on Radek’s validation set. It is not easy and I was certainly tortured by it for a few days and had thought to give up altogether.

But fastai has taught me to be tenacious. So, one step forward each day, I finally made some progress and I believe Radek’s validation is actually corrupted. This certainly sounds impossible, even though I have many experiments to support my claim, I still can’t believe it because we all (including grandmasters) used Radek’s validation and no one found out it is corrupted.

I have shared the discovery of the problems of Radek’s validation set in this twitter thread.

Hi @radek thanks for your replies on Kaggle, however, I can’t really reply as much as I like due to the kaggle spam system. So, I have pasted my responses and notebooks here.

Hi Radek I have made it possible to run organizer’s script on Kaggle to create the validation set. please see notebook here

Then I have optimized the validation set in your style and converted them into parquet, see notebook here

Then I compared several validation datasets to see which one has cold start problem on aid, see notebook here

  • finding out which validation set has no cold start problem on aid, comparing validation from Radek’s and validations from mine notebook

as @cdeotte pointed out that the reason why Radek’s validation set is a little different is because the organizer has changed the script and Radek’s used the older version to generate his validation set.

Don’t know about others, but as a beginner in a competition, my mind is shifting dramatically on a daily basis.

In the morning, I am pumped and satisfied with a little progress during the day;

In the late afternoon or evening, I am feeling defeated and the inner voice keeps telling me: what’s the point? you will never get there with such slow progress … why not doing something easier and more …

Before falling asleep, I am just glad I have not given in to the thought of giving it all up.

Today’s progress so far

:joy: :rocket: reimplement organizer’s script in polars to create train_sessions or train_valid in otto validation set and verify its validity in this notebook

:scream: :joy: :rocket: :star: reimplement organizer’s script in polars to create test_valid_full or test_sessions_full and verify its validaty in this notebook


I have spent half a day working on this notebook, couldn’t get the right result I expect, and after trying all the possibilities, finally I got it working by figuring out a hidden trick of using polars.

Reading the user guide previously really helped me here. It gave me a hint and after experimented repeatedly along the direction, I finally get it right and discovered an important rule of when using many chained filters.

Yes, I have thought of quitting this little notebook after multiple failures. However, since the goal of this notebook was very small, the things to try won’t be too overwhelming. Hanging in there and keep trying, eventually I made it through.

POLARS

How to print head and tail together with suffix cell
How to create random seed in polars? (actually not by polars, but np.random.seed(42)) example

How to chain expressions

  • How to chain every expression nicely? use () in the outer space cell
  • When chain multiple filters, we must use collect as early as possible to avoid computer error and RAM error? cell

How to config polars?

  • set num of rows to display? api
  • set num of cols to display? api
  • set the colwidth? api
  • how to set num of rows, cols, and colwidth in pandas? guide

How to check whether two dataframes are the same?

  • use testing.assert_frame_equal, api and dataframes must be same type (lazy or not) cell
  • use testing.assert_series_equal, api
  • use frame_equal to compare two dataframe cell
  • check n_unique of each columns cell
  • check the total rows with count cell
  • check the first, last, min, max datetime cell

How to select and deselect columns

  • how to exclude column(s) by name, wildcard and dtypes? api
  • how to select all columns with pl.all? api
  • how to check bool values of a column to be True or not with exp.all? api and api

When if elif else needed

  • how to use if, elif, else or when, then otherwise? api cell

When a filter is needed

  • how to do filter with & and | ? api
  • how to Series filter? api
  • how to filter inside a context? api
  • how to save RAM with filter by using & and | instead of multiple filters ? cell
  • but chain two filter s can produce different result from filter with & cell
  • when chain multiple filter s, we must stop being lazy as soon as possible with collect, so that RAM won’t run out cell

I have reorganized the notes of polars for today’s learning.

  • :scream: :joy: :rocket: reimplement test_sessions and test_labels and verify its validaty script , notebook

today’s story

  • Instead of fearing I can’t get this task done today, I am more worried about forgetting to write down my today’s little journey. Let me write it down as I still remember most of it.
  • I have annotated the organizer’s scripts related to these two tasks. Although the work is tedious most of the time, but there are things which I don’t understand why, that will eventually turn out to be important to help me get the numbers exactly right
  • I first dealt with test_labels which was an easier task as I have done similar work previously.
  • In the process, I have learnt how to set the number of rows to display and how to expand the column width display in polars when the column has a long list of values
  • Reimplementing test_sessions was not smooth at all. The first major problem was that I can’t get random.seed(42) working to reproduce the same random number in polars. After trying different ways of integrating random.seed inside polars expressions and failed, I move outside polars and get the random number reproducible first and move those random numbers back into my polars expressions. It was any slower, and maybe faster as it does not cause any error.
  • The second major problem is that my test_sessions still does not match exactly with the “correct” dataframe. This forced me to go back to the scripts and I can recall the line of code which removes the last event of every session. I still don’t understand why the organizer does this, but it may help me get my numbers right. However, so far I have not figured a way to remove the last event of every session in polars. At the moment I still can’t solve it with neither slice nor filter .
  • Can I use shift to remove the last event of each session? cell yes, it works. cell
  • actually filter with cumcount may also solve this problem but certainly is more complex than the shift solution, I can imagine.
  • discovered that random.randint(1,1) can work, but np.random.randint(1,1) will fail. cell
  • Actually, test_sessions matches exactly with test_sessions_v7_3rd. cell

Todos:

  • Why test_sessions didn’t match with that from the first two validation sets, but only the 3rd one?
  • Check whether test_labels match exactly
  • Notebook on reimplementing test_sessions and test_labels Continued:
    • Why organizer want to remove the last event after transformed by ground_truth function? why it won’t affect anything? cell
    • Because of the above, when working on test_labels, I need to use the original test_sessions_full without shift the last event out. cell
    • Can unique is_unique, is_duplicated help to get the unique rows when a column has a list for each row? No, they don’t work on list. cell
    • How to select the unique rows when a column has lists instead of scalar value? groupby, pl.all().first() can work. cell
    • why test_sessions between handmade and test_7v_3rd are the same, but not same to test_7v, test_7v_2nd? seed is 42, same; maybe paperspace vs kaggle have different setting?
    • Can’t compare test_labels in which there are lists, so maybe turn list into long string then compare? cell
    • How to compare series using series_equal? cell
    • Let’s count and compare the length of the list of ‘ground_truth’ column, see whether the are the same length? (same) cell, cell2
    • Why not compare the sum of the list between two test_labels dataframes? cell Are their differences between two sum columns all zero? cell
    • How to do ops on two cols from two different dataframes? cell
    • Yes, the sum of lists are all the same. So, the difference may just be the difference ordering of the list in ‘ground_truth’ column. cell
    • How to sort the lists of ground_truth columns? cell Are they same after sorting the lists? cell

Todos (tomorrow)

Are my handmade train, test of full dataset, and train_sessions, test_sessions_full, test_sessions, test_labels of validation set the same to the ones generated by organizer’s script?

:joy: :star: Compare my test_sessions and test_labels with those of 1st validation set (notebook, optimized parquet dataset) and 2nd validation set (notebook and optimized parquet dataset): Done! (Same) notebook

:joy: :star: Compare 5th validation set (jsonl datast created on paperspace without pipenv, notebook to create optimized-parquet dataset on Kaggle) with 4th validation set: Done! (validation 1st, 2nd, 5th are the same (different from 3rd, 4th) as their jsonls are created on paperspace, even when 5th is created without pipenv ) notebook


I suspect the reason why validation-1st, 2nd, 5th (created on paperspace) are different from validation-3rd, 4th (created on kaggle) is due to pipenv environment. However, this notebook proved otherwise, i.e., pipenv has nothing to do with the difference.

Todos tomorrow

  • reimplement organizer’s evaluate.py script on kaggle
  • how to combine my implementation together in one notebook
    • story:
      • this way I can split any subset of train into train_sessions, test_sessions and test_labels for fast experimentation on training and evaluating
      • Radek seems have a simpler (not exactly like the organizer’s script) approach code

:scream: :joy: :rocket: how to debugging to understand each line of the script above: notebook


Story today

Debugging with return, pp
user guide on pp
:scream: :joy: :rocket: how to debugging to understand each line of a script: example notebook ^3ac7a9

  • what does submission.csv look like? debugging predictions = f.readlines()[1:] see cell
  • what does test_labels look like by debugging labels = f.readlines(), see cell
  • what does each label in for label in tqdm(labels, desc="Preparing labels"): look like, see cell
  • what does final_labels from prepare_labels(labels) look like? see cell
  • what does sid_type, preds = prediction.strip().split(",") in def prepare_predictions(predictions): look like? see cell
  • how each prediction is converted to prepared_predictions in def prepare_predictions(predictions):, see cell
  • what does def num_events(labels, k: int): do, see cell
  • what does evaluate_session do for each session, see cell; also make the debugging super fast by using subset labels = f.readlines()[:10] # add [:1000] for fast run, see cell
  • what does evaluate_sessions do for all sessions when a session is predicted, see cell; when a session is not predicted, see cell;
  • what does click_hits, cart_hits is None mean, see cell
  • what to do when a session is not predicted, see cell
  • what does def recall_by_event_type(evalutated_events: dict, total_number_events: dict): do? see cell
  • what does def weighted_recalls(recalls: dict, weights: dict): do? see cell
  • :scream: reimplement organizer’s evaluate.py script on kaggle
    • run organizer’s evaluate.py script on kaggle, using the evaluate code in a pipeline notebook Done!
    • :scream: :joy: :rocket: how to debugging to understand each line of the script above: notebook and story [[#^3ac7a9|inplace]] or forum Done!
    • :scream: implement the script above in polars
      • implement prepare_labels and prepare_predictions, see cell Done!
      • implement num_events(labels, k), see cell, confirmed by this cell Done!
      • implement evaluate_session and evaluate_sessions, evaluated_events can be confirmed by this cell
    • :scream: compare Radek’s a-robust-local-validation-framework which includes his evaluate script

:scream: :joy: :rocket: implement the script above in polars

  • implement prepare_labels and prepare_predictions, see cell Done!
  • implement num_events(labels, k), see cell, confirmed by this cell Done!
  • implement evaluate_session and evaluate_sessions, evaluated_events can be confirmed by this cell Done!
  • implement click_hits, cellDone!
  • implement cart_hits, cell Done!
  • implement order_hits, cell Done!
  • join them together, cell Done!
  • to confirm my implementation result is the same to the organizer’s result, cell Done!
  • implement recall_by_event_type and weighted_recalls, check script in cell , and implemented cell, confirmed cell Done!
  • :scream: using reimplementation notebooks above to split any subset of train into train_sessions, test_sessions and test_labels for fast experimentation on training and evaluating Todo