A beginner's attempt at OTTO with a focus on polars

Daniel · December 22, 2022, 2:54am

Over the last two months, I have been working on OTTO competition introduced to me by @radek, I can’t say I have made much progress on the LB score, but it is a great learning experience. I will keep report more of what I learnt when I get more strength back from covid. Today, I would like to share my most recent notes (my reimplementation and experimentations of every line of code using polars) on a nice but simple EDA notebook (using pandas from a great kaggler). Yes, I guess part of the highlight of this sharing below would be polars which I found to be so much faster, easier and efficient than pandas.

MY GOAL IN THIS COMPETITION

Goal in 3 steps

STEP 1 - understanding the problem including the dataset
STEP 2 - building the baseline or simplest pipelines
STEP 3 - gradually more tweaks on the pipelines

Strategy

experimenting and remixing great notebooks from a beginner’s perspective

Why

one of my goals is described in a reply and inspired by a discussion
benefits of dissecting and remixing great notebooks for a beginner (@radek1)

REACH MY AUDIENCE

through a reply to the original notebooks
a post on kaggle
through a tweet

GREAT INSIGHTS SHARED ON DISCUSSION

On Test Set

not appeared in train set, but could appear in earlier period and returning customers with existing carts and wishlist source

GREAT NOTEBOOKS TO BE STUDIED

OTTO - Getting Started (EDA + Baseline) notebook original by #otto_edward mine notebook by #otto_dcl (this is me)

HOW IS MY REPLICATE DIFFERENT FROM THE ORIGINAL NOTEBOOK

I experimented every line and extracted learning points which are useful and new to a beginner like myself in the sections below.
Thanks to @radek1 for introducing polars library to us, I have implemented almost every line of the original notebook in polars.

STAY FOCUS

techniques to do EDA with polars in particular
understanding the problem and dataset
build up the simple pipeline with polars and start experimenting

Good Habits to learn

make using DEBUG a habit whenever I start building a notebook, otherwise it takes too long to run experiment later on.
del variables whenever a section of a notebook is finished to save RAM, otherwise it’s easy to run out of memory later on. cell
one version of the notebook, deal with one investigation, code from scratch, write only the necessaries

TRIVIAL TYPOS FOUND IN THE ORIGINAL NOTEBOOKS #otto_dcl

on the use of tqdm with total discussion
on the most frequent aids disucssion (not that trivial )
on the predictions generation error, see discussion, see cells for investigation (not that trivial, as it contributes 0.001 up in LB score)

KNOW YOUR TOOLS

User guides, APIs
docs of polars in a cell
presentations of polars 2021.12
for pandas user tutorial

KNOW YOUR PROBLEM #otto_edward

overview of the problem cell
what does this competition want us to predict exactly? cell
how different are clicks, carts, orders are evaluated in the score? 1: 3: 6 link

BUILD YOUR PIPELINES

A basic pipeline by #otto_edward cell

This baseline will use the fact that people will often interact with articles they have previouslt interacted with. The prediction will consist of the top 20 most frequent articles in the session. If there are less than 20 articles in the session the prediction will be padded with the most frequent articles in the training data as found above.

The logic behind

the more appearances or clicks on an aid in early part of a session, the more likely it will be clicked, carted or ordered in the later part of the session

KNOW YOUR DATA

On Datasets
#otto_dataset_organizer_dataset : data description
#otto_radek_optimized_dataset: data description
#otto_radek_optimized_polars: created by me from notebook converting 11 GB into 1.2 GB on Kaggle, I have checked the sameness

earlier exploration prior to #otto_radek_optimized_polars

load competition organizer’s train.jsonl file (11 GB) and basic analysis with polars without blowing up 30GB RAM? notebook
- pl.read_json and pl.read_ndjson won’t help, but pl.scan_ndjson and .collect() can do the trick (1 min) cell
- how to prove there is no duplicated sessions? cell
- now I can use polars to load and transform the otto train.jsonl to train.parquet notebook

On first session #otto_dcl

create the following columns of a single session with polars - notebook
- total_num_of_rows
- total_num_unique_aids
- total_num_of_clicks
- total_num_of_carts
- total_num_of_orders
- session_starting
- session_ending
- total_duration_hour

On first session #otto_edward

Take a look at the first session and the first action of the first session? cell
what are session and events? what are included inside events? what are aid, type, and ts in Unix timestamp or milliseconds? cell
what is the exact time duration of the first session of the train set? cell
what are the frequencies of ‘clicks’, ‘carts’, ‘orders’ of the first session with a dict and dict.get? cell , with pl.select and pl.filter? cell

On all sessions #otto_dcl

session features on session, aid, clicks, carts, orders with polars notebook
- total_num_of_rows_or_actions_overall
- total_num_of_unique_sessions_overall
- total_num_of_unique_aids_overall
- total_actions_in_each_session
- total_clicks_in_each_session
- total_carts_in_each_session
- total_orders_in_each_session
session features on datetime, timestamp, duration with polars notebook
- starting_datetime_each_session
- ending_datetime_each_session
- duration_seconds_each_session
- duration_seconds_datetime_each_session notebook
- duration_hours_datetime_each_session
- duration_hours_int_each_session
how does a session ends? notebook
- sessions can end anytime from near starting datetime to near ending datetime
- sessions can last between near 28 days duration and 0 second duration
- sessions can end with clicks, carts or orders
plot distributions with seaborn notebook
- on total_actions_in_each_session
- on duration_hours_int_each_session

My investigation 3 : convert jsonl to parquet on Kaggle 30GB CPU with polars

On all sessions #otto_edward

what’s the frequency of clicks, carts, orders overall in a bar chart? cell
what’s the distribution of num of actions of a user does on otto site overall? cell
what’s the distribution of the duration (in hours) a session? cell
what are the interesting findings from the two graphs above? cell

On all aids #otto_dcl

total_num_unique_aids in train set notebook
- fetch is for fast debugging
- collect accesses full data
- select is a good habit to have
groupby aid to find features
- total_occurs_each_unique_aid, distribution and quantiles notebook
- total_occurs_as_clicks_each_aid cell
- total_occurs_as_carts_each_aid
- total_occurs_as_orders_each_aid
- cart_over_click_ratio_each_aid cell
- order_over_click_ratio_each_aid
- most_occurred_carts_highest_cart_click_ratio cell
- total_occurs_each_aid_whose_total_occurs_lt_30 cell
- plot distribution of total_occurs_each_aid_whose_total_occurs_lt_30 cell

On all aids #otto_edward

what’s the distribution of aids whose occurrences are under 30 in both pandas and polars? cells
what’s the 5 most common aids of the dataframe? cell
what’s the most frequent aids (for clicks, carts or orders) among the first 450000 sessions? cells

Techniques Experimented #otto_dcl

my utils
- load my utils for conveniences of using jupyter and debugging cell
- how to allow multi-output in a cell cell
polars
- how to read parquet file? cell
- how to just read a 100 rows of data from a huge json file with pl.scan_ndjson? cell
  - why do we need it given fetch, because if the large is too large, fetch can blow out 30GB RAM
- polars turn a list of dictionaries in a column into a list of structs. cell
- how to explode a list of structs above? cell
- how to split a struct[3] into 3 separate columns? cell
- how to cast columns from int64 to int32 and int8? cell
- how to use if, elif, else in polars with pl.when().then().otherwise(pl.when().then().otherwise())? cell
- how to experiment columns of a DataFrame? cell
- how to subset 150000 sessions? cell
- how to subset a single session of data? cell
- how to experiment max, min, count on a column data? cell
- how to insert a value from a pl.Expr into a function with apply? cell
- how to groupby and run ops on cols within each group? cell
- how to add columns of the same value to a dataframe? cell
- how to transform a row of a dataframe into a list with transpose, to_series, to_list? cell
- on datetime, timestamp, duration with polars notebook
  - convert from datetime to timestamp using datetime library
  - create a datetime object and convert datetime to timestamp and back and forth with polars
  - use pl.duration calculate duration and use pl.Int64 and dt.with_time_unit('ms') to convert timestamp to datetime
- map (or pl.duration alike) vs apply (with timedelta) on speed and RAM usage notebook
  - how much more RAM is used by apply vs map
  - how much slower is apply vs map
  - how much faster is doing more parallel ops or making cols with map
- How to use pl.duration which has map in a situation where groupby context is needed? notebook
  - map is used by pl.duration inferred from the error message and experiment afterward
  - use groupby to prepare inputs needed by pl.duration in a new dataframe
  - use pl.duration in the new dataframe
- How to use apply in polars notebook
- how to turn a dataframe (only one row) into a list? cell
- how to convert polars duration from seconds into hours? cell
- when to use fetch vs collect notebook notebook
- how to create an empty DataFrame? how to check whether a DataFrame is empty or not? cell
- how to check the size or RAM used of a DataFrame? cell

Techniques Experimented #otto_edward

how to take only 15,000 lines from train.jsonl into a pandas df for analysis? cell
how to run experiment to measure the speed of time one line of code or a block of code? %timeit or %%timeit – not for result, only for speed testing cells
how to get a basic idea of time for running code? %time cell
the most sure way to find out info of an object is with help cell
how to find out the difference between two timestamp in millisecond with datetime.timedelta? cell
how to check a long dict with a loop? cell
how to reorder a dict? cell
- sort by keys dict(sorted(people.items()))
- sort by values dict(sorted(people.items(), key=lambda item: item[1]))
how to find the most common/frequent items from a list with Counter and most_common? cell
how to check the size of an object with sys.getsizeof? cell
how to check the size of a dataframe with memory_usage and how to remove objects to save RAM with del and gc.collect? cell
how the RAM changes when loading libraries and datasets and removing objects cell
how to read a jsonl file in chunks with pd.read_json? cell and what is the chunks object? cell
how to create features for each session info from a dataframe of multiple sessions in pandas cell and polars cell
how to add feature columns addressing entire dataframe to the dataframe from above? cell
pandas
- how to use tqdm in loops for reading file lines? cell 3
- how to use column session as index and remove session as column? cell
- how fast is len(set(train.session)) vs len(train.session.unique())? cell
- how to prove that len(set(train.session)) is much slower than len(train.session.unique())? cell
- differentiate df[0] vs df.iloc[0], df[:1] vs df.iloc[:1], df.iloc[0].item() vs df.iloc[:1].item() cells
- how to loop through each row of a dataframe with idx? cell
- how to use tqdm with loop of a dataframe? cell
seaborn plotting
- how to draw barplot? cell
- how to draw distributions? cell
- how to draw a vertical line as mean for the distribution? cell

Daniel · December 27, 2022, 11:49am

Over the last few days I still can’t make progress in the more complex pipeline so instead of giving it up I decided to go back to build the validation set myself instead of relying on other people.

It is not a trivial job as almost no one does it themselves but rely on Radek’s validation set. It is not easy and I was certainly tortured by it for a few days and had thought to give up altogether.

But fastai has taught me to be tenacious. So, one step forward each day, I finally made some progress and I believe Radek’s validation is actually corrupted. This certainly sounds impossible, even though I have many experiments to support my claim, I still can’t believe it because we all (including grandmasters) used Radek’s validation and no one found out it is corrupted.

I have shared the discovery of the problems of Radek’s validation set in this twitter thread.

Daniel · December 28, 2022, 4:33am

Hi @radek thanks for your replies on Kaggle, however, I can’t really reply as much as I like due to the kaggle spam system. So, I have pasted my responses and notebooks here.

Hi Radek I have made it possible to run organizer’s script on Kaggle to create the validation set. please see notebook here

Create otto validation set (jsonl, split by the last 7 days) from running organizer’s script on Kaggle notebook validation-by-script-on-kaggle, (validation-set-1, validation-set-2 created using script on paperspace)

Then I have optimized the validation set in your style and converted them into parquet, see notebook here

Optimize and convert otto validation set (full, including test_labels ) from jsonl to parquet on Kaggle with polars experiment, notebook for optimization and conversion, (created the new optimized validation dataset )

Then I compared several validation datasets to see which one has cold start problem on aid, see notebook here

finding out which validation set has no cold start problem on aid, comparing validation from Radek’s and validations from mine notebook

Daniel · December 28, 2022, 7:46am

as @cdeotte pointed out that the reason why Radek’s validation set is a little different is because the organizer has changed the script and Radek’s used the older version to generate his validation set.

Daniel · December 29, 2022, 5:09am

Don’t know about others, but as a beginner in a competition, my mind is shifting dramatically on a daily basis.

In the morning, I am pumped and satisfied with a little progress during the day;

In the late afternoon or evening, I am feeling defeated and the inner voice keeps telling me: what’s the point? you will never get there with such slow progress … why not doing something easier and more …

Before falling asleep, I am just glad I have not given in to the thought of giving it all up.

Daniel · December 29, 2022, 5:10am

Today’s progress so far

reimplement organizer’s script in polars to create train_sessions or train_valid in otto validation set and verify its validity in this notebook

Daniel · December 29, 2022, 11:03am

reimplement organizer’s script in polars to create test_valid_full or test_sessions_full and verify its validaty in this notebook

I have spent half a day working on this notebook, couldn’t get the right result I expect, and after trying all the possibilities, finally I got it working by figuring out a hidden trick of using polars.

Reading the user guide previously really helped me here. It gave me a hint and after experimented repeatedly along the direction, I finally get it right and discovered an important rule of when using many chained filters.

Yes, I have thought of quitting this little notebook after multiple failures. However, since the goal of this notebook was very small, the things to try won’t be too overwhelming. Hanging in there and keep trying, eventually I made it through.

Daniel · December 29, 2022, 11:04am

POLARS

How to print head and tail together with suffix cell
How to create random seed in polars? (actually not by polars, but np.random.seed(42)) example

How to chain expressions

How to chain every expression nicely? use () in the outer space cell
When chain multiple filters, we must use collect as early as possible to avoid computer error and RAM error? cell

How to config polars?

set num of rows to display? api
set num of cols to display? api
set the colwidth? api
how to set num of rows, cols, and colwidth in pandas? guide

How to check whether two dataframes are the same?

use testing.assert_frame_equal, api and dataframes must be same type (lazy or not) cell
use testing.assert_series_equal, api
use frame_equal to compare two dataframe cell
check n_unique of each columns cell
check the total rows with count cell
check the first, last, min, max datetime cell

How to select and deselect columns

how to exclude column(s) by name, wildcard and dtypes? api
how to select all columns with pl.all? api
how to check bool values of a column to be True or not with exp.all? api and api

When if elif else needed

how to use if, elif, else or when, then otherwise? api cell

When a filter is needed

how to do filter with & and | ? api
how to Series filter? api
how to filter inside a context? api
how to save RAM with filter by using & and | instead of multiple filters ? cell
but chain two filter s can produce different result from filter with & cell
when chain multiple filter s, we must stop being lazy as soon as possible with collect, so that RAM won’t run out cell

I have reorganized the notes of polars for today’s learning.

Daniel · December 30, 2022, 1:41pm

reimplement test_sessions and test_labels and verify its validaty script , notebook

today’s story

Instead of fearing I can’t get this task done today, I am more worried about forgetting to write down my today’s little journey. Let me write it down as I still remember most of it.
I have annotated the organizer’s scripts related to these two tasks. Although the work is tedious most of the time, but there are things which I don’t understand why, that will eventually turn out to be important to help me get the numbers exactly right
I first dealt with test_labels which was an easier task as I have done similar work previously.
In the process, I have learnt how to set the number of rows to display and how to expand the column width display in polars when the column has a long list of values
Reimplementing test_sessions was not smooth at all. The first major problem was that I can’t get random.seed(42) working to reproduce the same random number in polars. After trying different ways of integrating random.seed inside polars expressions and failed, I move outside polars and get the random number reproducible first and move those random numbers back into my polars expressions. It was any slower, and maybe faster as it does not cause any error.
The second major problem is that my test_sessions still does not match exactly with the “correct” dataframe. This forced me to go back to the scripts and I can recall the line of code which removes the last event of every session. I still don’t understand why the organizer does this, but it may help me get my numbers right. However, so far I have not figured a way to remove the last event of every session in polars. At the moment I still can’t solve it with neither slice nor filter .
Can I use shift to remove the last event of each session? cell yes, it works. cell
actually filter with cumcount may also solve this problem but certainly is more complex than the shift solution, I can imagine.
discovered that random.randint(1,1) can work, but np.random.randint(1,1) will fail. cell
Actually, test_sessions matches exactly with test_sessions_v7_3rd. cell

Todos:

Why test_sessions didn’t match with that from the first two validation sets, but only the 3rd one?
Check whether test_labels match exactly

Daniel · December 31, 2022, 7:47am

Notebook on reimplementing test_sessions and test_labels Continued:
- Why organizer want to remove the last event after transformed by ground_truth function? why it won’t affect anything? cell
- Because of the above, when working on test_labels, I need to use the original test_sessions_full without shift the last event out. cell
- Can unique is_unique, is_duplicated help to get the unique rows when a column has a list for each row? No, they don’t work on list. cell
- How to select the unique rows when a column has lists instead of scalar value? groupby, pl.all().first() can work. cell
- why test_sessions between handmade and test_7v_3rd are the same, but not same to test_7v, test_7v_2nd? seed is 42, same; maybe paperspace vs kaggle have different setting?
- Can’t compare test_labels in which there are lists, so maybe turn list into long string then compare? cell
- How to compare series using series_equal? cell
- Let’s count and compare the length of the list of ‘ground_truth’ column, see whether the are the same length? (same) cell, cell2
- Why not compare the sum of the list between two test_labels dataframes? cell Are their differences between two sum columns all zero? cell
- How to do ops on two cols from two different dataframes? cell
- Yes, the sum of lists are all the same. So, the difference may just be the difference ordering of the list in ‘ground_truth’ column. cell
- How to sort the lists of ground_truth columns? cell Are they same after sorting the lists? cell

Daniel · December 31, 2022, 2:33pm

Todos (tomorrow)

Are my handmade train, test of full dataset, and train_sessions, test_sessions_full, test_sessions, test_labels of validation set the same to the ones generated by organizer’s script?
Compare my train.parquet and test.parquet from my otto-radek-style-polars with Radek’s train and test from otto-full-optimized-memory-footprint experiment notebook (proved the same)
Compare my train_ms.parquet and test_ms.parquet with those from Colum2131’s otto-chunk-data-inparquet-format (need processing)
Compare my train_sessions and test_sessions_full with those of validation-7days-parquet, validation-7days-2nd-parquet, new optimized validation dataset
Compare my test_sessions and test_labels with those of 3rd validation optimized parquet dataset and the 4th validation set (jsonl dataset and notebook, optimized parquet dataset and notebook)

Daniel · January 1, 2023, 2:12am

Are my handmade train, test of full dataset, and train_sessions, test_sessions_full, test_sessions, test_labels of validation set the same to the ones generated by organizer’s script?

Compare my train.parquet and test.parquet from my otto-radek-style-polars with Radek’s train and test from otto-full-optimized-memory-footprint: Done ! experiment notebook (proved the same)
Compare my train_ms.parquet and test_ms.parquet with those from Colum2131’s otto-chunk-data-inparquet-format (need processing): Done! (Same) notebook
Compare my train_sessions and test_sessions_full with those of validation-7days-parquet, validation-7days-2nd-parquet, new optimized validation dataset: Done! (Same! but radek’s train is in different length, due to his using of old script) notebook
Compare my test_sessions and test_labels with those of 3rd validation optimized parquet dataset and the 4th validation set (jsonl dataset and notebook, optimized parquet dataset and notebook): Done! (Same) notebook

Daniel · January 1, 2023, 5:49am

Compare my test_sessions and test_labels with those of 1st validation set (notebook, optimized parquet dataset) and 2nd validation set (notebook and optimized parquet dataset): Done! (Same) notebook

Compare 5th validation set (jsonl datast created on paperspace without pipenv, notebook to create optimized-parquet dataset on Kaggle) with 4th validation set: Done! (validation 1st, 2nd, 5th are the same (different from 3rd, 4th) as their jsonls are created on paperspace, even when 5th is created without pipenv ) notebook

I suspect the reason why validation-1st, 2nd, 5th (created on paperspace) are different from validation-3rd, 4th (created on kaggle) is due to pipenv environment. However, this notebook proved otherwise, i.e., pipenv has nothing to do with the difference.

Daniel · January 1, 2023, 1:15pm

Todos tomorrow

reimplement organizer’s evaluate.py script on kaggle
- run organizer’s evaluate.py script on kaggle, using the evaluate code in a pipeline notebook Done!
- reimplement Radek’s a-robust-local-validation-framework which includes his evaluate script
- maybe a more proper reimplementation
how to combine my implementation together in one notebook
- story:
  - this way I can split any subset of train into train_sessions, test_sessions and test_labels for fast experimentation on training and evaluating
  - Radek seems have a simpler (not exactly like the organizer’s script) approach code

Daniel · January 2, 2023, 10:11am

how to debugging to understand each line of the script above: notebook

Story today

Debugging with return, pp
user guide on pp
how to debugging to understand each line of a script: example notebook ^3ac7a9

what does submission.csv look like? debugging predictions = f.readlines()[1:] see cell
what does test_labels look like by debugging labels = f.readlines(), see cell
what does each label in for label in tqdm(labels, desc="Preparing labels"): look like, see cell
what does final_labels from prepare_labels(labels) look like? see cell
what does sid_type, preds = prediction.strip().split(",") in def prepare_predictions(predictions): look like? see cell
how each prediction is converted to prepared_predictions in def prepare_predictions(predictions):, see cell
what does def num_events(labels, k: int): do, see cell
what does evaluate_session do for each session, see cell; also make the debugging super fast by using subset labels = f.readlines()[:10] # add [:1000] for fast run, see cell
what does evaluate_sessions do for all sessions when a session is predicted, see cell; when a session is not predicted, see cell;
what does click_hits, cart_hits is None mean, see cell
what to do when a session is not predicted, see cell
what does def recall_by_event_type(evalutated_events: dict, total_number_events: dict): do? see cell
what does def weighted_recalls(recalls: dict, weights: dict): do? see cell

Daniel · January 2, 2023, 3:00pm

reimplement organizer’s evaluate.py script on kaggle
- run organizer’s evaluate.py script on kaggle, using the evaluate code in a pipeline notebook Done!
- how to debugging to understand each line of the script above: notebook and story [[#^3ac7a9|inplace]] or forum Done!
- implement the script above in polars
  - implement prepare_labels and prepare_predictions, see cell Done!
  - implement num_events(labels, k), see cell, confirmed by this cell Done!
  - implement evaluate_session and evaluate_sessions, evaluated_events can be confirmed by this cell
- compare Radek’s a-robust-local-validation-framework which includes his evaluate script

Daniel · January 3, 2023, 12:36pm

implement the script above in polars

implement prepare_labels and prepare_predictions, see cell Done!
implement num_events(labels, k), see cell, confirmed by this cell Done!
implement evaluate_session and evaluate_sessions, evaluated_events can be confirmed by this cell Done!
implement click_hits, cellDone!
implement cart_hits, cell Done!
implement order_hits, cell Done!
join them together, cell Done!
to confirm my implementation result is the same to the organizer’s result, cell Done!
implement recall_by_event_type and weighted_recalls, check script in cell , and implemented cell, confirmed cell Done!

Daniel · January 3, 2023, 1:09pm

using reimplementation notebooks above to split any subset of train into train_sessions, test_sessions and test_labels for fast experimentation on training and evaluating Todo
- integrate my implementations together
- Radek’s a-robust-local-validation-framework does subset, modeling, and evaluate in one go, let me reimplement it in polars