Over the last two months, I have been working on OTTO competition introduced to me by @radek, I can’t say I have made much progress on the LB score, but it is a great learning experience. I will keep report more of what I learnt when I get more strength back from covid. Today, I would like to share my most recent notes (my reimplementation and experimentations of every line of code using polars) on a nice but simple EDA notebook (using pandas from a great kaggler). Yes, I guess part of the highlight of this sharing below would be polars which I found to be so much faster, easier and efficient than pandas.
MY GOAL IN THIS COMPETITION
Goal in 3 steps
- STEP 1 - understanding the problem including the dataset
- STEP 2 - building the baseline or simplest pipelines
- STEP 3 - gradually more tweaks on the pipelines
Strategy
- experimenting and remixing great notebooks from a beginner’s perspective
Why
- one of my goals is described in a reply and inspired by a discussion
- benefits of dissecting and remixing great notebooks for a beginner (@radek1)
REACH MY AUDIENCE
- through a reply to the original notebooks
- a post on kaggle
- through a tweet
GREAT INSIGHTS SHARED ON DISCUSSION
On Test Set
- not appeared in train set, but could appear in earlier period and returning customers with existing carts and wishlist source
GREAT NOTEBOOKS TO BE STUDIED
- OTTO - Getting Started (EDA + Baseline) notebook original by #otto_edward mine notebook by #otto_dcl (this is me)
HOW IS MY REPLICATE DIFFERENT FROM THE ORIGINAL NOTEBOOK
- I experimented every line and extracted learning points which are useful and new to a beginner like myself in the sections below.
- Thanks to @radek1 for introducing polars library to us, I have implemented almost every line of the original notebook in polars.
STAY FOCUS
- techniques to do EDA with polars in particular
- understanding the problem and dataset
- build up the simple pipeline with polars and start experimenting
Good Habits to learn
- make using DEBUG a habit whenever I start building a notebook, otherwise it takes too long to run experiment later on.
- del variables whenever a section of a notebook is finished to save RAM, otherwise it’s easy to run out of memory later on. cell
- one version of the notebook, deal with one investigation, code from scratch, write only the necessaries
TRIVIAL TYPOS FOUND IN THE ORIGINAL NOTEBOOKS #otto_dcl
- on the use of tqdm with
total
discussion - on the most frequent aids disucssion (not that trivial )
- on the predictions generation error, see discussion, see cells for investigation (not that trivial, as it contributes 0.001 up in LB score)
KNOW YOUR TOOLS
KNOW YOUR PROBLEM #otto_edward
- overview of the problem cell
- what does this competition want us to predict exactly? cell
- how different are
clicks
,carts
,orders
are evaluated in the score? 1: 3: 6 link
BUILD YOUR PIPELINES
A basic pipeline by #otto_edward cell
This baseline will use the fact that people will often interact with articles they have previouslt interacted with. The prediction will consist of the top 20 most frequent articles in the session. If there are less than 20 articles in the session the prediction will be padded with the most frequent articles in the training data as found above.
The logic behind
- the more appearances or clicks on an aid in early part of a session, the more likely it will be clicked, carted or ordered in the later part of the session
KNOW YOUR DATA
On Datasets
#otto_dataset_organizer_dataset : data description
#otto_radek_optimized_dataset: data description
#otto_radek_optimized_polars: created by me from notebook converting 11 GB into 1.2 GB on Kaggle, I have checked the sameness
earlier exploration prior to #otto_radek_optimized_polars
- load competition organizer’s
train.jsonl
file (11 GB) and basic analysis with polars without blowing up 30GB RAM? notebook
On first session #otto_dcl
- create the following columns of a single session with polars - notebook
- total_num_of_rows
- total_num_unique_aids
- total_num_of_clicks
- total_num_of_carts
- total_num_of_orders
- session_starting
- session_ending
- total_duration_hour
On first session #otto_edward
- Take a look at the first session and the first action of the first session? cell
- what are
session
andevents
? what are included insideevents
? what areaid
,type
, andts
in Unix timestamp or milliseconds? cell - what is the exact time duration of the first session of the train set? cell
- what are the frequencies of ‘clicks’, ‘carts’, ‘orders’ of the first session with a
dict
anddict.get
? cell , withpl.select
andpl.filter
? cell
On all sessions #otto_dcl
- session features on session, aid, clicks, carts, orders with polars notebook
- total_num_of_rows_or_actions_overall
- total_num_of_unique_sessions_overall
- total_num_of_unique_aids_overall
- total_actions_in_each_session
- total_clicks_in_each_session
- total_carts_in_each_session
- total_orders_in_each_session
- session features on datetime, timestamp, duration with polars notebook
- starting_datetime_each_session
- ending_datetime_each_session
- duration_seconds_each_session
- duration_seconds_datetime_each_session notebook
- duration_hours_datetime_each_session
- duration_hours_int_each_session
- how does a session ends? notebook
- sessions can end anytime from near starting datetime to near ending datetime
- sessions can last between near 28 days duration and 0 second duration
- sessions can end with clicks, carts or orders
- plot distributions with seaborn notebook
- on total_actions_in_each_session
- on duration_hours_int_each_session
My investigation 3 : convert jsonl to parquet on Kaggle 30GB CPU with polars
On all sessions #otto_edward
- what’s the frequency of clicks, carts, orders overall in a bar chart? cell
- what’s the distribution of num of actions of a user does on otto site overall? cell
- what’s the distribution of the duration (in hours) a session? cell
- what are the interesting findings from the two graphs above? cell
On all aids #otto_dcl
- total_num_unique_aids in train set notebook
fetch
is for fast debuggingcollect
accesses full dataselect
is a good habit to have
- groupby aid to find features
- total_occurs_each_unique_aid, distribution and quantiles notebook
- total_occurs_as_clicks_each_aid cell
- total_occurs_as_carts_each_aid
- total_occurs_as_orders_each_aid
- cart_over_click_ratio_each_aid cell
- order_over_click_ratio_each_aid
- most_occurred_carts_highest_cart_click_ratio cell
- total_occurs_each_aid_whose_total_occurs_lt_30 cell
- plot distribution of total_occurs_each_aid_whose_total_occurs_lt_30 cell
On all aids #otto_edward
- what’s the distribution of aids whose occurrences are under 30 in both pandas and polars? cells
- what’s the 5 most common aids of the dataframe? cell
- what’s the most frequent aids (for clicks, carts or orders) among the first 450000 sessions? cells
Techniques Experimented #otto_dcl
-
my utils
-
polars
- how to read parquet file? cell
- how to just read a 100 rows of data from a huge json file with
pl.scan_ndjson
? cell- why do we need it given
fetch
, because if the large is too large,fetch
can blow out 30GB RAM
- why do we need it given
- polars turn a list of dictionaries in a column into a list of structs. cell
- how to explode a list of structs above? cell
- how to split a
struct[3]
into 3 separate columns? cell - how to cast columns from
int64
toint32
andint8
? cell - how to use
if, elif, else
in polars withpl.when().then().otherwise(pl.when().then().otherwise())
? cell - how to experiment columns of a DataFrame? cell
- how to subset 150000 sessions? cell
- how to subset a single session of data? cell
- how to experiment
max
,min
,count
on a column data? cell - how to insert a value from a pl.Expr into a function with
apply
? cell - how to groupby and run ops on cols within each group? cell
- how to add columns of the same value to a dataframe? cell
- how to transform a row of a dataframe into a list with
transpose
,to_series
,to_list
? cell - on datetime, timestamp, duration with polars notebook
- convert from datetime to timestamp using
datetime
library - create a datetime object and convert datetime to timestamp and back and forth with
polars
- use
pl.duration
calculate duration and usepl.Int64
anddt.with_time_unit('ms')
to convert timestamp to datetime
- convert from datetime to timestamp using
map
(orpl.duration
alike) vsapply
(withtimedelta
) on speed and RAM usage notebook- how much more RAM is used by
apply
vsmap
- how much slower is
apply
vsmap
- how much faster is doing more parallel ops or making cols with
map
- how much more RAM is used by
- How to use
pl.duration
which hasmap
in a situation wheregroupby
context is needed? notebookmap
is used bypl.duration
inferred from the error message and experiment afterward- use
groupby
to prepare inputs needed bypl.duration
in a new dataframe - use
pl.duration
in the new dataframe
- How to use
apply
in polars notebook - how to turn a dataframe (only one row) into a list? cell
- how to convert polars duration from seconds into hours? cell
- when to use
fetch
vscollect
notebook notebook - how to create an empty DataFrame? how to check whether a DataFrame is empty or not? cell
- how to check the size or RAM used of a DataFrame? cell
Techniques Experimented #otto_edward
- how to take only 15,000 lines from
train.jsonl
into a pandas df for analysis? cell - how to run experiment to measure the speed of time one line of code or a block of code?
%timeit
or%%timeit
– not for result, only for speed testing cells - how to get a basic idea of time for running code?
%time
cell - the most sure way to find out info of an object is with
help
cell - how to find out the difference between two timestamp in millisecond with
datetime.timedelta
? cell - how to check a long dict with a loop? cell
- how to reorder a dict? cell
- sort by keys
dict(sorted(people.items()))
- sort by values
dict(sorted(people.items(), key=lambda item: item[1]))
- sort by keys
- how to find the most common/frequent items from a list with
Counter
andmost_common
? cell - how to check the size of an object with
sys.getsizeof
? cell - how to check the size of a dataframe with
memory_usage
and how to remove objects to save RAM withdel
andgc.collect
? cell - how the RAM changes when loading libraries and datasets and removing objects cell
- how to read a jsonl file in
chunks
withpd.read_json
? cell and what is thechunks
object? cell - how to create features for each session info from a dataframe of multiple sessions in pandas cell and polars cell
- how to add feature columns addressing entire dataframe to the dataframe from above? cell
- pandas
- how to use tqdm in loops for reading file lines? cell 3
- how to use column
session
as index and removesession
as column? cell - how fast is
len(set(train.session))
vslen(train.session.unique())
? cell - how to prove that
len(set(train.session))
is much slower thanlen(train.session.unique())
? cell - differentiate
df[0]
vsdf.iloc[0]
,df[:1]
vsdf.iloc[:1]
,df.iloc[0].item()
vsdf.iloc[:1].item()
cells - how to loop through each row of a dataframe with idx? cell
- how to use
tqdm
with loop of a dataframe? cell
- seaborn plotting