Collaborative filtering

sermakarevich · November 29, 2017, 10:59am

I am about to apply collaborative filtering for matching target audience of different advertisement campaigns and sites/applications. Before some real examples lets take a look at lesson 5 notebook:

671 Users, 9066 movies, 1.6% non empty cells in matrix
Accuracy on train / test : 0.61688 0.76318 (mine real results)

So, we slightly overfit and accuracy on train set should be slightly better.

Validation set

fact = learn.data.val_y.reshape(-1)
preds = predict(learn.model, learn.data.val_dl)

Box plot of predictions:

Correlation is 0.57 (which is very high). Predictions are very wide, but, as Jeremy told, moving up when real ratings are moving up as well. So we learnt something.

Train set

fact = learn.data.trn_y.reshape(-1)
preds = predict(learn.model, learn.data.trn_dl)

Box plot of predictions:

Correlation is 0.00058. Predictions for all ratings looks like have exactly the same distribution. Anybody has an idea why?

Real world case
Company X is running advertisement campaigns by buying traffic (user visits) from different sites (placements). The idea is quite simple: ad campaigns = userId, placements = movieId, go and build recommendation system. Instead of rankings I use some other number, called conversion rate = user bought product / users viewed an ad. Evident differences from movies rating example:

% of non empty cells in matrix is 5 times lower - 0.3%
target values is continuous (instead of 10 possible rating values)
range of target values might be huge: from 0.0001 to 7
60% of values are zero (placement was useless for specific ad campaign)

What I get:

Some learning is happening

[ 0.       0.10338  0.09928]                                 
[ 1.       0.06246  0.05808]                                   
[ 2.       0.05271  0.05509]                                   
[ 3.       0.0477   0.04698]                                   
[ 4.       0.0454   0.04563]                                   
[ 5.       0.0439   0.04541]                                  
[ 6.       0.0423   0.04542]

Hist of predictions and facts looks terrible, CF significantly overestimates real values (x 20-30 times) :

Correlation is 0.1. For train set - same case as with movie ratings: 0 correlation, no visible accuracy. Any hints are highly welcome.

UPDT:

I substituted continuous values with binary (0 - placement was useless, 1 - placement was useful, I lost information how good was a placement for a campaign but thats ok in my case) - and this improved correlation to 0.42, AUC 0.74 (after some tuning got AUC 0.8), and make prediction accuracy visible and similar to movies rating:
Logistic regression gives the same accuracy for this dataset (AUC 0.8)
TSNE of userId embeddings - no visible clusters, no structure.
For those who are interested in getting deeppppper into CF - A Comparative Study of Collaborative Filtering Algorithms
After some paper reading I realize embeddings start meaning something only if you have really dense matrix. In this case embeddings need to solve complex problem - to fit to multiple varying cross - ratings. If you have highly sparse matrix, say 1 rating for movie-user than CF is not better than any simple algorithm.

kcturgutlu · November 29, 2017, 11:43am

I think the reason you get training preds very close to each other is the skewness you have in train data. In other words there is imbalance. You can maybe rank your predictions and scale them after ranking ? I guess it also depends on what metric you care about the most.

sermakarevich · November 29, 2017, 11:44am

But skewness is the same for train and validation sets.

kcturgutlu · November 29, 2017, 12:21pm

I get something like this, interestingly even I ran the same exact code from lesson notebook, there seems to be this no correlation but for both val and train.

datasciencegeek2018 · July 18, 2018, 1:01pm

I am applying collaborative filtering where my target is a 0/1 instead of 1-5 as in movie ratings. I am using the code that Jeremy had where he builds a neural net from scratch for movielens in Part I , uses columnar model data from fast ai and then calls fit
The only changes i make is

in columnar model data i set is_reg=False
in fit i specify F.cross_entropy instead of F.mse_loss

However i get this error and am not sure i understand, maybe you can take a quick look and help, much appreciated

KeyError Traceback (most recent call last)
in ()
----> 1 fit(model, data, 3, opt, F.cross_entropy)

~/fastai/fastai/model.py in fit(model, data, n_epochs, opt, crit, metrics, callbacks, stepper, swa_model, swa_start, swa_eval_freq, **kwargs)
135 if all_val: val_iter = IterBatch(cur_data.val_dl)
136
–> 137 for (*x,y) in t:
138 batch_num += 1
139 for cb in callbacks: cb.on_batch_begin()

~/anaconda3/envs/fastai/lib/python3.6/site-packages/tqdm/_tqdm.py in iter(self)
951 “”", fp_write=getattr(self.fp, ‘write’, sys.stderr.write))
952
–> 953 for obj in iterable:
954 yield obj
955 # Update and possibly print the progressbar.

~/fastai/fastai/dataloader.py in iter(self)
86 # avoid py3.6 issue where queue is infinite and can result in memory exhaustion
87 for c in chunk_iter(iter(self.batch_sampler), self.num_workers*10):
—> 88 for batch in e.map(self.get_batch, c):
89 yield get_tensor(batch, self.pin_memory, self.half)
90

~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/_base.py in result_iterator()
584 # Careful not to keep a reference to the popped future
585 if timeout is None:
–> 586 yield fs.pop().result()
587 else:
588 yield fs.pop().result(end_time - time.time())

~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
430 raise CancelledError()
431 elif self._state == FINISHED:
–> 432 return self.__get_result()
433 else:
434 raise TimeoutError()

~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
382 def __get_result(self):
383 if self._exception:
–> 384 raise self._exception
385 else:
386 return self._result

~/anaconda3/envs/fastai/lib/python3.6/concurrent/futures/thread.py in run(self)
54
55 try:
—> 56 result = self.fn(*self.args, **self.kwargs)
57 except BaseException as exc:
58 self.future.set_exception(exc)

~/fastai/fastai/dataloader.py in get_batch(self, indices)
73
74 def get_batch(self, indices):
—> 75 res = self.np_collate([self.dataset[i] for i in indices])
76 if self.transpose: res[0] = res[0].T
77 if self.transpose_y: res[1] = res[1].T

~/fastai/fastai/dataloader.py in (.0)
73
74 def get_batch(self, indices):
—> 75 res = self.np_collate([self.dataset[i] for i in indices])
76 if self.transpose: res[0] = res[0].T
77 if self.transpose_y: res[1] = res[1].T

~/fastai/fastai/column_data.py in getitem(self, idx)
35
36 def getitem(self, idx):
—> 37 return [self.cats[idx], self.conts[idx], self.y[idx]]
38
39 @classmethod

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/series.py in getitem(self, key)
621 key = com._apply_if_callable(key, self)
622 try:
–> 623 result = self.index.get_value(self, key)
624
625 if not is_scalar(result):

~/anaconda3/envs/fastai/lib/python3.6/site-packages/pandas/core/indexes/base.py in get_value(self, series, key)
2558 try:
2559 return self._engine.get_value(s, k,
-> 2560 tz=getattr(series.dtype, ‘tz’, None))
2561 except KeyError as e1:
2562 if len(self) > 0 and self.inferred_type in [‘integer’, ‘boolean’]:

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_value()

pandas/_libs/index.pyx in pandas._libs.index.IndexEngine.get_loc()

KeyError: 2237301

navinkb · January 20, 2019, 9:15am

Were you able to figure out with binary values?