Skimming through recommendation systems

I’m skimming through some background on recommendation systems - here is what I learned.

Source: Machine Learning System Design Interview by Ali Aminian (Author), Alex Xu (Author)

What is the ML objective?

Objective	Comment
Maximize # clicks	Con: clickbait
Time spent on each recommended item	Con: may be short based on content of recommended item
Total # relevant items based on user interactions	Best = strongest signal because based on user interaction

Strengths of these approaches

Feature	CF	Content-based
Can handle new items
”Cold start problem”	X	Yes
Do not need item features (can be expensive to compute)	Yes	X
Can handle niche interests (few users have this preference)	X	Yes

Best: hybrid filtering:

Sequential vs parallel: Do you first CF → Content (or vice versa) = sequential, or do the two in parallel and merge the results?
Most popular: sequential hybrid filtering

Encoding

Use light feature model like CBOW for short text
Use heavier model like BERT for more advanced queries

Examples:

Tags: CBOW
Search history: BERT for each search, then average
Liked items: Item ID → Embedding learned during training → average

Approach #1: Matrix factorization

Based on feedback matrix

Users opinions about items
Entries = positive (+1) or not observed (0)

Decomposition

Matrix = (users x items) → decompose into (users x embedding) (embedding x items)

Initially: decomposition is random
Train with loss function

Loss function

Intuition about the loss function:

Idea #1: MSE over <user, item> pairs which are observed (+1 entries). This is bad because it does not penalize errors in the unobserved pairs (0 entries).
Idea #2: MSE over <user, item> pairs of observed + unobserved. This is bad because it is dominated by the unobserved pairs, because the feedback matrix is sparse
Idea #3: ****Best:**** weighted observed + unobserved

Optimization

First approach: SGD

Better approach: WALS = alternating least squares

Fix one matrix → optimize the other
Switch

Pro: converges faster than SGD

Inference

Use dot product of learned embeddings

Limitations of Approach #1

Cold start problem: can be difficult for new users

E.g. if purely based on user interactions ⇒ not enough interactions for new users

Approach #2: two tower NN

Similarity = compute dot product → cross-entropy loss.

Dataset

Construct positive negative <user,item> pairs

Loss

Cross entropy

Inference time

NN problem ⇒ Use approx. NNs algorithm

Limitations

Slower to train + slower inference

Evaluation

Precision@k = # relevant items / k
mAP/nDCG:
- mAP for binary relevance scores
- nDCG otherwise

Other evaluation metrics

Diversity = how dissimilar recommended items are to each other → want more diversity in recommendations!
Click-rate: # clicked / # recommended ; but: subjective to clickbait

Model architecture

Generate candidates
1. Use two-tower network because (1) can handle new users + (2) does not require item features
2. Use multiple networks trained on different goals: popular, trending, relevant to location
Scoring
1. Use content-based filtering for accuracy
2. Score each candidate item
3. Output ranked list
Re-ranking (business logic)
1. Privacy
2. Bias
3. Duplicates
4. Misinformation

Challenges & solutions

Serving speed → use multistage models (e.g. YouTube)
New users = cold start problem → use two towers
New items → randomly include to collect click rate
How to address seasonality (dependencies on times/seasons)?
How to include negative feedback e.g. dislikes?

Source

Machine Learning System Design Interview by Ali Aminian (Author), Alex Xu (Author)