Skimming through recommendation systems
I’m skimming through some background on recommendation systems - here is what I learned.
Source: Machine Learning System Design Interview by Ali Aminian (Author), Alex Xu (Author)
What is the ML objective?
Objective |
Comment |
Maximize # clicks |
Con: clickbait |
Time spent on each recommended item |
Con: may be short based on content of recommended item |
Total # relevant items based on user interactions |
Best = strongest signal because based on user interaction |
Strengths of these approaches
Feature |
CF |
Content-based |
Can handle new items |
|
|
”Cold start problem” |
X |
Yes |
Do not need item features (can be expensive to compute) |
Yes |
X |
Can handle niche interests (few users have this preference) |
X |
Yes |
Best: hybrid filtering:
- Sequential vs parallel: Do you first CF → Content (or vice versa) = sequential, or do the two in parallel and merge the results?
- Most popular: sequential hybrid filtering
Encoding
- Use light feature model like CBOW for short text
- Use heavier model like BERT for more advanced queries
Examples:
- Tags: CBOW
- Search history: BERT for each search, then average
- Liked items: Item ID → Embedding learned during training → average
Approach #1: Matrix factorization
Based on feedback matrix
- Users opinions about items
- Entries = positive (+1) or not observed (0)
Decomposition
Matrix = (users x items) → decompose into (users x embedding) (embedding x items)
- Initially: decomposition is random
- Train with loss function
Loss function
Intuition about the loss function:
- Idea #1: MSE over <user, item> pairs which are observed (+1 entries). This is bad because it does not penalize errors in the unobserved pairs (0 entries).
- Idea #2: MSE over <user, item> pairs of observed + unobserved. This is bad because it is dominated by the unobserved pairs, because the feedback matrix is sparse
- Idea #3: ****Best:**** weighted observed + unobserved
Optimization
First approach: SGD
Better approach: WALS = alternating least squares
- Fix one matrix → optimize the other
- Switch
Pro: converges faster than SGD
Inference
Use dot product of learned embeddings
Limitations of Approach #1
Cold start problem: can be difficult for new users
E.g. if purely based on user interactions ⇒ not enough interactions for new users
Approach #2: two tower NN
Similarity = compute dot product → cross-entropy loss.
Dataset
Construct positive negative <user,item> pairs
Loss
Cross entropy
Inference time
NN problem ⇒ Use approx. NNs algorithm
Limitations
Slower to train + slower inference
Evaluation
- Precision@k = # relevant items / k
- mAP/nDCG:
- mAP for binary relevance scores
- nDCG otherwise
Other evaluation metrics
- Diversity = how dissimilar recommended items are to each other → want more diversity in recommendations!
- Click-rate: # clicked / # recommended ; but: subjective to clickbait
Model architecture
- Generate candidates
- Use two-tower network because (1) can handle new users + (2) does not require item features
- Use multiple networks trained on different goals: popular, trending, relevant to location
- Scoring
- Use content-based filtering for accuracy
- Score each candidate item
- Output ranked list
- Re-ranking (business logic)
- Privacy
- Bias
- Duplicates
- Misinformation
Challenges & solutions
- Serving speed → use multistage models (e.g. YouTube)
- New users = cold start problem → use two towers
- New items → randomly include to collect click rate
- How to address seasonality (dependencies on times/seasons)?
- How to include negative feedback e.g. dislikes?
Source
Machine Learning System Design Interview by Ali Aminian (Author), Alex Xu (Author)