A quick breakdown of some search system metrics.
Consider rank of 1st relevant item, then average:
\[MRR@m = \frac{1}{m} \sum_{i=1}^m \frac{1}{\text{rank}_i}\]Mean average precision.
First, what is average precision (AP)?
AP = average precision @ different values of k
\[AP@k = \frac{1}{\text{total no relevant items}} \sum_{i=1}^k \text{Prec@i} \times \text{Rel@i}\]The relevance vector is typically a vector of 0s and 1s, where 1 indicates a relevant item. The denominator is the total number of relevant items in the dataset - this is the same as in recall.
Second: what is the mean average precision (mAP)? It is the average taken across many queries.
\[mAP = \frac{1}{N} \sum_{i=1}^N AP_i\]Essentially, we are summing up scores from top to bottom, and discounting scores further near bottom.
Note that the logarithm factor is essentially made up. From Wikipedia:
“Previously there was no theoretically sound justification for using a logarithmic reduction factor[3] other than the fact that it produces a smooth reduction. But Wang et al. (2013)[2] gave theoretical guarantee for using the logarithmic reduction factor in Normalized DCG (NDCG). The authors show that for every pair of substantially different ranking functions, the NDCG can decide which one is better in a consistent manner.”
Normalized DCG:
\[\text{nDCG}@p = \frac{\text{DCG}@p}{\text{IDCG}@p}\]where IDCG = ideal ranging = recommend most ideal items first.
mAP is better for binary relevance, e.g. item is or is-not relevant.
nDCG is better for graded relevance, e.g. item is somewhat relevant.
Other good things to measure in search systems:
There are many metrics to measure search systems. The most important thing is to pick a metric that is relevant to your use case.
Oliver K. Ernst
August 29, 2023