This article mainly introduces a commonly used evaluation metric in Search and Recommend: NDCG.
NDCG (Normalized Discounted Cumulative Gain) can be translated as "归一化折损累计增益" in Chinese, and is usually used to measure and evaluate the accuracy of ranking. For example, in an e-commerce system, when a user enters a query, a list of products is returned. Assuming the length of the list is N, NDCG@N can be used to evaluate the difference between the ranking of this list and the actual interaction list.
Cumulative Gain#
CG (Cumulative Gain) is the sum of the relevance scores of all documents in the search result list. CG only considers the relevance of the results in the search result list, without considering the position of these documents in the result list. Given a sorting position of a result list, it can be defined as:
where represents the relevance at position .
For example, when a user searches for "mobile phones" on an e-commerce platform, the top 6 results are: iPhone, Xiaomi, Huawei, OPPO, VIVO, Samsung. And the user is more inclined to choose Huawei. In this case, no matter where Huawei is ranked within the top 6, the cumulative gain will be the same. However, for search and recommend systems, the best result is to have Huawei ranked first. This leads to the next metric, discounted cumulative gain.
Discounted Cumulative Gain#
DCG (Discounted Cumulative Gain) proposes that when results with higher relevance appear in the later positions of the search result list, a penalty should be applied to the evaluation score. The penalty ratio is related to the logarithmic value of the position of the document. Given a sorting position of a result list, DCG can be defined as:
There is another commonly used formula that can increase the weight of relevance influence:
The latter formula is more commonly used in most search companies and e-commerce platforms.
When the relevance score , i.e., the value is either 0 or 1, the above two formulas are equivalent.
Normalized Discounted Cumulative Gain#
NDCG (Normalized Discounted Cumulative Gain), going back to the NDCG mentioned at the beginning of the article. NDCG believes that the length of search results varies with the query, and using DCG alone to compare the performance of different queries is not consistent. Therefore, for a selected value of , the CG at each position should be normalized on the query, first sorting all the relevant results in the library by relevance, and then generating the maximum possible DCG through position , usually referred to as IDCG (Ideal DCG). For a query, NDCG can be calculated as follows:
where IDCG is the DCG in the ideal state, calculated as:
represents the top result list with the highest relevance in the library (sorted by relevance).
The NDCG values of all queries are averaged to evaluate the average performance of the search engine ranking algorithm. The larger the NDCG value, the better the ranking algorithm performance. In a perfect ranking algorithm, and should be equal, and the value of NDCG is 1. The minimum value of NDCG is 0, so the value of NDCG is distributed between 0 and 1. It is a relative value, so it can be compared across queries.
Calculation Example#
Taking the user search query "mobile phones" and returning 6 results as an example, with the results being: iPhone, Xiaomi, Huawei, OPPO, VIVO, Samsung, and the corresponding relevance scores are 3, 2, 3, 0, 1, 2.
Then, the corresponding CG is:
Changing the order of any two results will not affect the final result of CG calculation. DCG is used to emphasize highly relevant results appearing at the beginning of the result list. Using logarithms for reduction, the DCG for each result is as follows:
1 | 3 | 7 | 1 | 3 | 7 |
2 | 2 | 3 | 1.585 | 1.262 | 1.893 |
3 | 3 | 7 | 2 | 1.5 | 3.5 |
4 | 0 | 0 | 2.322 | 0 | 0 |
5 | 1 | 1 | 2.585 | 0.387 | 2.585 |
6 | 2 | 3 | 2.807 | 0.712 | 1.069 |
Using the first DCG formula, the calculation is as follows:
Using the second DCG formula, the calculation is as follows:
Now, if we swap the 3rd and 4th results, the DCG will decrease. This is because a result with lower relevance is ranked higher in the sorting, which means that a more relevant result is penalized more.
The performance of this query cannot be compared in this case because another query may have more results, resulting in a larger DCG, but it does not necessarily mean that the performance of this query is better. To make a comparison, the DCG values must be normalized.
To normalize the DCG values, an ideal sorting of the given query is needed. In this example, the sorting will be a monotonically decreasing sorting of all known relevance judgments. In addition to the 6 results in this example, let's assume that we also know that there is a result 7 with a relevance score of 3, and a result 8 with a relevance score of 2. Then the ideal sorting is:
When there are no 7th and 8th results, the ideal sorting is:
The ideal sorting's DCG, or IDCG, is calculated for the first 6 positions:
So the NDCG for this query can be calculated as:
Python Code for Calculating NDCG#
import numpy as np
def get_dcg(rec_list):
log_2 = np.log2(np.arange(2, len(rec_list) + 2))
mi = np.power(2, rec_list) - 1
dcg = np.array(mi / log_2).sum()
return dcg
if __name__ == "__main__":
rec_list = [3, 2, 3, 0, 1, 2]
sort_rec_list = [3, 3, 2, 2, 1, 0]
dcg = get_dcg(rec_list)
idcg = get_dcg(sort_rec_list)
print("ndcg = dcg / idcg = ", dcg/idcg)