Junjie Huang Shanghai Jiao Tong University China huangjunjie2019@sjtu.edu.cn , Jizheng Chen Shanghai Jiao Tong University China humihuadechengzhi@sjtu.edu.cn , Jianghao Lin Shanghai Jiao Tong University China chiangel@sjtu.edu.cn , Jiarui Qin Shanghai Jiao Tong University China qjr1996@sjtu.edu.cn , Ziming Feng † China Merchants Bank Credit Card Center China zimingfzm@cmbchina.com , Weinan Zhang † Shanghai Jiao Tong University China wnzhang@sjtu.edu.cn and Yong Yu Shanghai Jiao Tong University China yyu@sjtu.edu.cn
In an era dominated by information overload, effective recommender systems are essential for managing the deluge of data across digital platforms. Multi-stage cascade ranking systems are widely used in the industry, with retrieval and ranking being two typical stages. Retrieval methods sift through vast candidates to filter out irrelevant items, while ranking methods prioritize these candidates to present the most relevant items to users. Unlike studies focusing on the ranking stage, this survey explores the critical yet often overlooked retrieval stage of recommender systems. To achieve precise and efficient personalized retrieval, we summarize existing work in three key areas: improving similarity computation between user and item, enhancing indexing mechanisms for efficient retrieval, and optimizing training methods of retrieval. We also provide a comprehensive set of benchmarking experiments on three public datasets. Furthermore, we highlight current industrial applications through a case study on retrieval practices at a specific company, covering the entire retrieval process and online serving, along with practical implications and challenges. By detailing the retrieval stage, which is fundamental for effective recommendation, this survey aims to bridge the existing knowledge gap and serve as a cornerstone for researchers interested in optimizing this critical component of cascade recommender systems.
† Corresponding authors† † copyright: acmcopyright † † journalyear: 2024 † † doi: XXXXXXX.XXXXXXX † † ccs: Information systems Recommender systems
We are being bombarded with a vast amount of information due to the growing popularity of the Internet and the development of User Generated Content (UGC) (Krumm et al . , 2008) in recent years. To save users from information overload, recommender systems have been widely applied in today’s short video (Liu et al . , 2019) , news (Wang et al . , 2018b) and e-commerce (Chen et al . , 2019b) platforms. While complicated models (Pi et al . , 2020; Qin et al . , 2021; Lin et al . , 2023b; Wang et al . , 2023c) often offer higher accuracy, their poor efficiency makes online deployment challenging because of latency restrictions (Pi et al . , 2019) . On the other hand, simple models (Huang et al . , 2013; Rendle, 2010) have capacity limitations, but they could evaluate a great number of items efficiently because of their low time complexity. Therefore, striking a balance between efficacy and efficiency becomes crucial in order to quickly filter out information that users are interested in. As is shown in Figure 1 (a), one widely used solution in the industry is multi-stage cascade ranking systems (Wang et al . , 2011) . The system includes a retriever and a variety of subsequent rankers. In the very first stage of the cascade system, referred to as the retrieval stage in this paper (also called matching stage or recall stage in some literature (Qin et al . , 2022; Zhu et al . , 2022) ), a retriever is typically used to quickly eliminate irrelevant items from a large pool of candidates, whereas rankers in the later stages aim to accurately rank the items. Each stage selects the top- K 𝐾 K italic_K items it receives and feeds them to the next stage. As shown in Figure 1 (a), rankers in multi-stage cascade ranking systems are arranged in the shape of a funnel, narrowing from bottom to top. The retrieval and ranking stage are two typical stages, while pre-ranking (Wang et al . , 2020d) and re-ranking (Xi et al . , 2023a) stages are relatively optional, and the number of rankers in the system may vary depending on different scenarios. Additionally, on the left side of Figure 1 (a), we display the approximate output scale of each stage, noting that the range of this scale is specific to the particular platform and scenario.
Although both the retrieval and ranking stages aim to select the most relevant items, each stage has its own unique characteristics.
Difference in candidate sets (i.e., inference spaces). The retrieval stage needs to quickly filter through the entire item pool, which may contain millions of items; while the ranking stage only needs to score and order the items that have been selected by the retrieval methods, typically narrowing down to hundreds or thousands of items.
Difference in input features. During the retrieval stage, due to time constraints and the need to filter through a large candidate set quickly, utilizing complex feature interactions is impractical for real-time online requirements. As a result, only limited, coarse-grained features of users and items are considered. In contrast, the ranking stage can utilize a diverse set of features by designing various feature interaction operators, such as product operators (Qu et al . , 2016) , convolutional operators (Li et al . , 2019a) , and attention operators (Xiao et al . , 2017) . The ranking stage further enhances its capability by integrating multiple feature domains and providing detailed modeling of behavioral and interaction features. These intricate and meaningful patterns serve as valuable indicators for evaluating user preferences, allowing for a more precise and refined ranking of the items.
Difference in model architectures. Typically, retrieval models employ a dual-tower architecture, where user and item representations are generated separately, and their matching scores are calculated using basic similarity metrics (Huang et al . , 2013) , such as inner product or cosine similarity. This dual-tower architecture allows for independent optimization of user and item representations before they interact, which is often referred to as late interaction . On the other hand, ranking models often utilize a single-tower architecture, where user features, item features, and cross features are concatenated and processed through a deep neural network (DNN). This setup, known as early interaction , facilitates multi-level interactions between features, thereby increasing the model’s expressive power. The single-tower architecture processes these concatenated features deeply within the network, allowing for richer and more complex interactions compared to the more straightforward combination in dual-tower setups. The distinction between single and dual-tower architectures essentially lies in how and where the interactions between users and items are modeled, that is, late interaction in dual-tower setups via direct similarity metrics, versus early interaction in single-tower setups through deeper, concatenated feature processing.
Difference in optimization objectives. The retrieval stage aims to achieve high recall efficiently, i.e., to quickly identify a subset containing as many relevant items as possible. In this stage, the focus is on ensuring that the relevant items are included in the set, without prioritizing their exact order. In contrast, the goal of the ranking stage is to refine the retrieved candidate set by accurately determining the relative importance of each item. Since only a small number of items are fed into the ranking model, these models can utilize more complex architectures to enhance precision, ensuring that the most relevant items are positioned at the top of the list. Thus, the retrieval stage prioritizes the inclusion of relevant items (i.e., recall), while the ranking stage prioritizes precise ordering to maximize the relevance of the top-ranked items.
In the past few decades, there has been substantial progress in the retrieval stage. The traditional retrieval methods are based on the concept of collaborative filtering (CF) (Su and Khoshgoftaar, 2009) , which include neighborhood based (Sarwar et al . , 2001) and matrix factorization (MF) based methods (Koren et al . , 2009; Manotumruksa et al . , 2017) . However, these methods primarily focus on learning embeddings for user and item IDs, without considering other important features such as user profiles, item attributes, or contextual information. Additionally, using just shallow structures in these models is insufficient to capture complex patterns and interactions, resulting in suboptimal representations. Consequently, in recent years, there has been a shift towards embedding based retrieval (EBR), which treats the retrieval problem as a nearest neighbor search in the vector space. EBR is not a specific algorithm but a broad family of algorithms, including Item2Vec (Barkan and Koenigstein, 2016) , YouTube DNN (Covington et al . , 2016) , EGES (Wang et al . , 2018a) of Alibaba, and Pinsage (Ying et al . , 2018) of Pinterest, e.t.c. These models, which vary from two-tower to graph-embedding structures, aim to learn comprehensive representations for users and items. Given the extensive research conducted, it is timely to assess the current landscape, learn from existing methods, and identify directions for future research. While there are comprehensive surveys on the ranking stage (Zhang et al . , 2021; Wu et al . , 2022; He et al . , 2023) , the retrieval stage lacks systematic reviews. This oversight may stem from the greater emphasis traditionally placed on the ranking stage, which, as the final stage in recommender systems, interacts directly with users and has a more immediate impact on business objectives. However, the retrieval stage is also crucial as it sets the stage for all subsequent processes. Without effective retrieval, even the most advanced follow-up stages cannot perform optimally.
To our knowledge, this is the first survey that comprehensively summarizes retrieval-related efforts in recommender systems. While there have been several surveys on recommender systems, none have specifically focused on the retrieval stage. Batmaz et al. (Batmaz et al . , 2019) provide a thorough review of deep learning based recommendation methods, primarily focusing on autoencoders, recurrent neural networks (RNNs), and convolutional neural networks (CNNs) for recommendations. Similarly, Zhang et al. (Zhang et al . , 2019) offer a detailed overview of current efforts in deep learning based recommender systems and discuss some unresolved issues. Xu et al. (Xu et al . , 2018) propose a unified framework for search and recommendation and mention some retrieval models, but their discussion is scattered and lacks a focus on the retrieval stage. In contrast, our work provides a comprehensive overview of retrieval methods in recommender systems under a unified framework.
We review numerous works published in major conferences and journals and summarize the efforts made to achieve precise and efficient personalized retrieval into three main areas, as is illustrated in Figure 2 :
Similarity Learning. Many studies focus on improving the representation of users and items to enhance similarity learning. Traditional methods, based on CF and MF (Sarwar et al . , 2001; Koren et al . , 2009; Manotumruksa et al . , 2017) , use shallow structures for this purpose. Recently, there has been a growing trend towards using deep structures (Covington et al . , 2016; Wang et al . , 2018a; Ying et al . , 2018) to better learn the representations and similarity relationships of users and items.
Indexing Mechanisms. In the retrieval stage, indexing schemes are essential for effectively organizing and retrieving large-scale items. Some efforts have been directed towards improving indexing mechanisms, such as tree based methods that use a tree structure as an index (Zhu et al . , 2018, 2019; Zhuo et al . , 2020) . Deep Retrieval (Gao et al . , 2020) uses a matrix for indexing, enabling efficient item clustering and retrieval.
Optimization Methods. Various learning strategies have been developed to optimize the training of retrieval methods, particularly in negative sample construction. Users typically interact with a small subset of items compared to the vast number available in real systems, making the lack of reliable negative data a challenge for learning from implicit feedback. This includes various negative sampling strategies (He et al . , 2017; Rendle et al . , 2012) , which select a subset of missing data as negative instances, and whole-data based strategies (Devooght et al . , 2015; Hu et al . , 2008; Liang et al . , 2016) that treat all missing data as negative instances.
The aim of this survey is to thoroughly review the literature on retrieval methods in recommender systems and discuss the current state of retrieval in the industry, as well as future directions. This survey provides researchers and practitioners interested in recommender systems with a general understanding of the latest developments in the field of retrieval for recommendation. The key contributions of this survey are summarized as follows:
First Comprehensive Review. This review addresses the deficiency in summarizing retrieval-related efforts in recommender systems. We categorize these efforts into three main areas: improving similarity learning between user and item through both shallow and deep structures, enhancing indexing mechanisms for efficient large-scale retrieval, and optimizing training methods of retrieval. To the best of our knowledge, this is the first comprehensive review on the retrieval methods for recommendation, which not only highlights the progress achieved but also identifies areas for future research and development.
Extensive Benchmarking Experiments. We provide a comprehensive set of benchmarking experiments, covering various retrieval methods on three public datasets. As the study of retrieval methods is deeply rooted in practical applications and faces numerous real-world challenges, we summarize online performance statistics from published papers, providing valuable insights.
Case Study on Industrial Practices of Retrieval. Additionally, we provide a case study on the current industrial practices of the retrieval stage at a specific company, covering the entire retrieval process and online serving, along with key insights and challenges encountered.
The remaining of this article is structured as follows: Section 2 introduces the background of retrieval in recommender systems, including problem formulation, basic retrieval strategies and input data for retrieval. Sections 3 and 4 detail the efforts that focus on similarity learning using shallow and deep structures, respectively. Section 5 discusses improvements in indexing mechanisms for efficient retrieval. Section 6 covers optimization methods of retrieval, including key aspects of model learning, such as widely-used loss functions and different negative sampling strategies. Section 7 summarizes commonly used datasets and conducts detailed experiments on three public datasets to assess the effectiveness of various methods. Section 8 presents a specific case study of current industrial practices in the retrieval stage at a specific company. We discuss open challenges and suggest promising future research directions in Section 9 , and conclude the survey in Section 10 .
In this section, we first formulate the task of retrieval methods. Then we discuss the existing retrieval strategies from a broad perspective, including non-personalized and personalized retrieval. Next, we cover the common input data or information used in retrieval methods, including the user-item rating matrix and rich side information. Finally, we introduce the multi-channel retrieval, which is widely used in real-world recommender systems.
We denote the universal user and item set as 𝒰 𝒰 \mathcal caligraphic_U and ℐ ℐ \mathcal caligraphic_I , respectively. For a given user u 𝑢 u italic_u , the objective of the retrieval method is to identify all potentially relevant items from the vast candidate pool ℐ ℐ \mathcal caligraphic_I . Generally, the retrieval method can be formulated as:
(1) | 𝒞 ( u , ℐ ) = < i ∈ ℐ ∣ s ( u , i ) >θ > , 𝒞 𝑢 ℐ conditional-set 𝑖 ℐ 𝑠 𝑢 𝑖 𝜃 \mathcal(u,\mathcal)=\\theta\>, caligraphic_C ( italic_u , caligraphic_I ) = < italic_i ∈ caligraphic_I ∣ italic_s ( italic_u , italic_i ) >italic_θ > , |
where 𝒞 ( u , ℐ ) 𝒞 𝑢 ℐ \mathcal(u,\mathcal) caligraphic_C ( italic_u , caligraphic_I ) is the candidate set for user u 𝑢 u italic_u , s ( u , i ) 𝑠 𝑢 𝑖 s(u,i) italic_s ( italic_u , italic_i ) is a scoring function that measures the relevance of item i 𝑖 i italic_i to user u 𝑢 u italic_u , and θ 𝜃 \theta italic_θ is a threshold for relevance. Unlike the ranking phase, which deals with a smaller candidate set, the retrieval stage may involve filtering through millions of items. Consequently, efficiency becomes a critical concern for models used in this stage. As a result, most retrieval methods employ a dual-tower architecture where user and item representations are learned separately. Formally, given user u ∈ 𝒰 𝑢 𝒰 u\in\mathcal italic_u ∈ caligraphic_U and item i ∈ ℐ 𝑖 ℐ i\in\mathcal italic_i ∈ caligraphic_I , the process can be abstracted as:
(2) | s ( u , i ) = f ( ϕ u ( u ) , ϕ i ( i ) ) , 𝑠 𝑢 𝑖 𝑓 subscript italic-ϕ 𝑢 𝑢 subscript italic-ϕ 𝑖 𝑖 s(u,i)=f(\phi_(u),\phi_(i)), italic_s ( italic_u , italic_i ) = italic_f ( italic_ϕ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT ( italic_u ) , italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ( italic_i ) ) , |
where ϕ u : 𝒰 → ℋ : subscript italic-ϕ 𝑢 absent → 𝒰 ℋ \phi_:\mathcal\xrightarrow<>\mathcal italic_ϕ start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT : caligraphic_U start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW caligraphic_H and ϕ i : ℐ → ℋ : subscript italic-ϕ 𝑖 absent → ℐ ℋ \phi_:\mathcal\xrightarrow<>\mathcal italic_ϕ start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT : caligraphic_I start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW caligraphic_H represent the mappings of user space 𝒰 𝒰 \mathcal caligraphic_U and item space ℐ ℐ \mathcal caligraphic_I to a new space ℋ ℋ \mathcal caligraphic_H , respectively. f 𝑓 f italic_f denotes the similarity measure between user and item, utilizing metrics like inner product or cosine similarity. For any user-item pair ( u , i ) 𝑢 𝑖 (u,i) ( italic_u , italic_i ) , s ( u , i ) 𝑠 𝑢 𝑖 s(u,i) italic_s ( italic_u , italic_i ) provides a score reflecting the similarity between u 𝑢 u italic_u and i 𝑖 i italic_i . This score enables the ranking of all items in the corpus ℐ ℐ \mathcal caligraphic_I according to their predicted similarity scores. By ranking the items based on these scores, the retrieval method can efficiently generate a set of items, forming the retrieved set for user u 𝑢 u italic_u .
Before exploring specific retrieval methods, it is important to distinguish between retrieval strategies and methods, and clarify the relationship between two of them, as they operate on different levels. The retrieval strategy serves as the high-level concept that guides the design of various retrieval methods. Generally, as illustrated in Figure 3 (a) and Figure 3 (b), retrieval strategies can be broadly categorized into (1) non-personalized retrieval and (2) personalized retrieval. Non-personalized retrieval primarily aims at identifying and presenting trending topics. Though such an approach might not directly align with a user’s specific interests, there is a high likelihood of user clicks due to the popularity of the content. In contrast, personalized retrieval tailors its recommendations to match user preference, significantly enhancing user engagement and retention. This category includes varied strategies like item-to-item (I2I) and user-to-item (U2I), which will be detailed further. These strategies provide general guidance for the design of different retrieval methods. Understanding this distinction helps clarify how different approaches are applied in practice.
Non-personalized retrieval plays a pivotal role in the retrieval stage without tailoring to individual user preference, as shown in Figure 3 (a). One common strategy is promoting popular items, which is inspired by the ‘wisdom of the crowd’ (Surowiecki, 2005) . It is particularly effective for new users or in the absence of sufficient user interaction data (Adomavicius and Tuzhilin, 2005) . Another strategy focuses on less mainstream items, aiming to diversify the user’s preference and unearth user potential interest. Newly released item promotion (Celma and Cano, 2008) is also a widely used strategy, which pushes the latest item to users, fueling exploration and maintaining a fresh and dynamic environment (Fleder and Hosanagar, 2009) . Besides, the platform would design various operational algorithms to serve specific business goals or events, guiding users towards particular topics or items. Each of these strategies serves to enhance user engagement without personalization, necessitating the use of more advanced strategies and techniques.
Personalized retrieval recommends items that align with user interests. While popular items may draw broad attention, they do not always match individual user preferences. Hence, personalized retrieval aims to cater to users’ dynamic information needs and maintain their stickiness. As is shown in Figure 3 (b), U2I, U2I2I (often shortened as I2I for simplicity), and U2U2I are three popular personalized retrieval strategies, where ‘U’ stands for the user, ‘I’ stands for the item, and ‘2’ is short for ‘to’ that denotes the process of seeking similarity or connection. For instance, U2I directly correlates the target user with items he/she might like; U2I2I identifies items similar to those the target user has previously interacted with; and U2U2I finds users with similar tastes to suggest items for the target user. Each path introduced above offers a unique strategy to discover and recommend items that user might be interested in.
Additionally, recommender systems often use collaborative filtering (CF) and content based (CB) classifications. CF operates on the principle of gauging similarity based on user interaction data, typically through a rating matrix reflecting user interactions with items. This approach assumes that users who have agreed in the past are likely to agree in the future. On the other hand, CB methods focus on item attributes or user profiles to determine similarity. It analyzes item features and recommend items similar to those a user has shown interest in before, based on the content of the items themselves. Both CF and CB offer distinct ways to measure similarity, with CF leveraging user behavior patterns across the community, and CB concentrating on the item attributes and user profiles. The combination of these strategies allows for a more nuanced approach to personalized retrieval, enhancing user experience by connecting individuals with content that resonates with their interests and past interactions.
In this section, we will introduce the common input data or information used in retrieval methods, which includes the user-item rating matrix and rich side information. We categorize the side information into three aspects: user profile, item attributes, and contextual information.
The user-item rating matrix, denoted as R 𝑅 R italic_R , is the fundamental input data for retrieval, capturing user preferences towards various items. Each entry R i j subscript 𝑅 𝑖 𝑗 R_ italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT within the matrix indicates the preference of user i 𝑖 i italic_i for item j 𝑗 j italic_j ; if R i j > 0 subscript 𝑅 𝑖 𝑗 0 R_>0 italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT > 0 , it suggests a positive preference. Preferences can be recorded directly through explicit ratings or indirectly through binary indicators that reflect actions such as clicks, views, or purchases. In Figure 3 (c), we show the user-item rating matrix, where positive feedback is indicated by smiley faces and negative feedback by sad faces. It is important to note that the matrix is typically sparse, as the known preferences are usually limited, with many entries unspecified. We usually regard these unspecified (i.e., not interacted) entries as negative samples.
In addition to the user-item rating matrix, three types of side information play crucial roles in enhancing retrieval methods in recommender systems (Shi et al . , 2014) , as illustrated in Figure 3 (d). One vital source is the user profile, which might include demographics information, such as a user’s gender, age, and hobbies. Item attributes also provide rich information, detailing properties of the items like their category or content. Contextual information provides insights into the circumstances under which user interactions with items occur (Adomavicius and Tuzhilin, 2010) . Commonly included in this category are timestamps, which track when interactions like ratings or purchases take place (Xiong et al . , 2010; Koren, 2009) . Additionally, other relevant details like a user’s location during an app download (Böhmer et al . , 2011) or their hunger status when rating a food menu (Ono et al . , 2009) can also enrich the understanding of user-item interactions, further tailoring the retrieval process to specific user conditions or preferences.
Additionally, in recent years, user-generated content (UGC) has become increasingly prevalent on UGC platforms, enriching item attributes. Tags, for instance, are short textual labels that users freely assign to items without being limited to a predefined list of categories (Robu et al . , 2009) . These tags can describe the characteristics of an item or express users’ feelings, serving as a valuable source of data. Reviews and comments published by users constitute another crucial source of side information that has been successfully leveraged to improve the performance (Levi et al . , 2012) .
Multi-channel retrieval is widely used in recommender systems, which employs various independent retrieval methods to individually retrieve different subsets of items (Covington et al . , 2016; Grbovic and Cheng, 2018) . As is shown in Figure 1 (b), these retrieved item subsets are then aggregated to form a comprehensive candidate pool for downstream ranking stages. The primary motivation is to maximize the coverage of users’ diverse interests and improve the recall rate (Zhang et al . , 2019) . By using different retrieval methods, multi-channel retrieval captures a wide range of potential user preferences and enhances the overall performance of recommender systems.
The choice of retrieval channel is highly dependent on the specific applications. For example, the ‘trending list’ and ‘interest tags’ are essential for timely platforms like news streaming, while ‘friend favorites’ is more important for social media platforms. The number of items to be retrieved (i.e., top- K 𝐾 K italic_K ) can vary for different channels, and would be typically optimized through offline evaluation and online A/B testing.
In Sections 3 and 4 , we discuss efforts focused on accurate user-item similarity learning, thereby enhancing effectiveness. This section centers on similarity learning using shallow structures, which primarily revolves around the concept of collaborative filtering (CF). CF, as the name suggests, leverages collective feedback, evaluations, and opinions to help users filter through vast amounts of information, which is a cornerstone technique in the retrieval stage of recommender systems. These approaches can be divided into two categories: (1) neighborhood based (Sarwar et al . , 2001) and (2) matrix factorization (MF) based (Koren et al . , 2009; Manotumruksa et al . , 2017) methods. In this section, we will explore these two types of traditional retrieval methods in details.
The neighborhood based CF method (also called memory based CF methods in some literature (Su and Khoshgoftaar, 2009; Chen et al . , 2018a) ) is a fundamental subset of collaborative filtering techniques used in the retrieval stage. These methods operate by clustering users or items based on similarities in their interactions and preferences, thereby forming ‘neighborhoods’. There are primarily two types of neighborhood based CF methods: user based and item based, each with its distinct processes, advantages, and application contexts.
This approach focuses on finding similar users to the target user. Similarity between users is typically calculated using measures such as cosine similarity or pearson correlation coefficient based on their ratings in a user-item rating matrix (Breese et al . , 2013; Sarwar et al . , 2000) . The similarity between user u 𝑢 u italic_u and user v 𝑣 v italic_v can be represented as:
where I u v subscript 𝐼 𝑢 𝑣 I_ italic_I start_POSTSUBSCRIPT italic_u italic_v end_POSTSUBSCRIPT is the set of items rated by both users u 𝑢 u italic_u and v 𝑣 v italic_v , r u i subscript 𝑟 𝑢 𝑖 r_ italic_r start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT is the rating of user u 𝑢 u italic_u for item i 𝑖 i italic_i , and r u ¯ ¯ subscript 𝑟 𝑢 \overline> over¯ start_ARG italic_r start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT end_ARG is the average rating of user u 𝑢 u italic_u . Once similar users are identified, the system recommends items that these similar users have liked but the target user has not yet interacted with. User based CF can provide highly personalized recommendations since it directly leverages user behaviors and preferences. However, its scalability (Karypis, 2001) is a significant issue due to the computational complexity of comparing each user with every other user, especially in systems with a large number of users. It also suffers from the cold-start problem for new users with insufficient interaction data.
Item based CF focuses on calculating item similarity based on the patterns of users who have interacted with them (Sarwar et al . , 2001; Linden et al . , 2003) . The similarity between item i 𝑖 i italic_i and item j 𝑗 j italic_j can be calculated as:
where U i j subscript 𝑈 𝑖 𝑗 U_ italic_U start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is the set of users who have rated both items i 𝑖 i italic_i and j 𝑗 j italic_j , r u i subscript 𝑟 𝑢 𝑖 r_ italic_r start_POSTSUBSCRIPT italic_u italic_i end_POSTSUBSCRIPT is the rating of user u 𝑢 u italic_u for item i 𝑖 i italic_i , and r i ¯ ¯ subscript 𝑟 𝑖 \overline> over¯ start_ARG italic_r start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG is the average rating of item i 𝑖 i italic_i . Recommendations to a user are made by identifying items similar to those the user has already liked or interacted with, assuming that users will prefer items similar to their previous interests. Item based CF tends to be more scalable than user based CF, as the item-item similarity matrix is typically more stable and changes less frequently than user-user similarities (Sarwar et al . , 2001; Bell and Koren, 2007; Takács et al . , 2007) . It also mitigates the cold-start problem for long-tail users with limited interactions to some extent, but still struggles with completely cold-start users. Item based CF is particularly effective in systems where item relationships are more pronounced and stable, such as movie or book recommendation systems where items have longer lifespans and well-defined features, but it is not suitable in cases where user tastes are highly diverse or when new items frequently enter the system.
Both user based CF and item based CF enhance the retrieval process by incorporating the vast amount of user-item interactions. However, the choice between them, or the decision to integrate both approaches, depends largely on the specific requirements, scalability considerations, and the nature of user interactions within the recommendation platform. Modern recommender systems often employ hybrid models that combine the strengths of user based CF and item based CF to overcome their respective limitations.
Traditional neighborhood based CF methods, like those described above, primarily utilize direct pairwise interactions between users and items to make recommendations. These are either based on user-user or item-item similarities derived directly from the user-item rating matrix. The Swing algorithm (Yang et al . , 2020b) introduces a quasi-local structure that goes beyond these direct interactions by examining the relationship among users and items in a more comprehensive manner. The Swing algorithm is detailed using the concept of a swing score, which measures the relationship strength between two items based on their shared interactions with users. Specifically, the swing score between item i 𝑖 i italic_i and item j 𝑗 j italic_j is defined in equation 5 , where U i subscript 𝑈 𝑖 U_ italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the set of users who have clicked item i 𝑖 i italic_i , and I u subscript 𝐼 𝑢 I_ italic_I start_POSTSUBSCRIPT italic_u end_POSTSUBSCRIPT is the set of items that user u 𝑢 u italic_u has clicked. The parameter α 𝛼 \alpha italic_α serves as a smoothing coefficient to manage the influence of highly interactive user-item pairs. By utilizing such quasi-local structures, Swing can uncover more complex and subtle patterns in user-item interactions, leading to a richer and more accurate collaborative filtering result.
Additionally, some studies have built upon the user-item rating matrix by using side information to refine similarity calculations (Shi et al . , 2014) . Melville et al . (2002) first propose filling missing values in the user-item rating matrix using item side information such as titles and genres. Another category of approaches focuses on using tags as side information (Firan et al . , 2007; Shepitsen et al . , 2008) . Tso-Sutter et al . (2008) develop a method to blend user-user or item-item similarities based on both tags and ratings within a neighborhood based CF framework. To combat the inherent noisiness of tags, schemes to enhance tag reliability have been introduced, such as assigning different weights to different tags and integrating these with conventional neighborhood based CF approaches (Liang et al . , 2010) .
Matrix factorization (MF) based Collaborative Filtering (CF) methods (Koren et al . , 2009) , also referred to as latent factor models (Koren, 2008; Su and Khoshgoftaar, 2009) , play a pivotal role in the retrieval stage. Originating from the need to effectively map both users and items into a latent factor space, these methods estimate user-item interactions, typically ratings, by calculating the dot product of their latent factors. The essence of MF involves decomposing the user-item rating matrix R 𝑅 R italic_R (of shape M × N 𝑀 𝑁 M\times N italic_M × italic_N , where M 𝑀 M italic_M is the number of users and N 𝑁 N italic_N is the number of items) into two lower-dimensional matrices: U 𝑈 U italic_U (of shape K × M 𝐾 𝑀 K\times M italic_K × italic_M , representing latent user preferences) and V 𝑉 V italic_V (of shape K × N 𝐾 𝑁 K\times N italic_K × italic_N , representing latent item attributes), aiming to approximate the original rating matrix R 𝑅 R italic_R with the product U T × V superscript 𝑈 𝑇 𝑉 U^\times V italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × italic_V as shown in equation 6 . Optimization of these methods, given the sparsity of the rating matrix and computational constraints, is typically achieved through techniques such as stochastic gradient descent, complemented by regularization terms to mitigate overfitting. We use the same notation in previous works (Shi et al . , 2014) and as is shown in equation 7 , U 𝑈 U italic_U and V 𝑉 V italic_V represent matrices of latent factors, where U ∗ superscript 𝑈 U^ italic_U start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT and V ∗ superscript 𝑉 V^ italic_V start_POSTSUPERSCRIPT ∗ end_POSTSUPERSCRIPT denote their optimal values determined through the optimization process. Each column vector U i subscript 𝑈 𝑖 U_ italic_U start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in U 𝑈 U italic_U corresponds to the latent factors associated with user i 𝑖 i italic_i , while V j subscript 𝑉 𝑗 V_ italic_V start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT in V 𝑉 V italic_V corresponds to the latent factors for item j 𝑗 j italic_j . The indicator function I i j subscript 𝐼 𝑖 𝑗 I_ italic_I start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT equals 1 if R i j > 0 subscript 𝑅 𝑖 𝑗 0 R_>0 italic_R start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT > 0 and 0 otherwise. The Frobenius norm of a matrix is represented as ‖ U ‖ F subscript norm 𝑈 𝐹 \|U\|_ ∥ italic_U ∥ start_POSTSUBSCRIPT italic_F end_POSTSUBSCRIPT , and λ U subscript 𝜆 𝑈 \lambda_ italic_λ start_POSTSUBSCRIPT italic_U end_POSTSUBSCRIPT and λ V subscript 𝜆 𝑉 \lambda_ italic_λ start_POSTSUBSCRIPT italic_V end_POSTSUBSCRIPT are regularization parameters used to prevent overfitting in the model.
(6) | R ≈ U T × V . 𝑅 superscript 𝑈 𝑇 𝑉 R\approx U^\times V. italic_R ≈ italic_U start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT × italic_V . |
1 2 superscript subscript 𝑖 1 𝑀 superscript subscript 𝑗 1 𝑁 subscript 𝐼 𝑖 𝑗 superscript subscript 𝑅 𝑖 𝑗 superscript subscript 𝑈 𝑖 𝑇 subscript 𝑉 𝑗 2
subscript 𝜆 𝑈 2 superscript subscript norm 𝑈 𝐹 2MF based CF gained prominence during the Netflix Challenge (Takács et al . , 2008) , highlighting their capability to handle large and sparse datasets by uncovering complex, latent relationships within the data. To expand upon the foundational MF approach, advanced variations like SVD++ (Koren, 2008) were developed. SVD++ not only considers the explicit ratings that users provide but also integrates implicit feedback—such as browsing or purchase history—to enhance the model’s predictive accuracy and relevance. This integration helps capture a more complete picture of user behavior and preferences, greatly enriching the algorithm’s applicability.
Further extending the capabilities of traditional MF, Factorization Machine (FM) (Rendle, 2010) incorporates additional side information, which MF does not typically include. Unlike traditional MF, which focuses on learning latent representations for user and item IDs, FM is able to model interactions among more available features, providing a powerful framework capable of operating within highly sparse data environments and across different domains, where data interconnectivity plays a critical role. Time based dynamics are considered in methods like TimeSVD++ (Koren, 2009) , which specifically incorporates temporal information to improve predictions over time. Additionally, SLIM (Ning and Karypis, 2011) is a simple linear model that merges the benefits of neighborhood CF and latent factor models. These enhancements and innovations in MF based methods underscore their vital roles in the evolution of retrieval methods, which offer a scalable solution that delivers quick and precise predictions even for large databases (Bobadilla et al . , 2013) . The methods discussed so far primarily use shallow structures for user-item similarity learning. In Section 4 , we will introduce more recent works that leverage deep structures, exploring how deep learning techniques are leveraged to further improve the effectiveness of retrieval methods.
Recently, with the advent of deep learning, there has been a significant shift towards more sophisticated and deep structures for user-item similarity learning. Modern recommender systems are increasingly adopting embedding based retrieval (EBR). EBR transforms the retrieval problem into a nearest neighbor search within a vector space, where both users and items are represented as points. This structure allows EBR to incorporate rich user and item features into the representation learning process, considering a broader range of features beyond mere user and item IDs. We classify these methods into six categories based on their structure as is shown in Figure 4 (i.e., tow-tower methods, autoencoder based methods, graph embedding based methods, graph neural network based methods, MLP based collaborative filtering methods, and multi-interest methods), and we will now discuss each category in details.
The two-tower methods represent a state-of-the-art approach in the retrieval stage, offering a balance between prediction accuracy and inference efficiency (Li et al . , 2022) . In these methods, the data used for training consists of instances denoted as ( x , y ) 𝑥 𝑦 (x,y) ( italic_x , italic_y ) , where y ∈ < 0 , 1 >𝑦 0 1 y\in\ italic_y ∈ < 0 , 1 >indicates user feedback (Richardson et al . , 2007) . A label of 1 signifies positive feedback, and 0 denotes negative. The feature x 𝑥 x italic_x can be divided into user features and item features. In the two-tower architecture, these user and item features are processed independently in parallel sub-towers, namely, the user tower and the item tower. Taking the user tower as an example, user features are first passed through an embedding layer, where user feature embeddings are obtained via an embedding look-up operation (Zhang et al . , 2016) . These embeddings are then processed through multiple fully connected layers to refine the representation. The final output of the two-tower model is determined by the inner product of the user and item representations. This architecture decouples the user and item modeling, which is highly efficient and significantly reduces inference latency and computational resources. Classic implementations of two-tower methods in industry include Microsoft’s DSSM (Huang et al . , 2013) , Google’s YouTubeDNN (Covington et al . , 2016) , and Airbnb’s personalized embeddings (Grbovic and Cheng, 2018) , leading to significant performance improvements. The two-tower methods mentioned above, while efficient, often face limitations in adequately modeling user-item interactions, which can impede model performance. To overcome such challenges, advanced techniques have been developed to enhance the depth of interaction between the user and item side, which we will further detail in future directions in Section 9 .
Autoencoder based methods, particularly Variational Autoencoders (VAEs) (Kingma and Welling, 2013) , are increasingly utilized in the retrieval stage of recommender systems to address challenges like data sparsity and the complexity of user-item interactions. These methods excel in generating and representing complex data structures. VAEs operate by assuming that observable data points are generated from latent variables following some probability distribution, such as standard Gaussian distribution. They consist of two main components: an encoder and a decoder. The encoder, or inference model, approximates the true posterior distribution of the latent variables given the input data, effectively capturing the underlying data structure. The decoder, on the other hand, reconstructs the input data from the latent representation, aiming to match the original input as closely as possible.
The effectiveness of VAEs in the retrieval stage of recommender systems is showcased by their ability to learn comprehensive user preferences and item characteristics through deep neural networks. Wu et al . (2016) introduce CDAE, which incorporates a user-specific bias into an autoencoder framework. CDAE represents a generalization of various existing CF techniques. Mult-VAE (Liang et al . , 2018) sparks interest in VAE for CF by showcasing its superiority over other benchmarks. Simultaneously, Lee et al . (2017) explore the VAE framework for CF with an emphasis on conditional and joint VAE formulations to integrate auxiliary user data. CVAE (Li and She, 2017) and MD-CVAE (Zhu and Chen, 2022) enhance collaborative latent variable modeling by incorporating item content information. Karamanolakis et al . (2018) apply VAE to personalized recommendations, delving into user-dependent priors. More recent efforts by Kim and Suh (2019) evaluate VAEs enhanced by VampPrior (Tomczak and Welling, 2018) . MacridVAE (Ma et al . , 2019) and CaD-VAE (Wang et al . , 2023a) focus on modeling the latent factors behind interactions to accurately capture the distribution of latent variables. RaCT (Lobel et al . , 2019) introduces an actor-critic reinforcement learning technique to optimize VAEs, while RecVAE (Shenbin et al . , 2020) examines various regularization methods to refine VAE’s performance in CF. BiVAE (Truong et al . , 2021) treats users and items symmetrically and proposed bilateral user and item based inference models to autoencode users and items under a unified framework. FastVAE (Chen et al . , 2022) improves the efficiency of the model using an inverted multi-index technique. SE-VAE (Cho and Oh, 2022) advances latent variable modeling by integrating a multi-expert system with stochastic expert selection, enhancing the model’s adaptability and accuracy.
Graph embedding based methods have been pivotal in enhancing the retrieval process in recommender systems. These methods extend the principles of word embeddings, such as Word2Vec (Mikolov et al . , 2013) , to the domain of recommendations. Item2Vec (Barkan and Koenigstein, 2016) adapts Word2Vec to items, treating the user behavior sequences of items (like products or songs) as sentences to uncover similarities based on co-occurrence. This enables better recommendations by capturing contextual similarities among items. For example, Airbnb successfully applied Item2Vec (Grbovic and Cheng, 2018) to personalize user experiences in online accommodations. DeepWalk (Perozzi et al . , 2014) uses random walks on graphs to generate node sequences, learning representations that reflect the social structure of interactions. Node2Vec (Grover and Leskovec, 2016) enhances this approach by balancing breadth-first and depth-first sampling, capturing diverse connectivity patterns in the graph. Further refining this, EGES (Wang et al . , 2018a) integrates side information (like user demographics or item attributes) into the embeddings, enriching the model’s contextual understanding. LINE (Tang et al . , 2015) preserves both direct connections and neighbor similarities, ensuring embeddings reflect both explicit and implicit relationships. SDNE (Wang et al . , 2016) uses autoencoders to learn node representations, adeptly capturing nonlinear relationships within complex, hierarchical structures. All these models underscore the importance of graph embeddings in understanding and leveraging complex user-item interaction patterns.
Graph Construction. Applying GNNs directly to user-item graphs (Berg et al . , 2017; He et al . , 2020; Sun et al . , 2020; Zheng et al . , 2018) can face challenges due to sparse connections or large scale. Enhancements such as adding connections (Sun et al . , 2019; Liu et al . , 2020b) or virtual nodes (Wang et al . , 2020a; Li et al . , 2020) improve graph connectivity and retrieval quality. Techniques like Multi-GCCF (Sun et al . , 2019) and DGCF (Liu et al . , 2020b) enhance graphs by linking two-hop neighbors, while HiGNN (Li et al . , 2020) clusters similar entities to enhance user-item graphs. For better efficiency, sampling methods like those used in Multi-GCCF (Sun et al . , 2019) and PinSage (Ying et al . , 2018) help process large-scale graphs by balancing the retention of meaningful information and computational demands.
Neighbor Aggregation. This step propagates information through the graph. Methods vary from mean-pooling (Berg et al . , 2017; Tan et al . , 2020) to more sophisticated methods like attention mechanisms (Wang et al . , 2020b; Ma et al . , 2020) that weigh neighbors based on relevance. PinSage (Ying et al . , 2018) uses normalized visit counts to determine neighbor importance.
Information Update. Updating node representations is crucial. Some methods fully replace node representations with aggregated information (Berg et al . , 2017; He et al . , 2020) , while others combine original attributes with neighbor information through pooling or transformations (Li et al . , 2019b; Ying et al . , 2018) . LightGCN (He et al . , 2020) simplifies this process by omitting non-linearities, enhancing performance and efficiency.
Final Node Representation. After multiple layers of aggregation and updating, the final node representation is used for predictions. Methods like weighted pooling and concatenation integrate information from different layers, leveraging the unique characteristics captured at each depth.
GNN based retrieval methods show remarkable results in understanding and leveraging complex user-item interaction patterns. However, enhancing their efficiency for large-scale retrieval applications remains a critical research problem.
Unlike conventional collaborative filtering techniques that often rely on linear models and explicit factorization methods, neural collaborative filtering (NCF) (He et al . , 2017) and its variants leverage the expressive power of multi-layer perceptron (MLP) to capture complex user-item interactions through non-linear optimization. Bai et al . (2017) enhance NCF by incorporating local neighborhood information using convolutional filters, improving interaction representation between users and items. Liu et al . (2017) propose PhD, a hybrid recommendation model combining user-item interactions with side information using probabilistic matrix factorization. Cao et al . (2018) develop a group recommendation method integrating an attention mechanism with MLP within the NCF framework. Chen et al . (2018b) utilize heterogeneous information networks (HIN) to represent diverse data types, capturing complex semantics through meta paths. Fu et al . (2018) investigate both local and global co-occurrences in user-user and item-item relations to capture contextual information more comprehensively. Zhou et al . (2019) enrich the model’s understanding by incorporating textual representations from items, enhancing both explicit and implicit feedback. Chen et al . (2019a) combine deep feature learning and deep interaction modeling using two MLP networks with a rating matrix. Otunba et al . (2019) propose an ensemble framework integrating generalized matrix factorization (GMF) and MLP networks, leveraging the strengths of deep and shallow architectures. Xue et al . (2019) focus on capturing high-order item relations in item based CF models using an NCF based approach. Yi et al . (2019a) develop deep matrix factorization (DMF), integrating diverse side information from users and items to enhance the retrieval process. Chen et al . (2020) propose efficient whole-data learning strategies to avoid common sampling techniques in CF retrieval methods. Additionally, closely related methods like CML (Hsieh et al . , 2017) is based on metric learning, and NBPO (Yu and Qin, 2020) employs noisy-label robust learning techniques for retrieval.
The deep neural retrieval methods mentioned above typically assign a single vector to represent each user, encapsulating their preferences and behaviors. However, this single vector representation often falls short in capturing the complexity and diversity of a user’s multiple interests. As a result, recent works have developed models that encapsulate multiple interest vectors for each user. One of the pioneering models is MIND (Li et al . , 2019c) , which introduces a dynamic routing mechanism to assign each user interaction to specific interests. ComiRec (Cen et al . , 2020) builds on this by employing a self-attentive mechanism to distill user interests from their activity patterns. Further advancements include models like SINE (Tan et al . , 2021) and Octopus (Liu et al . , 2020a) , which establish interest pools and use attention mechanisms to activate relevant interests based on historical interactions. DGCF (Wang et al . , 2020a) extends these concepts by integrating a dynamic routing mechanism into graph structures, focusing on the independence among interests for enhanced multi-interest learning.
The dynamic nature of user preferences is further explored in works like Pinnersage (Pal et al . , 2020) and MIP (Shi et al . , 2023) , emphasizing the importance of models adapting to the evolution of user preferences over time. Models like PIMIRec (Chen et al . , 2021) and UMI (Chai et al . , 2022) integrate time information, interactivity, and user profiles to represent user interests more contextually and temporally. Re4 (Zhang et al . , 2022) introduces backward flow regularization to enhance multi-interest learning by capturing the details of user interest evolution. MIRN (Zhang et al . , 2023b) uses EM routing to segment user behavior for tailored interests. kNN-Embed (El-Kishky et al . , 2023) improves diversity in candidate generation by representing users as mixtures over learned item clusters. HimiRec (Pei et al . , 2024) explores hierarchical interest modeling for structured user preferences. MTMI (Xiong and Yu, 2024) advocates for a multi-tower architecture to diversify user representation, addressing adaptability and complexity in a dynamic environment. REMI (Xie et al . , 2023) enhances retrieval quality by mining hard negatives that consider user interests and employing routing regularization for improved robustness and accuracy.
After obtaining high-quality user and item representations in Sections 3 and 4 , we now discuss efficient indexing mechanisms in modern retrieval systems, which are crucial for organizing and retrieving large-scale items effectively. The indexing methods can be broadly categorized into three main types: (1) no index with whole vault calculations, (2) approximate nearest neighbor search, and (3) tree based indexing.
In the simplest form of retrieval, there is no indexing mechanism, and the system performs whole vault calculations. This approach involves computing the similarity between the user and every item in the candidate pool. While this method can be straightforward to implement, it is computationally expensive and impractical for large-scale systems with millions of items due to the sheer volume of calculations required.
To efficiently retrieve the top- K 𝐾 K italic_K items from a vast candidate item pool after obtaining the embeddings for users and items, approximate nearest neighbor (ANN) search techniques have been developed. These techniques are essential for handling large datasets where full pairwise comparisons would be computationally prohibitive.
Various techniques and tools facilitate ANN search, such as Faiss (Johnson et al . , 2019) , Milvus (Wang et al . , 2021) , and ScaNN (Guo et al . , 2020a) , which are instrumental in industry applications for efficiently retrieving the top- K 𝐾 K italic_K nearest vectors. The construction of popular ANN indices, such as HNSW (Malkov and Yashunin, 2018) and IVFPQ in Faiss (Johnson et al . , 2019) , involves time-consuming procedures, including clustering algorithms like K 𝐾 K italic_K -means. Despite these challenges, the advantages of using ANN search in practical applications, particularly in speeding up the retrieval process and reducing computational loads, make it a valuable technique in modern retrieval systems.
Tree based indexing methods have gained attention in recent years for their ability to efficiently organize and retrieve large-scale items. These methods leverage hierarchical structures to capture the diversity of user interests and represent information more effectively. Tree based indexing methods offer several advantages over vector based approaches. They can significantly reduce the computational complexity by structuring the data hierarchically, enabling logarithmic scale computation relative to the size of the corpus during prediction. Additionally, they allow for the interaction between candidate items and users within the model, enhancing the expressive power of the retrieval method. These methods not only make the use of arbitrary advanced models possible, but also excel at discovering novel and effective recommendations by exploring the entire dataset and utilizing sophisticated deep models to uncover potential user interests. This capability represents a significant leap forward, offering a powerful solution for improving the performance of retrieval.
TDM (Zhu et al . , 2018) is a pioneering work in this area. It ties each item in the corpus to a leaf node via clustering, using a tree structure as an index. The model computes the probability value that corresponds to each embedding vector represented by a node in the tree. The user’s interest in a specific node is described by its probability value, while the diversity of user interest is represented by the entire tree. This transforms the retrieval process into a sequence of hierarchical classification tasks, achieving efficient and scalable retrieval. JTM (Zhu et al . , 2019) improves upon TDM by synchronizing the objectives of index learning and model training, addressing inconsistencies, and enhancing the model’s overall performance. OTM (Zhuo et al . , 2020) further refines this approach by addressing the discrepancy between training and testing in tree based methods. It introduces the notion of bayes optimality under beam search and calibration under beam search, providing general tools for analyzing and improving tree based methods.
Furthermore, Deep Retrieval (DR) (Gao et al . , 2020) innovates by learning a retrievable structure end-to-end, overcoming data scarcity at the leaf level in large-scale recommender systems. DR enables efficient item clustering and retrieval by using a matrix for indexing. These advanced indexing mechanisms play a critical role in improving the efficiency and effectiveness of retrieval stages in recommender systems, offering advantages over early methods by reducing computational loads and enhancing retrieval performance. However, they do require substantial engineering effort to implement, underscoring the need for skilled development to fully realize their performance benefits.
In this section, we discuss the learning strategies and optimization methods for the retrieval stage, which involves three main aspects: data preparation, loss functions, and evaluation metrics. In scenarios involving implicit user feedback, the lack of reliable negative data poses a significant challenge. Consequently, data preparation, especially negative sample construction is critical. We will begin with various negative sampling strategies (He et al . , 2017; Rendle et al . , 2012) that select a subset of the missing data to act as negative instances, and whole-data based strategies (Devooght et al . , 2015; Hu et al . , 2008; Liang et al . , 2016) that treat all missing data as negative instances. Then we will explore common loss functions and evaluation metrics used in retrieval tasks.
In many real-world scenarios, it is hard to collect explicit user preference. In datasets with implicit feedback, observed interactions indicate a user’s positive preference for an item, while all other unobserved items remain unlabeled. This means that only positive feedback is observed, and negative feedback is indistinguishable from missing values in the unobserved data. Additionally, users typically interact with a relatively small selection of items compared to the vast number of items available in real systems. Consequently, the absence of reliable negative data makes learning from implicit feedback particularly challenging. To optimize retrieval methods with implicit feedback, two prevalent strategies have been commonly employed. Negative sampling strategies (He et al . , 2017; Rendle et al . , 2012) involve selecting a subset of the missing data to serve as negative instances. While this method is efficient, it can potentially reduce the model’s performance due to its selective nature of sampling. On the other hand, whole-data based strategies (Devooght et al . , 2015; Hu et al . , 2008; Liang et al . , 2016) treat all missing data as negative instances. It aims to utilize the complete dataset, which can enhance coverage and may improve the model’s effectiveness. However, this strategy can be less efficient due to the extensive data processing involved and may lead to the issue of false negatives, where items that are not explicitly interacted with by users are incorrectly treated as irrelevant. We will discuss these strategies in further detail.
The negative sampling strategy extracts negative instances from unobserved data, significantly reducing the training sample size and enhancing the efficiency of the training process (He et al . , 2017) . This strategy is prevalent in various retrieval methods, including traditional ones like BPR (Rendle et al . , 2012) and neural based ones like NCF (He et al . , 2017) . We classify various negative sampling strategies applied in retrieval stage into four types: static negative sampling (He et al . , 2016; Yu et al . , 2017) , hard negative sampling (Ding et al . , 2020; Rendle and Freudenthaler, 2014; Zhang et al . , 2013) , adversarial negative sampling (He et al . , 2018; Jin et al . , 2020; Wang et al . , 2017; Chae et al . , 2018; Wang et al . , 2019b; Guo et al . , 2020b; Zhou et al . , 2021b) and graph based negative sampling (Ying et al . , 2018; Huang et al . , 2021) .
Static negative sampling. When the probability of each item being sampled as a negative instance remains constant throughout training, this sampling strategy is referred to as Static Negative Sampling. Within static negative sampling, the simplest and most widely used method is Random Negative Sampling (RNS), also known as Uniform Negative Sampling. However, this strategy often selects uninformative instances that contribute minimally to model updates, leading to the heuristic strategy of Popularity-biased Negative Sampling (PNS). PNS was first introduced in Word2Vec (Mikolov et al . , 2013) . (Caselles-Dupré et al . , 2018) has investigated the importance of hyperparameters when applied to recommendation. A more common strategy based on item popularity uses the frequency of an item in the training set as the weight for selecting negative examples, that is, favoring more popular items (He et al . , 2016; Yu et al . , 2017) . This strategy can be explained by popularity bias: if a user does not interact with a highly popular item, it likely indicates disinterest. However, empirical results suggest that PNS does not consistently outperform RNS in retrieval methods (Wang et al . , 2020c) .
Hard negative sampling. It is challenging for static negative sampling methods to dynamically adapt and adjust the distribution of candidate negative samples. Consequently, they struggle to identify more informative negative samples. Hard negative sampling selects hard negatives that are misclassified or have higher prediction scores, which are more relevant for improving performance (Ding et al . , 2020; Rendle and Freudenthaler, 2014; Zhang et al . , 2013) . The most common method for mining hard negatives involves selecting samples closest to the user, i.e., those most similar in the embedding space. Other innovative strategies include a sampling-bias-corrected algorithm for more accurate item frequency estimation from streaming data (Yi et al . , 2019b) and strategy that utilize both batch and uniformly sampled negatives to mitigate selection bias in implicit recommendations (Yang et al . , 2020a) .
Adversarial negative sampling. Adversarial negative sampling methods typically involve a generator and a discriminator. The generator acts as a sampler, creating samples to confuse the discriminator, which must determine whether a given sample is a positive instance or one generated by the generator. The goal is still to learn a better distribution of negative samples. Some GAN based methods adaptively evolve the sampling probability by optimizing adversarial objectives (He et al . , 2018; Jin et al . , 2020; Wang et al . , 2017; Chae et al . , 2018; Wang et al . , 2019b; Guo et al . , 2020b; Zhou et al . , 2021b) . However, adversarial negative sampling faces significant challenges: its complex framework, unstable performance, and lengthy training times all limit its practical applications. Additionally, the competition between the generator and discriminator may not always converge to an ideal Nash equilibrium, indicating that there is still room for exploration and improvement.
Graph based negative sampling. Hard negative sampling and adversarial negative sampling effectively utilize the semantic information in the embedding space, while graph based negative sampling methods further integrate the structural information of samples within the graph. For instance, (Ying et al . , 2018) sampled negative nodes based on their PageRank scores. (Huang et al . , 2021) proposed MixGCF, which implemented two strategies: positive mixing and hop mixing. Positive mixing enriches original negative samples with positive embeddings to infuse them with positive representations, while hop mixing enhances negative samples through GNN aggregation of neighborhood information. These strategies have achieved SOTA results in sampling methods for GNN based recommender systems.
Additionally, there are some cutting-edge efforts in the field. For instance, while negative sampling strategies primarily aim to improve the quality of negative samples, the ratio of negative sampling determines the quantity of these samples. SimpleX (Mao et al . , 2021a) has shown that even basic CF methods, when combined with the appropriate negative sampling ratio and loss function, can outperform current SOTA retrieval methods. Additionally, in scenarios where only positive and unlabeled data are accessible, sampling inevitably introduces certain biases, such as the false negative issue, a typical example of sample bias. CLRec (Zhou et al . , 2021a) theoretically demonstrates that popularity based selection in contrastive loss equates to reducing exposure bias through inverse propensity weighting, offering a new perspective on understanding the effectiveness of contrastive learning.
Whole-data based strategies, also known as non-sampling strategies, treat all unobserved data as negative examples but assign them a lower weight than positive examples. This approach ensures that the entire dataset is utilized, potentially offering better coverage. For instance, WMF (Hu et al . , 2008) assigns a uniform weight to all unobserved entries. EALS (He et al . , 2016) and ExpoMF (Liang et al . , 2016) adjust the weight of unobserved entries based on item popularity. The rationale is that popular items are more likely to be seen by users and should therefore be given higher weights as negatives. These methods have demonstrated the ability to leverage whole datasets with potentially better coverage (He et al . , 2016; Liang et al . , 2016; Xin et al . , 2018) , while inefficiency is the challenge.
To mitigate inefficiency, several methods have been developed, such as batch based ALS (Hu et al . , 2008; He et al . , 2016; Bayer et al . , 2017) and mini-batch SGD (Xin et al . , 2018) . These techniques accelerate the learning process, making whole-data based strategies more practical. However, these methods are generally suitable only for traditional retrieval methods with a linear prediction layer (Bayer et al . , 2017; Xin et al . , 2018; Chen et al . , 2020) and regression loss function. Consequently, non-sampling methods have been widely applied in traditional retrieval methods (Hu et al . , 2008; He et al . , 2016; Liang et al . , 2016) but are less common in neural retrieval methods. Recently, the neural non-sampling retrieval method ENMF (Chen et al . , 2020) was proposed, representing an effort to apply non-sampling strategies within neural frameworks. Individuals who are interested in this part can refer to (Chen et al . , 2023b) for further details.
In this part, several commonly used loss functions in retrieval methods will be introduced.
In recommender systems, calculating the softmax over a vast number of items is computationally expensive, making it impractical for real-time retrieval. By approximating the softmax function through negative sampling, sampled softmax loss is often used to address the computational inefficiency of the softmax function when dealing with large-scale item sets in retrieval stage. As is shown in equation 8 , sampled softmax loss approximates the denominator by considering only a subset of sampled negative items.
1 𝐵 subscript subscript 𝑢 𝑖 subscript 𝑣 𝑖 𝐵 𝐺 subscript 𝑢 𝑖 subscript 𝑣 𝑖(9) | G ( u , v ) = 𝐮 ⋅ 𝐯 − log ( Q ( v | u ) ) 𝐺 𝑢 𝑣 ⋅ 𝐮 𝐯 𝑄 conditional 𝑣 𝑢 G\left(u,v\right)=\mathbf\cdot\mathbf-\log(Q(v|u)) italic_G ( italic_u , italic_v ) = bold_u ⋅ bold_v - roman_log ( italic_Q ( italic_v | italic_u ) ) |
Sampled softmax loss ensures that the model focuses on differentiating between the positive items and the selected negative samples, and is particularly useful in the retrieval stage of recommender systems, where the goal is to quickly and accurately identify a subset of relevant items from a large item pool. By reducing the computational overhead, it enables faster training and inference times, making it feasible to deploy in real-world scenarios with millions of items.
Noice Contrastive Estimation(NCE) loss is an efficient training objective for large-scale recommender systems, particularly in the retrieval stage. It addresses the challenge of computing probabilities over an extensive set of items by transforming the problem into a binary classification task. NCE approximates the softmax function by contrasting observed data (positive examples) against artificially generated noise (negative examples), simplifying the likelihood computation. NCE loss can be formulated as equation 10 , where B 𝐵 B italic_B stands for the mini-batch, and G ( u i , v i ) 𝐺 subscript 𝑢 𝑖 subscript 𝑣 𝑖 G(u_,v_) italic_G ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) is the matching score between user u i subscript 𝑢 𝑖 u_ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and item v i subscript 𝑣 𝑖 v_ italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in equation 9 . σ 𝜎 \sigma italic_σ is the sigmoid function and S i subscript 𝑆 𝑖 S_ italic_S start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT represents the set of negative items sampled for user u i subscript 𝑢 𝑖 u_ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT .
1 𝐵 subscript subscript 𝑢 𝑖 subscript 𝑣 𝑖 𝐵 delimited-[]During training, the method learns to differentiate between true user-item interactions and sampled noise. By adjusting the parameters to maximize the likelihood of observed interactions and minimize the likelihood of noise, NCE effectively handles the computational complexity of large-scale retrieval tasks.
Pairwise loss is a type of Learning-to-Rank approach commonly used in retrieval methods. It focuses on optimizing the ranking of items for a given user by comparing pairs of items, which ensures that preferred items are ranked higher than less preferred ones. One common method to implement pairwise loss is using Marginal Hinge loss, defined as:
1 𝐵 subscript subscript 𝑢 𝑖 subscript 𝑣 limit-from 𝑖subscript 𝑣 limit-from 𝑖 𝐵 0
𝑚 ⋅ subscript 𝐮 𝑖 subscript 𝐯 limit-from 𝑖
where B 𝐵 B italic_B is a batch containing triplets ( u i , v i + , v i − ) subscript 𝑢 𝑖 subscript 𝑣 limit-from 𝑖
subscript 𝑣 limit-from 𝑖 (u_,v_,v_) ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i + end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT ) . 𝐮 i subscript 𝐮 𝑖 \mathbf_ bold_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the embedding vector for user u i subscript 𝑢 𝑖 u_ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT and 𝐯 i + subscript 𝐯 limit-from 𝑖
\mathbf_ bold_v start_POSTSUBSCRIPT italic_i + end_POSTSUBSCRIPT and 𝐯 i − subscript 𝐯 limit-from 𝑖 \mathbf_ bold_v start_POSTSUBSCRIPT italic_i - end_POSTSUBSCRIPT are the embedding vectors for the positive and negative items, respectively. m 𝑚 m italic_m is a margin parameter ensuring a minimum difference between the scores of positive and negative items. The goal is to ensure the matching degree of positive items is higher than that of negative items by at least the margin m 𝑚 m italic_m .
Bayesian Personalized Ranking (BPR) loss is another pairwise loss, which maximizes the probability that a positive item is ranked higher than a negative item. It is defined as:
1 𝐵 subscript subscript 𝑢 𝑖 subscript 𝑣 limit-from 𝑖subscript 𝑣 limit-from 𝑖 𝐵 𝜎 ⋅ subscript 𝐮 𝑖 subscript 𝐯 limit-from 𝑖
where σ 𝜎 \sigma italic_σ is the sigmoid function. This formulation avoids the need for a margin parameter and focuses on maximizing the ranking order directly.
In this section, we introduce some commonly-used evaluation metrics in retrieval methods. Recall@K, Precision@K, and F1@K are closely related metrics in the retrieval stage. We categorize the ground truth labels as either positive or negative, within the predicted labels, this results in classifications such as True Positive (TP), True Negative (TN), False Positive (FP), and False Negative (FN). Recall and Precision indicate the proportion of correctly predicted positive samples relative to the predicted positives and ground truth positives, respectively. The F1-score combines recall and precision into a single metric, as is shown in equation 13 .
2 precision recallHit Rate(HR) is another popular metric, which measures the proportion of hit items in the predicted top- K 𝐾 K italic_K list relative to the total items in the test set. A hit occurs if a user interacts with any of the top- K 𝐾 K italic_K retrieved items, which can be formulated as:
While Precision and Recall do not account for the order of items, Average Precision(AP) calculates the average precision score across multiple items within the list. It considers the precision at each rank and takes into account both the presence and position of relevant items, giving higher scores to relevant items appearing earlier in the list. Mean Average Precision (MAP) extends AP by averaging the AP scores across all users. AP and MAP are formulated in equation 15 , where rel ( k ) rel 𝑘 \operatorname(k) roman_rel ( italic_k ) indicates relevance and p k subscript 𝑝 𝑘 p_ italic_p start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT is the precision at cut-off k 𝑘 k italic_k .
1 𝐾 superscript subscript 𝑘 1 𝐾 ⋅ subscript 𝑝 𝑘 rel 𝑘 𝑘 𝑀 𝐴 𝑃In this section, we first introduce commonly used large-scale public datasets for retrieval tasks and select representative retrieval methods to evaluate their performance across three datasets. Additionally, since the study of retrieval methods is deeply rooted in practical applications and faces numerous real-world challenges, designs purely based on theoretical research may not fully capture the potential benefits of innovations at the retrieval stage. Therefore, we summarize online performance statistics from published papers to gain further insights.
MovieLens (mov, 2022) dataset is a widely used dataset in the field of recommender systems. It was released by the GroupLens Research Project at the University of Minnesota. The dataset contains user ratings for movies, along with some additional information such as user age, occupation, zipcode and item genre. Movielens-1M and Movielens-20M are commonly used in various research studies (Mao et al . , 2021a; He et al . , 2020) .
Gowalla (gow, 2022) dataset was collected from the Gowalla social network, which was a location based social networking website where users could check in at physical locations and share their activities with friends.
Amazon-Book (ama, 2021) dataset is a sub-dataset from the Amazon review dataset, which contains millions of reviews written by Amazon customers for various products available on the Amazon e-commerce platform. Additionally, other sub-datasets in Amazon, such as Amazon-CDs, Amazon-Movies, Amazon-Beauty, and Amazon-Electronics, are also used in some research studies (Sun et al . , 2020; Mao et al . , 2021a, b) .
Yelp (yel, 2021) dataset consists of reviews and related information collected from the Yelp platform, which is known for providing crowd-sourced reviews about businesses, particularly restaurants and other local services.
CiteUlike-A (cit, 2024) dataset was gathered from CiteULike and Google Scholar. CiteULike enables users to create their own collections of articles. Each article includes its abstract, title, and tags.
MillionSongData (son, 2024) dataset is a freely available collection of audio features and metadata for a million contemporary popular music tracks. Typically, play counts are binarized and interpreted as implicit preference data.
To provide a clear understanding of the performance of various retrieval methods, we present the results of several retrieval methods on the MovieLens-1M, Amazon-Book, and Yelp2018 datasets below. Following previous research (He and McAuley, 2016; He et al . , 2017) , we use the 10-core setting to ensure that each user and item have at least ten interactions. The statistics of these datasets are shown in Table 1 .
We outline the overall recommendation pipeline at CMB Life APP for credit card in Figure 5 . The pipeline is broadly divided into two main parts: offline training (with data storage), and the online recommendation service. Typically, offline processes are responsible for storing item embeddings and retrieved item list for each user. When a user makes an online request, the retrieval stage queries the k 𝑘 k italic_k most similar items from the stored retrieval results and forwards them to the subsequent stages including ranking and re-ranking, culminating in the display of recommendation results to the user. Users’ interactions with these recommendation results, such as clicks or purchases, are logged as user interaction logs, which serve as data for offline training of retrieval models. This whole process forms a continuous feedback loop. We will next provide in-depth explanations of data processing and storage, training, and recommendation services, respectively.
In the data processing and storage section, we utilize three types of data sources. (1) Item information: The item service provides detailed information about items, such as titles, prices, e.t.c. (2) User information: This primarily includes basic user profiles. (3) User interaction logs recorded by various business systems: These logs capture user-item interactions such as views, clicks, purchases, and favorites. As mentioned in Section 6.1.1 , due to the need to score a large number of items that have not been previously exposed, relying solely on ranking data with user feedback can lead to sample selection bias (SSB) (Chen et al . , 2023a) , which is essentially a mismatch between training and inference data. This is akin to forcing a student to take a test on material not covered in the syllabus. To address this, negative sampling is needed. In the specific context of Mobile Life, we utilize random negative sampling, whose proportion depends on the volume of data and the balance of positive and negative samples. These three data sources are combined and used for offline training or statistical analysis of retrieval models.
As for retrieval models, we categorize them into two types: offline retrieval and nearline retrieval.
Offline retrieval. Offline retrieval typically involves methods that do not require real-time updating, such as popular or new item retrieval. Once offline statistical analysis is complete or training process is finished, the retrieval results, such as user-to-item lists, are generated and saved for quick access during retrieval.
Nearline retrieval. Nearline retrieval involves methods that require periodic training and updating, which necessitates maintaining item embeddings. When a user’s interaction history updates, nearline retrieval is triggered. To be specific, if a user recently clicked on an item, nearline retrieval select the top items from the item pool based on the dot product of item embeddings. The retrieved items are then combined with the offline user-to-item list to create an updated retrieval list.
During online recommendation service, when a user initiates a request, the system leverages multi-channel retrieval to query the user-specific retrieval pool, combining both offline and nearline results. These results are then passed to subsequent ranking models, ultimately displaying the final recommendations to the user.
As illustrated in Figure 6 , our industrial setting features seven retrieval channels, which includes three nearline I2I retrieval methods, one offline U2I retrieval method, and three other statistical based retrieval methods. During online service, the I2I approach retrieves the top- K 𝐾 K italic_K similar items using the embedding of an item that a user clicks or purchases and places them into the user’s retrieval pool; the U2I retrieval model retrieves directly through the user ID; and the statistical based retrieval methods fetch the top- K 𝐾 K italic_K items directly from the database based on predefined quotas. We will now introduce each channel.
For Word2Vec, we feed comprehensive text fields from within the scenario such as item titles, attributes, and categories to generate item embeddings, which are then stored in an embedding database. The underlying principle is that items with similar titles, attributes, and categories are more likely to appeal to users. For Node2Vec, we create positive sample pairs from user behavior sequences by using a sliding window of length 2 to generate positive item pair item i 𝑖 i italic_i and item j 𝑗 j italic_j . Simultaneously, we form negative item pairs by pairing item i 𝑖 i italic_i with a random item k 𝑘 k italic_k from the item pool, ensuring that k ≠ j 𝑘 𝑗 k\neq j italic_k ≠ italic_j , to train the Node2Vec retrieval model. In the setting of I2I, we directly store the generated item embeddings, while in the setting of U2I, we obtain user embeddings by weighting the embeddings of the recent sequence of items interacted with by the user. We then use the user embedding to retrieve the top- K 𝐾 K italic_K items, storing them in the format of User → absent → \xrightarrow<> start_ARROW start_OVERACCENT end_OVERACCENT → end_ARROW ItemList. The rationale is that items interacted with consecutively in a user’s behavior sequence are perceived as more similar by the user, making it likely that they will engage with other items close to item i 𝑖 i italic_i after visiting it. For DSSM, we generate positive user-item pairs from user behavior sequences and create negative samples by pairing the user with a randomly sampled item from the item pool. These data are then used to train the DSSM retrieval model, which yields item embeddings that are then stored in the embedding database. DSSM may offer more personalization than Node2Vec in this scenario, as the implicit graph structure includes both users and items.
Additionally, we have three retrieval methods based on statistical analysis. The New and Niche Item retrieval refers to storing recently least exposed items, based on the idea that new and niche items typically have lower exposure and interactions, leading to significant bias in interaction volume predictions. Exposing a certain proportion of such items can unearth overlooked items. The Popular retrieval method involves scoring items based on recent clicks, exposures, favorites, and purchases, then storing those with the highest scores. This method is especially useful for supplementing retrieval results for some users, particularly inactive or new users, by providing popular items as a fallback. User Tag Based Popular retrieval categorizes users into buckets based on fundamental attributes, which may yield more relevant results.
Figure 6 displays the distribution and online performance of the seven retrieval channels mentioned above. In the retrieval stage, the focus is on covering potential user interests with various approaches, making the integration of multiple retrieval results a significant topic. Currently, in the scenario of CMB Life APP for credit cards, our retrieval strategy is divided into two layers. The first layer includes multiple retrieval methods, consisting of three I2I retrieval methods, one U2I retrieval method, and a retrieval method for new and niche items. The second layer involves supplementary retrieval approaches, including popular item retrieval and user tag based popular item retrieval. This supplementary layer is only activated if the first layer’s results are insufficient. For example, if we aim to retrieve 300 items and the first layer retrieves only 180, the second layer will add the top 120 to reach 300. If the first layer already retrieves 300 items, the second layer will not be triggered. As seen in Figure 6 , supplementary retrieval accounts for about 50% in this scenario, largely due to the presence of many inactive users.
In this section, we discuss some insights and challenges during the efforts in the retrieval stage.
During the training of retrieval models, attention must be paid to the construction of negative samples and the sampling ratios. The proportion of negative sampling largely depends on the volume of data and the balance of positive and negative samples. Typically, the optimal sampling ratio is determined by evaluating offline metrics. Different sampling ratios can be tested to identify the best performing ratio, similar to a hyperparameter grid search. This careful tuning ensures that the model effectively distinguishes between relevant and irrelevant items, thereby improving the overall performance of retrieval.
As retrieval methods are followed by ranking and re-ranking stage and do not directly target final business metrics like CTR and CVR, there can be discrepancies between online and offline metrics. The primary reason for this inconsistency is that the retrieval stage is just the first part of the cascade system. Subsequent stages, such as ranking and re-ranking, further filter and prioritize the retrieved items based on more complex models. Therefore, offline evaluations might not fully capture the impact of these later stages. As a result, online experiments are often necessary to assess the actual performance of various retrieval methods, as they provide a more holistic view of the end-to-end recommender system’s effectiveness.
Multi-channel retrieval is essential for maximizing recall rate. By using different retrieval strategies, we can cover a wide range of user preferences, thereby improving overall recall. Even if channels focusing on new and niche items have a slightly lower CTR, they still contribute significantly to enhancing the overall recall rate.
In the context of the CMB Life APP for credit cards, the integration strategy employs a multi-channel retrieval approach with additional supplementary retrieval methods. This involves how to adjust the volume of retrieved items from each channel. Also, snake-merge is commonly used in industry, which involves alternating the selection from each retrieval channel to create a mixed list of retrieval results. However, these methods are quite rudimentary, and needs better ways to adjust the weights and integrate retrieval results from multiple retrieval channels.
In this section, we propose several open challenges and potential future directions related to retrieval methods in recommender systems. Some of these topics are crucial yet underexplored in the field, while others present promising avenues for future research.
As mentioned in Section 1 , retrieval differs from ranking. Due to the large-scale candidate item set and efficiency constraints at the retrieval stage, it is challenging to use complex cross-features in retrieval methods. Consequently, many earlier approaches, such as the commonly used two-tower methods (Huang et al . , 2013; Covington et al . , 2016) , separately obtain user and item representations and then compute a simple dot product or cosine similarity. However, these coarse features of users and items fail to provide comprehensive representations. Therefore, recent work has focused on balancing and making trade-offs between performance, model complexity, and real-time latency.
Enhance the depth of interaction between the user and item side. For instance, DAT (Yu et al . , 2021) uses an adaptive-mic mechanism to implicitly model interactions between the two towers. Another approach, IntTower (Li et al . , 2022) , enhances the original two-tower methods by interacting the last hidden vector on the item side with multiple hidden vectors on the user side explicitly. Additionally, it employs positive and negative sample pairs to guide the model through self-supervised learning, capturing interactive signals implicitly. MVKE (Xu et al . , 2022) generates multiple user interest representations, which are then aggregated through an attention mechanism with the item representations to refine each user’s item-related representations.
Knowledge distillation. Moreover, models like PFD (Xu et al . , 2020) have been developed where two-tower and single-tower models are trained in parallel. The rich information of the single-tower model guides the learning of the two-tower model, with only the two-tower model being used for online retrieval services after training. During offline training, the teacher model includes a large number of informative features, such as interaction features that the two-tower methods cannot access, like clicks within the same category by a user in the past 24 hours, and post-click information that is unavailable during online services. The teacher model does not require pre-training and is updated synchronously with the student model. Initially, both models update independently until the teacher model stabilizes, at which point distillation begins. Beyond recommendation, some research on knowledge distillation has also been applied to information retrieval (IR) tasks, such as ENDX (Wang et al . , 2022) , TRMD (Choi et al . , 2021) , VIRT (Li et al . , 2021) , ERNIE-Search (Lu et al . , 2022) .
Advanced indexing schemes. In the retrieval stage, efficient indexing schemes are essential for managing and accessing large-scale items effectively. Tree based methods (Zhu et al . , 2018) , while offering enhanced expressive power and the ability to capture diverse user interests through hierarchical representations, also present significant challenges. The primary drawback is the need for tight integration between algorithm and engineering, requiring customized indexing structures that demand substantial engineering effort. Unlike two-tower methods that can leverage open-source tools like Faiss (Johnson et al . , 2019) , tree based methods necessitate bespoke solutions, complicating deployment and optimization. The co-design approach of aligning model and index, although addressing consistency issues, adds further intricacy.
In recent years, there has been a notable shift from single-stage retrieval models to exploring multi-stage cascade ranking systems. For instance, Gallagher et al. (Gallagher et al . , 2019) explored deriving gradients for cascade rankers, employing end-to-end methods for optimization, while Fan et al. (Fan et al . , 2019) advocated for unifying multiple stages into a single stage using hard negative sampling. Furthermore, Fei et al. (Fei et al . , 2021) suggested a strategy to share features across different stages. Hron et al. (Hron et al . , 2021) suggests using multiple channels together during the retrieval stage and learning how to merge the items retrieved from various channels. Rankflow (Qin et al . , 2022) suggest training rankers on their specific data distributions to alleviate the SSB issue and better utilize the interactions between each stage. In CoRR (Huang et al . , 2023) , the retriever is enhanced by deriving high-quality training signals from the ranker, while the ranker benefits from learning to discriminate hard negatives sampled by the retriever. GPRP (Zheng et al . , 2024) addresses both the selection bias in each stage of the system pipeline and the underlying interests of users. The core idea is to estimate the selection bias in subsequent stages and then develop a ranking model that aligns with this bias, ensuring the top-ranked results are delivered in the final output. RAR (Huang et al . , 2024) stands for Recall Augmented Ranking, which optimizes the recommendation process by utilizing multi-stage information. ARF (Wang et al . , 2023d) introduces a novel perspective on optimizing cascade ranking systems by emphasizing the adaptability of optimization targets to data complexities and model capabilities. These studies highlight the importance of viewing the retrieval process from a cascade system perspective, where each stage is optimized not in isolation but in concert with the others to improve the overall performance and effectiveness of the recommender system.
Discriminative recommendation retrieve items from an existing corpus via scoring. However, this retrieval based approach has two main drawbacks: (1) the predefined items in the corpus may not meet the diverse information needs of all users, and (2) users typically provide feedback through passive interactions like clicks, which is inefficient and often imprecise. Recent advancements in AI-Generated Content (AIGC) offer promising solutions to these issues. Generative AI can create personalized items that better align with specific user needs. Moreover, large language models (LLMs) with advanced language understanding and generation capabilities make it easier for users to express their preferences through natural language, reducing the effort required to tailor recommendations (Lin et al . , 2023a) . Generative recommendation differ significantly from discriminative recommendation. Unlike discriminative recommendation, which typically rely on pointwise scoring, generative models streamline the process by directly generating recommendations.
In rating prediction tasks, several studies have tested LLMs, many of which are based on ChatGPT (Zhiyuli et al . , 2023; Dai et al . , 2023; Liu et al . , 2023; Wang et al . , 2023b; Xi et al . , 2023b) . Recent studies have found that the context length limit of an LLM makes it impractical to input all items. To address this, two approaches have been explored. The first is straightforward recommendation, which uses a prompt containing only user information (e.g., ID or metadata) and asks the LLM to generate recommendations directly (Xu et al . , 2023; Zhang et al . , 2023a; Di Palma et al . , 2023) . The second approach is selective recommendation, which includes both user information and a list of candidate items in the prompt, asking the LLM to select items from these candidates (Geng et al . , 2022; Li et al . , 2023a; Zhang et al . , 2023c; Li et al . , 2023c; Dai et al . , 2023) .
Some recent studies (Li et al . , 2023b) have also instructed LLMs to determine whether a user will interact with a given item by generating ‘yes’ or ‘no’ responses (Lin et al . , 2024; Bao et al . , 2023) . Although these responses are generated by LLMs, this approach is still considered discriminative recommendation since it generates an answer or a probability score for each item. The shift towards generative recommendation highlights a new paradigm in recommender systems, where generative AI not only enhances recommendation quality but also simplifies the recommendation pipeline.
In conclusion, our survey is the first to thoroughly examine the retrieval stage of recommender systems, looking at both academic research and industry applications. We cover three main areas: improving similarity computation between users and items using both shallow and deep structures, enhancing indexing mechanisms for efficient retrieval, and optimizing training methods of retrieval. We also conduct extensive benchmarking experiments, testing various methods on three public datasets. Furthermore, we highlight current industrial applications through a case study on retrieval practices at a specific company, detailing the entire process and the challenges they face. This exploration is crucial as the retrieval stage sets the groundwork for the subsequent recommendation processes, determining the efficiency and effectiveness of the entire system.
Additionally, we have identified existing gaps and proposed future directions for research, aiming to inspire and guide ongoing efforts in this field. By discussing both the theoretical frameworks and practical implementations, this survey serves as a valuable resource for researchers and practitioners alike, seeking to enhance or develop innovative retrieval methods. Looking ahead, the field of retrieval in recommender systems holds promising potential for significant advancements. Our hope is that this survey will act as a catalyst for future research, fostering developments that will refine and revolutionize the capabilities of retrieval in recommendation, thus improving the overall user experience on recommendation platforms.