💡 Recommendation Track

Introduction

This track focuses on developing LLM agents that serve as recommendation assistants, capable of generating personalized recommendations for targeted users. The task leverages large-scale, open-source datasets, including data from Yelp, Goodreads, and Amazon, to provide rich user-item interaction data and diverse recommendation scenarios. Participants are expected to design recommendation agents set up in an interactive online platform simulator, which integrates tools for retrieving user information, item details, and existing reviews.

Task Inputs:

User ID: Identifies the specific user for whom the recommendation is being generated.
Candidate Item List: A list of items available for recommendation.

Task Outputs:

Ranked Recommendation List: A prioritized list of items tailored to the user's preferences.

This track aims to advance the design of LLM agents for recommendation tasks, exploring their potential to enhance personalization, effectively model user preferences, and shape next-generation recommendation systems.

Evaluation

Participants' recommendation agents will be evaluated based on their ability to accurately rank items that users are most likely to select from a list of candidates. During the Development phase, we provide simulation data for development and evaluation. The ground truth for each task is generated by a predefined LLM agent simulating user choices based on context and task queries. After selecting the Top 20 teams to enter the Final Phase, the evaluation results will be calculated based on 40% simulation data and 60% real data.

Ranking Accuracy

Metric: Top-N Hit Rate (where N = {1, 3, 5})
Description: Measures how often the ground truth item appears in the top-N of the ranked list of 20 candidate items.
Formula:

                    HR@N = (1 / T) * Σ 𝟙(p_t ∈ P_t_hat(N))

where T is the total number of test tasks, and 𝟙(·) is an indicator function that equals 1 if the ground truth p_t is in the top N recommendations P_t_hat(N) in the ranked list P_t_hat, and 0 otherwise. Similar evaluation methods have been used in [1][2].

References

Wang, Yancheng, et al. "RecMind: Large Language Model Powered Agent For Recommendation." Findings of the Association for Computational Linguistics: NAACL 2024, 2024.
Kim, Sein, et al. "Large language models meet collaborative filtering: An efficient all-round LLM-based recommender system." Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024.

Checkout Recommendation Evaluation for more calculation details.