🧩 User Modeling Track

Introduction

This track centers on simulating user reviews using LLM-based agents. Participants are tasked with designing and implementing LLM agents that can analyze historical user behavior records, model user preferences, and generate realistic reviews. By integrating user characteristics, item attributes, and other users’ feedback, these agents will simulate user behavior in an interactive environment. The task leverages large-scale, open-source datasets, including data from Yelp, Goodreads, and Amazon, to provide rich user-item interaction data and diverse review scenarios. Agents are set up in an interactive online platform simulator, where they can utilize tools for retrieving user information, item details, and existing reviews for decision making.

Task Inputs:

  • User ID: Identifies the specific user whose behavior is being simulated.
  • Item ID: Represents the product, service, or content for which the simulated review is being generated.

Task Outputs:

  • Star/Rating: A numerical score reflecting the simulated user’s overall opinion.
  • Review Text: A detailed, contextually relevant commentary informed by user preferences and item attributes.

This task aims to evaluate the ability of LLM agents to generate coherent and contextually appropriate reviews and preference ratings, showcasing their capacity for user behavior modeling and preference learning. By exploring the effectiveness of LLM agents in simulating user behaviors, this track contributes to advancing methodologies in behavioral modeling and offers valuable insights into improving user experience on online review platforms.

Evaluation

The performance of participants' models in simulating user interactions will be assessed through quantitative metrics, with a focus on accuracy in predicting user behavior. During the Development phase, we provide simulation data for development and evaluation. After selecting the Top 20 teams to enter the Final Phase, the evaluation results will be calculated based on 40% simulation data and 60% real data. The evaluation criteria include:

Preference Estimation

  • The preference estimation is calculated based on the star rating accuracy.
  • Metric: 1 - MAE of star ratings, indicating the deviation from actual preferences.

Star Rating Accuracy

  • Metric: Mean Absolute Error (MAE)
  • Description: The predicted star ratings will be compared to ground truth values, normalized to the range [0,1].
  • Formula:

\[ MAE = \frac{1}{N} \sum_{i=1}^{N} |{\hat{s}_{ni} - s_{ni}}| \] where \(N\) is the total number of reviews, and \(\hat{s}_{ni}\) and \(s_{ni}\) are the normalized predicted and ground truth star ratings, respectively. Similar evaluation method has been used in [3].

Review Generation

  • The review generation is calculated based on the review metrics.
  • Metric: 1 - (Emotional Tone Error * 0.25 + Sentiment Attitude Error * 0.25 + Topic Relevance Error * 0.5), indicating the deviation from actual reviews.

Emotional Tone Error

  • A vector of emotion scores for the top five emotions in the review text is calculated using a predefined emotion classifier model [1], with each dimension normalized to the range [0,1].
  • Metric: Mean Absolute Error (MAE) of normalized emotion scores, reflecting the deviation from the actual emotions.

Sentiment Attitude Error

  • The sentiment attitude of the review text is analyzed using nltk.sentiment.SentimentIntensityAnalyzer(), with the resulting value normalized to the range [0,1].
  • Metric: Mean Absolute Error (MAE) of normalized sentiment scores, indicating the deviation from actual sentiment attitude.

Topic Relevance Error

  • An embedding vector for the review text is generated using a predefined embedding model [2].
  • Metric: Cosine similarity between text embeddings, measuring alignment with the real topics.

Overall Quality

  • The overall quality is calculated based on the preference estimation and review generation.
  • Metric: (Preference Estimation + Review Generation) / 2, indicating the overall quality of the simulated reviews.

References

  1. Barbieri, F., et al, "TweetEval: Unified Benchmark and Comparative Evaluation for Tweet Classification," in Proceedings of Findings of EMNLP, 2020.
  2. N. Reimers, I. Gurevych, "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks," in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, 2019.
  3. Wang, Yancheng, et al. "RecMind: Large Language Model Powered Agent For Recommendation." Findings of the Association for Computational Linguistics: NAACL 2024, 2024.

Checkout User Behavior Modeling Evaluation for more calculation details.