OmniBehavior: Towards Real-world Human Behavior Simulation

Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

Jiawei Chen1,2 Ruoxi Xu1,2 Boxi Cao1, Ruotong Pan3, Yunfei Zhang3,
Yifei Hu3, Yong Du3, Tingting Gao3, Yaojie Lu1, Yingfei Sun2,
Xianpei Han1, Le Sun1, Xiangyu Wu3, Hongyu Lin1
1ISCAS, 2UCAS, 3Kuaishou Technology

Introduction

    The holistic modeling of human behavior is central to a wide range of disciplines. While Large Language Models (LLMs) have raised the prospect of serving as general-purpose user-simulators, the empirical basis for assessing their capabilities remains systematically insufficient. Existing benchmarks suffer from critical limitations:
    (1) They are confined to isolated scenarios with narrow action spaces (e.g., exclusively focusing on video browsing or e-commerce dialogue).
    (2) Such narrow focus overlooks the holistic, interconnected nature of authentic human decision-making and preferences.
    (3) They fail to assess whether LLMs can transcend fragmented data to model long-horizon causal structures.

    To bridge this gap, we present OmniBehavior, the first user simulation benchmark built entirely on real-world data that simultaneously captures long-horizon, cross-scenario, and heterogeneous behavioral patterns. Collected from Kuaishou, OmniBehavior aggregates complete interaction traces from representative users over a three-month period across 5 diverse scenarios.
    Comprehensive evaluation of current state-of-the-art closed and open-source LLMs reveals substantial limitations. For instance, the best-performing LLM (Claude-4.5-Opus) achieves an overall score of only 44.55, and F1 scores on binary behavior prediction for most models do not exceed 40%. This proves that current LLMs struggle to accurately simulate complex, long-horizon user behavior traces.
    Crucially, our study uncovers a fundamental structural bias in current LLM simulators: they converge toward a "positivity-and-average" representation. This Utopian bias and hyper-activity blur individual-specific differences, discarding long-tail behaviors and negative feedback signals. OmniBehavior provides a vital framework to guide the genuine modeling of real human diversity.
data-composition

Overview of OmniBehavior, a real-world comprehensive benchmark for evaluating LLM-based user simulators. The benchmark is constructed in three stages:
    (1) Data Collection: aggregation of real-world logs from Kuaishou platform across five major scenarios.
    (2) Data Processing: multi-modal fusion, two-level cleaning, representative sampling, and anonymization.
    (3) Benchmark Construction: the resulting dataset captures long-horizon, cross-scenario behavior traces, providing as a high-fidelity testbed for evaluating LLM-based user simulators in real-world industrial settings.

OmniBehavior Benchmark

Benchmark Scope

We construct a unified simulation environment covering five major user activities on Kuaishou platform (including customer service dialogues within the E-commerce scenario). The framework requires the agent to predict diverse behaviors (e.g., watch duration, purchase, comment) based on scenario-specific contexts, serving as a comprehensive testbed for high-fidelity user simulation.

data-composition


Single-scenario vs. Multi-scenario Profiles

User profile reconstruction based on Single-scenario vs. Multi-scenario data. To assess whether single-scenario data is sufficient to model user preferences, we compare single-scenario and multi-scenario settings through qualitative profile reconstruction. Using Claude-3.5-Sonnet to extract interests from interaction histories, we build user word clouds and summaries. As illustrated, profiles derived from single-scenario data are often fragmented and biased. In contrast, multi-scenario data provides richer contextual signals that better capture a user's stable and essential characteristics.

User profile reconstruction

Cross-scenario Causal Chain Analysis

We present a representative case study of a 12-day causal chain leading to a purchase event. After an initial search for "Xiaomi Launch Event", the user interacts with related items across various scenarios, eventually adding the item to the cart during a live stream and completing the order. This confirms that decisions stem from long-term, cross-scenario accumulation. Benchmarks limited to short sessions or single scenarios cause a form of "causal amputation," underscoring that ultra-long sequences and multi-scenario environments are necessary to preserve causal integrity.

Cross-scenario causal chain case study

Experimental Results


Comprehensive comparison of LLM backbones on the OmniBehavior Benchmark. We categorize user behaviors into three types: binary behaviors (e.g., clicks), continuous behaviors (e.g., duration), and textual behaviors (e.g., dialogue). The overall score represents the aggregated performance. The best/second best scores are bolded/underlined.

Model Video Live Ads E-commerce Overall Score
Binary Continuous Binary Binary Binary Textual
Closed-source
Claude-Opus-4.5 33.05 64.19 31.70 51.16 29.98 57.21 44.55
Claude-Sonnet-4.5 18.85 65.95 25.00 42.77 36.13 54.26 40.49
Claude-Haiku-4.5 22.84 63.26 26.11 30.00 26.37 50.29 36.48
Claude-Sonnet-4 25.29 64.62 28.86 36.81 16.50 49.13 36.87
Gemini-3-Flash 22.09 53.79 25.61 24.64 19.65 49.80 32.60
GPT-5.2 31.54 65.01 28.63 33.60 29.32 46.29 39.07
GPT-4o 27.88 62.75 28.15 25.24 28.66 44.92 36.27
Open-source
GLM-4.7 26.86 64.43 28.97 40.34 32.90 55.25 41.46
DeepSeek-V3 21.42 63.98 27.92 25.74 33.31 52.13 37.42
Kimi-K2-Instruct 23.30 64.80 28.60 31.19 29.94 47.83 37.61
Qwen3-235B 18.26 62.38 23.84 23.19 19.22 45.74 32.11

The Structural Bias of LLM Simulator