An OpenEnv Benchmark Where LLMs Learn to Design Strategic Questions for Preference Elicitation.
Most LLM benchmarks test what a model can say. They rarely test whether it can uncover hidden structure through action. LotteryElicitationEnv is an OpenEnv-native environment where an agent designs sequences of lottery pairs to recover a simulated respondent's prospect-theory parameters: risk aversion (γ) and loss aversion (λ).
Each episode is adaptive. The agent proposes a lottery pair, observes a binary choice from a closed-form respondent, updates its belief in-context, and on the final turn commits a parameter estimate. Reward arrives only at termination, grounded entirely in arithmetic: mean-squared error against the ground-truth θ*, Holt–Laury consistency of the implied preferences, and an efficiency bonus for stopping early when confident.
The challenge is not language. It is verifiable experimental design: which lottery pair, right now, is most informative about (γ, λ) given the history so far?
The falsifiable claim: a GRPO-trained LLM can recover (γ, λ) more efficiently than the fixed 10-pair Holt–Laury (2002) battery that economists still use as the status-quo elicitation protocol.
Preference elicitation is a foundational problem in behavioral economics, marketing science, medical triage, and alignment. The dominant practice is still a fixed menu: every subject answers the same pre-specified list. An adaptive protocol that picks the next question given past answers should be strictly more sample-efficient, but hand-designing a Bayesian optimal experimental design (BOED) over the prospect-theory likelihood is expensive. We ask whether an LLM can learn that policy via RL.
The methodology is transferable. Any latent-parameter elicitation with an auditable forward model — medical triage (which symptoms to ask about next), educational diagnostics (which question reveals a student's misconception), alignment preference learning (which comparison is most informative about a human's utility function) — fits the same MDP template. Lotteries are the proxy; the capability is adaptive experimental design.
Every reward signal here is ground-truth arithmetic, not a judge. The environment samples θ*, runs a closed-form prospect-theory respondent, and scores the agent against the stored parameters. There is no circularity and no LLM judge in the loop.
Most "LLMs + economics" work lands in one of three buckets. None occupies the cell we target:
| Prior work bucket | What it does | What it does not |
|---|---|---|
| Static economic QA Recon (Zhou et al., arXiv:2506.00577, 2025) | SFT + GRPO on curated economic reasoning items, graded by rules on text | No sequential active choice, no continuous latent recovery |
| LLMs as agents in games RLVR Negotiation (Liu et al., arXiv:2604.09855, 2026); EconAgent (Li et al., ACL 2024) | RL in strategic or macro simulations, verifiable reward on surplus or budget | No parameter inference from a known simulator |
| LLMs as subjects Horton's Homo silicus (arXiv:2301.07543); "PT Fails for LLMs" (arXiv:2508.08992) | Measures whether LLMs are PT-rational | Does not train them to query a PT-rational counterpart |
| LotteryElicitationEnv (ours) | Sequential MDP, structured JSON lottery actions, terminal reward from ground-truth θ*, non-learned respondent | Not a human study (yet) |
To our knowledge, no prior work trains an LLM to adaptively design lottery pairs against a non-learned prospect-theory respondent with terminal ground-truth rewards under the GRPO + OpenEnv contract. The task, action space, and reward semantics are new; the method (RLVR / GRPO on verifiable signals) is shared with the Recon and negotiation-RLVR lineages.
Important caveat from the "PT Fails for LLMs" literature: we do not claim the policy LLM is PT-rational. We train it to design experiments against a PT-rational simulator with known θ*. The LLM is the experimenter, not the subject.
An OpenEnv-native sequential MDP in which an LLM agent designs lottery pairs, a closed-form prospect-theory respondent answers, and the agent is rewarded on the final turn for how accurately it recovers the respondent's hidden (γ, λ).
Each episode proceeds like this:
noise_std), and returns the choice plus the running history.terminate_early=true, runs out of turns, or submits on the final allowed turn.The agent's interface is deliberately minimal: raw JSON output, no tool-call protocol, no markdown parsing. The LLM emits text, the training client parses and steps the environment over WebSocket.
The core contract is three Pydantic types exchanged over the OpenEnv WebSocket:
# Action (agent → env)
class LotteryElicitationAction(Action):
lottery_a: Lottery # 2–3 outcomes, probs sum to 1.0
lottery_b: Lottery
theta_estimate: Optional[dict] # {"gamma": float, "lambda": float}
terminate_early: bool = False
# Observation (env → agent)
class LotteryElicitationObservation(Observation):
step_idx, steps_remaining, max_steps: int
history: list[dict] # [{lottery_a, lottery_b, choice}, ...]
last_choice: Optional[str] # "A" | "B"
gamma_range, lambda_range: tuple[float, float]
min_outcome_value, max_outcome_value: float
done: bool; reward: Optional[float]; metadata: dict
# State (hidden from agent)
true_gamma, true_lambda, gamma_mse, lambda_mse, hl_accuracy
A Lottery is 2 or 3 outcomes with probabilities that must sum to 1.0. Values lie in [min_outcome_value, max_outcome_value], both surfaced on every observation so the agent cannot drift off-spec.
The respondent is pure arithmetic. No LLM, no heuristic:
A two-stage curriculum shapes the training distribution:
| Stage | γ sampled | λ sampled | Purpose |
|---|---|---|---|
| Stage 1 | Uniform[γlo, γhi] | Fixed at 2.25 | Shorten credit assignment, learn risk curvature first |
| Stage 2 | Uniform[γlo, γhi] | Uniform[λlo, λhi] | Full two-parameter elicitation |
Curriculum is honored both in EnvConfig and at env.reset(curriculum_stage=...), so a single server can serve both stages to different sessions concurrently.
OpenEnv gives us three things that matter for this submission: (1) a standard WebSocket environment contract consumable by TRL's rollout_func, (2) per-session state with SUPPORTS_CONCURRENT_SESSIONS=True and max_concurrent_envs=64, so DDP ranks can hammer the same Space without cross-talk, and (3) a uniform deployment path. The same env code runs in-process for tests, as a Docker container for development, and as a Hugging Face Space during training and evaluation.
No in-process environment imports from PT — everything crosses the wire, exactly like OpenEnv intends. No new abstractions were invented. Base types only: EnvClient, Environment, Pydantic Action / Observation. All extensions (curriculum stage, reward breakdown, history) ride on metadata. No new method signatures, no fork. The env ships with openenv.yaml, a Dockerfile, and a live Hugging Face Space.
The environment reward is terminal only. No mid-episode credit. On the final step the env computes:
Here \(s_k(\cdot)\) is the predicted Holt–Laury choice on menu pair \(k\) under the implied parameters.
Mapping to code: \(S_{\max}\) is max_steps, \(S_{\mathrm{taken}}\) is steps_taken, and \((w_{\mathrm{mse}}, w_{\mathrm{HL}}, w_{\mathrm{eff}})\) are mse_weight, holt_laury_weight, efficiency_weight in EnvConfig.
Defaults live in EnvConfig:
| Component | Weight | What it rewards |
|---|---|---|
| MSE | 1.0 | Closeness of θ̂ to ground truth in normalized range |
| Holt–Laury accuracy | 0.5 | Behavioral consistency: θ̂ should predict the same HL choices as θ* |
| Efficiency bonus | 0.1 | Stopping early when confident (not just guessing and quitting) |
| Missing θ penalty | −2.0 | Final-turn action with no valid theta_estimate |
The training package adds one optional auxiliary reward: a format score (fraction of turns that produced valid JSON, weighted at 0.1–0.75) into the GRPO advantage. It is intended as training wheels, removable once the model reliably emits structured output (>90% validity).
Why three components: MSE alone rewards a lucky guess. Holt–Laury accuracy alone lets the agent memorize the HL menu without recovering θ. Efficiency alone rewards guess-and-quit. The product of incentives forces the agent to actually identify the parameters, not just match a proxy. As we show in the Episode Traces section, the efficiency bonus interacts with curriculum design in a way that can create reward-hacking fixed points.
The project is two strictly separated packages: LotteryElicitationEnv (the OpenEnv environment) and LotteryElicitationPT (the GRPO training client). They communicate exclusively over WebSocket — no in-process imports.
flowchart LR
subgraph PT ["LotteryElicitationPT (Training)"]
GRPO["GRPOTrainer
TRL 1.0.0"]
RF["rollout_func"]
VLLM["vLLM
colocate/server"]
PARSE["action_parser
JSON + guardrails"]
end
subgraph ENV ["LotteryElicitationEnv (OpenEnv)"]
WS["FastAPI
WebSocket"]
RESP["PT Respondent
v(x) = x^γ ..."]
REW["Reward
MSE + HL + eff."]
end
GRPO --> RF
RF --> VLLM
VLLM -->|"generate"| PARSE
PARSE -->|"JSON action"| WS
WS --> RESP
RESP -->|"choice A/B"| WS
WS -->|"observation"| RF
REW -->|"terminal reward"| WS
Figure 1. System architecture. PT never imports env-side types — everything crosses the WebSocket.
Training uses GRPO (Group Relative Policy Optimization), a critic-free RL algorithm ideal for terminal-only rewards. We use TRL 1.0.0's rollout_func contract for explicit control over the generate → parse → step loop, avoiding TRL's Qwen3-only add_response_schema allowlist.
The rollout function manages: chat-template tokenization with enable_thinking=False, vLLM generation (colocate or server mode), think-block stripping, null-safe JSON parsing with 18 regression tests, probability normalization, and episode logging to reward_logs.jsonl.
The central research-grade finding of this submission is not a converged checkpoint — it is a structural diagnosis of how GRPO collapses on multi-turn verifiable-reward environments. We show the failure, explain the mechanism, and prescribe the fix stack.
Under Stage 1 training (λ fixed at 2.25), the policy collapsed to the following single-turn episode:
{"lottery_a": {"outcomes": [{"value": 50, "probability": 0.5}, {"value": 10, "probability": 0.5}]}, "lottery_b": {"outcomes": [{"value": 30, "probability": 1.0}]}, "theta_estimate": {"gamma": 1.0, "lambda": 2.25}, "terminate_early": true}
From 322 logged episodes on Qwen3-1.7B and Qwen2.5-7B-Instruct runs:
| Signal | Value | What it means |
|---|---|---|
frac_reward_zero_std | ≈ 1.0 | Every completion in the GRPO group gets identical reward |
loss, grad_norm | ≈ 0 | No gradient signal — policy is frozen at a fixed point |
entropy | ≈ 1e-5 | Policy has collapsed to a single deterministic output |
clipped_ratio | ≈ 1.0 | No policy update being applied |
Episodes hitting max_steps | 98.4% | Before collapse, most episodes used all 10 turns |
| Tokens/step (early → late) | ~450 → ~55 | Cold-start verbosity converges; OOM is a cold-start problem |
Stage 1 fixes λ=2.25 in the data-generating process to simplify credit assignment. But this creates a partially-correct shortcut: guessing λ=2.25 is always exactly right for that parameter. Combined with the efficiency bonus rewarding early termination, the model discovers that guess-and-quit on turn 1 yields a stable reward. Since every rollout in the GRPO group finds the same shortcut, within-group reward variance drops to zero, GRPO's group-relative advantage becomes zero, and the gradient vanishes. The policy is stuck at a fixed point that is partially correct by construction.
This failure mode is general, not specific to our environment. Any GRPO run on a multi-turn verifiable-reward env with a partially-right-but-cheap shortcut has this bug latent.
_safe_float, _safe_int), think-block stripping, hard-cap on completion_ids, probability normalization. 18 regression tests.max_completion_length, shorter episodes (5 turns for Stage 1), stronger format_weight (0.1 → 0.75), tune learning rate.efficiency_weight, raise mse_weight, force minimum-turn exploration before terminate_early can fire, randomize Stage 1 λ to a narrow band around 2.25.history to directly incentivize informative lottery pairs.For contrast, here is what the agent should learn to do — an adaptive 5-turn elicitation for a respondent with γ*=0.6, λ*=3.0:
{"theta_estimate": {"gamma": 0.65, "lambda": 2.9}, "terminate_early": true}The environment bundles two deterministic baselines. Both run in-process without a GPU:
| Baseline | Policy | What it isolates |
|---|---|---|
| Random lottery | Sample valid lottery pairs uniformly; return prior midpoint as θ̂ | Lower bound: beating it proves the model learned something |
| Holt–Laury fixed battery | Replay the canonical 10-pair menu; grid-search fit (γ, λ) at 0.01 resolution | Status-quo comparison from experimental economics |
The eval harness adds: zero-shot API LLMs, local vLLM-served LLMs, and a trained HF policy loaded from disk.
| Metric | Holt–Laury fixed | Random lottery | Target for trained policy |
|---|---|---|---|
| γ MSE (normalized) | ≈ 0.02 | high | lower than HL, at fewer steps |
| λ MSE (normalized) | ≈ 0.3 | high | lower than HL |
| HL accuracy | ≈ 0.9 | ≈ 0.5 | ≥ HL |
| Steps used | 10 / 10 | 10 / 10 | < 10 via terminate_early |
The pipeline is validated end-to-end. Convergence to a baseline-beating checkpoint is blocked by two factors: the reward-hacking fixed point diagnosed above (now understood, fix stack prescribed), and Unsloth's multi-GPU incompatibility with our FSDP + vLLM topology (see Unsloth section), which prevented scaling beyond single-GPU training within the submission window.
What we have demonstrated:
reward_logs.jsonl.LEPT_VLLM_GPU_UTIL from 0.90 to 0.75–0.80.float(None) edge case (fixed in Session 12).The research story — can a GRPO-trained LLM beat Holt–Laury's 24-year-old fixed battery? — is the experiment this submission sets up. The reward-hacking diagnosis is the finding we contribute now.
Building a real GRPO + OpenEnv + vLLM training pipeline on a multi-turn, verifiable-reward environment surfaced five categories of structural issues. We document them because the next OpenEnv submission will hit every one.
In vllm_mode=server, every trainer.vllm_generation.generate() call performs gather_object → all_gather_object → broadcast_object_list. Our rollout is while not session.done, so different DDP ranks make different numbers of generate() calls per episode. NCCL collectives are sequence-numbered: different call counts per rank = permanent desync.
Symptoms: tqdm stuck at 0/14, GPUs 0–6 pinned at ~32 GiB, GPU 7 idle (vLLM). After ~30 minutes, the NCCL watchdog fires with last enqueued: 529 vs last completed: 527 on rank 1. Then UnpicklingError as ranks deserialize off-by-one collective buffers.
Fix: fixed-count padding — every rank performs exactly 8 generate() calls per episode:
DIST_SERVER_GENERATES_PER_EPISODE = 8
per_episode_generate_cap = min(max_episode_turns, 8)
# After the real loop terminates, issue (8 - num_real_generates)
# dummy generates under _temporary_vllm_max_tokens(..., 1).
# Outputs are DISCARDED. Guarded with try/finally.
Active only when vllm_mode == "server" and world_size > 1. Reward, logprobs, and credit assignment are byte-identical to the unpadded case. Any TRL rollout_func user running variable-length rollouts in server mode has this bug latent.
We invested significant time integrating Unsloth for efficient fine-tuning. Unsloth's multi-GPU support did not work with our distributed FSDP + vLLM server topology — specifically, the weight-sharding and vLLM weight-sync paths are incompatible with Unsloth's kernel replacements. This prevented us from scaling beyond single-GPU colocate training within the submission window, which in turn limited the training budget available to push past the reward-hacking fixed point.
We document this so the next OpenEnv submitter can avoid the same dead end: if your pipeline requires FSDP + vLLM server mode, Unsloth is not currently a compatible acceleration path.
Four issues that each crashed training runs, consolidated with their fixes:
| Issue | Root cause | Fix |
|---|---|---|
CUDA OOM despite low max_completion_length |
_rollout_one_episode concatenated generated tokens and observation suffixes across 10 turns → 4,000–5,000 token sequences per episode |
Hard-cap completion_ids to max_completion_length; strip think blocks from training tensor; PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True |
Qwen3 <think> blocks despite reasoning_mode=off |
enable_thinking=False is prompt-side only; Qwen3 hybrid models still emit verbose think traces that truncate before JSON |
_strip_think_blocks regex (closed + unclosed); re-encode only stripped JSON for completion_ids |
float(None) crash killing all DDP ranks |
LLM emitted {"value": null, "probability": 0.5}; parser checked key presence but not None-ness. One rank died → gloo cascade killed all others |
_safe_float / _safe_int guardrails; 18 regression tests; fallback action instead of crash |
FSDP1 _is_root assertion in server mode |
TRL 1.0.0's _sync_fsdp1_params_to_vllm calls summon_full_params per child module, corrupting the FSDP root flag (TRL PR #3582, unmerged) |
Default to vllm_mode=colocate; opt-in FSDP2 behind LEPT_FSDP2_SHARDING=1 flag |
max_completion_length ≠ what TRL trains on. The rollout function keeps appending per-turn generations + observation suffixes until you explicitly hard-cap.
quadrantChart
title LLMs + Economics: Task Structure vs Agent Role
x-axis "Static / Fixed Tasks" --> "Sequential / Adaptive Tasks"
y-axis "LLM as Subject" --> "LLM as Experimenter"
quadrant-1 "Our target"
quadrant-2 "Unexplored"
quadrant-3 "Most prior work"
quadrant-4 "Emerging"
"Recon (GRPO on econ QA)": [0.2, 0.35]
"Homo silicus (Horton)": [0.15, 0.2]
"PT Fails for LLMs": [0.25, 0.15]
"RLVR Negotiation": [0.7, 0.45]
"EconAgent (macro sim)": [0.6, 0.35]
"LotteryElicitationEnv": [0.85, 0.85]
Figure 2. Positioning of LotteryElicitationEnv relative to prior work. We occupy the high-sequential, high-experimenter quadrant that no prior work targets.
| Foundation | Role in this project | Citation |
|---|---|---|
| Cumulative prospect theory | Closed-form respondent: piecewise value function, expected utility, binary choice | Tversky & Kahneman, J. Risk & Uncertainty 5(4), 1992 |
| Holt–Laury risk elicitation | Fixed 10-pair battery, HL accuracy reward term, grid-search baseline | Holt & Laury, American Economic Review 92(5), 2002 |
| Bayesian OED | Motivation for adaptive > fixed; hand-derived BOED called "expensive" | Chaloner & Verdinelli, Statistical Science 10(3), 1995 |
| OpenEnv | Gym-style reset/step, WebSocket transport, HF Space deployment | HF Blog: Introducing OpenEnv |
| TRL + GRPO | GRPOTrainer, custom rollout_func, remote env rollouts | Shao et al., arXiv:2402.03300 (DeepSeekMath) |
| ReasoningEconomicsEnv/PT | Sibling project — structural template for two-repo split, rollout_func, DDP padding | Same monorepo |
# 1. Run the env locally (Python in-process)
pip install -e LotteryElicitationEnv
python -m lottery_elicitation_env.server.app
# 2. Or pull the HF Space
export ENV_BASE_URL="https://yashu2000-lotteryelicitationenv.hf.space"
# 3. Train with GRPO (1xH100 colocate)
cd LotteryElicitationPT
bash scripts/bootstrap_lambda.sh
bash scripts/preflight_lambda.sh
bash scripts/run_grpo_lambda.sh
# 4. Evaluate a checkpoint against baselines
python -m lottery_elicitation_pt.eval.evaluate \
--policy hf --model ./outputs/ckpt-last \
--episodes 200 --baselines random,holt_laury
All episodes are seeded and reproducible from (env_seed, curriculum_stage, θ_prior). No external fixtures, no live API, no human labels.
max_steps=10) and publish the comparison table: Random vs Holt–Laury-Fixed vs Trained-HF across γ-MSE, λ-MSE, HL accuracy, total reward, average steps.history — directly incentivize informative lottery pairs to kill guess-and-quit at the source.noise_std > 0) for a realism ablation — how much does the sim-to-real gap depend on clean deterministic choices?LotteryElicitationEnv reframes an economics problem as a verifiable RL task. A non-learned prospect-theory respondent, structured JSON lottery actions, and a terminal reward grounded in MSE against ground-truth θ* give us a sequential MDP where every component is auditable.
The infrastructure contributions — NCCL desync padding for variable-length rollouts, reward-hacking diagnosis under GRPO with partially-correct shortcuts, think-block hygiene for training tensors, null-safe JSON parsing that prevents DDP cascade failures — are the lessons the next OpenEnv + TRL 1.0 + multi-turn submission will need.
The research question remains open: can a GRPO-trained LLM beat Holt–Laury's fixed battery? The pipeline to answer it is built, validated, and documented. The reward-hacking diagnosis is the first finding we contribute.