OpenEnv · AgentBeats Phase 2

LotteryElicitationEnv

An OpenEnv Benchmark Where LLMs Learn to Design Strategic Questions for Preference Elicitation.

GitHub HF Space OpenEnv Variable | Up to 10 Tasks GRPO
Live Environment Space → GitHub Repo

Lottery Elicitation as Proxy for Strategic Question Design

Most LLM benchmarks test what a model can say. They rarely test whether it can uncover hidden structure through action. LotteryElicitationEnv is an OpenEnv-native environment where an agent designs sequences of lottery pairs to recover a simulated respondent's prospect-theory parameters: risk aversion (γ) and loss aversion (λ).

Each episode is adaptive. The agent proposes a lottery pair, observes a binary choice from a closed-form respondent, updates its belief in-context, and on the final turn commits a parameter estimate. Reward arrives only at termination, grounded entirely in arithmetic: mean-squared error against the ground-truth θ*, Holt–Laury consistency of the implied preferences, and an efficiency bonus for stopping early when confident.

The challenge is not language. It is verifiable experimental design: which lottery pair, right now, is most informative about (γ, λ) given the history so far?

The falsifiable claim: a GRPO-trained LLM can recover (γ, λ) more efficiently than the fixed 10-pair Holt–Laury (2002) battery that economists still use as the status-quo elicitation protocol.

Why this benchmark matters

Preference elicitation is a foundational problem in behavioral economics, marketing science, medical triage, and alignment. The dominant practice is still a fixed menu: every subject answers the same pre-specified list. An adaptive protocol that picks the next question given past answers should be strictly more sample-efficient, but hand-designing a Bayesian optimal experimental design (BOED) over the prospect-theory likelihood is expensive. We ask whether an LLM can learn that policy via RL.

The methodology is transferable. Any latent-parameter elicitation with an auditable forward model — medical triage (which symptoms to ask about next), educational diagnostics (which question reveals a student's misconception), alignment preference learning (which comparison is most informative about a human's utility function) — fits the same MDP template. Lotteries are the proxy; the capability is adaptive experimental design.

Every reward signal here is ground-truth arithmetic, not a judge. The environment samples θ*, runs a closed-form prospect-theory respondent, and scores the agent against the stored parameters. There is no circularity and no LLM judge in the loop.

Prior work & novelty

Most "LLMs + economics" work lands in one of three buckets. None occupies the cell we target:

Prior work bucketWhat it doesWhat it does not
Static economic QA
Recon (Zhou et al., arXiv:2506.00577, 2025)
SFT + GRPO on curated economic reasoning items, graded by rules on textNo sequential active choice, no continuous latent recovery
LLMs as agents in games
RLVR Negotiation (Liu et al., arXiv:2604.09855, 2026); EconAgent (Li et al., ACL 2024)
RL in strategic or macro simulations, verifiable reward on surplus or budgetNo parameter inference from a known simulator
LLMs as subjects
Horton's Homo silicus (arXiv:2301.07543); "PT Fails for LLMs" (arXiv:2508.08992)
Measures whether LLMs are PT-rationalDoes not train them to query a PT-rational counterpart
LotteryElicitationEnv (ours)Sequential MDP, structured JSON lottery actions, terminal reward from ground-truth θ*, non-learned respondentNot a human study (yet)
To our knowledge, no prior work trains an LLM to adaptively design lottery pairs against a non-learned prospect-theory respondent with terminal ground-truth rewards under the GRPO + OpenEnv contract. The task, action space, and reward semantics are new; the method (RLVR / GRPO on verifiable signals) is shared with the Recon and negotiation-RLVR lineages.

Important caveat from the "PT Fails for LLMs" literature: we do not claim the policy LLM is PT-rational. We train it to design experiments against a PT-rational simulator with known θ*. The LLM is the experimenter, not the subject.

What LotteryElicitationEnv is

An OpenEnv-native sequential MDP in which an LLM agent designs lottery pairs, a closed-form prospect-theory respondent answers, and the agent is rewarded on the final turn for how accurately it recovers the respondent's hidden (γ, λ).

Each episode proceeds like this:

The agent's interface is deliberately minimal: raw JSON output, no tool-call protocol, no markdown parsing. The LLM emits text, the training client parses and steps the environment over WebSocket.

Environment design

The core contract is three Pydantic types exchanged over the OpenEnv WebSocket:

# Action (agent → env)
class LotteryElicitationAction(Action):
    lottery_a: Lottery          # 2–3 outcomes, probs sum to 1.0
    lottery_b: Lottery
    theta_estimate: Optional[dict]  # {"gamma": float, "lambda": float}
    terminate_early: bool = False

# Observation (env → agent)
class LotteryElicitationObservation(Observation):
    step_idx, steps_remaining, max_steps: int
    history: list[dict]         # [{lottery_a, lottery_b, choice}, ...]
    last_choice: Optional[str]  # "A" | "B"
    gamma_range, lambda_range: tuple[float, float]
    min_outcome_value, max_outcome_value: float
    done: bool; reward: Optional[float]; metadata: dict

# State (hidden from agent)
true_gamma, true_lambda, gamma_mse, lambda_mse, hl_accuracy

A Lottery is 2 or 3 outcomes with probabilities that must sum to 1.0. Values lie in [min_outcome_value, max_outcome_value], both surfaced on every observation so the agent cannot drift off-spec.

The respondent is pure arithmetic. No LLM, no heuristic:

\[ v(x) = x^{\gamma}\;\text{if } x \geq 0, \qquad v(x) = -\lambda\,(-x)^{\gamma}\;\text{if } x \lt 0 \] \[ \mathrm{EU}(L) = \sum_{i} p_i\, v(x_i) \] \[ \text{choice} = \mathrm{A}\;\text{if }\mathrm{EU}(A) \geq \mathrm{EU}(B), \quad\text{else }\mathrm{B} \]

A two-stage curriculum shapes the training distribution:

Stageγ sampledλ sampledPurpose
Stage 1Uniform[γlo, γhi]Fixed at 2.25Shorten credit assignment, learn risk curvature first
Stage 2Uniform[γlo, γhi]Uniform[λlo, λhi]Full two-parameter elicitation

Curriculum is honored both in EnvConfig and at env.reset(curriculum_stage=...), so a single server can serve both stages to different sessions concurrently.

Why OpenEnv

OpenEnv gives us three things that matter for this submission: (1) a standard WebSocket environment contract consumable by TRL's rollout_func, (2) per-session state with SUPPORTS_CONCURRENT_SESSIONS=True and max_concurrent_envs=64, so DDP ranks can hammer the same Space without cross-talk, and (3) a uniform deployment path. The same env code runs in-process for tests, as a Docker container for development, and as a Hugging Face Space during training and evaluation.

No in-process environment imports from PT — everything crosses the wire, exactly like OpenEnv intends. No new abstractions were invented. Base types only: EnvClient, Environment, Pydantic Action / Observation. All extensions (curriculum stage, reward breakdown, history) ride on metadata. No new method signatures, no fork. The env ships with openenv.yaml, a Dockerfile, and a live Hugging Face Space.

Scoring: a three-part terminal reward, plus format shaping

The environment reward is terminal only. No mid-episode credit. On the final step the env computes:

\[ \mathrm{mse\_component} = -\Biggl[ \frac{(\hat{\gamma}-\gamma^*)^2}{(\gamma_{\mathrm{hi}}-\gamma_{\mathrm{lo}})^2} + \frac{(\hat{\lambda}-\lambda^*)^2}{(\lambda_{\mathrm{hi}}-\lambda_{\mathrm{lo}})^2} \Biggr] \] \[ \mathrm{hl\_accuracy} = \frac{1}{10}\sum_{k=1}^{10} \mathbf{1}\bigl[s_k(\hat{\theta}) = s_k(\theta^*)\bigr] \]

Here \(s_k(\cdot)\) is the predicted Holt–Laury choice on menu pair \(k\) under the implied parameters.

\[ \mathrm{efficiency\_bonus} = \frac{\max\bigl(0,\; S_{\max} - S_{\mathrm{taken}}\bigr)}{S_{\max}} \] \[ R = w_{\mathrm{mse}}\cdot \mathrm{mse\_component} + w_{\mathrm{HL}}\cdot \mathrm{hl\_accuracy} + w_{\mathrm{eff}}\cdot \mathrm{efficiency\_bonus} \]

Mapping to code: \(S_{\max}\) is max_steps, \(S_{\mathrm{taken}}\) is steps_taken, and \((w_{\mathrm{mse}}, w_{\mathrm{HL}}, w_{\mathrm{eff}})\) are mse_weight, holt_laury_weight, efficiency_weight in EnvConfig.

Defaults live in EnvConfig:

ComponentWeightWhat it rewards
MSE1.0Closeness of θ̂ to ground truth in normalized range
Holt–Laury accuracy0.5Behavioral consistency: θ̂ should predict the same HL choices as θ*
Efficiency bonus0.1Stopping early when confident (not just guessing and quitting)
Missing θ penalty−2.0Final-turn action with no valid theta_estimate

The training package adds one optional auxiliary reward: a format score (fraction of turns that produced valid JSON, weighted at 0.1–0.75) into the GRPO advantage. It is intended as training wheels, removable once the model reliably emits structured output (>90% validity).

Why three components: MSE alone rewards a lucky guess. Holt–Laury accuracy alone lets the agent memorize the HL menu without recovering θ. Efficiency alone rewards guess-and-quit. The product of incentives forces the agent to actually identify the parameters, not just match a proxy. As we show in the Episode Traces section, the efficiency bonus interacts with curriculum design in a way that can create reward-hacking fixed points.

Architecture & training pipeline

The project is two strictly separated packages: LotteryElicitationEnv (the OpenEnv environment) and LotteryElicitationPT (the GRPO training client). They communicate exclusively over WebSocket — no in-process imports.

flowchart LR
    subgraph PT ["LotteryElicitationPT (Training)"]
        GRPO["GRPOTrainer
TRL 1.0.0"] RF["rollout_func"] VLLM["vLLM
colocate/server"] PARSE["action_parser
JSON + guardrails"] end subgraph ENV ["LotteryElicitationEnv (OpenEnv)"] WS["FastAPI
WebSocket"] RESP["PT Respondent
v(x) = x^γ ..."] REW["Reward
MSE + HL + eff."] end GRPO --> RF RF --> VLLM VLLM -->|"generate"| PARSE PARSE -->|"JSON action"| WS WS --> RESP RESP -->|"choice A/B"| WS WS -->|"observation"| RF REW -->|"terminal reward"| WS

Figure 1. System architecture. PT never imports env-side types — everything crosses the WebSocket.

Training uses GRPO (Group Relative Policy Optimization), a critic-free RL algorithm ideal for terminal-only rewards. We use TRL 1.0.0's rollout_func contract for explicit control over the generate → parse → step loop, avoiding TRL's Qwen3-only add_response_schema allowlist.

The rollout function manages: chat-template tokenization with enable_thinking=False, vLLM generation (colocate or server mode), think-block stripping, null-safe JSON parsing with 18 regression tests, probability normalization, and episode logging to reward_logs.jsonl.

Episode traces & reward hacking

The central research-grade finding of this submission is not a converged checkpoint — it is a structural diagnosis of how GRPO collapses on multi-turn verifiable-reward environments. We show the failure, explain the mechanism, and prescribe the fix stack.

The reward-hacking episode: guess-and-quit

Under Stage 1 training (λ fixed at 2.25), the policy collapsed to the following single-turn episode:

Turn 1 (and only turn)
Agent action:
{"lottery_a": {"outcomes": [{"value": 50, "probability": 0.5}, {"value": 10, "probability": 0.5}]}, "lottery_b": {"outcomes": [{"value": 30, "probability": 1.0}]}, "theta_estimate": {"gamma": 1.0, "lambda": 2.25}, "terminate_early": true}

What happens: The agent ignores the respondent entirely. It emits a fixed guess (γ=1.0, λ=2.25) and terminates immediately. The lottery pair is irrelevant — the respondent's choice is never used.
Reward: efficiency bonus (0.9 × 0.1 = 0.09) + partial MSE credit (because λ=2.25 is literally correct under Stage 1) + some HL accuracy. Every rollout in the GRPO group gets the same reward → zero within-group variance → zero advantage → zero gradient → no learning.

The telemetry that reveals collapse

From 322 logged episodes on Qwen3-1.7B and Qwen2.5-7B-Instruct runs:

SignalValueWhat it means
frac_reward_zero_std≈ 1.0Every completion in the GRPO group gets identical reward
loss, grad_norm≈ 0No gradient signal — policy is frozen at a fixed point
entropy≈ 1e-5Policy has collapsed to a single deterministic output
clipped_ratio≈ 1.0No policy update being applied
Episodes hitting max_steps98.4%Before collapse, most episodes used all 10 turns
Tokens/step (early → late)~450 → ~55Cold-start verbosity converges; OOM is a cold-start problem

Why this happens: the mechanism

Stage 1 fixes λ=2.25 in the data-generating process to simplify credit assignment. But this creates a partially-correct shortcut: guessing λ=2.25 is always exactly right for that parameter. Combined with the efficiency bonus rewarding early termination, the model discovers that guess-and-quit on turn 1 yields a stable reward. Since every rollout in the GRPO group finds the same shortcut, within-group reward variance drops to zero, GRPO's group-relative advantage becomes zero, and the gradient vanishes. The policy is stuck at a fixed point that is partially correct by construction.

This failure mode is general, not specific to our environment. Any GRPO run on a multi-turn verifiable-reward env with a partially-right-but-cheap shortcut has this bug latent.

The prescribed fix stack

  1. Phase A — Bug fixes (done): Null-safe parsing (_safe_float, _safe_int), think-block stripping, hard-cap on completion_ids, probability normalization. 18 regression tests.
  2. Phase B — Training mechanics: Raise max_completion_length, shorter episodes (5 turns for Stage 1), stronger format_weight (0.1 → 0.75), tune learning rate.
  3. Phase C — Kill reward hacking: Move to Stage 2 (both γ and λ sampled), lower efficiency_weight, raise mse_weight, force minimum-turn exploration before terminate_early can fire, randomize Stage 1 λ to a narrow band around 2.25.
  4. Phase D — Advanced shaping: Information-gain reward term using history to directly incentivize informative lottery pairs.

What a good episode should look like

For contrast, here is what the agent should learn to do — an adaptive 5-turn elicitation for a respondent with γ*=0.6, λ*=3.0:

Turn 1 — Probe risk aversion (γ)
Lottery A: (100, 0.5; 0, 0.5) vs Lottery B: (40, 1.0)
Choice: B — respondent is risk-averse (γ < 1), prefers the safe option.
Turn 2 — Narrow γ range
Lottery A: (60, 0.7; 10, 0.3) vs Lottery B: (35, 1.0)
Choice: A — the expected-value gap is large enough that moderate risk aversion still picks A.
Turn 3 — Introduce losses to probe λ
Lottery A: (80, 0.5; −20, 0.5) vs Lottery B: (20, 1.0)
Choice: B — loss aversion makes the mixed gamble unattractive. Constrains λ > 2.
Turn 4 — Refine λ estimate
Lottery A: (90, 0.6; −10, 0.4) vs Lottery B: (30, 1.0)
Choice: B — still avoids losses. λ likely > 2.5.
Turn 5 — Submit estimate + terminate early
{"theta_estimate": {"gamma": 0.65, "lambda": 2.9}, "terminate_early": true}
Efficiency bonus: 5 unused turns → 0.5 × 0.1 = 0.05. MSE is low. HL accuracy is high.
The agent used losses to probe λ (turns 3–4), safe-vs-risky comparisons to probe γ (turns 1–2), and terminated early with a close estimate. This is what adaptive experimental design looks like.

Results: what we found

Baselines

The environment bundles two deterministic baselines. Both run in-process without a GPU:

BaselinePolicyWhat it isolates
Random lotterySample valid lottery pairs uniformly; return prior midpoint as θ̂Lower bound: beating it proves the model learned something
Holt–Laury fixed batteryReplay the canonical 10-pair menu; grid-search fit (γ, λ) at 0.01 resolutionStatus-quo comparison from experimental economics

The eval harness adds: zero-shot API LLMs, local vLLM-served LLMs, and a trained HF policy loaded from disk.

Baseline performance targets

MetricHolt–Laury fixedRandom lotteryTarget for trained policy
γ MSE (normalized)≈ 0.02highlower than HL, at fewer steps
λ MSE (normalized)≈ 0.3highlower than HL
HL accuracy≈ 0.9≈ 0.5≥ HL
Steps used10 / 1010 / 10< 10 via terminate_early

Current training status

The pipeline is validated end-to-end. Convergence to a baseline-beating checkpoint is blocked by two factors: the reward-hacking fixed point diagnosed above (now understood, fix stack prescribed), and Unsloth's multi-GPU incompatibility with our FSDP + vLLM topology (see Unsloth section), which prevented scaling beyond single-GPU training within the submission window.

What we have demonstrated:

The research story — can a GRPO-trained LLM beat Holt–Laury's 24-year-old fixed battery? — is the experiment this submission sets up. The reward-hacking diagnosis is the finding we contribute now.

Engineering lessons

Building a real GRPO + OpenEnv + vLLM training pipeline on a multi-turn, verifiable-reward environment surfaced five categories of structural issues. We document them because the next OpenEnv submission will hit every one.

NCCL desync under variable-length episodes

In vllm_mode=server, every trainer.vllm_generation.generate() call performs gather_object → all_gather_object → broadcast_object_list. Our rollout is while not session.done, so different DDP ranks make different numbers of generate() calls per episode. NCCL collectives are sequence-numbered: different call counts per rank = permanent desync.

Symptoms: tqdm stuck at 0/14, GPUs 0–6 pinned at ~32 GiB, GPU 7 idle (vLLM). After ~30 minutes, the NCCL watchdog fires with last enqueued: 529 vs last completed: 527 on rank 1. Then UnpicklingError as ranks deserialize off-by-one collective buffers.

Fix: fixed-count padding — every rank performs exactly 8 generate() calls per episode:

DIST_SERVER_GENERATES_PER_EPISODE = 8
per_episode_generate_cap = min(max_episode_turns, 8)
# After the real loop terminates, issue (8 - num_real_generates)
# dummy generates under _temporary_vllm_max_tokens(..., 1).
# Outputs are DISCARDED. Guarded with try/finally.

Active only when vllm_mode == "server" and world_size > 1. Reward, logprobs, and credit assignment are byte-identical to the unpadded case. Any TRL rollout_func user running variable-length rollouts in server mode has this bug latent.

Unsloth & multi-GPU scaling

We invested significant time integrating Unsloth for efficient fine-tuning. Unsloth's multi-GPU support did not work with our distributed FSDP + vLLM server topology — specifically, the weight-sharding and vLLM weight-sync paths are incompatible with Unsloth's kernel replacements. This prevented us from scaling beyond single-GPU colocate training within the submission window, which in turn limited the training budget available to push past the reward-hacking fixed point.

We document this so the next OpenEnv submitter can avoid the same dead end: if your pipeline requires FSDP + vLLM server mode, Unsloth is not currently a compatible acceleration path.

Memory, parsing & model hygiene

Four issues that each crashed training runs, consolidated with their fixes:

IssueRoot causeFix
CUDA OOM despite low max_completion_length _rollout_one_episode concatenated generated tokens and observation suffixes across 10 turns → 4,000–5,000 token sequences per episode Hard-cap completion_ids to max_completion_length; strip think blocks from training tensor; PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
Qwen3 <think> blocks despite reasoning_mode=off enable_thinking=False is prompt-side only; Qwen3 hybrid models still emit verbose think traces that truncate before JSON _strip_think_blocks regex (closed + unclosed); re-encode only stripped JSON for completion_ids
float(None) crash killing all DDP ranks LLM emitted {"value": null, "probability": 0.5}; parser checked key presence but not None-ness. One rank died → gloo cascade killed all others _safe_float / _safe_int guardrails; 18 regression tests; fallback action instead of crash
FSDP1 _is_root assertion in server mode TRL 1.0.0's _sync_fsdp1_params_to_vllm calls summon_full_params per child module, corrupting the FSDP root flag (TRL PR #3582, unmerged) Default to vllm_mode=colocate; opt-in FSDP2 behind LEPT_FSDP2_SHARDING=1 flag

Key takeaways

Where this submission sits

quadrantChart
    title LLMs + Economics: Task Structure vs Agent Role
    x-axis "Static / Fixed Tasks" --> "Sequential / Adaptive Tasks"
    y-axis "LLM as Subject" --> "LLM as Experimenter"
    quadrant-1 "Our target"
    quadrant-2 "Unexplored"
    quadrant-3 "Most prior work"
    quadrant-4 "Emerging"
    "Recon (GRPO on econ QA)": [0.2, 0.35]
    "Homo silicus (Horton)": [0.15, 0.2]
    "PT Fails for LLMs": [0.25, 0.15]
    "RLVR Negotiation": [0.7, 0.45]
    "EconAgent (macro sim)": [0.6, 0.35]
    "LotteryElicitationEnv": [0.85, 0.85]
      

Figure 2. Positioning of LotteryElicitationEnv relative to prior work. We occupy the high-sequential, high-experimenter quadrant that no prior work targets.

Foundations & citations

FoundationRole in this projectCitation
Cumulative prospect theoryClosed-form respondent: piecewise value function, expected utility, binary choiceTversky & Kahneman, J. Risk & Uncertainty 5(4), 1992
Holt–Laury risk elicitationFixed 10-pair battery, HL accuracy reward term, grid-search baselineHolt & Laury, American Economic Review 92(5), 2002
Bayesian OEDMotivation for adaptive > fixed; hand-derived BOED called "expensive"Chaloner & Verdinelli, Statistical Science 10(3), 1995
OpenEnvGym-style reset/step, WebSocket transport, HF Space deploymentHF Blog: Introducing OpenEnv
TRL + GRPOGRPOTrainer, custom rollout_func, remote env rolloutsShao et al., arXiv:2402.03300 (DeepSeekMath)
ReasoningEconomicsEnv/PTSibling project — structural template for two-repo split, rollout_func, DDP paddingSame monorepo

Quick start

# 1. Run the env locally (Python in-process)
pip install -e LotteryElicitationEnv
python -m lottery_elicitation_env.server.app

# 2. Or pull the HF Space
export ENV_BASE_URL="https://yashu2000-lotteryelicitationenv.hf.space"

# 3. Train with GRPO (1xH100 colocate)
cd LotteryElicitationPT
bash scripts/bootstrap_lambda.sh
bash scripts/preflight_lambda.sh
bash scripts/run_grpo_lambda.sh

# 4. Evaluate a checkpoint against baselines
python -m lottery_elicitation_pt.eval.evaluate \
    --policy hf --model ./outputs/ckpt-last \
    --episodes 200 --baselines random,holt_laury

All episodes are seeded and reproducible from (env_seed, curriculum_stage, θ_prior). No external fixtures, no live API, no human labels.

Can an LLM learn to ask better questions than the economist's 24-year-old playbook?
Adaptive elicitation under a verifiable reward is the test. The pipeline is built. The reward-hacking diagnosis is the first finding.

Future work

Conclusion

LotteryElicitationEnv reframes an economics problem as a verifiable RL task. A non-learned prospect-theory respondent, structured JSON lottery actions, and a terminal reward grounded in MSE against ground-truth θ* give us a sequential MDP where every component is auditable.

The infrastructure contributions — NCCL desync padding for variable-length rollouts, reward-hacking diagnosis under GRPO with partially-correct shortcuts, think-block hygiene for training tensors, null-safe JSON parsing that prevents DDP cascade failures — are the lessons the next OpenEnv + TRL 1.0 + multi-turn submission will need.

The research question remains open: can a GRPO-trained LLM beat Holt–Laury's fixed battery? The pipeline to answer it is built, validated, and documented. The reward-hacking diagnosis is the first finding we contribute.