OpenEnv · AgentBeats Phase 2

LotteryElicitationEnv

An OpenEnv Benchmark Where LLMs Learn to Design Strategic Questions for Preference Elicitation.

AgentBeats Phase 2 · OpenEnv Challenge Submission | Yashaswi Sharma (University of Southern California) | Defu Cao (USC) | Muyan Weng (USC)

Live Environment Space → GitHub Repo

Lottery Elicitation as Proxy for Strategic Question Design

Most LLM benchmarks test what a model can say. They rarely test whether it can uncover hidden structure through action. LotteryElicitationEnv is an OpenEnv-native environment where an agent designs sequences of lottery pairs to recover a simulated respondent's prospect-theory parameters: risk aversion (γ) and loss aversion (λ).

Each episode is adaptive. The agent proposes a lottery pair, observes a binary choice from a closed-form respondent, updates its belief in-context, and on the final turn commits a parameter estimate. Reward arrives only at termination, grounded entirely in arithmetic: mean-squared error against the ground-truth θ*, Holt–Laury consistency of the implied preferences, and an efficiency bonus for stopping early when confident.

The challenge is not language. It is verifiable experimental design: which lottery pair, right now, is most informative about (γ, λ) given the history so far?

The falsifiable claim: a GRPO-trained LLM can recover (γ, λ) more efficiently than the fixed 10-pair Holt–Laury (2002) battery that economists still use as the status-quo elicitation protocol.

Why this benchmark matters

Preference elicitation is a foundational problem in behavioral economics, marketing science, medical triage, and alignment. The dominant practice is still a fixed menu: every subject answers the same pre-specified list. An adaptive protocol that picks the next question given past answers should be strictly more sample-efficient, but hand-designing a Bayesian optimal experimental design (BOED) over the prospect-theory likelihood is expensive. We ask whether an LLM can learn that policy via RL.

The methodology is transferable. Any latent-parameter elicitation with an auditable forward model — medical triage (which symptoms to ask about next), educational diagnostics (which question reveals a student's misconception), alignment preference learning (which comparison is most informative about a human's utility function) — fits the same MDP template. Lotteries are the proxy; the capability is adaptive experimental design.

Every reward signal here is ground-truth arithmetic, not a judge. The environment samples θ*, runs a closed-form prospect-theory respondent, and scores the agent against the stored parameters. There is no circularity and no LLM judge in the loop.

Prior work & novelty

Most "LLMs + economics" work lands in one of three buckets. None occupies the cell we target:

Prior work bucket	What it does	What it does not
Static economic QA Recon (Zhou et al., arXiv:2506.00577, 2025)	SFT + GRPO on curated economic reasoning items, graded by rules on text	No sequential active choice, no continuous latent recovery
LLMs as agents in games RLVR Negotiation (Liu et al., arXiv:2604.09855, 2026); EconAgent (Li et al., ACL 2024)	RL in strategic or macro simulations, verifiable reward on surplus or budget	No parameter inference from a known simulator
LLMs as subjects Horton's Homo silicus (arXiv:2301.07543); "PT Fails for LLMs" (arXiv:2508.08992)	Measures whether LLMs are PT-rational	Does not train them to query a PT-rational counterpart
LotteryElicitationEnv (ours)	Sequential MDP, structured JSON lottery actions, terminal reward from ground-truth θ*, non-learned respondent	Not a human study (yet)

To our knowledge, no prior work trains an LLM to adaptively design lottery pairs against a non-learned prospect-theory respondent with terminal ground-truth rewards under the GRPO + OpenEnv contract. The task, action space, and reward semantics are new; the method (RLVR / GRPO on verifiable signals) is shared with the Recon and negotiation-RLVR lineages.

Important caveat from the "PT Fails for LLMs" literature: we do not claim the policy LLM is PT-rational. We train it to design experiments against a PT-rational simulator with known θ*. The LLM is the experimenter, not the subject.

What LotteryElicitationEnv is

An OpenEnv-native sequential MDP in which an LLM agent designs lottery pairs, a closed-form prospect-theory respondent answers, and the agent is rewarded on the final turn for how accurately it recovers the respondent's hidden (γ, λ).

Each episode proceeds like this:

The environment samples θ* = (γ*, λ*) from configured priors.
On turn t, the agent emits a structured JSON action containing two lotteries, and optionally a current estimate of θ and an early-termination flag.
The respondent computes expected utility under prospect theory, picks A or B deterministically (optional Fechner logistic noise via noise_std), and returns the choice plus the running history.
Mid-episode reward is always zero. On termination the environment computes a decomposed reward from the final θ estimate.
Termination happens when the agent sets terminate_early=true, runs out of turns, or submits on the final allowed turn.

The agent's interface is deliberately minimal: raw JSON output, no tool-call protocol, no markdown parsing. The LLM emits text, the training client parses and steps the environment over WebSocket.

Environment design

The core contract is three Pydantic types exchanged over the OpenEnv WebSocket:

# Action (agent → env)
class LotteryElicitationAction(Action):
    lottery_a: Lottery          # 2–3 outcomes, probs sum to 1.0
    lottery_b: Lottery
    theta_estimate: Optional[dict]  # {"gamma": float, "lambda": float}
    terminate_early: bool = False

# Observation (env → agent)
class LotteryElicitationObservation(Observation):
    step_idx, steps_remaining, max_steps: int
    history: list[dict]         # [{lottery_a, lottery_b, choice}, ...]
    last_choice: Optional[str]  # "A" | "B"
    gamma_range, lambda_range: tuple[float, float]
    min_outcome_value, max_outcome_value: float
    done: bool; reward: Optional[float]; metadata: dict

# State (hidden from agent)
true_gamma, true_lambda, gamma_mse, lambda_mse, hl_accuracy

A Lottery is 2 or 3 outcomes with probabilities that must sum to 1.0. Values lie in [min_outcome_value, max_outcome_value], both surfaced on every observation so the agent cannot drift off-spec.

The respondent is pure arithmetic. No LLM, no heuristic:

v(x) = x^{\gamma}\;\text{if } x \geq 0, \qquad v(x) = -\lambda\,(-x)^{\gamma}\;\text{if } x \lt 0 \] \[ \mathrm{EU}(L) = \sum_{i} p_i\, v(x_i) \] \[ \text{choice} = \mathrm{A}\;\text{if }\mathrm{EU}(A) \geq \mathrm{EU}(B), \quad\text{else }\mathrm{B}

A two-stage curriculum shapes the training distribution:

Stage	γ sampled	λ sampled	Purpose
Stage 1	Uniform[γ_lo, γ_hi]	Fixed at 2.25	Shorten credit assignment, learn risk curvature first
Stage 2	Uniform[γ_lo, γ_hi]	Uniform[λ_lo, λ_hi]	Full two-parameter elicitation

Curriculum is honored both in EnvConfig and at env.reset(curriculum_stage=...), so a single server can serve both stages to different sessions concurrently.

Why OpenEnv

OpenEnv gives us three things that matter for this submission: (1) a standard WebSocket environment contract consumable by TRL's rollout_func, (2) per-session state with SUPPORTS_CONCURRENT_SESSIONS=True and max_concurrent_envs=64, so DDP ranks can hammer the same Space without cross-talk, and (3) a uniform deployment path. The same env code runs in-process for tests, as a Docker container for development, and as a Hugging Face Space during training and evaluation.

No in-process environment imports from PT — everything crosses the wire, exactly like OpenEnv intends. No new abstractions were invented. Base types only: EnvClient, Environment, Pydantic Action / Observation. All extensions (curriculum stage, reward breakdown, history) ride on metadata. No new method signatures, no fork. The env ships with openenv.yaml, a Dockerfile, and a live Hugging Face Space.

Scoring: a three-part terminal reward, plus format shaping

The environment reward is terminal only. No mid-episode credit. On the final step the env computes:

\mathrm{mse\_component} = -\Biggl[ \frac{(\hat{\gamma}-\gamma^*)^2}{(\gamma_{\mathrm{hi}}-\gamma_{\mathrm{lo}})^2} + \frac{(\hat{\lambda}-\lambda^*)^2}{(\lambda_{\mathrm{hi}}-\lambda_{\mathrm{lo}})^2} \Biggr] \] \[ \mathrm{hl\_accuracy} = \frac{1}{10}\sum_{k=1}^{10} \mathbf{1}\bigl[s_k(\hat{\theta}) = s_k(\theta^*)\bigr]

Here \(s_k(\cdot)\) is the predicted Holt–Laury choice on menu pair \(k\) under the implied parameters.

\mathrm{efficiency\_bonus} = \frac{\max\bigl(0,\; S_{\max} - S_{\mathrm{taken}}\bigr)}{S_{\max}} \] \[ R = w_{\mathrm{mse}}\cdot \mathrm{mse\_component} + w_{\mathrm{HL}}\cdot \mathrm{hl\_accuracy} + w_{\mathrm{eff}}\cdot \mathrm{efficiency\_bonus}

Mapping to code: \(S_{\max}\) is max_steps, \(S_{\mathrm{taken}}\) is steps_taken, and \((w_{\mathrm{mse}}, w_{\mathrm{HL}}, w_{\mathrm{eff}})\) are mse_weight, holt_laury_weight, efficiency_weight in EnvConfig.

Defaults live in EnvConfig:

Component	Weight	What it rewards
MSE	1.0	Closeness of θ̂ to ground truth in normalized range
Holt–Laury accuracy	0.5	Behavioral consistency: θ̂ should predict the same HL choices as θ*
Efficiency bonus	0.1	Stopping early when confident (not just guessing and quitting)
Missing θ penalty	−2.0	Final-turn action with no valid `theta_estimate`

The training package adds one optional auxiliary reward: a format score (fraction of turns that produced valid JSON, weighted at 0.1–0.75) into the GRPO advantage. It is intended as training wheels, removable once the model reliably emits structured output (>90% validity).

Why three components: MSE alone rewards a lucky guess. Holt–Laury accuracy alone lets the agent memorize the HL menu without recovering θ. Efficiency alone rewards guess-and-quit. The product of incentives forces the agent to actually identify the parameters, not just match a proxy. As we show in the Episode Traces section, the efficiency bonus interacts with curriculum design in a way that can create reward-hacking fixed points.

Architecture & training pipeline

The project is two strictly separated packages: LotteryElicitationEnv (the OpenEnv environment) and LotteryElicitationPT (the GRPO training client). They communicate exclusively over WebSocket — no in-process imports.

flowchart LR
    subgraph PT ["LotteryElicitationPT (Training)"]
        GRPO["GRPOTrainer
TRL 1.0.0"]
        RF["rollout_func"]
        VLLM["vLLM
colocate/server"]
        PARSE["action_parser
JSON + guardrails"]
    end
    subgraph ENV ["LotteryElicitationEnv (OpenEnv)"]
        WS["FastAPI
WebSocket"]
        RESP["PT Respondent
v(x) = x^γ ..."]
        REW["Reward
MSE + HL + eff."]
    end

    GRPO --> RF
    RF --> VLLM
    VLLM -->|"generate"| PARSE
    PARSE -->|"JSON action"| WS
    WS --> RESP
    RESP -->|"choice A/B"| WS
    WS -->|"observation"| RF
    REW -->|"terminal reward"| WS

Figure 1. System architecture. PT never imports env-side types — everything crosses the WebSocket.

Training uses GRPO (Group Relative Policy Optimization), a critic-free RL algorithm ideal for terminal-only rewards. We use TRL 1.0.0's rollout_func contract for explicit control over the generate → parse → step loop, avoiding TRL's Qwen3-only add_response_schema allowlist.

The rollout function manages: chat-template tokenization with enable_thinking=False, vLLM generation (colocate or server mode), think-block stripping, null-safe JSON parsing with 18 regression tests, probability normalization, and episode logging to reward_logs.jsonl.

Episode traces & reward hacking

The central research-grade finding of this submission is not a converged checkpoint — it is a structural diagnosis of how GRPO collapses on multi-turn verifiable-reward environments. We show the failure, explain the mechanism, and prescribe the fix stack.

The reward-hacking episode: guess-and-quit

Under Stage 1 training (λ fixed at 2.25), the policy collapsed to the following single-turn episode:

Turn 1 (and only turn)

Agent action:

{"lottery_a": {"outcomes": [{"value": 50, "probability": 0.5}, {"value": 10, "probability": 0.5}]}, "lottery_b": {"outcomes": [{"value": 30, "probability": 1.0}]}, "theta_estimate": {"gamma": 1.0, "lambda": 2.25}, "terminate_early": true}

What happens: The agent ignores the respondent entirely. It emits a fixed guess (γ=1.0, λ=2.25) and terminates immediately. The lottery pair is irrelevant — the respondent's choice is never used.

Reward: efficiency bonus (0.9 × 0.1 = 0.09) + partial MSE credit (because λ=2.25 is literally correct under Stage 1) + some HL accuracy. Every rollout in the GRPO group gets the same reward → zero within-group variance → zero advantage → zero gradient → no learning.

The telemetry that reveals collapse

From 322 logged episodes on Qwen3-1.7B and Qwen2.5-7B-Instruct runs:

Signal	Value	What it means
`frac_reward_zero_std`	≈ 1.0	Every completion in the GRPO group gets identical reward
`loss`, `grad_norm`	≈ 0	No gradient signal — policy is frozen at a fixed point
`entropy`	≈ 1e-5	Policy has collapsed to a single deterministic output
`clipped_ratio`	≈ 1.0	No policy update being applied
Episodes hitting `max_steps`	98.4%	Before collapse, most episodes used all 10 turns
Tokens/step (early → late)	~450 → ~55	Cold-start verbosity converges; OOM is a cold-start problem

Why this happens: the mechanism

Stage 1 fixes λ=2.25 in the data-generating process to simplify credit assignment. But this creates a partially-correct shortcut: guessing λ=2.25 is always exactly right for that parameter. Combined with the efficiency bonus rewarding early termination, the model discovers that guess-and-quit on turn 1 yields a stable reward. Since every rollout in the GRPO group finds the same shortcut, within-group reward variance drops to zero, GRPO's group-relative advantage becomes zero, and the gradient vanishes. The policy is stuck at a fixed point that is partially correct by construction.

This failure mode is general, not specific to our environment. Any GRPO run on a multi-turn verifiable-reward env with a partially-right-but-cheap shortcut has this bug latent.

The prescribed fix stack

Phase A — Bug fixes (done): Null-safe parsing (_safe_float, _safe_int), think-block stripping, hard-cap on completion_ids, probability normalization. 18 regression tests.
Phase B — Training mechanics: Raise max_completion_length, shorter episodes (5 turns for Stage 1), stronger format_weight (0.1 → 0.75), tune learning rate.
Phase C — Kill reward hacking: Move to Stage 2 (both γ and λ sampled), lower efficiency_weight, raise mse_weight, force minimum-turn exploration before terminate_early can fire, randomize Stage 1 λ to a narrow band around 2.25.
Phase D — Advanced shaping: Information-gain reward term using history to directly incentivize informative lottery pairs.

What a good episode should look like

For contrast, here is what the agent should learn to do — an adaptive 5-turn elicitation for a respondent with γ*=0.6, λ*=3.0:

Turn 1 — Probe risk aversion (γ)

Lottery A: (100, 0.5; 0, 0.5) vs Lottery B: (40, 1.0)
Choice: B — respondent is risk-averse (γ < 1), prefers the safe option.

Turn 2 — Narrow γ range

Lottery A: (60, 0.7; 10, 0.3) vs Lottery B: (35, 1.0)
Choice: A — the expected-value gap is large enough that moderate risk aversion still picks A.

Turn 3 — Introduce losses to probe λ

Lottery A: (80, 0.5; −20, 0.5) vs Lottery B: (20, 1.0)
Choice: B — loss aversion makes the mixed gamble unattractive. Constrains λ > 2.

Turn 4 — Refine λ estimate

Lottery A: (90, 0.6; −10, 0.4) vs Lottery B: (30, 1.0)
Choice: B — still avoids losses. λ likely > 2.5.

Turn 5 — Submit estimate + terminate early

{"theta_estimate": {"gamma": 0.65, "lambda": 2.9}, "terminate_early": true}
Efficiency bonus: 5 unused turns → 0.5 × 0.1 = 0.05. MSE is low. HL accuracy is high.

The agent used losses to probe λ (turns 3–4), safe-vs-risky comparisons to probe γ (turns 1–2), and terminated early with a close estimate. This is what adaptive experimental design looks like.

Results: what we found

Baselines

The environment bundles two deterministic baselines. Both run in-process without a GPU:

Baseline	Policy	What it isolates
Random lottery	Sample valid lottery pairs uniformly; return prior midpoint as θ̂	Lower bound: beating it proves the model learned something
Holt–Laury fixed battery	Replay the canonical 10-pair menu; grid-search fit (γ, λ) at 0.01 resolution	Status-quo comparison from experimental economics

The eval harness adds: zero-shot API LLMs, local vLLM-served LLMs, and a trained HF policy loaded from disk.

Baseline performance targets

Metric	Holt–Laury fixed	Random lottery	Target for trained policy
γ MSE (normalized)	≈ 0.02	high	lower than HL, at fewer steps
λ MSE (normalized)	≈ 0.3	high	lower than HL
HL accuracy	≈ 0.9	≈ 0.5	≥ HL
Steps used	10 / 10	10 / 10	< 10 via `terminate_early`

Current training status

The pipeline is validated end-to-end. Convergence to a baseline-beating checkpoint is blocked by two factors: the reward-hacking fixed point diagnosed above (now understood, fix stack prescribed), and Unsloth's multi-GPU incompatibility with our FSDP + vLLM topology (see Unsloth section), which prevented scaling beyond single-GPU training within the submission window.

What we have demonstrated:

Full multi-turn rollouts against the live OpenEnv WebSocket, with structured JSON actions parsed through null-safe guardrails and episode logs persisting to reward_logs.jsonl.
Stable 1×H100 colocate runs on Qwen3-1.7B — hundreds of optimizer steps with non-zero reward variance before vLLM memory-sync spikes required lowering LEPT_VLLM_GPU_UTIL from 0.90 to 0.75–0.80.
Qwen2.5-7B-Instruct colocate run reaching step 82/1400 (~1 hour wall time) — the first run to generate parseable JSON lotteries intermittently, before crashing on a float(None) edge case (fixed in Session 12).
322-episode telemetry from run-1: 98.4% of episodes hit max_steps=10; token length converged from ~4,000–5,000 to ~55/step by episode 200+. The cold-start OOM problem is solved.
Infrastructure for FSDP v1/v2, vLLM server/colocate, NCCL padding, and three deployment targets (Lambda, AWS EC2, CARC Discovery).

The research story — can a GRPO-trained LLM beat Holt–Laury's 24-year-old fixed battery? — is the experiment this submission sets up. The reward-hacking diagnosis is the finding we contribute now.

Engineering lessons

Building a real GRPO + OpenEnv + vLLM training pipeline on a multi-turn, verifiable-reward environment surfaced five categories of structural issues. We document them because the next OpenEnv submission will hit every one.

NCCL desync under variable-length episodes

In vllm_mode=server, every trainer.vllm_generation.generate() call performs gather_object → all_gather_object → broadcast_object_list. Our rollout is while not session.done, so different DDP ranks make different numbers of generate() calls per episode. NCCL collectives are sequence-numbered: different call counts per rank = permanent desync.

Symptoms: tqdm stuck at 0/14, GPUs 0–6 pinned at ~32 GiB, GPU 7 idle (vLLM). After ~30 minutes, the NCCL watchdog fires with last enqueued: 529 vs last completed: 527 on rank 1. Then UnpicklingError as ranks deserialize off-by-one collective buffers.

Fix: fixed-count padding — every rank performs exactly 8 generate() calls per episode:

DIST_SERVER_GENERATES_PER_EPISODE = 8
per_episode_generate_cap = min(max_episode_turns, 8)
# After the real loop terminates, issue (8 - num_real_generates)
# dummy generates under _temporary_vllm_max_tokens(..., 1).
# Outputs are DISCARDED. Guarded with try/finally.

Active only when vllm_mode == "server" and world_size > 1. Reward, logprobs, and credit assignment are byte-identical to the unpadded case. Any TRL rollout_func user running variable-length rollouts in server mode has this bug latent.

Unsloth & multi-GPU scaling

We invested significant time integrating Unsloth for efficient fine-tuning. Unsloth's multi-GPU support did not work with our distributed FSDP + vLLM server topology — specifically, the weight-sharding and vLLM weight-sync paths are incompatible with Unsloth's kernel replacements. This prevented us from scaling beyond single-GPU colocate training within the submission window, which in turn limited the training budget available to push past the reward-hacking fixed point.

We document this so the next OpenEnv submitter can avoid the same dead end: if your pipeline requires FSDP + vLLM server mode, Unsloth is not currently a compatible acceleration path.

Memory, parsing & model hygiene

Four issues that each crashed training runs, consolidated with their fixes:

Issue	Root cause	Fix
CUDA OOM despite low `max_completion_length`	`_rollout_one_episode` concatenated generated tokens and observation suffixes across 10 turns → 4,000–5,000 token sequences per episode	Hard-cap `completion_ids` to `max_completion_length`; strip think blocks from training tensor; `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True`
Qwen3 `<think>` blocks despite `reasoning_mode=off`	`enable_thinking=False` is prompt-side only; Qwen3 hybrid models still emit verbose think traces that truncate before JSON	`_strip_think_blocks` regex (closed + unclosed); re-encode only stripped JSON for `completion_ids`
`float(None)` crash killing all DDP ranks	LLM emitted `{"value": null, "probability": 0.5}`; parser checked key presence but not None-ness. One rank died → gloo cascade killed all others	`_safe_float` / `_safe_int` guardrails; 18 regression tests; fallback action instead of crash
FSDP1 `_is_root` assertion in server mode	TRL 1.0.0's `_sync_fsdp1_params_to_vllm` calls `summon_full_params` per child module, corrupting the FSDP root flag (TRL PR #3582, unmerged)	Default to `vllm_mode=colocate`; opt-in FSDP2 behind `LEPT_FSDP2_SHARDING=1` flag

Key takeaways

Terminal-only rewards + GRPO is a sweet spot — but only after you kill the reward-hacking fixed points. Curriculum and reward weighting must be staged carefully.
max_completion_length ≠ what TRL trains on. The rollout function keeps appending per-turn generations + observation suffixes until you explicitly hard-cap.
Qwen2.5-Instruct > Qwen3-thinking-off on structured JSON output for this task, despite Qwen3's newer release date.
Verifiable reward >>> LLM judge. A closed-form simulator means training reduces to "can the model hit the right arithmetic target" — no second-order "what does the judge like" pathology.

Where this submission sits

quadrantChart
    title LLMs + Economics: Task Structure vs Agent Role
    x-axis "Static / Fixed Tasks" --> "Sequential / Adaptive Tasks"
    y-axis "LLM as Subject" --> "LLM as Experimenter"
    quadrant-1 "Our target"
    quadrant-2 "Unexplored"
    quadrant-3 "Most prior work"
    quadrant-4 "Emerging"
    "Recon (GRPO on econ QA)": [0.2, 0.35]
    "Homo silicus (Horton)": [0.15, 0.2]
    "PT Fails for LLMs": [0.25, 0.15]
    "RLVR Negotiation": [0.7, 0.45]
    "EconAgent (macro sim)": [0.6, 0.35]
    "LotteryElicitationEnv": [0.85, 0.85]

Figure 2. Positioning of LotteryElicitationEnv relative to prior work. We occupy the high-sequential, high-experimenter quadrant that no prior work targets.

Foundations & citations

Foundation	Role in this project	Citation
Cumulative prospect theory	Closed-form respondent: piecewise value function, expected utility, binary choice	Tversky & Kahneman, J. Risk & Uncertainty 5(4), 1992
Holt–Laury risk elicitation	Fixed 10-pair battery, HL accuracy reward term, grid-search baseline	Holt & Laury, American Economic Review 92(5), 2002
Bayesian OED	Motivation for adaptive > fixed; hand-derived BOED called "expensive"	Chaloner & Verdinelli, Statistical Science 10(3), 1995
OpenEnv	Gym-style reset/step, WebSocket transport, HF Space deployment	HF Blog: Introducing OpenEnv
TRL + GRPO	GRPOTrainer, custom rollout_func, remote env rollouts	Shao et al., arXiv:2402.03300 (DeepSeekMath)
ReasoningEconomicsEnv/PT	Sibling project — structural template for two-repo split, rollout_func, DDP padding	Same monorepo

Quick start

# 1. Run the env locally (Python in-process)
pip install -e LotteryElicitationEnv
python -m lottery_elicitation_env.server.app

# 2. Or pull the HF Space
export ENV_BASE_URL="https://yashu2000-lotteryelicitationenv.hf.space"

# 3. Train with GRPO (1xH100 colocate)
cd LotteryElicitationPT
bash scripts/bootstrap_lambda.sh
bash scripts/preflight_lambda.sh
bash scripts/run_grpo_lambda.sh

# 4. Evaluate a checkpoint against baselines
python -m lottery_elicitation_pt.eval.evaluate \
    --policy hf --model ./outputs/ckpt-last \
    --episodes 200 --baselines random,holt_laury

All episodes are seeded and reproducible from (env_seed, curriculum_stage, θ_prior). No external fixtures, no live API, no human labels.

Can an LLM learn to ask better questions than the economist's 24-year-old playbook?

Adaptive elicitation under a verifiable reward is the test. The pipeline is built. The reward-hacking diagnosis is the first finding.

Future work

Run Stage 2 to convergence (γ+λ sampled, max_steps=10) and publish the comparison table: Random vs Holt–Laury-Fixed vs Trained-HF across γ-MSE, λ-MSE, HL accuracy, total reward, average steps.
Information-gain reward term using history — directly incentivize informative lottery pairs to kill guess-and-quit at the source.
Bayesian OED baseline for a strong intellectual benchmark that upper-bounds what adaptive elicitation can achieve.
Logistic-noise respondent (noise_std > 0) for a realism ablation — how much does the sim-to-real gap depend on clean deterministic choices?
SFT warm-up on valid-JSON lottery examples before GRPO, to skip the cold-start OOM entirely.
Port the NCCL padding pattern upstream into TRL — the bug is general, the fix is simple.
Human-subjects pilot to measure sim-to-real transfer: real people ≠ prospect theory, but the adaptive-elicitation capability might still beat Holt–Laury on genuine risk preferences.

Conclusion

LotteryElicitationEnv reframes an economics problem as a verifiable RL task. A non-learned prospect-theory respondent, structured JSON lottery actions, and a terminal reward grounded in MSE against ground-truth θ* give us a sequential MDP where every component is auditable.

The infrastructure contributions — NCCL desync padding for variable-length rollouts, reward-hacking diagnosis under GRPO with partially-correct shortcuts, think-block hygiene for training tensors, null-safe JSON parsing that prevents DDP cascade failures — are the lessons the next OpenEnv + TRL 1.0 + multi-turn submission will need.

The research question remains open: can a GRPO-trained LLM beat Holt–Laury's fixed battery? The pipeline to answer it is built, validated, and documented. The reward-hacking diagnosis is the first finding we contribute.