Last updated Jul 7, 2026

Open Source Contributions

Selected merged upstream PRs to RL, agent infrastructure, and LLM serving systems.

AReaL

CISPO loss surrogate - added the MiniMax-M1 CISPO clipped importance-sampling loss surrogate to AReaL's PPO path.
vLLM generation request parity - forwarded frequency penalties and stop conditions through the vLLM generation backend.
Reward scoring failure guard - guarded a CLEVR reward function against scoring failures instead of letting one bad sample break evaluation.
Robust free-port selection - fixed free-port discovery so out-of-range exclusions do not block valid worker ports.

Batched rollout metrics - computed SkyRL Gym batched rollout metrics from truncated responses, matching the tokens returned for training.

Rollout importance-sampling metrics - computed sequence-level high/low rollout importance-sampling fractions from raw weights instead of clamped weights.

seed_oss streaming parser support - ported seed_oss to the streaming parser engine as a Qwen3 subclass for frontend parsing.
Anthropic empty-completion compatibility - returned an explicit content block for empty Anthropic completions instead of an invalid response shape.
Non-ASCII tool-call argument emission - kept non-ASCII tool-call arguments readable instead of escaping them as Unicode codepoints.
Matryoshka embedding dimension validation - rejected oversized Matryoshka embedding dimensions instead of silently returning hidden-size vectors.

Stop-string precedence under speculative decoding - fixed stop-string trimming when EOS and a stop string are accepted in the same decode step.

Gemma 4 VLM dispatch and softcapping - registered Gemma 4 as a vision-language model and preserved nested logit softcapping during training.

Gemma4 dense and MoE support - added slime-native Gemma4 model, conversion, loss-mask, script, doc, and test support for dense and MoE checkpoints.
Empty colocated weight bucket handling - fixed raw weight sync when uneven tensor chunks leave a tensor-parallel rank with no local Hugging Face tensors.
CISPO advantage estimator - added the MiniMax-M1 CISPO advantage-estimator option at slime's existing policy-loss seam, with tests for surrogate value and gradient routing.
Dr.GRPO docs reference cleanup - removed a dangling custom-reducer example reference from the Dr.GRPO docs.

dspy.RLM agent - added a host-side agent with a sandbox tool bridge and deterministic tests.
Scoped trial log streaming - added structured live stdout/stderr callbacks for long-running trials.
mini-swe-agent credential env handling - fixed host-side credential and API-base resolution from configured agent env.
Agent env propagation - propagated configured agent environment variables through every agent load path.
Sandbox env secret reuse - reused environment secrets consistently across Modal sandbox operations.
Agent install fix - fixed install scripts when uv's env file is absent.
Adapter docs fix - aligned adapter README filenames with the validator contract.