Existing model training methods (e.g., RLVR) are bottlenecked in domains without a clear reward signal.
Reinforcement learning continues to drive progress in domains where correctness can be automatically verified: coding, maths, structured extraction. But non-verifiable, open-ended domains, where the bulk of economically valuable work sits (finance, law, medicine, biology, cybersecurity), lack an automatic reward signal, leaving existing data unsuitable for direct model training.
Capability gain in these domains requires new data collection formats – ones that capture expert reasoning and intent at the process level, codifying elements like judgement, deliberation, backtracking, and hypothesis construction – to construct clear reward signals, making data usable for model training.








