Member of Technical StaffRuntime Evals & Post-Training Systems
This is one of the most consequential engineering roles at Disruptive Rain right now. It is the role that determines whether our platform is merely impressive or genuinely trustworthy. That distinction is everything to us.
About Disruptive Rain
Disruptive Rain exists to build AI that organizations can actually trust. We are developing a proprietary large language model, a production multi-agent orchestration platform, and the infrastructure to run both at the scale that enterprises demand. Our team is small, extraordinarily capable, and relentlessly focused on building systems that perform when it matters. We do not tolerate mediocrity in our code, in our models, or in our standards. Everyone here is expected to contribute directly to the mission, take full ownership of their work, and raise the bar for the people around them. Leadership is not assigned. It is earned by people who ship great things and do not make excuses.
About the Role
We have built a powerful AI platform. Multi-agent orchestration, tool use, real-time inference, connector-backed tasks, approval-governed workflows. The infrastructure is live. The capability is real. What we need now is the person who can make it reliable at the level that the world's most demanding organizations will require of us.
That person is you, if you have spent your career thinking seriously about how to close the loop between what an AI system does in production and what it learns from that experience. If you have built evaluation frameworks that actually catch failures before customers do. If you have turned runtime traces into training data that made a model measurably better. If you know the difference between a model problem and a product problem and a data problem, and you know it from experience rather than intuition.
Own the evaluation architecture
Design and execute end-to-end evaluation suites that cover multi-agent workflows, tool invocation, approval-governed actions, and connector-backed tasks across the full production surface.
Turn failures into training signal
Build trace triage and sample-repair workflows that transform production failures into high-signal training data. Raise the bar for how runtime-lab promotes trajectories into post-training so that only verified, high-quality examples shape the model.
Build a quality operating system
Define coverage goals, failure taxonomies, and release gates so that model quality is managed as an engineering discipline and not a feeling. Create the dashboards and review cadences that let the team see exactly why model quality is moving in either direction, with evidence, not guesses.
Close the loop fast
Partner with inference, orchestrator, and product engineers to localize failures fast and ship fixes faster. Author gold-standard training samples when the product capability exists but the model has not yet internalized it.
Design and execute end-to-end evaluation suites that cover multi-agent workflows, tool invocation, approval-governed actions, and connector-backed tasks across the full production surface.
Define coverage goals, failure taxonomies, and release gates so that model quality is managed as an engineering discipline and not a feeling.
Build trace triage and sample-repair workflows that transform production failures into high-signal training data.
Raise the bar for how runtime-lab promotes trajectories into post-training so that only verified, high-quality examples shape the model.
Partner with inference, orchestrator, and product engineers to localize failures fast and ship fixes faster.
Create the dashboards and review cadences that let the team see exactly why model quality is moving in either direction, with evidence, not guesses.
Author gold-standard training samples when the product capability exists but the model has not yet internalized it.
Help define what it means for a new agent, tool, or connector to graduate from experimental to trusted.
You will have defined the canonical evaluation architecture for live agent behavior across runtime-lab, traces, suites, and post-training promotion.
The highest-value capability gaps will have owners, runbooks, and coverage targets.
Promotion into training data will be disciplined, automated, and no longer dependent on manual judgment calls that do not scale.
Regression detection will be faster, sharper, and grounded in metrics the entire team can read and act on.
Disruptive Rain already has ambitious infrastructure for multi-agent execution, trace capture, runtime evaluation, and post-training. The opportunity now is to make that machinery compound. This role is how we move from interesting capabilities to trusted capabilities, and that is the only version of this platform worth building.
What We Are Looking For
You have shipped production AI systems where model quality was your direct responsibility.
You have designed evaluations that go well beyond benchmark scores and can defend every methodological decision you made.
You move fluidly between applied machine learning and production engineering because you understand that the two are inseparable at this level of the problem.
You debug complex distributed systems using traces, logs, and event streams the way other engineers debug a function.
You write production-quality Python and TypeScript and you do not treat code quality as someone else's concern.
You have a precise and hard-won sense for when a failure originates in the model, in the product, or in the data, and you do not confuse them.
You hold yourself to a higher standard than anyone around you would require, because the alternative does not interest you.
Exceptional Candidates Will Also Bring
Experience with post-training pipelines, whether supervised fine-tuning, direct preference optimization, process reward modeling, reward modeling, or reinforcement learning-style rollout curation.
Prior work building evaluation programs specifically for tool-using or agentic systems.
Familiarity with trace-based debugging workflows, human review queues, or annotation infrastructure at scale.
Background in retrieval systems, enterprise connectors, or policy-governed automation.
Experience with LoRA-based model adaptation, model routing, or vLLM-backed inference systems.
A well-formed and clearly articulated conviction about how production evidence should translate into durable model improvement.
Interview Process
The first conversation will focus on the systems you have built and how you have improved model quality in practice. We want to hear about a real system, a real failure, a real evaluation you designed, and a real result you measured.
The technical deep dive will go into evaluation design, failure localization, and production feedback loops in detail.
The working session will place a realistic problem in front of you, one drawn from the actual platform you would be working on.
A final conversation with engineering and AI leadership will close the process.
We are not interested in what benchmarks you have memorized. We are interested in what you have built and what happened when it met production. Come prepared to walk through a real system, a real failure, and a real result.
Build the quality engine behind our AI platform
If you know how to turn live agent behavior into a disciplined improvement loop, we want to talk to you.