As artificial intelligence systems are integrated into military operations, a familiar intuition hardens into an institutional standard: The higher the stakes, the more essential it is to keep humans in the loop. In matters of life and death, machines must not be left to decide on their own.

That intuition is understandable. It is also, in important respects, wrong.

In lower-stakes environments—traffic management, service delivery, even routine policing—human oversight can sometimes function as a backstop. Errors are visible, decisions can be revisited, and the costs of delay are tolerable. In crisis response, human oversight becomes less effective in addressing errors. Decisions must be made quickly, information is incomplete, and the consequences of hesitation grow more severe. Under these conditions, late-stage human intervention becomes less reliable, not more.

In military contexts, where these dynamics are the most consequential, late-stage human-in-the-loop overrides are, in fact, the least trustworthy and effective way to fix errors that arise because of the algorithmic system. In military engagement, errors can be lethal. Time is compressed, uncertainty is pervasive, and decisions are often irreversible. Understandably, the conventional wisdom is that it is precisely here that the case for human-in-the-loop control is strongest. The assumption is that human judgment—especially in identifying targets and avoiding civilian harm—is inherently superior to algorithmic decision-making. However, the conventional wisdom does not hold up under scrutiny.

Consider a simple but revealing example—what we might call the white van problem. In a combat zone, intelligence has associated a reported threat with a white van. Other information is scant or unsubstantiated. For soldiers on the ground or drones in the sky, any white van may either be entirely benign—or it may be carrying combatants or explosives. The fundamental challenge, then, is how to act under conditions where signals are weak, context-dependent, and consequential.

When a white van is spotted, that signal, in isolation, is weak. Operators must rely on contextual cues: movement patterns, timing, proximity to known threats, and behavior that deviates from local norms. The standard argument is that such contextual judgment cannot be codified and must remain with human decision-makers in the field.

But this argument contains a critical tension. If those cues are sufficiently systematic to guide human judgment, they can, in principle, be incorporated into a model. Moreover, if the alternative is to depend on intuition shaped by stress, fatigue, or incomplete perception, then human override power is not a reliable basis for life-and-death decisions.

A common objection here is that combat scenarios are too novel and variable—that the uniqueness of each engagement makes it impossible to build a sufficiently large or representative dataset to train a reliable model. This concern deserves to be taken seriously. But it cuts both ways. If a situation is so novel that a model trained on extensive operational data cannot process it reliably, why would we expect a single human operator, under extreme time pressure and cognitive load, to fare better? The novelty objection, pressed to its logical conclusion, is not an argument for human override; it is an argument for the kind of systematic operating standards that a well-designed system can embody more consistently than any individual under stress. In fact, it is precisely in novel and ambiguous situations that predetermined policy and carefully calibrated decision rules matter most—because those are the moments when individual intuition is most likely to falter.

The white van problem reveals the structure of the underlying issue. Targeting decisions are shaped by a trade-off between two types of error: failing to identify a real threat, and incorrectly identifying a threat where none exists. In statistical terms, these are Type I and Type II errors; in operational terms, the trade-off between missed threats and unintended harm.

There is no way to eliminate both risks simultaneously. Any system—human or algorithmic—must decide how to balance them.

The balance between these errors is not a situational judgment to be made in the field. It must be a prior decision that reflects doctrine, rules of engagement, and legal and political accountability. How much risk of a missed threat is acceptable? How much risk of civilian harm is tolerable? These are not questions that can be answered consistently by individual operators under time pressure. One may question whether every operational scenario is truly so novel that no prior calibration can account for it. But even granting a high degree of novelty, the objection rests on the assumption that the right priorities are better embodied in a single individual in the moment than in a system designed to encode doctrine, rules of engagement, and ethical constraints in a consistent and disciplined way. That assumption should be questioned, not treated as self-evident.

They are design choices.

In high-stakes environments, the primary locus of control is not at the moment of decision, but at the moment of design. A well-constructed system implements a consistent balance between competing risks. A poorly constructed one cannot be rescued reliably by real-time human oversight.

This does not imply that military AI systems should operate without human involvement. Rather, it reframes where and how human judgment is exercised—not at one level or the other, but at multiple levels, with different functions at each. Upstream, human judgment operates at the system level: setting parameters, calibrating trade-offs, and applying the doctrinal and ethical constraints within which a system will function. This is arguably the most consequential point of human intervention, because it is where the fundamental decisions about acceptable risk are made.

But human involvement does not end at the design phase. At the tactical level—what might be called the tip of the spear—humans retain a critical role, though not the one traditionally imagined. Rather than serving as a last-minute override on individual targeting decisions, operators should function as vigilant monitors of system performance, working with assistive technology to detect whether something is going wrong systemically and step in when it is. Are patterns of engagement drifting from expected parameters? Is the system producing anomalous outputs that suggest a failure mode? This kind of real-time monitoring—identifying systemic breakdown rather than second-guessing individual calls under fire—is both more realistic and more valuable than the traditional model of human-in-the-loop approval.

The two forms of oversight are complementary, not competing. Design-level control and operational monitoring coexist, each addressing a different dimension of the problem. What this framework does reject is the specific idea that a human operator, under extreme time pressure and with no more information than the system itself possesses, can reliably override individual algorithmic decisions as the primary safeguard against error.

Attempts to resolve these questions through real-time override are structurally fragile. In fast-moving environments, human operators face the same informational constraints as the system, but with far less capacity to process data at scale. They are also subject to cognitive bias, stress, and variability across individuals and units. The idea that human-in-the-loop control can reliably correct system-level deficiencies under these conditions is, at best, optimistic.

This reframing also carries implications for accountability—a question that any serious framework for military AI must confront directly. One reason traditionally cited for keeping humans at the point of individual use-of-force decisions is that it preserves a clear chain of responsibility: a commander who makes a decision to bomb a building that turns out to be a school can face consequences. If an AI system makes that determination, the question becomes harder. Can a procurement officer who signed off on purchasing the system be held responsible? An engineer who designed its targeting algorithm? A senior leader who authorized its deployment?

The answer is that accountability extends to everyone involved in the chain, but responsibility does not mean the same thing at every link. This is not, in fact, a novel challenge for the military. The armed forces already distinguish between mistake, negligence, and intentional wrongdoing. A soldier who bombs a white van carrying civilians because of a genuine misreading of ambiguous signals faces different consequences than one who acts knowingly or recklessly. The same graduated framework of accountability applies naturally to those who design, procure, test, and deploy AI systems. An engineer who built a targeting algorithm in good faith, following established protocols, bears a different kind of responsibility than one who cut corners or ignored known failure modes. A procurement officer who failed to conduct adequate due diligence is in a different position than one who followed rigorous evaluation procedures. The legal and institutional frameworks for making these distinctions already exist within military justice and the law of armed conflict. What is needed is their deliberate extension to encompass the new roles and decision points that AI systems introduce into the chain of command.

AI-enabled systems are already embedded in military operations—supporting intelligence analysis, target identification, logistics, and increasingly autonomous platforms. These systems will continue to expand in capability and scope. The question is not whether they will be used, but how they are designed and employed.

This trajectory is now being reinforced at the institutional level. The Department of Defense has begun rolling out AI systems for broader operational use, with the expectation that they will augment decision-making widely rather than remain confined to narrow experimental applications. This shift reflects a recognition that AI will become part of the everyday infrastructure of military operations. As that integration accelerates, the central challenge is not how to preserve human involvement at every point of decision, but how to ensure that these systems are designed from the outset to operate within clearly defined doctrinal and ethical constraints.

The nature of military operations is itself evolving. Much of the future of warfare will be shaped at a computational level—in the design of algorithms, the calibration of decision systems, and the architecture of autonomous platforms. Military institutions that recognize this shift will be better positioned to wage war effectively and responsibly.

A large share of the debate about military deployment of AI focuses on autonomy—whether machines should act without human approval. A more productive framing shifts attention from autonomy to architecture. The issue is not whether a human is formally in the loop, but whether the system has been designed with guardrails based on operational realities and ethical constraints.

The existing policy landscape already reflects this tension. DoD Directive 3000.09, which governs autonomy in weapon systems, does not prohibit autonomous weapons or require a human in the loop for every use-of-force decision. What it does require is that autonomous and semiautonomous systems be designed to allow commanders and operators to exercise “appropriate levels of human judgment over the use of force.” The directive’s language is deliberately flexible—it does not specify where or how that judgment is exercised. But in practice, it has reinforced an institutional conservatism that defaults to human override at the point of action, where the argument for it is weakest. The framework advanced here is not in tension with the directive; it takes its core requirement seriously by asking where human judgment is most appropriately and effectively exercised, rather than assuming the answer is always at the moment of engagement.

The US joint force has multigenerational experience managing lethal risk under uncertainty. Military doctrine, rules of engagement, and the law of armed conflict are mechanisms for determining, in advance, how competing risks are balanced. AI systems do not replace that process; they formalize it.

The paradox is that the highest-stakes environments are precisely those in which human-in-the-loop oversight is least reliable as a primary safeguard. Trust will not be built through the promise of last-minute intervention, but through demonstrable evidence that systems have been designed to make the right trade-offs from the outset—and that clear lines of accountability exist when they fail.

In war, as in crisis response, the most important decisions are made before the system is ever deployed.

Michael A. Santoro is professor of management and entrepreneurship at Santa Clara University, where he writes on AI governance, public institutions, and ethical decision-making. His work focuses on how emerging technologies reshape institutional responsibility and high-stakes decision-making.

Author’s note: I am grateful to Isaac Nikssarian for helpful comments on an earlier draft.

The views expressed are those of the author and do not reflect the official position of the United States Military Academy, Department of the Army, or Department of Defense.

Image credit: Master Sgt. Whitney Hughes, US Army