In a recent paper on what we know (and don’t) about large language models (LLMs) like ChatGPT, New York University’s Sam Bowman states that “as of early 2023, there is no technique that would allow us to lay out in any satisfactory way what kinds of knowledge, reasoning, or goals a model is using when it produces some output.” That is, we are unable to predict, based on our input, the output of an LLM, because the inner workings, or black box, of artificial intelligence is not completely understood.

This presents a daunting challenge for any organization that seek to leverage the power of AI decision tools—like the US military. As a soldier put it during a touchpoint session I was involved in, If I bring a recommended course of action to my commander and tell him it’s recommended because the computer said so, I’m gonna get punched in the face. Physical danger to the messenger aside, if we do not understand the rationale behind a decision, accepting the risk is a hard ask for a commander.

Without knowing what’s inside the black box, we can envision its contents as almost anything. A comically complex Rube Goldberg machine? A nest of wires? A series of tubes? Surely, we might reassure ourselves, it would make sense to us if we simply took the time to sort it all out. That same thought occurred to French mathematician Pierre-Simon Laplace two hundred years ago, when the physical mechanics of the universe often presented as puzzling to people of the time as artificial intelligence outputs are to us today.

In his Philosophical Essay on Probabilities, Laplace took up the position that the reason we couldn’t accurately predict the future came down to a lack of data. He proposed, as a thought experiment, a being that could know the position, vector, and momentum of every particle in the universe. Such a being (later dubbed by others “Laplace’s demon”) could, in theory, be able to predict any past or future state by simply winding or unwinding the present state according to the classical laws of mechanics.

One future state Laplace’s causally deterministic demon did not predict was that of physics, as his theory was undone first by thermodynamics, which belies the notion that reversibility to past states is always possible, and then by quantum theory, where indeterminacy at the quantum level prevents us from predicting the future with any certainty (excepting here the “many-worlds” interpretation of quantum mechanics, in which every possible state exists, somewhere—a messy reconciliation of deterministic and probabilistic models, both creator and fixer of superhero franchise plot holes).

Early successes of AI—such as expert systems in the 1980s and early 1990s, including IBM’s first version of its chess-playing Deep Blue—were, like Laplace’s model, deterministic. Theory outpaced the available computational power, and the success we’re seeing now is largely the result of newly available hardware capable of quicker and more complex operations.

To draw statistically useful conclusions about a population from a sample, one needs an appropriate amount of data. This is why early attempts at AI were rule-based. To use the example of chess, we didn’t have the infrastructure to consume thousands or millions of chess games to draw inferences from them; therefore, the focus was on a thorough knowledge of the best games. The player with the white pieces starts a game with e4 and the response is e5.

With respect to language—the key element of LLMs, of course—the same evolution from rules-based, labor-intensive processes to something substantially more sophisticated has taken place. The concept and theories of computational linguistics have been around for a while, but when Miles Hanley compiled a word index for James Joyce’s novel Ulysses, or George Zipf conducted analyses for Human Behavior and the Principle of Least Effort, they did not have access to computers or the vast stores of information on the internet that we have access to today.

By contrast, an LLM is dependent on deriving relationships between words—ChatGPT, for example, is trained on a corpus of three hundred billion words. The resulting complexity is due not only to the size of the corpus, but the number of parameters describing the relationships between the words. Parameters are defined by probabilities, called weights, indicating the strength of the relationship. The probabilistic design makes an LLM able to take on tasks that it has not been explicitly designed for, but does leave it more susceptible to errors of fact or judgment because it is, in effect, making educated guesses.

Beyond simply knowing all the data (nodes) and the relationships (edges), querying an LLM requires understanding the impact of how we ask for a response, because even implicit goals may affect the output. There is a problem called instrumental convergence, where actions are driven not by an overarching goal (such as making the world a better or more livable place), but an instrument or supporting goal that may be more direct or achievable. One example of this, formulated by Swedish philosopher Nick Bostrom, is the paper clip problem. An intelligence designed to make paper clips (presumably a subgoal of making the world better—we could always use more paper clips) maximizes that goal above any others and kills all humans because (a) humans could switch off the intelligence, limiting the number of paper clips it can make, or (b) because the humans’ atoms could be used to make more paper clips.

An example of instrumental convergence that we have already seen in the era of LLMs is AI hallucinations—that is, the tendency for an LLM to create plausible-sounding responses that are not true (nor are they sourced from a deliberate fabrication). For example, if citations are requested to support a response, ChatGPT will comply, but while some works cited may both be accurate in supporting the content and also exist in the real world, others may not support the content or not even exist—even when they cite real authors for these sources. OpenAI, creator of ChatGPT, addresses this as an “alignment” problem the company is currently using reward modeling to address. In this case, the agent doesn’t understand the goal of research, but it knows what a citation looks like.

Returning to the issue this presents to organizations intent on employing AI decision tools, the black box of AI makes trust, and therefore adoption, more of a challenge. Thomas Sheridan’s research on features that enable a human’s trust of a system identified seven trust-causation factors. Of these, two are especially relevant here: reliability, which is hard to gauge when the same input may elicit different outputs, and understandability, where the user must be able to develop a mental model to predict outputs.

With respect to an AI decision aide in the tactical environment, our inability to see inside the black box means that we cannot validate that mission goals were properly interpreted and that assumptions were uncovered and documented. However, there are mitigations to this risk—for example, the ability to query a plan to build understanding, or by making the planning iterative and interactive, with human and machine working shoulder to shoulder as a team.

The black box makes us feel outmatched by AI. It knows things we don’t. If it’s not smarter than us now, it will be soon. There is, however, at least some comfort in knowing that we’ve been here before. We tried and failed to comprehensively reconcile the physical universe and now we hardly give it a second thought when we calculate the trajectory of a mortar or the fuel needs for a mission. We will find ways to cast light upon the shadows until we can confidently describe any unknowable variation as an error and limit it to an acceptable level. Until then, military leaders’ focus must be on developing and refining the risk-mitigation measures that will be necessary as AI decision tools are introduced.

Thom Hawkins is a project officer for artificial intelligence and data strategy with US Army Project Manager Mission Command. Mr. Hawkins specializes in AI-enabling infrastructure and adoption of AI-driven decision aids.

The views expressed are those of the author and do not reflect the official position of the United States Military Academy, Department of the Army, or Department of Defense.

Image credit: Tech. Sgt. Amy Picard, US Air Force