Human Judgment as a Specification

Posted on: 09 June 2026
Tags: #LargeLanguageModels, #Semantics, #Tools, #LinearTemporalLogic

The Intersection of GenAI and Formalism

The rapid proliferation of Generative AI within the realm of software engineering necessitates a corresponding surge in the application of formal methods. As AI systems are deployed with increasing autonomy, we must have a mechanism to verify that these "wild" systems are actually delivering the specific outcomes we intend.

To leverage the power of formal verification tools, the specification must be expressed mathematically. However, a significant skill gap exists:

The Dilemma: The average programmer possesses far more expertise in writing code than in constructing formal mathematical specifications.

The Translation Problem

The core challenge is the transition from informal descriptions (natural language prose) to formal specifications (mathematical logic).

We can represent this desired transformation as a function: $\text{FormalSpec} = f(\text{InformalProse})$

The "Incorrect" Approach to Specification

A tempting, yet flawed, solution is to utilize Large Language Models (LLMs) to perform this translation. While LLMs are surprisingly adept at generating syntax for various formal notations, this approach is fraught with danger.

The Minsky-Eisenberg Debate

The tension regarding this method is captured in the following exchange:

Ron Minsky (suggesting a lighthearted approach): "I wonder if a more plausible model is, you go to your large language model and say, ‘Please write me a specification for a function that sorts a list.’ And then it, like, spits something out. And then you look at it and think, yeah, that seems about right."

Richard Eisenberg (the critical counterpoint): "How can we be sure that the generated specification is the right one?"

Why "Looking Right" Isn't Enough

Simply glancing at a generated spec is ~~sufficient~~ insufficient. The risks include:

Risk Factor	Description	Impact
Subtle Errors	The spec looks correct but contains a logical flaw.	False sense of security.
Ambiguity	The prose was vague; the LLM picked the wrong interpretation.	System meets the wrong goal.
Misconceptions	The LLM inherits common myths about the language.	Fundamental logic errors.
Lack of Ground Truth	The requirement exists only in the user's mind.	No objective way to verify.

Visualizing the Failure Pipeline

Potential Failure Checklist

Is the specification ambiguous?
Does the LLM have a misconception about the LTL or Semantics?
Is there a clear "ground truth" for this requirement?
Are there subtle edge cases the LLM ignored?

Example: The "Sorting" Specification

If we asked an LLM to provide a formal specification for a sorting function, it might produce something like this in a pseudo-formal notation:

(* Hypothetical Formal Spec for Sort *)
let is_sorted (l: 'a list) = 
  match l with
  | [] | [_] -> true
  | x :: y :: tl -> x <= y && is_sorted (y :: tl)

let is_permutation (l1 l2: 'a list) = 
  (* Logic to ensure elements are the same *)
  ...

let spec_sort (input output: 'a list) = 
  is_sorted output && is_permutation input output

While the code above seems correct, the process of relying on an LLM to generate it without a rigorous verification framework is where the danger lies.