Abstract
The scale and complexity of modern cloud infrastructure have made Infrastructure-as-Code (IaC) essential for managing deployments. While large language models (LLMs) are increasingly being used to generate IaC configurations from natural language, user requests are often underspecified. Unlike traditional code generation, IaC configurations cannot be executed cheaply or iteratively repaired, forcing the LLMs into an almost one-shot regime. We observe that ambiguity in IaC exhibits a tractable compositional structure: configurations decompose into three hierarchical axes (resources, topology, attributes) where higher-level decisions constrain lower-level ones. We propose a training-free, disagreement-driven framework that generates diverse candidate specifications, identifies structural disagreements across these axes, ranks them by informativeness, and produces targeted clarification questions that progressively narrow the configuration space. We introduce Ambig-IaC, a benchmark of 300 validated IaC tasks with ambiguous prompts, and an evaluation framework based on graph edit distance and embedding similarity. Our method outperforms the strongest baseline, achieving relative improvements of +18.4% and +25.4% on structure and attribute evaluations, respectively.
Ambiguity in Cloud IaC
An underspecified user request corresponds to many plausible cloud infrastructures that can be represented as resource dependency graphs (left). Ambiguity arises in choosing resource abstractions and services for compute, networking, databases, etc., inter-resource topology, and per-resource configuration attributes when translating intent into Infrastructure-as-Code (right).
Method Overview
Overview of the iterative multi-level disambiguation process for interactive Infrastructure-as-Code synthesis. ➊ Users provide an initial IaC request. ➋ A pool of diverse structured specifications is generated based on current interpretations. ➌ Disagreements among candidates are computed symbolically. ➍ Disagreements are ranked to identify the most informative differences measured by entropy. ➎ Based on the top-ranked disagreement, the system generates a targeted clarification question. ➏ Interaction with the user. ➐ If the candidate pool is non-empty, prune candidates inconsistent with the user's answer. ➑ The refined candidate pool is fed back into the next iteration, repeating until the interaction budget is exhausted. Otherwise, pool regeneration is triggered.
Key Contributions:
- We present the first investigation of interactive cloud IaC generation under ambiguous user requests, where an agent explicitly reasons about the structure of uncertainty and resolves it through multi-turn clarification dialogue.
- We propose a training-free, disagreement-driven framework that decomposes IaC ambiguity into three hierarchical axes (resource, topology, and attribute), generates diverse candidate specifications, and uses structural disagreements—ranked by entropy with balanced cross-dimension selection—to produce targeted clarification questions.
- We introduce Ambig-IaC, a curated benchmark of 300 IaC tasks with validated reference configurations and LLM-generated ambiguous prompts, along with an evaluation framework measuring configuration correctness via graph edit distance and embedding similarity.
- Our method achieves 54.85% structure and 45.72% attribute correctness at 15 rounds, improving over the strongest baseline by +8.53 and +9.25 points (+18.4% and +25.4% relative).
Results
Main Results on Ambig-IaC (GPT-4o-mini)
| Method | #Rounds | Struct. (%) | Δ Struct. | Attr. (%) | Δ Attr. |
|---|---|---|---|---|---|
| Direct Q Generation | 5 | 43.08 | -5.87 (↓11.99%) | 35.05 | -4.68 (↓11.78%) |
| Best-of-N | 5 | 41.28 | -7.67 (↓15.67%) | 33.13 | -6.60 (↓16.61%) |
| Self-Consistency | 5 | 38.59 | -10.36 (↓21.16%) | 31.12 | -8.61 (↓21.67%) |
| Ours | 5 | 48.95 | -- | 39.73 | -- |
| Direct Q Generation | 10 | 44.38 | -6.61 (↓12.96%) | 35.39 | -7.32 (↓17.14%) |
| Best-of-N | 10 | 43.52 | -7.47 (↓14.65%) | 35.81 | -6.90 (↓16.16%) |
| Self-Consistency | 10 | 38.41 | -12.58 (↓24.67%) | 31.54 | -11.17 (↓26.15%) |
| Ours | 10 | 50.99 | -- | 42.71 | -- |
| Direct Q Generation | 15 | 46.32 | -8.53 (↓15.53%) | 36.47 | -9.25 (↓20.23%) |
| Best-of-N | 15 | 43.12 | -11.73 (↓21.39%) | 35.10 | -10.62 (↓23.23%) |
| Self-Consistency | 15 | 38.19 | -16.66 (↓30.37%) | 31.43 | -14.29 (↓31.26%) |
| Ours | 15 | 54.85 | -- | 45.72 | -- |
Results Across Backbone LLMs (K=5)
| Model | Method | Struct. (%) | Attr. (%) |
|---|---|---|---|
| GPT-4o-mini | Direct Q Generation | 43.08 | 35.05 |
| Best-of-N | 41.28 | 33.13 | |
| Self-Consistency | 38.59 | 31.12 | |
| Ours | 48.95 | 39.73 | |
| GPT-4.1-mini | Direct Q Generation | 50.80 | 39.83 |
| Best-of-N | 53.06 | 41.48 | |
| Self-Consistency | 52.58 | 40.66 | |
| Ours | 57.40 | 45.10 |
Ablation: Round-Robin Balancing
| Method | #Rounds | Struct. (%) | Attr. (%) |
|---|---|---|---|
| Ours w/o RR | 5 | 47.18 | 37.83 |
| Ours | 5 | 48.95 | 39.73 |
| Ours w/o RR | 10 | 49.74 | 40.75 |
| Ours | 10 | 50.99 | 42.71 |
| Ours w/o RR | 15 | 51.65 | 42.14 |
| Ours | 15 | 54.85 | 45.72 |
Per-Round Dynamics
Regeneration Analysis
Left: Disagreement counts across all three dimensions decrease steadily over rounds, confirming that each clarification round effectively resolves uncertainty. Resource-level disagreements are resolved fastest, followed by topology and then attributes, consistent with the hierarchical structure of IaC ambiguity.
Right: Most tasks require 3–5 regenerations, indicating the candidate pool is typically exhausted within 2–3 questions. Higher regeneration counts correlate with higher structure and attribute scores, suggesting that conditioned regeneration is an effective mechanism for progressive refinement.
Conclusion
We presented a multi-level disambiguation framework for interactive IaC generation that leverages the hierarchical structure of cloud configurations to guide multi-turn clarification, along with Ambig-IaC, a 300-task benchmark of validated IaC tasks with ambiguous prompts. Our method consistently outperforms structure-agnostic baselines, with gains that scale with the interaction budget and generalize across models.
BibTeX
@misc{yang2026ambigiacmultileveldisambiguationinteractive,
title={Ambig-IaC: Multi-level Disambiguation for Interactive Cloud Infrastructure-as-Code Synthesis},
author={Zhenning Yang and Kaden Gruizenga and Tongyuan Miao and Patrick Tser Jern Kon and Hui Guan and Ang Chen},
year={2026},
eprint={2604.02382},
archivePrefix={arXiv},
primaryClass={cs.SE},
url={https://arxiv.org/abs/2604.02382},
}