Contextual Architecture for Applying
Language Models in Cloud Infrastructure Management

How retrieval pipelines, contextual memory, and control-plane enforcement
determine safety and usefulness — not the choice of model.

Kristiyan H. Kolev · Veselin N. Kyurkchiev
Paisii Hilendarski University of Plovdiv · Dept. of Software Technologies

LLMs Do Not Know Your Infrastructure

Garbage In → Garbage Out
# User prompt without context
> Scale the staging API to 5 replicas

# LLM output (no infra context)
kubectl scale deploy api --replicas=5
  # wrong namespace
  # ignores RBAC
  # violates resource quota
  # no policy check
Context In → Safe Output
# Same prompt + infrastructure context
> Scale the staging API to 5 replicas

# LLM consults vector DB, schemas, RBAC
? Which staging? staging-eu, staging-us
  Your RBAC allows: staging-eu
  Quota check: OK (3/10 used)
  Generated CRD → OPA validated
~16%
GPT-4 accuracy on org data without grounding
data.world benchmark
20-30%
NL-to-IaC first-pass deployability
NL-to-IaC 2024
All models
RBAC violations even with rules in prompt
OrgAccess, Shan et al. 2025

Architecture Over Model

The architecture around the model — retrieval pipelines, contextual memory, clarification workflows, and control-plane enforcement — determines whether the system is safe and useful. Not which model you plug in.

Contextual Architecture: Four Layers

Grounding
Hybrid RAG injects org knowledge at query time — runbooks, IaC modules, topology data from vector DB
like feeding runbooks to the model
Knowledge
Multi-layer contextual memory — policies, schemas, naming conventions, operational history defined by platform engineers
like a living CMDB + policy repo
Interaction
Intent clarification against real resource state — namespaces, quotas, RBAC permissions
like pre-flight before tf apply
Enforcement
Kubernetes admission + policy-as-code (OPA, Kyverno) — deterministic, not probabilistic
admission controllers stop bad deploys

Model-agnostic: swap GPT-4 for Claude or Llama — the architecture stays the same. Defense-in-depth: each layer catches what the model misses.

The Agent Loop: From Intent to Kubernetes

NL
Input
User intent
MCP
Query context
Vector
DB
Retrieve infra
LLM
Generate plan
Guard
rails
OPA / Kyverno
Tools
CRD / manifest
K8s
Deploy
  • MCP queries real infrastructure context — namespaces, quotas, RBAC, schemas
  • Vector DB retrieves relevant runbooks, IaC modules, platform definitions
  • LLM generates with real infrastructure awareness, not hallucinated commands
  • Guardrails validate via OPA/Kyverno admission policy — deterministic check
  • Output is a CRD, not imperative kubectl — GitOps compatible, auditable
  • Human-in-the-loop approval before mutations remains the default

Converging Evidence: Infrastructure Context Wins

5.45% → 42.21% F1
Hybrid retrieval vs. pure vector search
InfoQ 2025
~16% → ~54% accuracy
KG-structured context on organizational data
data.world benchmark (SQL QA domain)
20-30% → 50-90%
NL-to-IaC deployability with iterative validation
DPIaC-Eval 2025
Context > Model
Context engineering yields larger gains than switching to bigger models
Context Engineering 2024
Compound Systems
Retrieval + tools + orchestration outperform model-only optimization
Databricks 2024
Architectural Risks
Vulnerabilities are wiring patterns, not model-specific bugs
LLM Vulnerabilities 2023

What's Missing

01
No Integrated Reference Architecture
Individual components studied in isolation. Systematic composition for internal developer platforms is not.
02
Intent-vs-Resources Gap
Clarification workflows are disconnected from actual cluster state — namespaces, quotas, RBAC.
03
No Infrastructure-Specific Benchmarks
Evaluations borrow NLP metrics that miss operational cost and safety impact.
04
Observability Gap
Tracing agent tool calls, latency, and decision paths (OpenTelemetry + GenAI semantic conventions) is underexplored.

Key Takeaways

1
LLM failures in infrastructure are structural.
GPT-4, Claude, Gemini, Llama, Mistral — all fail on org-specific infra without the right architecture. Garbage in, garbage out.
2
Build the system, not just the prompt.
Vector DB + real infra schemas + platform definitions from platform engineers + deterministic guardrails > model selection.
3
The integration is the research gap.
Components are mature. Composing them for real DevOps/platform engineering workflows is the open problem.
Kristiyan H. Kolev · Veselin N. Kyurkchiev · CTA'2026
Thank you.
1 / 8