CTA'2026 · Pamporovo

Contextual Architecture for Applying
Language Models in Cloud Infrastructure Management

How retrieval pipelines, contextual memory, and control-plane enforcement
determine safety and usefulness — not the choice of model.

Kristiyan H. Kolev · Veselin N. Kyurkchiev
Paisii Hilendarski University of Plovdiv · Dept. of Software Technologies

01 · Problem

LLMs Do Not Know Your Infrastructure

Garbage In → Garbage Out

          # User prompt without context

          > Scale the staging API to 5 replicas

          # LLM output (no infra context)

          kubectl scale deploy api --replicas=5

            # wrong namespace

            # ignores RBAC

            # violates resource quota

            # no policy check

Context In → Safe Output

          # Same prompt + infrastructure context

          > Scale the staging API to 5 replicas

          # LLM consults vector DB, schemas, RBAC

          ? Which staging? staging-eu, staging-us

            Your RBAC allows: staging-eu

            Quota check: OK (3/10 used)

            Generated CRD → OPA validated

~16%

GPT-4 accuracy on org data without grounding

data.world benchmark

20-30%

NL-to-IaC first-pass deployability

NL-to-IaC 2024

All models

RBAC violations even with rules in prompt

OrgAccess, Shan et al. 2025

02 · Thesis

Architecture Over Model

The architecture around the model — retrieval pipelines, contextual memory, clarification workflows, and control-plane enforcement — determines whether the system is safe and useful. Not which model you plug in.

The model is one replaceable component — not the system
Four independent research lines converge on this finding
Real infrastructure knowledge — schemas, RBAC, topology, policies — defined by platform engineers, not by the LLM
Literature-driven synthesis, not a benchmark or deployment study

03 · Architecture

Contextual Architecture: Four Layers

Grounding

Hybrid RAG injects org knowledge at query time — runbooks, IaC modules, topology data from vector DB

like feeding runbooks to the model

Knowledge

Multi-layer contextual memory — policies, schemas, naming conventions, operational history defined by platform engineers

like a living CMDB + policy repo

Interaction

Intent clarification against real resource state — namespaces, quotas, RBAC permissions

like pre-flight before tf apply

Enforcement

Kubernetes admission + policy-as-code (OPA, Kyverno) — deterministic, not probabilistic

admission controllers stop bad deploys

Model-agnostic: swap GPT-4 for Claude or Llama — the architecture stays the same. Defense-in-depth: each layer catches what the model misses.

04 · How It Works

The Agent Loop: From Intent to Kubernetes

NL
Input

User intent

MCP

Query context

Vector
DB

Retrieve infra

LLM

Generate plan

Guard
rails

OPA / Kyverno

Tools

CRD / manifest

K8s

Deploy

MCP queries real infrastructure context — namespaces, quotas, RBAC, schemas
Vector DB retrieves relevant runbooks, IaC modules, platform definitions
LLM generates with real infrastructure awareness, not hallucinated commands

Guardrails validate via OPA/Kyverno admission policy — deterministic check
Output is a CRD, not imperative kubectl — GitOps compatible, auditable
Human-in-the-loop approval before mutations remains the default

05 · Evidence

Converging Evidence: Infrastructure Context Wins

5.45% → 42.21% F1

Hybrid retrieval vs. pure vector search

InfoQ 2025

~16% → ~54% accuracy

KG-structured context on organizational data

data.world benchmark (SQL QA domain)

20-30% → 50-90%

NL-to-IaC deployability with iterative validation

DPIaC-Eval 2025

Context > Model

Context engineering yields larger gains than switching to bigger models

Context Engineering 2024

Compound Systems

Retrieval + tools + orchestration outperform model-only optimization

Databricks 2024

Architectural Risks

Vulnerabilities are wiring patterns, not model-specific bugs

LLM Vulnerabilities 2023

06 · Open Problems

What's Missing

01

No Integrated Reference Architecture

Individual components studied in isolation. Systematic composition for internal developer platforms is not.

02

Intent-vs-Resources Gap

Clarification workflows are disconnected from actual cluster state — namespaces, quotas, RBAC.

03

No Infrastructure-Specific Benchmarks

Evaluations borrow NLP metrics that miss operational cost and safety impact.

04

Observability Gap

Tracing agent tool calls, latency, and decision paths (OpenTelemetry + GenAI semantic conventions) is underexplored.

07 · Takeaways

Key Takeaways

1

LLM failures in infrastructure are structural.
GPT-4, Claude, Gemini, Llama, Mistral — all fail on org-specific infra without the right architecture. Garbage in, garbage out.

2

Build the system, not just the prompt.
Vector DB + real infra schemas + platform definitions from platform engineers + deterministic guardrails > model selection.

3

The integration is the research gap.
Components are mature. Composing them for real DevOps/platform engineering workflows is the open problem.

Kristiyan H. Kolev · Veselin N. Kyurkchiev · CTA'2026
Thank you.

Contextual Architecture for Applying Language Models in Cloud Infrastructure Management