Prompt Engine Strategy

This document describes the current direction for ArchSpine's prompt assembly layer.

The goal is not "better prompt wording". The goal is to maximize two things:

performance ceiling
result quality

ArchSpine should behave like a prompt system, not a collection of prompt strings.

Why this exists

Naive prompt construction breaks down quickly in real repositories:

too much context gets shoved into one request
low-value imports crowd out high-value evidence
rule-heavy audit tasks and summary-heavy synthesis tasks compete for the same budget
output quality becomes unstable as file size and dependency fan-out increase

The prompt engine exists to solve exactly that.

Current model

The current implementation has four layers.

1. Context retrieval

ContextEngine resolves local imports, symbol references, and known dependency semantics.

Current context output is organized into:

Import Inventory
Known Internal Dependency Semantics
Resolved Symbol References

This gives the LLM evidence-backed context instead of raw file sprawl.

2. Prompt context orchestration

src/infra/prompt-context.ts is the current orchestration layer for source-file prompts.

It is responsible for:

compacting AST skeleton input
trimming previous semantic state
budgeting rule context vs dependency context
producing diagnostics for what was kept vs dropped

This is the main architectural shift from "prompt builder" to "prompt engine".

3. Task-aware prompt construction

generateSourcePrompt(...) now accepts a task mode:

summarize
validate

That mode changes instruction emphasis:

summarize prefers concise semantic synthesis
validate prefers precise audit behavior and evidence-backed rule reporting

This prevents one prompt style from trying to serve every job equally well.

4. Provider execution and parsing

Providers still execute the prompt and parse structured output, but they now consume orchestrated inputs instead of raw, unbounded context.

That separation matters:

orchestration decides what context is worth paying for
providers decide how to execute and parse it

Design principles

1. Budget before wording

Prompt quality is dominated by context selection and allocation, not by clever phrasing. The engine answers what gets included and how much budget each section gets before worrying about phrasing.

2. Evidence over breadth

The engine prefers direct imports, known semantic docs, and exact symbol evidence over speculative neighbors or generic path similarity. This improves quality and token efficiency.

3. Task-specific context policy

Different jobs need different budgets. validate spends more on rules and audit evidence, while summarize spends more on role and dependency semantics.

4. Headless Generation (JSON as UI)

ArchSpine treats the LLM as an analysis engine, not a prose writer.

Data over Prose: LLMs output strictly structured JSON containing all semantic facts and localized strings.
Node-side Rendering: Human-readable Markdown is generated by a deterministic Node.js renderer from the JSON payload.
Consistency: Ensures 100% consistent documentation style regardless of model variance.

5. Intelligence Primitives

To bridge the gap between "working" and "precise" semantics:

Few-Shot Library: "Gold Standard" examples guide the model's professional tone.
Chain-of-Thought (CoT): A mandatory _thinking scratchpad forces step-by-step reasoning during validation.
Symbol Pinning: Constraining dependency inference to AST skeleton symbols to eliminate "fantasy" dependencies.

Current optimizations

The current implementation is no longer at the "next ideas only" stage. The following capabilities are already in the mainline:

prompt policy tiers: lite / balanced
validate-specific policy: default / strict
runtime mode presets: standard / heavy
task-aware budget allocation for summarize and validate
relevance diagnostics for dependency candidates and symbol targets
prompt diagnostics for retained vs dropped dependencies and rules
fixed evaluation corpus and comparison harness
rule-aware context weighting for validate
a heavier semantic-first validate path, exposed through runtime mode and advanced flow controls

Dynamic budget allocation

Source prompt budgets are no longer fixed constants.

The current allocator adapts based on:

source file line count
import/export/usage counts
dependency context size
rule context size
task mode

The allocator currently controls:

header lines
max imports
max exports
max usages
implementation clue depth
total context chars
dependency context chars
rule chars
previous responsibility count

Lightweight relevance sorting

Dependency summaries and symbol targets are now ordered by lightweight relevance signals.

Current factors include:

same-directory proximity
known semantic docs
public surface / export evidence
number of imported symbols
direct import target
exact imported-symbol match
path distance

This is intentionally simple and fast.

Diagnostics

Prompt artifacts now expose diagnostics so the system can observe:

raw input size
allocated budgets
final retained size
retained vs truncated dependency candidates
retained vs dropped rule blocks
relevance scoring contributions for dependency candidates and symbol targets

This is necessary for any serious quality/performance tuning work.

What we should optimize next

The next phase should stay focused on performance ceiling and result quality, but the foundation is already in place.

The highest-value next steps are:

expand live validate sample coverage for the experimental JSON-only path
compare more real-provider runs before changing any default runtime behavior
keep corpus and comparison data current as prompt policies evolve
sync implementation conclusions into docs and runbooks without reintroducing legacy lite-mode / lite mode wording as the default story
avoid new prompt policies unless they can be measured by the existing corpus and harness

What we should not optimize yet

These are lower-value for the current stage:

elaborate prompt wording tweaks without measurement
heavyweight scoring frameworks
agent-facing governance layers mixed into prompt assembly
architecture cleanup that does not improve quality or throughput

Success criteria

Prompt engine work is successful only if it improves one or more of these:

lower token cost for the same quality
better JSON parse stability
better rule violation precision and recall
lower drift hallucination
more stable output across large repositories
lower orchestration overhead per file

If a change does not move those metrics, it is not prompt-engine progress.

Prompt Engine Strategy ​

Why this exists ​

Current model ​

1. Context retrieval ​

2. Prompt context orchestration ​

3. Task-aware prompt construction ​

4. Provider execution and parsing ​

Design principles ​

1. Budget before wording ​

2. Evidence over breadth ​

3. Task-specific context policy ​

4. Headless Generation (JSON as UI) ​

5. Intelligence Primitives ​

Current optimizations ​

Dynamic budget allocation ​

Lightweight relevance sorting ​

Diagnostics ​

What we should optimize next ​

What we should not optimize yet ​

Success criteria ​