Prompt Engine Strategy
This document describes the current direction for ArchSpine's prompt assembly layer.
The goal is not "better prompt wording". The goal is to maximize two things:
- performance ceiling
- result quality
ArchSpine should behave like a prompt system, not a collection of prompt strings.
Why this exists
Naive prompt construction breaks down quickly in real repositories:
- too much context gets shoved into one request
- low-value imports crowd out high-value evidence
- rule-heavy audit tasks and summary-heavy synthesis tasks compete for the same budget
- output quality becomes unstable as file size and dependency fan-out increase
The prompt engine exists to solve exactly that.
Current model
The current implementation has four layers.
1. Context retrieval
ContextEngine resolves local imports, symbol references, and known dependency semantics.
Current context output is organized into:
Import InventoryKnown Internal Dependency SemanticsResolved Symbol References
This gives the LLM evidence-backed context instead of raw file sprawl.
2. Prompt context orchestration
src/infra/prompt-context.ts is the current orchestration layer for source-file prompts.
It is responsible for:
- compacting AST skeleton input
- trimming previous semantic state
- budgeting rule context vs dependency context
- producing diagnostics for what was kept vs dropped
This is the main architectural shift from "prompt builder" to "prompt engine".
3. Task-aware prompt construction
generateSourcePrompt(...) now accepts a task mode:
summarizevalidate
That mode changes instruction emphasis:
summarizeprefers concise semantic synthesisvalidateprefers precise audit behavior and evidence-backed rule reporting
This prevents one prompt style from trying to serve every job equally well.
4. Provider execution and parsing
Providers still execute the prompt and parse structured output, but they now consume orchestrated inputs instead of raw, unbounded context.
That separation matters:
- orchestration decides what context is worth paying for
- providers decide how to execute and parse it
Design principles
1. Budget before wording
Prompt quality is dominated by context selection and allocation, not by clever phrasing. The engine answers what gets included and how much budget each section gets before worrying about phrasing.
2. Evidence over breadth
The engine prefers direct imports, known semantic docs, and exact symbol evidence over speculative neighbors or generic path similarity. This improves quality and token efficiency.
3. Task-specific context policy
Different jobs need different budgets. validate spends more on rules and audit evidence, while summarize spends more on role and dependency semantics.
4. Headless Generation (JSON as UI)
ArchSpine treats the LLM as an analysis engine, not a prose writer.
- Data over Prose: LLMs output strictly structured JSON containing all semantic facts and localized strings.
- Node-side Rendering: Human-readable Markdown is generated by a deterministic Node.js renderer from the JSON payload.
- Consistency: Ensures 100% consistent documentation style regardless of model variance.
5. Intelligence Primitives
To bridge the gap between "working" and "precise" semantics:
- Few-Shot Library: "Gold Standard" examples guide the model's professional tone.
- Chain-of-Thought (CoT): A mandatory
_thinkingscratchpad forces step-by-step reasoning during validation. - Symbol Pinning: Constraining dependency inference to AST skeleton symbols to eliminate "fantasy" dependencies.
Current optimizations
The current implementation is no longer at the "next ideas only" stage. The following capabilities are already in the mainline:
- prompt policy tiers:
lite/balanced - validate-specific policy:
default/strict - runtime mode presets:
standard/heavy - task-aware budget allocation for
summarizeandvalidate - relevance diagnostics for dependency candidates and symbol targets
- prompt diagnostics for retained vs dropped dependencies and rules
- fixed evaluation corpus and comparison harness
- rule-aware context weighting for
validate - a heavier semantic-first validate path, exposed through runtime mode and advanced flow controls
Dynamic budget allocation
Source prompt budgets are no longer fixed constants.
The current allocator adapts based on:
- source file line count
- import/export/usage counts
- dependency context size
- rule context size
- task mode
The allocator currently controls:
- header lines
- max imports
- max exports
- max usages
- implementation clue depth
- total context chars
- dependency context chars
- rule chars
- previous responsibility count
Lightweight relevance sorting
Dependency summaries and symbol targets are now ordered by lightweight relevance signals.
Current factors include:
- same-directory proximity
- known semantic docs
- public surface / export evidence
- number of imported symbols
- direct import target
- exact imported-symbol match
- path distance
This is intentionally simple and fast.
Diagnostics
Prompt artifacts now expose diagnostics so the system can observe:
- raw input size
- allocated budgets
- final retained size
- retained vs truncated dependency candidates
- retained vs dropped rule blocks
- relevance scoring contributions for dependency candidates and symbol targets
This is necessary for any serious quality/performance tuning work.
What we should optimize next
The next phase should stay focused on performance ceiling and result quality, but the foundation is already in place.
The highest-value next steps are:
- expand live
validatesample coverage for the experimental JSON-only path - compare more real-provider runs before changing any default runtime behavior
- keep corpus and comparison data current as prompt policies evolve
- sync implementation conclusions into docs and runbooks without reintroducing legacy
lite-mode/lite modewording as the default story - avoid new prompt policies unless they can be measured by the existing corpus and harness
What we should not optimize yet
These are lower-value for the current stage:
- elaborate prompt wording tweaks without measurement
- heavyweight scoring frameworks
- agent-facing governance layers mixed into prompt assembly
- architecture cleanup that does not improve quality or throughput
Success criteria
Prompt engine work is successful only if it improves one or more of these:
- lower token cost for the same quality
- better JSON parse stability
- better rule violation precision and recall
- lower drift hallucination
- more stable output across large repositories
- lower orchestration overhead per file
If a change does not move those metrics, it is not prompt-engine progress.