Skip to content

Prompt Engine Strategy

This document describes the current direction for ArchSpine's prompt assembly layer.

The goal is not "better prompt wording". The goal is to maximize two things:

  • performance ceiling
  • result quality

ArchSpine should behave like a prompt system, not a collection of prompt strings.

Why this exists

Naive prompt construction breaks down quickly in real repositories:

  • too much context gets shoved into one request
  • low-value imports crowd out high-value evidence
  • rule-heavy audit tasks and summary-heavy synthesis tasks compete for the same budget
  • output quality becomes unstable as file size and dependency fan-out increase

The prompt engine exists to solve exactly that.

Current model

The current implementation has four layers.

1. Context retrieval

ContextEngine resolves local imports, symbol references, and known dependency semantics.

Current context output is organized into:

  • Import Inventory
  • Known Internal Dependency Semantics
  • Resolved Symbol References

This gives the LLM evidence-backed context instead of raw file sprawl.

2. Prompt context orchestration

src/infra/prompt-context.ts is the current orchestration layer for source-file prompts.

It is responsible for:

  • compacting AST skeleton input
  • trimming previous semantic state
  • budgeting rule context vs dependency context
  • producing diagnostics for what was kept vs dropped

This is the main architectural shift from "prompt builder" to "prompt engine".

3. Task-aware prompt construction

generateSourcePrompt(...) now accepts a task mode:

  • summarize
  • validate

That mode changes instruction emphasis:

  • summarize prefers concise semantic synthesis
  • validate prefers precise audit behavior and evidence-backed rule reporting

This prevents one prompt style from trying to serve every job equally well.

4. Provider execution and parsing

Providers still execute the prompt and parse structured output, but they now consume orchestrated inputs instead of raw, unbounded context.

That separation matters:

  • orchestration decides what context is worth paying for
  • providers decide how to execute and parse it

Design principles

1. Budget before wording

Prompt quality is dominated by context selection and allocation, not by clever phrasing. The engine answers what gets included and how much budget each section gets before worrying about phrasing.

2. Evidence over breadth

The engine prefers direct imports, known semantic docs, and exact symbol evidence over speculative neighbors or generic path similarity. This improves quality and token efficiency.

3. Task-specific context policy

Different jobs need different budgets. validate spends more on rules and audit evidence, while summarize spends more on role and dependency semantics.

4. Headless Generation (JSON as UI)

ArchSpine treats the LLM as an analysis engine, not a prose writer.

  • Data over Prose: LLMs output strictly structured JSON containing all semantic facts and localized strings.
  • Node-side Rendering: Human-readable Markdown is generated by a deterministic Node.js renderer from the JSON payload.
  • Consistency: Ensures 100% consistent documentation style regardless of model variance.

5. Intelligence Primitives

To bridge the gap between "working" and "precise" semantics:

  • Few-Shot Library: "Gold Standard" examples guide the model's professional tone.
  • Chain-of-Thought (CoT): A mandatory _thinking scratchpad forces step-by-step reasoning during validation.
  • Symbol Pinning: Constraining dependency inference to AST skeleton symbols to eliminate "fantasy" dependencies.

Current optimizations

The current implementation is no longer at the "next ideas only" stage. The following capabilities are already in the mainline:

  • prompt policy tiers: lite / balanced
  • validate-specific policy: default / strict
  • runtime mode presets: standard / heavy
  • task-aware budget allocation for summarize and validate
  • relevance diagnostics for dependency candidates and symbol targets
  • prompt diagnostics for retained vs dropped dependencies and rules
  • fixed evaluation corpus and comparison harness
  • rule-aware context weighting for validate
  • a heavier semantic-first validate path, exposed through runtime mode and advanced flow controls

Dynamic budget allocation

Source prompt budgets are no longer fixed constants.

The current allocator adapts based on:

  • source file line count
  • import/export/usage counts
  • dependency context size
  • rule context size
  • task mode

The allocator currently controls:

  • header lines
  • max imports
  • max exports
  • max usages
  • implementation clue depth
  • total context chars
  • dependency context chars
  • rule chars
  • previous responsibility count

Lightweight relevance sorting

Dependency summaries and symbol targets are now ordered by lightweight relevance signals.

Current factors include:

  • same-directory proximity
  • known semantic docs
  • public surface / export evidence
  • number of imported symbols
  • direct import target
  • exact imported-symbol match
  • path distance

This is intentionally simple and fast.

Diagnostics

Prompt artifacts now expose diagnostics so the system can observe:

  • raw input size
  • allocated budgets
  • final retained size
  • retained vs truncated dependency candidates
  • retained vs dropped rule blocks
  • relevance scoring contributions for dependency candidates and symbol targets

This is necessary for any serious quality/performance tuning work.

What we should optimize next

The next phase should stay focused on performance ceiling and result quality, but the foundation is already in place.

The highest-value next steps are:

  1. expand live validate sample coverage for the experimental JSON-only path
  2. compare more real-provider runs before changing any default runtime behavior
  3. keep corpus and comparison data current as prompt policies evolve
  4. sync implementation conclusions into docs and runbooks without reintroducing legacy lite-mode / lite mode wording as the default story
  5. avoid new prompt policies unless they can be measured by the existing corpus and harness

What we should not optimize yet

These are lower-value for the current stage:

  • elaborate prompt wording tweaks without measurement
  • heavyweight scoring frameworks
  • agent-facing governance layers mixed into prompt assembly
  • architecture cleanup that does not improve quality or throughput

Success criteria

Prompt engine work is successful only if it improves one or more of these:

  • lower token cost for the same quality
  • better JSON parse stability
  • better rule violation precision and recall
  • lower drift hallucination
  • more stable output across large repositories
  • lower orchestration overhead per file

If a change does not move those metrics, it is not prompt-engine progress.

English is the primary docs tree; zh-CN mirrors shipped behavior.