Atari 8 Bit knowledge MCP server for AI
Find a file
mozzwald 85a832f932 Add Phase 7 markdown ingest and updates
Implement ingest-md for Markdown files/directories with source revision snapshots, document_revisions metadata, and direct page population without OCR.

Add update-doc replacement flow for Markdown/PDF sources, preserve stale manual annotations, keep unchanged Markdown idempotent, and document/test the workflows.
2026-05-20 21:45:23 -05:00
data Initial Atari docs MCP pipeline 2026-05-19 22:00:29 -05:00
docs Add Phase 7 markdown ingest and updates 2026-05-20 21:45:23 -05:00
plan Add Phase 7 markdown ingest and updates 2026-05-20 21:45:23 -05:00
scripts Initial Atari docs MCP pipeline 2026-05-19 22:00:29 -05:00
src/atari_docs Add Phase 7 markdown ingest and updates 2026-05-20 21:45:23 -05:00
tests Add Phase 7 markdown ingest and updates 2026-05-20 21:45:23 -05:00
.gitignore Initial Atari docs MCP pipeline 2026-05-19 22:00:29 -05:00
AGENTS.md Initial Atari docs MCP pipeline 2026-05-19 22:00:29 -05:00
LICENSE Initial commit 2026-05-19 19:16:57 -05:00
pyproject.toml Add AI annotation and vector search phases 2026-05-20 19:08:22 -05:00
README.md Add Phase 7 markdown ingest and updates 2026-05-20 21:45:23 -05:00

atari8-mcp

Local Atari 8-bit documentation ingestion and lookup server for MCP clients.

This project turns Atari programming PDFs, manuals, books, and magazine scans into durable local artifacts:

  • raw source PDFs under data/raw/
  • copied Markdown/PDF source revisions under data/sources/
  • untouched OCR JSON under data/ocr-json/
  • per-page Markdown under data/pages-md/
  • cleaned Markdown placeholders under data/clean-md/
  • extracted structured JSON under data/structured/
  • SQLite plus FTS5 indexes under data/index/

The MCP server is read-only and model/client agnostic. It is intended to work with Codex, Claude, Cursor, and any MCP client that can launch a stdio server. After OCR is complete, lookup runs locally/offline from SQLite and files.

Install

python3 -m venv .venv
. .venv/bin/activate
python3 -m pip install -e .

Mistral OCR requires an API key:

export MISTRAL_API_KEY=...

Initialize the local data directories and SQLite schema:

atari-docs init-db

During schema-development phases the SQLite index is generated data. To delete and recreate only data/index/atari_docs.sqlite, while preserving raw PDFs, OCR JSON, page Markdown, cleaned Markdown, and structured artifacts:

atari-docs init-db --reset

CLI Flow

atari-docs ingest path/to/manual.pdf
atari-docs ocr <doc_id> --provider mistral
atari-docs ocr-page <doc_id> <page_index> --provider mistral --word-confidence
atari-docs annotate <doc_id>
atari-docs extract <doc_id>
atari-docs index
atari-docs taxonomy seed
atari-docs recipes seed
atari-docs serve-mcp

ingest prints the generated doc_id. The ID is based on the PDF stem and SHA-256 hash, so re-ingesting the same file is stable.

ingest also accepts multiple files, directories, and quoted glob patterns:

atari-docs ingest "/path/to/docs/*.pdf"
atari-docs ingest /path/to/docs
atari-docs ingest manual-a.pdf manual-b.pdf

Markdown/wiki/API docs can be ingested without OCR:

atari-docs ingest-md path/to/doc.md
atari-docs ingest-md path/to/wiki-dir --source-family fujinet

Markdown ingest copies the original sources under data/sources/<doc_id>/, creates page rows directly from Markdown, and then uses the same annotate, extract, index, recipe, and embedding pipeline as OCR-derived documents.

Existing documents can be explicitly updated while keeping old source snapshots:

atari-docs update-doc <doc_id> path/to/new-doc.md
atari-docs update-doc <doc_id> path/to/new-doc.pdf --ocr --provider mistral

The default update mode is replace: generated rows for the selected doc_id are removed and rebuilt from the new source. Older source revisions remain in data/sources/. PDF updates can be registered without OCR; the document is then marked as needing OCR.

To OCR every imported document that does not have OCR pages yet:

atari-docs ocr --missing --provider mistral

Batch OCR processes documents one by one. If a provider/API call fails, for example because a PDF is too large, the command reports the failing doc_id, error type, and provider error text, records the failure in document metadata, then continues with the remaining documents. The command exits non-zero if any document failed.

Before each OCR request, the CLI prints the document details to stderr:

[1/12] OCR start: atari-basic-reference-...
  title: Atari BASIC Reference
  file: data/raw/atari-basic-reference-....pdf
  size: 18.4 MB
  pages: 240
  provider: mistral
  model: mistral-ocr-latest

Mistral OCR Defaults

The first OCR provider is Mistral OCR. The adapter uses:

  • model="mistral-ocr-latest"
  • table_format="html"
  • extract_header=True
  • extract_footer=True
  • include_image_base64=False for full-document OCR
  • confidence_scores_granularity="page" for full-document OCR

Selected pages can be rerun with:

atari-docs ocr-page <doc_id> <page_index> --provider mistral --word-confidence

That command extracts a one-page PDF locally, sends it to Mistral with confidence_scores_granularity="word" and include_image_base64=True, then stores the raw response separately from the full-document OCR JSON.

The full raw OCR response is always written unchanged to data/ocr-json/. Per-page Markdown and SQLite rows are derived artifacts.

SQLite Schema

The database is data/index/atari_docs.sqlite and contains:

  • documents
  • document_revisions
  • pages
  • sections
  • chunks
  • tables
  • registers
  • memory_locations
  • bitfields
  • symbols
  • commands
  • protocols
  • protocol_messages
  • code_examples
  • taxonomy_terms
  • annotations
  • evidence
  • aliases
  • document_authority
  • recipes
  • recipe_steps
  • recipe_sources
  • recipe_seed_runs
  • embedding_models
  • embedding_jobs
  • embedding_items
  • pages_fts
  • chunks_fts
  • recipes_fts

Schema status can be inspected with:

atari-docs schema-info

Seed and validate the controlled Atari taxonomy with:

atari-docs taxonomy seed
atari-docs taxonomy validate
atari-docs taxonomy list concept

Chunking is heading/page/topic aware. Each chunk preserves:

  • doc_id
  • page_start
  • page_end
  • heading_path
  • content_type
  • topics_json
  • table_ids_json
  • source provenance and confidence score

Tables are stored separately and chunk text uses [TABLE:<table_id>] placeholders so structured table records remain addressable.

Annotation And Extraction

annotate uses a local rules engine. It writes normalized annotations and evidence rows, and also maintains derived helper JSON on pages/chunks for the current indexer. No AI APIs are called by this command.

atari-docs annotate <doc_id> --mode rules --target pages
atari-docs annotate <doc_id> --mode rules --target chunks --explain
atari-docs annotation-stats <doc_id>

Derived annotation summaries include:

  • page_type
  • topics
  • contains_registers
  • contains_memory_addresses
  • contains_bitfields
  • contains_code
  • contains_protocol
  • recommended_extractors

extract writes structured JSON separate from source Markdown and populates SQLite records for:

  • registers
  • memory maps
  • bitfields
  • protocol message tables
  • code listings
  • symbols
  • tables

The current extractors are intentionally conservative regex/table pass extractors. They are suitable for lookup bootstrapping and can be expanded without changing the stored OCR provenance.

Recipes

Recipes are first-class programming help records for common Atari 8-bit tasks. Seeded recipes start as drafts until their source links are reviewed.

atari-docs recipes seed
atari-docs recipes list --subsystem antic
atari-docs recipes find DLI
atari-docs recipes get install_dli
atari-docs recipes link-sources --dry-run

User-authored recipes can be imported from JSON:

atari-docs recipes import path/to/recipes.json

After annotating and indexing new documents, draft recipe candidates can be generated from chunks marked as likely procedures. Dry-run is the default; --write records the generated drafts and stores the source document IDs in recipe_seed_runs so later broad candidate generation can skip documents that have already been used.

atari-docs recipes generate-candidates
atari-docs recipes generate-candidates --doc-id <doc_id>
atari-docs recipes generate-candidates --write

Selective AI Annotation

annotate-ai is an optional offline-import enrichment step. It selects targets already marked by local rules, estimates cost before running, calls a provider only when explicitly run, then stores cached AI annotations in SQLite. The MCP server does not call AI APIs at runtime.

Default provider/model:

  • provider: openai
  • model: gpt-5.4-nano
  • escalation model for future use: gpt-5.4-mini

OpenAI pricing used by the cost estimator, as published by OpenAI on 2026-05-20:

  • gpt-5.4-nano: $0.20 / 1M input tokens, $1.25 / 1M output tokens
  • gpt-5.4-mini: $0.75 / 1M input tokens, $4.50 / 1M output tokens

Candidate and cost review:

atari-docs annotate-ai candidates --limit 10
atari-docs annotate-ai dry-run-cost --limit 10
atari-docs annotate-ai --select-candidates --limit 10
atari-docs annotate-ai --dry-run-cost --limit 10

Live OpenAI annotation requires OPENAI_API_KEY and enforces a max-cost guard:

atari-docs annotate-ai run --limit 5 --max-cost 5.00
atari-docs annotate-ai run --doc-id <doc_id> --only-needs-review --max-cost 1.00

Inspect cached AI annotations:

atari-docs annotate-ai show-cache
atari-docs annotate-ai show-cache <target_id>

Optional Embeddings

Embeddings are an optional secondary search index. SQLite remains the source of truth, and FTS/exact lookup continue to work without vectors. sqlite-vec is the first vector backend, but it is isolated behind a replaceable VectorIndex interface because sqlite-vec is pre-1.0 and may change. The generated sqlite-vec virtual table is disposable and rebuildable from embedding_items.

Install vector dependencies with:

python3 -m pip install -e ".[vector]"

Default provider/model:

  • provider: openai
  • model: text-embedding-3-small
  • dimensions: 1536

OpenAI embedding pricing used by the estimator, as checked on 2026-05-20:

  • text-embedding-3-small: $0.02 / 1M input tokens
  • text-embedding-3-large: $0.13 / 1M input tokens

Review candidates and costs before generating vectors. prepare records deterministic metadata only; later generate still embeds prepared rows that do not yet have vectors:

atari-docs embeddings models
atari-docs embeddings prepare --target-type recipe --limit 10
atari-docs embeddings candidates --target-type recipe --limit 10
atari-docs embeddings dry-run-cost --target-type recipe --limit 10

Generate embeddings only when explicitly requested. OPENAI_API_KEY is required for the OpenAI provider and --max-cost guards total spend:

atari-docs embeddings generate --target-type recipe --limit 5 --max-cost 1.00
atari-docs embeddings rebuild --backend sqlite-vec
atari-docs embeddings status

Search stored vectors:

atari-docs embeddings search "install a DLI" --target-type recipe --limit 5

If sqlite-vec is not installed, vector backend rebuilds report a clear unavailable message. The code still imports, MCP still starts, and vector search can use the local cosine fallback for stored vectors in development.

search_docs performs hybrid retrieval: exact/alias signals, recipes, FTS, and optional vector hits are merged and reranked. If vectors or the embedding provider are unavailable, normal FTS/exact/recipe search still works.

MCP Tools

atari-docs serve-mcp exposes these read-only tools:

  • search_docs(query, topic=None, doc_id=None, limit=10)
  • get_page(doc_id, page_index)
  • get_section(section_id)
  • lookup_register(name_or_address)
  • lookup_memory_address(address)
  • lookup_sio_command(query)
  • lookup_symbol(name)
  • lookup_table(table_id)
  • find_code_examples(query, language=None)
  • find_recipes(query, subsystem=None, language=None, limit=10)
  • get_recipe(recipe_id)
  • semantic_search(query, target_types=None, limit=10)

Example MCP client command:

{
  "command": "atari-docs",
  "args": ["serve-mcp"]
}

If your client does not inherit the virtualenv environment, point it at the installed script inside .venv/bin/atari-docs.

OCR Provider Boundary

OCR providers implement OcrProvider.run(OcrRequest). Add new providers under src/atari_docs/ocr/ and register them in src/atari_docs/ocr_runner.py. All providers should preserve raw provider JSON and write derived Markdown, pages, confidence scores, and provenance through the same storage path.

Project Docs

Additional design and workflow notes live under docs/:

Agent-specific guidance is in AGENTS.md.