Implement ingest-md for Markdown files/directories with source revision snapshots, document_revisions metadata, and direct page population without OCR. Add update-doc replacement flow for Markdown/PDF sources, preserve stale manual annotations, keep unchanged Markdown idempotent, and document/test the workflows. |
||
|---|---|---|
| data | ||
| docs | ||
| plan | ||
| scripts | ||
| src/atari_docs | ||
| tests | ||
| .gitignore | ||
| AGENTS.md | ||
| LICENSE | ||
| pyproject.toml | ||
| README.md | ||
atari8-mcp
Local Atari 8-bit documentation ingestion and lookup server for MCP clients.
This project turns Atari programming PDFs, manuals, books, and magazine scans into durable local artifacts:
- raw source PDFs under
data/raw/ - copied Markdown/PDF source revisions under
data/sources/ - untouched OCR JSON under
data/ocr-json/ - per-page Markdown under
data/pages-md/ - cleaned Markdown placeholders under
data/clean-md/ - extracted structured JSON under
data/structured/ - SQLite plus FTS5 indexes under
data/index/
The MCP server is read-only and model/client agnostic. It is intended to work with Codex, Claude, Cursor, and any MCP client that can launch a stdio server. After OCR is complete, lookup runs locally/offline from SQLite and files.
Install
python3 -m venv .venv
. .venv/bin/activate
python3 -m pip install -e .
Mistral OCR requires an API key:
export MISTRAL_API_KEY=...
Initialize the local data directories and SQLite schema:
atari-docs init-db
During schema-development phases the SQLite index is generated data. To delete
and recreate only data/index/atari_docs.sqlite, while preserving raw PDFs,
OCR JSON, page Markdown, cleaned Markdown, and structured artifacts:
atari-docs init-db --reset
CLI Flow
atari-docs ingest path/to/manual.pdf
atari-docs ocr <doc_id> --provider mistral
atari-docs ocr-page <doc_id> <page_index> --provider mistral --word-confidence
atari-docs annotate <doc_id>
atari-docs extract <doc_id>
atari-docs index
atari-docs taxonomy seed
atari-docs recipes seed
atari-docs serve-mcp
ingest prints the generated doc_id. The ID is based on the PDF stem and
SHA-256 hash, so re-ingesting the same file is stable.
ingest also accepts multiple files, directories, and quoted glob patterns:
atari-docs ingest "/path/to/docs/*.pdf"
atari-docs ingest /path/to/docs
atari-docs ingest manual-a.pdf manual-b.pdf
Markdown/wiki/API docs can be ingested without OCR:
atari-docs ingest-md path/to/doc.md
atari-docs ingest-md path/to/wiki-dir --source-family fujinet
Markdown ingest copies the original sources under data/sources/<doc_id>/,
creates page rows directly from Markdown, and then uses the same annotate,
extract, index, recipe, and embedding pipeline as OCR-derived documents.
Existing documents can be explicitly updated while keeping old source snapshots:
atari-docs update-doc <doc_id> path/to/new-doc.md
atari-docs update-doc <doc_id> path/to/new-doc.pdf --ocr --provider mistral
The default update mode is replace: generated rows for the selected doc_id
are removed and rebuilt from the new source. Older source revisions remain in
data/sources/. PDF updates can be registered without OCR; the document is
then marked as needing OCR.
To OCR every imported document that does not have OCR pages yet:
atari-docs ocr --missing --provider mistral
Batch OCR processes documents one by one. If a provider/API call fails, for
example because a PDF is too large, the command reports the failing doc_id,
error type, and provider error text, records the failure in document metadata,
then continues with the remaining documents. The command exits non-zero if any
document failed.
Before each OCR request, the CLI prints the document details to stderr:
[1/12] OCR start: atari-basic-reference-...
title: Atari BASIC Reference
file: data/raw/atari-basic-reference-....pdf
size: 18.4 MB
pages: 240
provider: mistral
model: mistral-ocr-latest
Mistral OCR Defaults
The first OCR provider is Mistral OCR. The adapter uses:
model="mistral-ocr-latest"table_format="html"extract_header=Trueextract_footer=Trueinclude_image_base64=Falsefor full-document OCRconfidence_scores_granularity="page"for full-document OCR
Selected pages can be rerun with:
atari-docs ocr-page <doc_id> <page_index> --provider mistral --word-confidence
That command extracts a one-page PDF locally, sends it to Mistral with
confidence_scores_granularity="word" and include_image_base64=True, then
stores the raw response separately from the full-document OCR JSON.
The full raw OCR response is always written unchanged to data/ocr-json/.
Per-page Markdown and SQLite rows are derived artifacts.
SQLite Schema
The database is data/index/atari_docs.sqlite and contains:
documentsdocument_revisionspagessectionschunkstablesregistersmemory_locationsbitfieldssymbolscommandsprotocolsprotocol_messagescode_examplestaxonomy_termsannotationsevidencealiasesdocument_authorityrecipesrecipe_stepsrecipe_sourcesrecipe_seed_runsembedding_modelsembedding_jobsembedding_itemspages_ftschunks_ftsrecipes_fts
Schema status can be inspected with:
atari-docs schema-info
Seed and validate the controlled Atari taxonomy with:
atari-docs taxonomy seed
atari-docs taxonomy validate
atari-docs taxonomy list concept
Chunking is heading/page/topic aware. Each chunk preserves:
doc_idpage_startpage_endheading_pathcontent_typetopics_jsontable_ids_json- source provenance and confidence score
Tables are stored separately and chunk text uses [TABLE:<table_id>]
placeholders so structured table records remain addressable.
Annotation And Extraction
annotate uses a local rules engine. It writes normalized annotations and
evidence rows, and also maintains derived helper JSON on pages/chunks for
the current indexer. No AI APIs are called by this command.
atari-docs annotate <doc_id> --mode rules --target pages
atari-docs annotate <doc_id> --mode rules --target chunks --explain
atari-docs annotation-stats <doc_id>
Derived annotation summaries include:
page_typetopicscontains_registerscontains_memory_addressescontains_bitfieldscontains_codecontains_protocolrecommended_extractors
extract writes structured JSON separate from source Markdown and populates
SQLite records for:
- registers
- memory maps
- bitfields
- protocol message tables
- code listings
- symbols
- tables
The current extractors are intentionally conservative regex/table pass extractors. They are suitable for lookup bootstrapping and can be expanded without changing the stored OCR provenance.
Recipes
Recipes are first-class programming help records for common Atari 8-bit tasks. Seeded recipes start as drafts until their source links are reviewed.
atari-docs recipes seed
atari-docs recipes list --subsystem antic
atari-docs recipes find DLI
atari-docs recipes get install_dli
atari-docs recipes link-sources --dry-run
User-authored recipes can be imported from JSON:
atari-docs recipes import path/to/recipes.json
After annotating and indexing new documents, draft recipe candidates can be
generated from chunks marked as likely procedures. Dry-run is the default;
--write records the generated drafts and stores the source document IDs in
recipe_seed_runs so later broad candidate generation can skip documents that
have already been used.
atari-docs recipes generate-candidates
atari-docs recipes generate-candidates --doc-id <doc_id>
atari-docs recipes generate-candidates --write
Selective AI Annotation
annotate-ai is an optional offline-import enrichment step. It selects
targets already marked by local rules, estimates cost before running, calls a
provider only when explicitly run, then stores cached AI annotations in SQLite.
The MCP server does not call AI APIs at runtime.
Default provider/model:
- provider:
openai - model:
gpt-5.4-nano - escalation model for future use:
gpt-5.4-mini
OpenAI pricing used by the cost estimator, as published by OpenAI on 2026-05-20:
gpt-5.4-nano: $0.20 / 1M input tokens, $1.25 / 1M output tokensgpt-5.4-mini: $0.75 / 1M input tokens, $4.50 / 1M output tokens
Candidate and cost review:
atari-docs annotate-ai candidates --limit 10
atari-docs annotate-ai dry-run-cost --limit 10
atari-docs annotate-ai --select-candidates --limit 10
atari-docs annotate-ai --dry-run-cost --limit 10
Live OpenAI annotation requires OPENAI_API_KEY and enforces a max-cost guard:
atari-docs annotate-ai run --limit 5 --max-cost 5.00
atari-docs annotate-ai run --doc-id <doc_id> --only-needs-review --max-cost 1.00
Inspect cached AI annotations:
atari-docs annotate-ai show-cache
atari-docs annotate-ai show-cache <target_id>
Optional Embeddings
Embeddings are an optional secondary search index. SQLite remains the source of
truth, and FTS/exact lookup continue to work without vectors. sqlite-vec is
the first vector backend, but it is isolated behind a replaceable VectorIndex
interface because sqlite-vec is pre-1.0 and may change. The generated
sqlite-vec virtual table is disposable and rebuildable from embedding_items.
Install vector dependencies with:
python3 -m pip install -e ".[vector]"
Default provider/model:
- provider:
openai - model:
text-embedding-3-small - dimensions:
1536
OpenAI embedding pricing used by the estimator, as checked on 2026-05-20:
text-embedding-3-small: $0.02 / 1M input tokenstext-embedding-3-large: $0.13 / 1M input tokens
Review candidates and costs before generating vectors. prepare records
deterministic metadata only; later generate still embeds prepared rows that do
not yet have vectors:
atari-docs embeddings models
atari-docs embeddings prepare --target-type recipe --limit 10
atari-docs embeddings candidates --target-type recipe --limit 10
atari-docs embeddings dry-run-cost --target-type recipe --limit 10
Generate embeddings only when explicitly requested. OPENAI_API_KEY is
required for the OpenAI provider and --max-cost guards total spend:
atari-docs embeddings generate --target-type recipe --limit 5 --max-cost 1.00
atari-docs embeddings rebuild --backend sqlite-vec
atari-docs embeddings status
Search stored vectors:
atari-docs embeddings search "install a DLI" --target-type recipe --limit 5
If sqlite-vec is not installed, vector backend rebuilds report a clear unavailable message. The code still imports, MCP still starts, and vector search can use the local cosine fallback for stored vectors in development.
search_docs performs hybrid retrieval: exact/alias signals, recipes, FTS, and
optional vector hits are merged and reranked. If vectors or the embedding
provider are unavailable, normal FTS/exact/recipe search still works.
MCP Tools
atari-docs serve-mcp exposes these read-only tools:
search_docs(query, topic=None, doc_id=None, limit=10)get_page(doc_id, page_index)get_section(section_id)lookup_register(name_or_address)lookup_memory_address(address)lookup_sio_command(query)lookup_symbol(name)lookup_table(table_id)find_code_examples(query, language=None)find_recipes(query, subsystem=None, language=None, limit=10)get_recipe(recipe_id)semantic_search(query, target_types=None, limit=10)
Example MCP client command:
{
"command": "atari-docs",
"args": ["serve-mcp"]
}
If your client does not inherit the virtualenv environment, point it at the
installed script inside .venv/bin/atari-docs.
OCR Provider Boundary
OCR providers implement OcrProvider.run(OcrRequest). Add new providers under
src/atari_docs/ocr/ and register them in src/atari_docs/ocr_runner.py.
All providers should preserve raw provider JSON and write derived Markdown,
pages, confidence scores, and provenance through the same storage path.
Project Docs
Additional design and workflow notes live under docs/:
Agent-specific guidance is in AGENTS.md.