Files
2026-05-25 17:13:27 +02:00

4.4 KiB

AGENTS.md

Project Summary

ebookm is a Rust workspace for compiling a set of Substack posts and local HTML files into a single EPUB.

Current workspace layout:

  • ebookm-core Core library: manifest parsing, source loading, extraction, normalization, TOC/link logic, EPUB generation.
  • ebookm-cli Thin CLI wrapper around ebookm-core.

Primary user workflow:

cargo run -p ebookm-cli -- build -m <manifest>

Key Files

  • Cargo.toml Workspace manifest.
  • ebookm-core/src/manifest.rs YAML manifest schema and defaults.
  • ebookm-core/src/source.rs Source loading for Substack URLs and local HTML files.
  • ebookm-core/src/extract.rs Metadata/body extraction, including Substack-specific selectors.
  • ebookm-core/src/normalize.rs HTML cleanup, local/remote image bundling, link rewriting, XHTML-safe output conversion.
  • ebookm-core/src/pipeline.rs Main build orchestration and chapter generation.
  • ebookm-core/src/epub.rs EPUB packaging, nav.xhtml and toc.ncx generation.
  • ebookm-core/src/template.rs Starter manifest template used by ebookm init.
  • README.md User-facing docs and manifest reference.

Current Manifest Semantics

Top-level manifest keys:

  • book
  • output
  • defaults
  • sections
  • entries
  • link_rules

Supported source kinds:

  • substack Public Substack post URL.
  • html Local HTML file path, resolved relative to the manifest.

Important processing options:

  • defaults.processing.include_author
  • defaults.processing.include_date
  • defaults.processing.include_source_url
  • defaults.processing.skip_first_paragraphs
  • per-entry overrides under entries.<id>.processing

Current defaults:

  • include_author: true
  • include_date: true
  • include_source_url: true
  • skip_first_paragraphs: 0

Current EPUB Behavior

  • Section structure is emitted into both nav.xhtml and toc.ncx.
  • Chapter header content is configurable: author, date, and canonical URL can each be independently shown/hidden.
  • Local HTML images are bundled when fetch_images: true.
  • Local image paths are resolved relative to the HTML file, not the manifest.
  • Remote images from Substack pages are also bundled when fetch_images: true.
  • Generated chapter XHTML is post-processed to self-close HTML void tags like img, hr, and br for EPUB/XML compatibility.

Known Implementation Boundaries

  • Substack handling is tuned to current public page structure, especially: .available-content .body.markup .post-title JSON-LD datePublished
  • Subscriber-only/authenticated Substack content is not implemented.
  • CSS background images are not bundled.
  • Manifest fields like subtitle, summary, and tags are parsed but only partially used.
  • rewrite_external_substack_links and preserve_other_external_links exist in the manifest schema but are not deeply wired into behavior yet.

Validation and Debugging

Run tests:

cargo test

Build a manifest:

cargo run -p ebookm-cli -- build -m ageofpeace/ageofpeace.yaml

Inspect extracted source metadata:

cargo run -p ebookm-cli -- inspect <url-or-file>

Validate generated XHTML quickly:

unzip -p path/to/book.epub OEBPS/text/chapter.xhtml | xmllint --noout -

Validate the full EPUB package:

epubcheck path/to/book.epub

Useful inspection commands:

unzip -l path/to/book.epub
unzip -p path/to/book.epub OEBPS/nav.xhtml
unzip -p path/to/book.epub OEBPS/toc.ncx
unzip -p path/to/book.epub OEBPS/text/<entry>.xhtml

Existing Real Example

The repository contains a real working manifest:

  • ageofpeace/ageofpeace.yaml

Related local content/assets:

  • ageofpeace/introduction.html
  • ageofpeace/johngu.jpg
  • ageofpeace/age_of_peace_cover.jpg

This is the best regression case for:

  • mixed local HTML + Substack sources
  • cover image handling
  • local image bundling
  • section TOC nesting
  • chapter-header processing options

Guidance For Future Agents

  • Preserve manifest backward compatibility unless there is a strong reason not to.
  • If TOC behavior changes, verify both nav.xhtml and toc.ncx.
  • If HTML normalization changes, verify generated XHTML with xmllint.
  • If image handling changes, test both: local HTML image references remote Substack image references
  • Prefer extending ebookm-core behavior and keeping ebookm-cli thin.
  • Update README.md whenever user-facing manifest fields or behavior change.