# AGENTS.md ## Project Summary `ebookm` is a Rust workspace for compiling a set of Substack posts and local HTML files into a single EPUB. Current workspace layout: - `ebookm-core` Core library: manifest parsing, source loading, extraction, normalization, TOC/link logic, EPUB generation. - `ebookm-cli` Thin CLI wrapper around `ebookm-core`. Primary user workflow: ```bash cargo run -p ebookm-cli -- build -m ``` ## Key Files - `Cargo.toml` Workspace manifest. - `ebookm-core/src/manifest.rs` YAML manifest schema and defaults. - `ebookm-core/src/source.rs` Source loading for Substack URLs and local HTML files. - `ebookm-core/src/extract.rs` Metadata/body extraction, including Substack-specific selectors. - `ebookm-core/src/normalize.rs` HTML cleanup, local/remote image bundling, link rewriting, XHTML-safe output conversion. - `ebookm-core/src/pipeline.rs` Main build orchestration and chapter generation. - `ebookm-core/src/epub.rs` EPUB packaging, nav.xhtml and toc.ncx generation. - `ebookm-core/src/template.rs` Starter manifest template used by `ebookm init`. - `README.md` User-facing docs and manifest reference. ## Current Manifest Semantics Top-level manifest keys: - `book` - `output` - `defaults` - `sections` - `entries` - `link_rules` Supported source kinds: - `substack` Public Substack post URL. - `html` Local HTML file path, resolved relative to the manifest. Important processing options: - `defaults.processing.include_author` - `defaults.processing.include_date` - `defaults.processing.include_source_url` - `defaults.processing.skip_first_paragraphs` - per-entry overrides under `entries..processing` Current defaults: - `include_author: true` - `include_date: true` - `include_source_url: true` - `skip_first_paragraphs: 0` ## Current EPUB Behavior - Section structure is emitted into both `nav.xhtml` and `toc.ncx`. - Chapter header content is configurable: author, date, and canonical URL can each be independently shown/hidden. - Local HTML images are bundled when `fetch_images: true`. - Local image paths are resolved relative to the HTML file, not the manifest. - Remote images from Substack pages are also bundled when `fetch_images: true`. - Generated chapter XHTML is post-processed to self-close HTML void tags like `img`, `hr`, and `br` for EPUB/XML compatibility. ## Known Implementation Boundaries - Substack handling is tuned to current public page structure, especially: `.available-content .body.markup` `.post-title` JSON-LD `datePublished` - Subscriber-only/authenticated Substack content is not implemented. - CSS background images are not bundled. - Manifest fields like `subtitle`, `summary`, and `tags` are parsed but only partially used. - `rewrite_external_substack_links` and `preserve_other_external_links` exist in the manifest schema but are not deeply wired into behavior yet. ## Validation and Debugging Run tests: ```bash cargo test ``` Build a manifest: ```bash cargo run -p ebookm-cli -- build -m ageofpeace/ageofpeace.yaml ``` Inspect extracted source metadata: ```bash cargo run -p ebookm-cli -- inspect ``` Validate generated XHTML quickly: ```bash unzip -p path/to/book.epub OEBPS/text/chapter.xhtml | xmllint --noout - ``` Validate the full EPUB package: ```bash epubcheck path/to/book.epub ``` Useful inspection commands: ```bash unzip -l path/to/book.epub unzip -p path/to/book.epub OEBPS/nav.xhtml unzip -p path/to/book.epub OEBPS/toc.ncx unzip -p path/to/book.epub OEBPS/text/.xhtml ``` ## Existing Real Example The repository contains a real working manifest: - `ageofpeace/ageofpeace.yaml` Related local content/assets: - `ageofpeace/introduction.html` - `ageofpeace/johngu.jpg` - `ageofpeace/age_of_peace_cover.jpg` This is the best regression case for: - mixed local HTML + Substack sources - cover image handling - local image bundling - section TOC nesting - chapter-header processing options ## Guidance For Future Agents - Preserve manifest backward compatibility unless there is a strong reason not to. - If TOC behavior changes, verify both `nav.xhtml` and `toc.ncx`. - If HTML normalization changes, verify generated XHTML with `xmllint`. - If image handling changes, test both: local HTML image references remote Substack image references - Prefer extending `ebookm-core` behavior and keeping `ebookm-cli` thin. - Update `README.md` whenever user-facing manifest fields or behavior change.