diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..279e942 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,165 @@ +# AGENTS.md + +## Project Summary + +`ebookm` is a Rust workspace for compiling a set of Substack posts and local HTML files into a single EPUB. + +Current workspace layout: + +- `ebookm-core` + Core library: manifest parsing, source loading, extraction, normalization, TOC/link logic, EPUB generation. +- `ebookm-cli` + Thin CLI wrapper around `ebookm-core`. + +Primary user workflow: + +```bash +cargo run -p ebookm-cli -- build -m +``` + +## Key Files + +- `Cargo.toml` + Workspace manifest. +- `ebookm-core/src/manifest.rs` + YAML manifest schema and defaults. +- `ebookm-core/src/source.rs` + Source loading for Substack URLs and local HTML files. +- `ebookm-core/src/extract.rs` + Metadata/body extraction, including Substack-specific selectors. +- `ebookm-core/src/normalize.rs` + HTML cleanup, local/remote image bundling, link rewriting, XHTML-safe output conversion. +- `ebookm-core/src/pipeline.rs` + Main build orchestration and chapter generation. +- `ebookm-core/src/epub.rs` + EPUB packaging, nav.xhtml and toc.ncx generation. +- `ebookm-core/src/template.rs` + Starter manifest template used by `ebookm init`. +- `README.md` + User-facing docs and manifest reference. + +## Current Manifest Semantics + +Top-level manifest keys: + +- `book` +- `output` +- `defaults` +- `sections` +- `entries` +- `link_rules` + +Supported source kinds: + +- `substack` + Public Substack post URL. +- `html` + Local HTML file path, resolved relative to the manifest. + +Important processing options: + +- `defaults.processing.include_author` +- `defaults.processing.include_date` +- `defaults.processing.include_source_url` +- `defaults.processing.skip_first_paragraphs` +- per-entry overrides under `entries..processing` + +Current defaults: + +- `include_author: true` +- `include_date: true` +- `include_source_url: true` +- `skip_first_paragraphs: 0` + +## Current EPUB Behavior + +- Section structure is emitted into both `nav.xhtml` and `toc.ncx`. +- Chapter header content is configurable: + author, date, and canonical URL can each be independently shown/hidden. +- Local HTML images are bundled when `fetch_images: true`. +- Local image paths are resolved relative to the HTML file, not the manifest. +- Remote images from Substack pages are also bundled when `fetch_images: true`. +- Generated chapter XHTML is post-processed to self-close HTML void tags like `img`, `hr`, and `br` for EPUB/XML compatibility. + +## Known Implementation Boundaries + +- Substack handling is tuned to current public page structure, especially: + `.available-content .body.markup` + `.post-title` + JSON-LD `datePublished` +- Subscriber-only/authenticated Substack content is not implemented. +- CSS background images are not bundled. +- Manifest fields like `subtitle`, `summary`, and `tags` are parsed but only partially used. +- `rewrite_external_substack_links` and `preserve_other_external_links` exist in the manifest schema but are not deeply wired into behavior yet. + +## Validation and Debugging + +Run tests: + +```bash +cargo test +``` + +Build a manifest: + +```bash +cargo run -p ebookm-cli -- build -m ageofpeace/ageofpeace.yaml +``` + +Inspect extracted source metadata: + +```bash +cargo run -p ebookm-cli -- inspect +``` + +Validate generated XHTML quickly: + +```bash +unzip -p path/to/book.epub OEBPS/text/chapter.xhtml | xmllint --noout - +``` + +Validate the full EPUB package: + +```bash +epubcheck path/to/book.epub +``` + +Useful inspection commands: + +```bash +unzip -l path/to/book.epub +unzip -p path/to/book.epub OEBPS/nav.xhtml +unzip -p path/to/book.epub OEBPS/toc.ncx +unzip -p path/to/book.epub OEBPS/text/.xhtml +``` + +## Existing Real Example + +The repository contains a real working manifest: + +- `ageofpeace/ageofpeace.yaml` + +Related local content/assets: + +- `ageofpeace/introduction.html` +- `ageofpeace/johngu.jpg` +- `ageofpeace/age_of_peace_cover.jpg` + +This is the best regression case for: + +- mixed local HTML + Substack sources +- cover image handling +- local image bundling +- section TOC nesting +- chapter-header processing options + +## Guidance For Future Agents + +- Preserve manifest backward compatibility unless there is a strong reason not to. +- If TOC behavior changes, verify both `nav.xhtml` and `toc.ncx`. +- If HTML normalization changes, verify generated XHTML with `xmllint`. +- If image handling changes, test both: + local HTML image references + remote Substack image references +- Prefer extending `ebookm-core` behavior and keeping `ebookm-cli` thin. +- Update `README.md` whenever user-facing manifest fields or behavior change.