4.4 KiB
4.4 KiB
AGENTS.md
Project Summary
ebookm is a Rust workspace for compiling a set of Substack posts and local HTML files into a single EPUB.
Current workspace layout:
ebookm-coreCore library: manifest parsing, source loading, extraction, normalization, TOC/link logic, EPUB generation.ebookm-cliThin CLI wrapper aroundebookm-core.
Primary user workflow:
cargo run -p ebookm-cli -- build -m <manifest>
Key Files
Cargo.tomlWorkspace manifest.ebookm-core/src/manifest.rsYAML manifest schema and defaults.ebookm-core/src/source.rsSource loading for Substack URLs and local HTML files.ebookm-core/src/extract.rsMetadata/body extraction, including Substack-specific selectors.ebookm-core/src/normalize.rsHTML cleanup, local/remote image bundling, link rewriting, XHTML-safe output conversion.ebookm-core/src/pipeline.rsMain build orchestration and chapter generation.ebookm-core/src/epub.rsEPUB packaging, nav.xhtml and toc.ncx generation.ebookm-core/src/template.rsStarter manifest template used byebookm init.README.mdUser-facing docs and manifest reference.
Current Manifest Semantics
Top-level manifest keys:
bookoutputdefaultssectionsentrieslink_rules
Supported source kinds:
substackPublic Substack post URL.htmlLocal HTML file path, resolved relative to the manifest.
Important processing options:
defaults.processing.include_authordefaults.processing.include_datedefaults.processing.include_source_urldefaults.processing.skip_first_paragraphs- per-entry overrides under
entries.<id>.processing
Current defaults:
include_author: trueinclude_date: trueinclude_source_url: trueskip_first_paragraphs: 0
Current EPUB Behavior
- Section structure is emitted into both
nav.xhtmlandtoc.ncx. - Chapter header content is configurable: author, date, and canonical URL can each be independently shown/hidden.
- Local HTML images are bundled when
fetch_images: true. - Local image paths are resolved relative to the HTML file, not the manifest.
- Remote images from Substack pages are also bundled when
fetch_images: true. - Generated chapter XHTML is post-processed to self-close HTML void tags like
img,hr, andbrfor EPUB/XML compatibility.
Known Implementation Boundaries
- Substack handling is tuned to current public page structure, especially:
.available-content .body.markup.post-titleJSON-LDdatePublished - Subscriber-only/authenticated Substack content is not implemented.
- CSS background images are not bundled.
- Manifest fields like
subtitle,summary, andtagsare parsed but only partially used. rewrite_external_substack_linksandpreserve_other_external_linksexist in the manifest schema but are not deeply wired into behavior yet.
Validation and Debugging
Run tests:
cargo test
Build a manifest:
cargo run -p ebookm-cli -- build -m ageofpeace/ageofpeace.yaml
Inspect extracted source metadata:
cargo run -p ebookm-cli -- inspect <url-or-file>
Validate generated XHTML quickly:
unzip -p path/to/book.epub OEBPS/text/chapter.xhtml | xmllint --noout -
Validate the full EPUB package:
epubcheck path/to/book.epub
Useful inspection commands:
unzip -l path/to/book.epub
unzip -p path/to/book.epub OEBPS/nav.xhtml
unzip -p path/to/book.epub OEBPS/toc.ncx
unzip -p path/to/book.epub OEBPS/text/<entry>.xhtml
Existing Real Example
The repository contains a real working manifest:
ageofpeace/ageofpeace.yaml
Related local content/assets:
ageofpeace/introduction.htmlageofpeace/johngu.jpgageofpeace/age_of_peace_cover.jpg
This is the best regression case for:
- mixed local HTML + Substack sources
- cover image handling
- local image bundling
- section TOC nesting
- chapter-header processing options
Guidance For Future Agents
- Preserve manifest backward compatibility unless there is a strong reason not to.
- If TOC behavior changes, verify both
nav.xhtmlandtoc.ncx. - If HTML normalization changes, verify generated XHTML with
xmllint. - If image handling changes, test both: local HTML image references remote Substack image references
- Prefer extending
ebookm-corebehavior and keepingebookm-clithin. - Update
README.mdwhenever user-facing manifest fields or behavior change.