Files
ebookm/AGENTS.md
T
2026-05-25 17:13:27 +02:00

166 lines
4.4 KiB
Markdown

# AGENTS.md
## Project Summary
`ebookm` is a Rust workspace for compiling a set of Substack posts and local HTML files into a single EPUB.
Current workspace layout:
- `ebookm-core`
Core library: manifest parsing, source loading, extraction, normalization, TOC/link logic, EPUB generation.
- `ebookm-cli`
Thin CLI wrapper around `ebookm-core`.
Primary user workflow:
```bash
cargo run -p ebookm-cli -- build -m <manifest>
```
## Key Files
- `Cargo.toml`
Workspace manifest.
- `ebookm-core/src/manifest.rs`
YAML manifest schema and defaults.
- `ebookm-core/src/source.rs`
Source loading for Substack URLs and local HTML files.
- `ebookm-core/src/extract.rs`
Metadata/body extraction, including Substack-specific selectors.
- `ebookm-core/src/normalize.rs`
HTML cleanup, local/remote image bundling, link rewriting, XHTML-safe output conversion.
- `ebookm-core/src/pipeline.rs`
Main build orchestration and chapter generation.
- `ebookm-core/src/epub.rs`
EPUB packaging, nav.xhtml and toc.ncx generation.
- `ebookm-core/src/template.rs`
Starter manifest template used by `ebookm init`.
- `README.md`
User-facing docs and manifest reference.
## Current Manifest Semantics
Top-level manifest keys:
- `book`
- `output`
- `defaults`
- `sections`
- `entries`
- `link_rules`
Supported source kinds:
- `substack`
Public Substack post URL.
- `html`
Local HTML file path, resolved relative to the manifest.
Important processing options:
- `defaults.processing.include_author`
- `defaults.processing.include_date`
- `defaults.processing.include_source_url`
- `defaults.processing.skip_first_paragraphs`
- per-entry overrides under `entries.<id>.processing`
Current defaults:
- `include_author: true`
- `include_date: true`
- `include_source_url: true`
- `skip_first_paragraphs: 0`
## Current EPUB Behavior
- Section structure is emitted into both `nav.xhtml` and `toc.ncx`.
- Chapter header content is configurable:
author, date, and canonical URL can each be independently shown/hidden.
- Local HTML images are bundled when `fetch_images: true`.
- Local image paths are resolved relative to the HTML file, not the manifest.
- Remote images from Substack pages are also bundled when `fetch_images: true`.
- Generated chapter XHTML is post-processed to self-close HTML void tags like `img`, `hr`, and `br` for EPUB/XML compatibility.
## Known Implementation Boundaries
- Substack handling is tuned to current public page structure, especially:
`.available-content .body.markup`
`.post-title`
JSON-LD `datePublished`
- Subscriber-only/authenticated Substack content is not implemented.
- CSS background images are not bundled.
- Manifest fields like `subtitle`, `summary`, and `tags` are parsed but only partially used.
- `rewrite_external_substack_links` and `preserve_other_external_links` exist in the manifest schema but are not deeply wired into behavior yet.
## Validation and Debugging
Run tests:
```bash
cargo test
```
Build a manifest:
```bash
cargo run -p ebookm-cli -- build -m ageofpeace/ageofpeace.yaml
```
Inspect extracted source metadata:
```bash
cargo run -p ebookm-cli -- inspect <url-or-file>
```
Validate generated XHTML quickly:
```bash
unzip -p path/to/book.epub OEBPS/text/chapter.xhtml | xmllint --noout -
```
Validate the full EPUB package:
```bash
epubcheck path/to/book.epub
```
Useful inspection commands:
```bash
unzip -l path/to/book.epub
unzip -p path/to/book.epub OEBPS/nav.xhtml
unzip -p path/to/book.epub OEBPS/toc.ncx
unzip -p path/to/book.epub OEBPS/text/<entry>.xhtml
```
## Existing Real Example
The repository contains a real working manifest:
- `ageofpeace/ageofpeace.yaml`
Related local content/assets:
- `ageofpeace/introduction.html`
- `ageofpeace/johngu.jpg`
- `ageofpeace/age_of_peace_cover.jpg`
This is the best regression case for:
- mixed local HTML + Substack sources
- cover image handling
- local image bundling
- section TOC nesting
- chapter-header processing options
## Guidance For Future Agents
- Preserve manifest backward compatibility unless there is a strong reason not to.
- If TOC behavior changes, verify both `nav.xhtml` and `toc.ncx`.
- If HTML normalization changes, verify generated XHTML with `xmllint`.
- If image handling changes, test both:
local HTML image references
remote Substack image references
- Prefer extending `ebookm-core` behavior and keeping `ebookm-cli` thin.
- Update `README.md` whenever user-facing manifest fields or behavior change.