166 lines
4.4 KiB
Markdown
166 lines
4.4 KiB
Markdown
# AGENTS.md
|
|
|
|
## Project Summary
|
|
|
|
`ebookm` is a Rust workspace for compiling a set of Substack posts and local HTML files into a single EPUB.
|
|
|
|
Current workspace layout:
|
|
|
|
- `ebookm-core`
|
|
Core library: manifest parsing, source loading, extraction, normalization, TOC/link logic, EPUB generation.
|
|
- `ebookm-cli`
|
|
Thin CLI wrapper around `ebookm-core`.
|
|
|
|
Primary user workflow:
|
|
|
|
```bash
|
|
cargo run -p ebookm-cli -- build -m <manifest>
|
|
```
|
|
|
|
## Key Files
|
|
|
|
- `Cargo.toml`
|
|
Workspace manifest.
|
|
- `ebookm-core/src/manifest.rs`
|
|
YAML manifest schema and defaults.
|
|
- `ebookm-core/src/source.rs`
|
|
Source loading for Substack URLs and local HTML files.
|
|
- `ebookm-core/src/extract.rs`
|
|
Metadata/body extraction, including Substack-specific selectors.
|
|
- `ebookm-core/src/normalize.rs`
|
|
HTML cleanup, local/remote image bundling, link rewriting, XHTML-safe output conversion.
|
|
- `ebookm-core/src/pipeline.rs`
|
|
Main build orchestration and chapter generation.
|
|
- `ebookm-core/src/epub.rs`
|
|
EPUB packaging, nav.xhtml and toc.ncx generation.
|
|
- `ebookm-core/src/template.rs`
|
|
Starter manifest template used by `ebookm init`.
|
|
- `README.md`
|
|
User-facing docs and manifest reference.
|
|
|
|
## Current Manifest Semantics
|
|
|
|
Top-level manifest keys:
|
|
|
|
- `book`
|
|
- `output`
|
|
- `defaults`
|
|
- `sections`
|
|
- `entries`
|
|
- `link_rules`
|
|
|
|
Supported source kinds:
|
|
|
|
- `substack`
|
|
Public Substack post URL.
|
|
- `html`
|
|
Local HTML file path, resolved relative to the manifest.
|
|
|
|
Important processing options:
|
|
|
|
- `defaults.processing.include_author`
|
|
- `defaults.processing.include_date`
|
|
- `defaults.processing.include_source_url`
|
|
- `defaults.processing.skip_first_paragraphs`
|
|
- per-entry overrides under `entries.<id>.processing`
|
|
|
|
Current defaults:
|
|
|
|
- `include_author: true`
|
|
- `include_date: true`
|
|
- `include_source_url: true`
|
|
- `skip_first_paragraphs: 0`
|
|
|
|
## Current EPUB Behavior
|
|
|
|
- Section structure is emitted into both `nav.xhtml` and `toc.ncx`.
|
|
- Chapter header content is configurable:
|
|
author, date, and canonical URL can each be independently shown/hidden.
|
|
- Local HTML images are bundled when `fetch_images: true`.
|
|
- Local image paths are resolved relative to the HTML file, not the manifest.
|
|
- Remote images from Substack pages are also bundled when `fetch_images: true`.
|
|
- Generated chapter XHTML is post-processed to self-close HTML void tags like `img`, `hr`, and `br` for EPUB/XML compatibility.
|
|
|
|
## Known Implementation Boundaries
|
|
|
|
- Substack handling is tuned to current public page structure, especially:
|
|
`.available-content .body.markup`
|
|
`.post-title`
|
|
JSON-LD `datePublished`
|
|
- Subscriber-only/authenticated Substack content is not implemented.
|
|
- CSS background images are not bundled.
|
|
- Manifest fields like `subtitle`, `summary`, and `tags` are parsed but only partially used.
|
|
- `rewrite_external_substack_links` and `preserve_other_external_links` exist in the manifest schema but are not deeply wired into behavior yet.
|
|
|
|
## Validation and Debugging
|
|
|
|
Run tests:
|
|
|
|
```bash
|
|
cargo test
|
|
```
|
|
|
|
Build a manifest:
|
|
|
|
```bash
|
|
cargo run -p ebookm-cli -- build -m ageofpeace/ageofpeace.yaml
|
|
```
|
|
|
|
Inspect extracted source metadata:
|
|
|
|
```bash
|
|
cargo run -p ebookm-cli -- inspect <url-or-file>
|
|
```
|
|
|
|
Validate generated XHTML quickly:
|
|
|
|
```bash
|
|
unzip -p path/to/book.epub OEBPS/text/chapter.xhtml | xmllint --noout -
|
|
```
|
|
|
|
Validate the full EPUB package:
|
|
|
|
```bash
|
|
epubcheck path/to/book.epub
|
|
```
|
|
|
|
Useful inspection commands:
|
|
|
|
```bash
|
|
unzip -l path/to/book.epub
|
|
unzip -p path/to/book.epub OEBPS/nav.xhtml
|
|
unzip -p path/to/book.epub OEBPS/toc.ncx
|
|
unzip -p path/to/book.epub OEBPS/text/<entry>.xhtml
|
|
```
|
|
|
|
## Existing Real Example
|
|
|
|
The repository contains a real working manifest:
|
|
|
|
- `ageofpeace/ageofpeace.yaml`
|
|
|
|
Related local content/assets:
|
|
|
|
- `ageofpeace/introduction.html`
|
|
- `ageofpeace/johngu.jpg`
|
|
- `ageofpeace/age_of_peace_cover.jpg`
|
|
|
|
This is the best regression case for:
|
|
|
|
- mixed local HTML + Substack sources
|
|
- cover image handling
|
|
- local image bundling
|
|
- section TOC nesting
|
|
- chapter-header processing options
|
|
|
|
## Guidance For Future Agents
|
|
|
|
- Preserve manifest backward compatibility unless there is a strong reason not to.
|
|
- If TOC behavior changes, verify both `nav.xhtml` and `toc.ncx`.
|
|
- If HTML normalization changes, verify generated XHTML with `xmllint`.
|
|
- If image handling changes, test both:
|
|
local HTML image references
|
|
remote Substack image references
|
|
- Prefer extending `ebookm-core` behavior and keeping `ebookm-cli` thin.
|
|
- Update `README.md` whenever user-facing manifest fields or behavior change.
|