# ebookm `ebookm` is a Rust command-line tool that compiles a set of Substack posts and local HTML files into a single EPUB. ## Current Scope `v0.1` supports: - YAML manifests - Public Substack post URLs - Local HTML files - Manifest-defined section order and TOC structure - Per-entry metadata and TOC overrides - Basic internal link rewriting between included entries - EPUB generation with bundled article assets ## Build ```bash cargo build ``` ## Run Use the CLI through Cargo: ```bash cargo run -p ebookm-cli -- ``` Available commands: - `build -m ` - `validate -m ` - `inspect ` - `init` ## Quick Start This repository includes a runnable example manifest and local HTML fixture in `examples/`. Validate the example manifest: ```bash cargo run -p ebookm-cli -- validate -m examples/example-book.yaml ``` Build the example EPUB: ```bash cargo run -p ebookm-cli -- build -m examples/example-book.yaml ``` The output EPUB will be written to: ```text examples/dist/example-book.epub ``` Inspect a local HTML file: ```bash cargo run -p ebookm-cli -- inspect examples/articles/intro.html ``` Generate a starter manifest: ```bash cargo run -p ebookm-cli -- init ``` ## Validate The Output EPUB There are two useful levels of validation. Quick XML/XHTML validation for generated chapter files: ```bash unzip -p path/to/book.epub OEBPS/text/chapter.xhtml | xmllint --noout - ``` If you want to check every generated XHTML file in the EPUB: ```bash mkdir -p /tmp/ebookm-check unzip -o path/to/book.epub -d /tmp/ebookm-check find /tmp/ebookm-check/OEBPS -name '*.xhtml' -print -exec xmllint --noout {} \; ``` Full EPUB validation: Use `epubcheck`, which validates the EPUB package itself, including metadata, navigation files, manifest/spine consistency, and XHTML correctness. ```bash epubcheck path/to/book.epub ``` Practical guidance: - Use `xmllint` when you want to quickly confirm that generated XHTML is well-formed XML. - Use `epubcheck` when you want proper EPUB-level validation before distributing the file. - If an EPUB reader only shows part of a chapter, malformed XHTML is a common cause, so `xmllint` on the generated chapter files is a good first check. ## Manifest Reference The top-level manifest keys are: - `book`: EPUB metadata - `output`: output path and optional cover image - `defaults`: shared normalization and metadata defaults - `sections`: ordered TOC and reading-order groups - `entries`: source definitions and per-entry overrides - `link_rules`: cross-link rewriting behavior Minimal example: ```yaml book: title: "My Book" author: "Editor" language: "en" identifier: "urn:uuid:my-book" output: path: "dist/my-book.epub" sections: - id: "part-1" title: "Part 1" entries: - "essay" entries: essay: source: kind: "html" path: "articles/essay.html" link_rules: mode: "auto" ``` ### Top-Level Fields - `book`: EPUB metadata block. Required. - `output`: Output configuration block. Required. - `defaults`: Shared defaults applied across entries. Optional. - `sections`: Ordered list of sections. Required in practice for a useful build. - `entries`: Map of entry IDs to entry definitions. Required in practice. - `link_rules`: Global link rewriting policy. Optional. ### `book` - `book.title`: Required string. Book title. - `book.author`: Optional string. Book-level author. - `book.language`: Optional string. Defaults to `en`. - `book.identifier`: Optional string. Defaults to a generated `urn:uuid:...`. - `book.description`: Optional string. Written into the EPUB package metadata. Example: ```yaml book: title: "Collected Essays" author: "Jane Doe" language: "en" identifier: "urn:uuid:collected-essays" description: "A single-volume EPUB generated by ebookm" ``` ### `output` - `output.path`: Required string. Output EPUB path. Resolved relative to the manifest file. - `output.cover_image`: Optional string. Path to a cover image, resolved relative to the manifest file. Example: ```yaml output: path: "dist/book.epub" cover_image: "assets/cover.jpg" ``` ### `defaults` - `defaults.fetch_images`: Optional boolean. Defaults to `true`. When enabled, image assets referenced from article HTML are fetched and bundled into the EPUB. - `defaults.normalize_substack_embeds`: Optional boolean. Defaults to `true`. Currently removes iframe embeds during normalization. - `defaults.metadata`: Optional metadata override block applied after extracted source metadata and before per-entry overrides. - `defaults.processing`: Optional shared article-processing and chapter-header defaults. Example: ```yaml defaults: fetch_images: true normalize_substack_embeds: true processing: include_author: true include_date: true include_source_url: true skip_first_paragraphs: 0 metadata: author: "Editorial Team" ``` ### `defaults.processing` and `entries..processing` These fields control chapter-header rendering and article trimming. For `defaults.processing`, all fields are concrete values with defaults. For `entries..processing`, the same fields are optional overrides. - `include_author`: Boolean. Defaults to `true`. Controls whether the extracted or overridden author name is shown at the start of the chapter. - `include_date`: Boolean. Defaults to `true`. Controls whether the extracted or overridden publication date is shown at the start of the chapter. - `include_source_url`: Boolean. Defaults to `true`. Controls whether the canonical article URL is shown at the start of the chapter. - `skip_first_paragraphs`: Integer. Defaults to `0`. Removes the first `n` paragraph elements from the extracted article body before EPUB generation. Example: ```yaml defaults: processing: include_author: true include_date: false include_source_url: false skip_first_paragraphs: 0 ``` ### `defaults.metadata` and `entries..metadata` These fields use the same shape: - `author`: Optional string. - `published`: Optional date in `YYYY-MM-DD` format. - `subtitle`: Optional string. Accepted by the parser but not yet emitted into the EPUB output. - `summary`: Optional string. Accepted by the parser but not yet emitted into the EPUB output. - `tags`: Optional list of strings. Accepted by the parser but not yet used in link or EPUB output logic. Example: ```yaml metadata: author: "Jane Doe" published: "2025-01-10" subtitle: "Notebook entry" summary: "A short summary" tags: ["essay", "history"] ``` ### `sections` `sections` is an ordered list. Section order controls reading order and TOC grouping. Each section supports: - `id`: Required string. Stable section identifier. - `title`: Required string. Section title shown in the TOC. - `entries`: Optional list of entry IDs in reading order. Usually should not be empty. Example: ```yaml sections: - id: "part-1" title: "Part 1" entries: - "opening-post" - "notes" ``` ### `entries` `entries` is a map from entry ID to entry definition. Entry IDs are referenced from `sections` and link rules. Each entry supports: - `source`: Required source definition block. - `title`: Optional string. Overrides the extracted article title. - `metadata`: Optional metadata override block. - `toc`: Optional TOC override block. - `links`: Optional per-entry link-policy block. - `processing`: Optional per-entry processing override block. Example: ```yaml entries: opening-post: source: kind: "substack" url: "https://example.substack.com/p/opening-post" title: "Opening Post" metadata: published: "2025-01-10" processing: include_source_url: false skip_first_paragraphs: 1 toc: title: "Introduction" links: mode: "explicit" allow_to: ["notes"] ``` ### `entries..source` Two source kinds are supported: - `kind: "substack"`: Use a public Substack article URL. Fields: `url` required string. - `kind: "html"`: Use a local HTML file. Fields: `path` required string, resolved relative to the manifest file. Examples: ```yaml source: kind: "substack" url: "https://example.substack.com/p/my-post" ``` ```yaml source: kind: "html" path: "articles/local-post.html" ``` You can mix Substack URLs and local HTML files in the same manifest: ```yaml entries: remote-post: source: kind: "substack" url: "https://example.substack.com/p/remote-post" local-post: source: kind: "html" path: "articles/local-post.html" ``` ### `entries..toc` - `title`: Optional string. Overrides the chapter label used in the TOC. - `hidden`: Optional boolean. Defaults to `false`. When `true`, the entry is omitted from the TOC. Example: ```yaml toc: title: "Appendix" hidden: false ``` ### `entries..links` - `mode`: Optional string. One of: `auto`, `explicit`, `none`. If omitted, the global `link_rules.mode` is used. - `allow_to`: Optional list of entry IDs. If set, rewritten internal links are limited to these targets. - `block_to`: Optional list of entry IDs. These targets are excluded from rewriting. Example: ```yaml links: mode: "explicit" allow_to: ["intro", "appendix"] block_to: ["draft-notes"] ``` ### `link_rules` - `link_rules.mode`: Optional string. One of: `auto`, `explicit`, `none`. Defaults to `auto`. - `link_rules.rewrite_external_substack_links`: Optional boolean. Defaults to `true`. Accepted by the manifest parser, but not currently used to change behavior in `v0.1`. - `link_rules.preserve_other_external_links`: Optional boolean. Defaults to `true`. Accepted by the manifest parser, but not currently used to change behavior in `v0.1`. - `link_rules.rules`: Optional list of explicit link rules. Example: ```yaml link_rules: mode: "explicit" rewrite_external_substack_links: true preserve_other_external_links: true rules: - from: ["notes"] to: ["intro"] match_mode: "canonical-url" ``` ### `link_rules.rules[]` Each rule supports: - `from`: Required list of selectors describing where the rule applies. - `to`: Required list of selectors describing eligible targets. - `match_mode`: Optional string. One of: `canonical-url`, `source-url`, `disabled`. Defaults to `canonical-url`. Supported selectors in `from` and `to`: - `*`: Match all entries. - ``: Match one entry by ID. - `section:`: Match all entries referenced by that section. Example: ```yaml link_rules: mode: "explicit" rules: - from: ["section:essays"] to: ["section:essays"] match_mode: "canonical-url" ``` ## Notes - Output paths are resolved relative to the manifest file location. - Local HTML paths are also resolved relative to the manifest file location. - `sections` and `entries` are deserialized with empty defaults, but `validate` and `build` expect them to be meaningfully populated. - `subtitle`, `summary`, `tags`, `rewrite_external_substack_links`, and `preserve_other_external_links` are accepted today but only partially wired into runtime behavior. - For Substack sources, `v0.1` assumes public posts. Subscriber-only/session-based fetching is not implemented.