11 KiB
ebookm
ebookm is a Rust command-line tool that compiles a set of Substack posts and local HTML files into a single EPUB.
Current Scope
v0.1 supports:
- YAML manifests
- Public Substack post URLs
- Local HTML files
- Manifest-defined section order and TOC structure
- Per-entry metadata and TOC overrides
- Basic internal link rewriting between included entries
- EPUB generation with bundled article assets
Build
cargo build
Run
Use the CLI through Cargo:
cargo run -p ebookm-cli -- <command>
Available commands:
build -m <manifest>validate -m <manifest>inspect <url-or-file>init
Quick Start
This repository includes a runnable example manifest and local HTML fixture in examples/.
Validate the example manifest:
cargo run -p ebookm-cli -- validate -m examples/example-book.yaml
Build the example EPUB:
cargo run -p ebookm-cli -- build -m examples/example-book.yaml
The output EPUB will be written to:
examples/dist/example-book.epub
Inspect a local HTML file:
cargo run -p ebookm-cli -- inspect examples/articles/intro.html
Generate a starter manifest:
cargo run -p ebookm-cli -- init
Validate The Output EPUB
There are two useful levels of validation.
Quick XML/XHTML validation for generated chapter files:
unzip -p path/to/book.epub OEBPS/text/chapter.xhtml | xmllint --noout -
If you want to check every generated XHTML file in the EPUB:
mkdir -p /tmp/ebookm-check
unzip -o path/to/book.epub -d /tmp/ebookm-check
find /tmp/ebookm-check/OEBPS -name '*.xhtml' -print -exec xmllint --noout {} \;
Full EPUB validation:
Use epubcheck, which validates the EPUB package itself, including metadata, navigation files, manifest/spine consistency, and XHTML correctness.
epubcheck path/to/book.epub
Practical guidance:
- Use
xmllintwhen you want to quickly confirm that generated XHTML is well-formed XML. - Use
epubcheckwhen you want proper EPUB-level validation before distributing the file. - If an EPUB reader only shows part of a chapter, malformed XHTML is a common cause, so
xmllinton the generated chapter files is a good first check.
Manifest Reference
The top-level manifest keys are:
book: EPUB metadataoutput: output path and optional cover imagedefaults: shared normalization and metadata defaultssections: ordered TOC and reading-order groupsentries: source definitions and per-entry overrideslink_rules: cross-link rewriting behavior
Minimal example:
book:
title: "My Book"
author: "Editor"
language: "en"
identifier: "urn:uuid:my-book"
output:
path: "dist/my-book.epub"
sections:
- id: "part-1"
title: "Part 1"
entries:
- "essay"
entries:
essay:
source:
kind: "html"
path: "articles/essay.html"
link_rules:
mode: "auto"
Top-Level Fields
book: EPUB metadata block. Required.output: Output configuration block. Required.defaults: Shared defaults applied across entries. Optional.sections: Ordered list of sections. Required in practice for a useful build.entries: Map of entry IDs to entry definitions. Required in practice.link_rules: Global link rewriting policy. Optional.
book
book.title: Required string. Book title.book.author: Optional string. Book-level author.book.language: Optional string. Defaults toen.book.identifier: Optional string. Defaults to a generatedurn:uuid:....book.description: Optional string. Written into the EPUB package metadata.
Example:
book:
title: "Collected Essays"
author: "Jane Doe"
language: "en"
identifier: "urn:uuid:collected-essays"
description: "A single-volume EPUB generated by ebookm"
output
output.path: Required string. Output EPUB path. Resolved relative to the manifest file.output.cover_image: Optional string. Path to a cover image, resolved relative to the manifest file.
Example:
output:
path: "dist/book.epub"
cover_image: "assets/cover.jpg"
defaults
defaults.fetch_images: Optional boolean. Defaults totrue. When enabled, image assets referenced from article HTML are fetched and bundled into the EPUB.defaults.normalize_substack_embeds: Optional boolean. Defaults totrue. Currently removes iframe embeds during normalization.defaults.metadata: Optional metadata override block applied after extracted source metadata and before per-entry overrides.defaults.processing: Optional shared article-processing and chapter-header defaults.
Example:
defaults:
fetch_images: true
normalize_substack_embeds: true
processing:
include_author: true
include_date: true
include_source_url: true
skip_first_paragraphs: 0
metadata:
author: "Editorial Team"
defaults.processing and entries.<id>.processing
These fields control chapter-header rendering and article trimming.
For defaults.processing, all fields are concrete values with defaults.
For entries.<id>.processing, the same fields are optional overrides.
include_author: Boolean. Defaults totrue. Controls whether the extracted or overridden author name is shown at the start of the chapter.include_date: Boolean. Defaults totrue. Controls whether the extracted or overridden publication date is shown at the start of the chapter.include_source_url: Boolean. Defaults totrue. Controls whether the canonical article URL is shown at the start of the chapter.skip_first_paragraphs: Integer. Defaults to0. Removes the firstnparagraph elements from the extracted article body before EPUB generation.
Example:
defaults:
processing:
include_author: true
include_date: false
include_source_url: false
skip_first_paragraphs: 0
defaults.metadata and entries.<id>.metadata
These fields use the same shape:
author: Optional string.published: Optional date inYYYY-MM-DDformat.subtitle: Optional string. Accepted by the parser but not yet emitted into the EPUB output.summary: Optional string. Accepted by the parser but not yet emitted into the EPUB output.tags: Optional list of strings. Accepted by the parser but not yet used in link or EPUB output logic.
Example:
metadata:
author: "Jane Doe"
published: "2025-01-10"
subtitle: "Notebook entry"
summary: "A short summary"
tags: ["essay", "history"]
sections
sections is an ordered list. Section order controls reading order and TOC grouping.
Each section supports:
id: Required string. Stable section identifier.title: Required string. Section title shown in the TOC.entries: Optional list of entry IDs in reading order. Usually should not be empty.
Example:
sections:
- id: "part-1"
title: "Part 1"
entries:
- "opening-post"
- "notes"
entries
entries is a map from entry ID to entry definition. Entry IDs are referenced from sections and link rules.
Each entry supports:
source: Required source definition block.title: Optional string. Overrides the extracted article title.metadata: Optional metadata override block.toc: Optional TOC override block.links: Optional per-entry link-policy block.processing: Optional per-entry processing override block.
Example:
entries:
opening-post:
source:
kind: "substack"
url: "https://example.substack.com/p/opening-post"
title: "Opening Post"
metadata:
published: "2025-01-10"
processing:
include_source_url: false
skip_first_paragraphs: 1
toc:
title: "Introduction"
links:
mode: "explicit"
allow_to: ["notes"]
entries.<id>.source
Two source kinds are supported:
kind: "substack": Use a public Substack article URL. Fields:urlrequired string.kind: "html": Use a local HTML file. Fields:pathrequired string, resolved relative to the manifest file.
Examples:
source:
kind: "substack"
url: "https://example.substack.com/p/my-post"
source:
kind: "html"
path: "articles/local-post.html"
You can mix Substack URLs and local HTML files in the same manifest:
entries:
remote-post:
source:
kind: "substack"
url: "https://example.substack.com/p/remote-post"
local-post:
source:
kind: "html"
path: "articles/local-post.html"
entries.<id>.toc
title: Optional string. Overrides the chapter label used in the TOC.hidden: Optional boolean. Defaults tofalse. Whentrue, the entry is omitted from the TOC.
Example:
toc:
title: "Appendix"
hidden: false
entries.<id>.links
mode: Optional string. One of:auto,explicit,none. If omitted, the globallink_rules.modeis used.allow_to: Optional list of entry IDs. If set, rewritten internal links are limited to these targets.block_to: Optional list of entry IDs. These targets are excluded from rewriting.
Example:
links:
mode: "explicit"
allow_to: ["intro", "appendix"]
block_to: ["draft-notes"]
link_rules
link_rules.mode: Optional string. One of:auto,explicit,none. Defaults toauto.link_rules.rewrite_external_substack_links: Optional boolean. Defaults totrue. Accepted by the manifest parser, but not currently used to change behavior inv0.1.link_rules.preserve_other_external_links: Optional boolean. Defaults totrue. Accepted by the manifest parser, but not currently used to change behavior inv0.1.link_rules.rules: Optional list of explicit link rules.
Example:
link_rules:
mode: "explicit"
rewrite_external_substack_links: true
preserve_other_external_links: true
rules:
- from: ["notes"]
to: ["intro"]
match_mode: "canonical-url"
link_rules.rules[]
Each rule supports:
from: Required list of selectors describing where the rule applies.to: Required list of selectors describing eligible targets.match_mode: Optional string. One of:canonical-url,source-url,disabled. Defaults tocanonical-url.
Supported selectors in from and to:
*: Match all entries.<entry-id>: Match one entry by ID.section:<section-id>: Match all entries referenced by that section.
Example:
link_rules:
mode: "explicit"
rules:
- from: ["section:essays"]
to: ["section:essays"]
match_mode: "canonical-url"
Notes
- Output paths are resolved relative to the manifest file location.
- Local HTML paths are also resolved relative to the manifest file location.
sectionsandentriesare deserialized with empty defaults, butvalidateandbuildexpect them to be meaningfully populated.subtitle,summary,tags,rewrite_external_substack_links, andpreserve_other_external_linksare accepted today but only partially wired into runtime behavior.- For Substack sources,
v0.1assumes public posts. Subscriber-only/session-based fetching is not implemented.