483 lines
11 KiB
Markdown
483 lines
11 KiB
Markdown
# ebookm
|
|
|
|
`ebookm` is a Rust command-line tool that compiles a set of Substack posts and local HTML files into a single EPUB.
|
|
|
|
## Current Scope
|
|
|
|
`v0.1` supports:
|
|
|
|
- YAML manifests
|
|
- Public Substack post URLs
|
|
- Local HTML files
|
|
- Manifest-defined section order and TOC structure
|
|
- Per-entry metadata and TOC overrides
|
|
- Basic internal link rewriting between included entries
|
|
- EPUB generation with bundled article assets
|
|
|
|
## Build
|
|
|
|
```bash
|
|
cargo build
|
|
```
|
|
|
|
## Run
|
|
|
|
Use the CLI through Cargo:
|
|
|
|
```bash
|
|
cargo run -p ebookm-cli -- <command>
|
|
```
|
|
|
|
Available commands:
|
|
|
|
- `build -m <manifest>`
|
|
- `validate -m <manifest>`
|
|
- `inspect <url-or-file>`
|
|
- `init`
|
|
|
|
## Quick Start
|
|
|
|
This repository includes a runnable example manifest and local HTML fixture in `examples/`.
|
|
|
|
Validate the example manifest:
|
|
|
|
```bash
|
|
cargo run -p ebookm-cli -- validate -m examples/example-book.yaml
|
|
```
|
|
|
|
Build the example EPUB:
|
|
|
|
```bash
|
|
cargo run -p ebookm-cli -- build -m examples/example-book.yaml
|
|
```
|
|
|
|
The output EPUB will be written to:
|
|
|
|
```text
|
|
examples/dist/example-book.epub
|
|
```
|
|
|
|
Inspect a local HTML file:
|
|
|
|
```bash
|
|
cargo run -p ebookm-cli -- inspect examples/articles/intro.html
|
|
```
|
|
|
|
Generate a starter manifest:
|
|
|
|
```bash
|
|
cargo run -p ebookm-cli -- init
|
|
```
|
|
|
|
## Validate The Output EPUB
|
|
|
|
There are two useful levels of validation.
|
|
|
|
Quick XML/XHTML validation for generated chapter files:
|
|
|
|
```bash
|
|
unzip -p path/to/book.epub OEBPS/text/chapter.xhtml | xmllint --noout -
|
|
```
|
|
|
|
If you want to check every generated XHTML file in the EPUB:
|
|
|
|
```bash
|
|
mkdir -p /tmp/ebookm-check
|
|
unzip -o path/to/book.epub -d /tmp/ebookm-check
|
|
find /tmp/ebookm-check/OEBPS -name '*.xhtml' -print -exec xmllint --noout {} \;
|
|
```
|
|
|
|
Full EPUB validation:
|
|
|
|
Use `epubcheck`, which validates the EPUB package itself, including metadata, navigation files, manifest/spine consistency, and XHTML correctness.
|
|
|
|
```bash
|
|
epubcheck path/to/book.epub
|
|
```
|
|
|
|
Practical guidance:
|
|
|
|
- Use `xmllint` when you want to quickly confirm that generated XHTML is well-formed XML.
|
|
- Use `epubcheck` when you want proper EPUB-level validation before distributing the file.
|
|
- If an EPUB reader only shows part of a chapter, malformed XHTML is a common cause, so `xmllint` on the generated chapter files is a good first check.
|
|
|
|
## Manifest Reference
|
|
|
|
The top-level manifest keys are:
|
|
|
|
- `book`: EPUB metadata
|
|
- `output`: output path and optional cover image
|
|
- `defaults`: shared normalization and metadata defaults
|
|
- `sections`: ordered TOC and reading-order groups
|
|
- `entries`: source definitions and per-entry overrides
|
|
- `link_rules`: cross-link rewriting behavior
|
|
|
|
Minimal example:
|
|
|
|
```yaml
|
|
book:
|
|
title: "My Book"
|
|
author: "Editor"
|
|
language: "en"
|
|
identifier: "urn:uuid:my-book"
|
|
|
|
output:
|
|
path: "dist/my-book.epub"
|
|
|
|
sections:
|
|
- id: "part-1"
|
|
title: "Part 1"
|
|
entries:
|
|
- "essay"
|
|
|
|
entries:
|
|
essay:
|
|
source:
|
|
kind: "html"
|
|
path: "articles/essay.html"
|
|
|
|
link_rules:
|
|
mode: "auto"
|
|
```
|
|
|
|
### Top-Level Fields
|
|
|
|
- `book`:
|
|
EPUB metadata block. Required.
|
|
- `output`:
|
|
Output configuration block. Required.
|
|
- `defaults`:
|
|
Shared defaults applied across entries. Optional.
|
|
- `sections`:
|
|
Ordered list of sections. Required in practice for a useful build.
|
|
- `entries`:
|
|
Map of entry IDs to entry definitions. Required in practice.
|
|
- `link_rules`:
|
|
Global link rewriting policy. Optional.
|
|
|
|
### `book`
|
|
|
|
- `book.title`:
|
|
Required string. Book title.
|
|
- `book.author`:
|
|
Optional string. Book-level author.
|
|
- `book.language`:
|
|
Optional string. Defaults to `en`.
|
|
- `book.identifier`:
|
|
Optional string. Defaults to a generated `urn:uuid:...`.
|
|
- `book.description`:
|
|
Optional string. Written into the EPUB package metadata.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
book:
|
|
title: "Collected Essays"
|
|
author: "Jane Doe"
|
|
language: "en"
|
|
identifier: "urn:uuid:collected-essays"
|
|
description: "A single-volume EPUB generated by ebookm"
|
|
```
|
|
|
|
### `output`
|
|
|
|
- `output.path`:
|
|
Required string. Output EPUB path. Resolved relative to the manifest file.
|
|
- `output.cover_image`:
|
|
Optional string. Path to a cover image, resolved relative to the manifest file.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
output:
|
|
path: "dist/book.epub"
|
|
cover_image: "assets/cover.jpg"
|
|
```
|
|
|
|
### `defaults`
|
|
|
|
- `defaults.fetch_images`:
|
|
Optional boolean. Defaults to `true`. When enabled, image assets referenced from article HTML are fetched and bundled into the EPUB.
|
|
- `defaults.normalize_substack_embeds`:
|
|
Optional boolean. Defaults to `true`. Currently removes iframe embeds during normalization.
|
|
- `defaults.metadata`:
|
|
Optional metadata override block applied after extracted source metadata and before per-entry overrides.
|
|
- `defaults.processing`:
|
|
Optional shared article-processing and chapter-header defaults.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
defaults:
|
|
fetch_images: true
|
|
normalize_substack_embeds: true
|
|
processing:
|
|
include_author: true
|
|
include_date: true
|
|
include_source_url: true
|
|
skip_first_paragraphs: 0
|
|
metadata:
|
|
author: "Editorial Team"
|
|
```
|
|
|
|
### `defaults.processing` and `entries.<id>.processing`
|
|
|
|
These fields control chapter-header rendering and article trimming.
|
|
|
|
For `defaults.processing`, all fields are concrete values with defaults.
|
|
For `entries.<id>.processing`, the same fields are optional overrides.
|
|
|
|
- `include_author`:
|
|
Boolean. Defaults to `true`. Controls whether the extracted or overridden author name is shown at the start of the chapter.
|
|
- `include_date`:
|
|
Boolean. Defaults to `true`. Controls whether the extracted or overridden publication date is shown at the start of the chapter.
|
|
- `include_source_url`:
|
|
Boolean. Defaults to `true`. Controls whether the canonical article URL is shown at the start of the chapter.
|
|
- `skip_first_paragraphs`:
|
|
Integer. Defaults to `0`. Removes the first `n` paragraph elements from the extracted article body before EPUB generation.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
defaults:
|
|
processing:
|
|
include_author: true
|
|
include_date: false
|
|
include_source_url: false
|
|
skip_first_paragraphs: 0
|
|
```
|
|
|
|
### `defaults.metadata` and `entries.<id>.metadata`
|
|
|
|
These fields use the same shape:
|
|
|
|
- `author`:
|
|
Optional string.
|
|
- `published`:
|
|
Optional date in `YYYY-MM-DD` format.
|
|
- `subtitle`:
|
|
Optional string. Accepted by the parser but not yet emitted into the EPUB output.
|
|
- `summary`:
|
|
Optional string. Accepted by the parser but not yet emitted into the EPUB output.
|
|
- `tags`:
|
|
Optional list of strings. Accepted by the parser but not yet used in link or EPUB output logic.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
metadata:
|
|
author: "Jane Doe"
|
|
published: "2025-01-10"
|
|
subtitle: "Notebook entry"
|
|
summary: "A short summary"
|
|
tags: ["essay", "history"]
|
|
```
|
|
|
|
### `sections`
|
|
|
|
`sections` is an ordered list. Section order controls reading order and TOC grouping.
|
|
|
|
Each section supports:
|
|
|
|
- `id`:
|
|
Required string. Stable section identifier.
|
|
- `title`:
|
|
Required string. Section title shown in the TOC.
|
|
- `entries`:
|
|
Optional list of entry IDs in reading order. Usually should not be empty.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
sections:
|
|
- id: "part-1"
|
|
title: "Part 1"
|
|
entries:
|
|
- "opening-post"
|
|
- "notes"
|
|
```
|
|
|
|
### `entries`
|
|
|
|
`entries` is a map from entry ID to entry definition. Entry IDs are referenced from `sections` and link rules.
|
|
|
|
Each entry supports:
|
|
|
|
- `source`:
|
|
Required source definition block.
|
|
- `title`:
|
|
Optional string. Overrides the extracted article title.
|
|
- `metadata`:
|
|
Optional metadata override block.
|
|
- `toc`:
|
|
Optional TOC override block.
|
|
- `links`:
|
|
Optional per-entry link-policy block.
|
|
- `processing`:
|
|
Optional per-entry processing override block.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
entries:
|
|
opening-post:
|
|
source:
|
|
kind: "substack"
|
|
url: "https://example.substack.com/p/opening-post"
|
|
title: "Opening Post"
|
|
metadata:
|
|
published: "2025-01-10"
|
|
processing:
|
|
include_source_url: false
|
|
skip_first_paragraphs: 1
|
|
toc:
|
|
title: "Introduction"
|
|
links:
|
|
mode: "explicit"
|
|
allow_to: ["notes"]
|
|
```
|
|
|
|
### `entries.<id>.source`
|
|
|
|
Two source kinds are supported:
|
|
|
|
- `kind: "substack"`:
|
|
Use a public Substack article URL.
|
|
Fields:
|
|
`url` required string.
|
|
- `kind: "html"`:
|
|
Use a local HTML file.
|
|
Fields:
|
|
`path` required string, resolved relative to the manifest file.
|
|
|
|
Examples:
|
|
|
|
```yaml
|
|
source:
|
|
kind: "substack"
|
|
url: "https://example.substack.com/p/my-post"
|
|
```
|
|
|
|
```yaml
|
|
source:
|
|
kind: "html"
|
|
path: "articles/local-post.html"
|
|
```
|
|
|
|
You can mix Substack URLs and local HTML files in the same manifest:
|
|
|
|
```yaml
|
|
entries:
|
|
remote-post:
|
|
source:
|
|
kind: "substack"
|
|
url: "https://example.substack.com/p/remote-post"
|
|
|
|
local-post:
|
|
source:
|
|
kind: "html"
|
|
path: "articles/local-post.html"
|
|
```
|
|
|
|
### `entries.<id>.toc`
|
|
|
|
- `title`:
|
|
Optional string. Overrides the chapter label used in the TOC.
|
|
- `hidden`:
|
|
Optional boolean. Defaults to `false`. When `true`, the entry is omitted from the TOC.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
toc:
|
|
title: "Appendix"
|
|
hidden: false
|
|
```
|
|
|
|
### `entries.<id>.links`
|
|
|
|
- `mode`:
|
|
Optional string. One of:
|
|
`auto`, `explicit`, `none`.
|
|
If omitted, the global `link_rules.mode` is used.
|
|
- `allow_to`:
|
|
Optional list of entry IDs. If set, rewritten internal links are limited to these targets.
|
|
- `block_to`:
|
|
Optional list of entry IDs. These targets are excluded from rewriting.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
links:
|
|
mode: "explicit"
|
|
allow_to: ["intro", "appendix"]
|
|
block_to: ["draft-notes"]
|
|
```
|
|
|
|
### `link_rules`
|
|
|
|
- `link_rules.mode`:
|
|
Optional string. One of:
|
|
`auto`, `explicit`, `none`.
|
|
Defaults to `auto`.
|
|
- `link_rules.rewrite_external_substack_links`:
|
|
Optional boolean. Defaults to `true`. Accepted by the manifest parser, but not currently used to change behavior in `v0.1`.
|
|
- `link_rules.preserve_other_external_links`:
|
|
Optional boolean. Defaults to `true`. Accepted by the manifest parser, but not currently used to change behavior in `v0.1`.
|
|
- `link_rules.rules`:
|
|
Optional list of explicit link rules.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
link_rules:
|
|
mode: "explicit"
|
|
rewrite_external_substack_links: true
|
|
preserve_other_external_links: true
|
|
rules:
|
|
- from: ["notes"]
|
|
to: ["intro"]
|
|
match_mode: "canonical-url"
|
|
```
|
|
|
|
### `link_rules.rules[]`
|
|
|
|
Each rule supports:
|
|
|
|
- `from`:
|
|
Required list of selectors describing where the rule applies.
|
|
- `to`:
|
|
Required list of selectors describing eligible targets.
|
|
- `match_mode`:
|
|
Optional string. One of:
|
|
`canonical-url`, `source-url`, `disabled`.
|
|
Defaults to `canonical-url`.
|
|
|
|
Supported selectors in `from` and `to`:
|
|
|
|
- `*`:
|
|
Match all entries.
|
|
- `<entry-id>`:
|
|
Match one entry by ID.
|
|
- `section:<section-id>`:
|
|
Match all entries referenced by that section.
|
|
|
|
Example:
|
|
|
|
```yaml
|
|
link_rules:
|
|
mode: "explicit"
|
|
rules:
|
|
- from: ["section:essays"]
|
|
to: ["section:essays"]
|
|
match_mode: "canonical-url"
|
|
```
|
|
|
|
## Notes
|
|
|
|
- Output paths are resolved relative to the manifest file location.
|
|
- Local HTML paths are also resolved relative to the manifest file location.
|
|
- `sections` and `entries` are deserialized with empty defaults, but `validate` and `build` expect them to be meaningfully populated.
|
|
- `subtitle`, `summary`, `tags`, `rewrite_external_substack_links`, and `preserve_other_external_links` are accepted today but only partially wired into runtime behavior.
|
|
- For Substack sources, `v0.1` assumes public posts. Subscriber-only/session-based fetching is not implemented.
|