Files
ebookm/README.md
T
2026-05-25 17:05:15 +02:00

483 lines
11 KiB
Markdown

# ebookm
`ebookm` is a Rust command-line tool that compiles a set of Substack posts and local HTML files into a single EPUB.
## Current Scope
`v0.1` supports:
- YAML manifests
- Public Substack post URLs
- Local HTML files
- Manifest-defined section order and TOC structure
- Per-entry metadata and TOC overrides
- Basic internal link rewriting between included entries
- EPUB generation with bundled article assets
## Build
```bash
cargo build
```
## Run
Use the CLI through Cargo:
```bash
cargo run -p ebookm-cli -- <command>
```
Available commands:
- `build -m <manifest>`
- `validate -m <manifest>`
- `inspect <url-or-file>`
- `init`
## Quick Start
This repository includes a runnable example manifest and local HTML fixture in `examples/`.
Validate the example manifest:
```bash
cargo run -p ebookm-cli -- validate -m examples/example-book.yaml
```
Build the example EPUB:
```bash
cargo run -p ebookm-cli -- build -m examples/example-book.yaml
```
The output EPUB will be written to:
```text
examples/dist/example-book.epub
```
Inspect a local HTML file:
```bash
cargo run -p ebookm-cli -- inspect examples/articles/intro.html
```
Generate a starter manifest:
```bash
cargo run -p ebookm-cli -- init
```
## Validate The Output EPUB
There are two useful levels of validation.
Quick XML/XHTML validation for generated chapter files:
```bash
unzip -p path/to/book.epub OEBPS/text/chapter.xhtml | xmllint --noout -
```
If you want to check every generated XHTML file in the EPUB:
```bash
mkdir -p /tmp/ebookm-check
unzip -o path/to/book.epub -d /tmp/ebookm-check
find /tmp/ebookm-check/OEBPS -name '*.xhtml' -print -exec xmllint --noout {} \;
```
Full EPUB validation:
Use `epubcheck`, which validates the EPUB package itself, including metadata, navigation files, manifest/spine consistency, and XHTML correctness.
```bash
epubcheck path/to/book.epub
```
Practical guidance:
- Use `xmllint` when you want to quickly confirm that generated XHTML is well-formed XML.
- Use `epubcheck` when you want proper EPUB-level validation before distributing the file.
- If an EPUB reader only shows part of a chapter, malformed XHTML is a common cause, so `xmllint` on the generated chapter files is a good first check.
## Manifest Reference
The top-level manifest keys are:
- `book`: EPUB metadata
- `output`: output path and optional cover image
- `defaults`: shared normalization and metadata defaults
- `sections`: ordered TOC and reading-order groups
- `entries`: source definitions and per-entry overrides
- `link_rules`: cross-link rewriting behavior
Minimal example:
```yaml
book:
title: "My Book"
author: "Editor"
language: "en"
identifier: "urn:uuid:my-book"
output:
path: "dist/my-book.epub"
sections:
- id: "part-1"
title: "Part 1"
entries:
- "essay"
entries:
essay:
source:
kind: "html"
path: "articles/essay.html"
link_rules:
mode: "auto"
```
### Top-Level Fields
- `book`:
EPUB metadata block. Required.
- `output`:
Output configuration block. Required.
- `defaults`:
Shared defaults applied across entries. Optional.
- `sections`:
Ordered list of sections. Required in practice for a useful build.
- `entries`:
Map of entry IDs to entry definitions. Required in practice.
- `link_rules`:
Global link rewriting policy. Optional.
### `book`
- `book.title`:
Required string. Book title.
- `book.author`:
Optional string. Book-level author.
- `book.language`:
Optional string. Defaults to `en`.
- `book.identifier`:
Optional string. Defaults to a generated `urn:uuid:...`.
- `book.description`:
Optional string. Written into the EPUB package metadata.
Example:
```yaml
book:
title: "Collected Essays"
author: "Jane Doe"
language: "en"
identifier: "urn:uuid:collected-essays"
description: "A single-volume EPUB generated by ebookm"
```
### `output`
- `output.path`:
Required string. Output EPUB path. Resolved relative to the manifest file.
- `output.cover_image`:
Optional string. Path to a cover image, resolved relative to the manifest file.
Example:
```yaml
output:
path: "dist/book.epub"
cover_image: "assets/cover.jpg"
```
### `defaults`
- `defaults.fetch_images`:
Optional boolean. Defaults to `true`. When enabled, image assets referenced from article HTML are fetched and bundled into the EPUB.
- `defaults.normalize_substack_embeds`:
Optional boolean. Defaults to `true`. Currently removes iframe embeds during normalization.
- `defaults.metadata`:
Optional metadata override block applied after extracted source metadata and before per-entry overrides.
- `defaults.processing`:
Optional shared article-processing and chapter-header defaults.
Example:
```yaml
defaults:
fetch_images: true
normalize_substack_embeds: true
processing:
include_author: true
include_date: true
include_source_url: true
skip_first_paragraphs: 0
metadata:
author: "Editorial Team"
```
### `defaults.processing` and `entries.<id>.processing`
These fields control chapter-header rendering and article trimming.
For `defaults.processing`, all fields are concrete values with defaults.
For `entries.<id>.processing`, the same fields are optional overrides.
- `include_author`:
Boolean. Defaults to `true`. Controls whether the extracted or overridden author name is shown at the start of the chapter.
- `include_date`:
Boolean. Defaults to `true`. Controls whether the extracted or overridden publication date is shown at the start of the chapter.
- `include_source_url`:
Boolean. Defaults to `true`. Controls whether the canonical article URL is shown at the start of the chapter.
- `skip_first_paragraphs`:
Integer. Defaults to `0`. Removes the first `n` paragraph elements from the extracted article body before EPUB generation.
Example:
```yaml
defaults:
processing:
include_author: true
include_date: false
include_source_url: false
skip_first_paragraphs: 0
```
### `defaults.metadata` and `entries.<id>.metadata`
These fields use the same shape:
- `author`:
Optional string.
- `published`:
Optional date in `YYYY-MM-DD` format.
- `subtitle`:
Optional string. Accepted by the parser but not yet emitted into the EPUB output.
- `summary`:
Optional string. Accepted by the parser but not yet emitted into the EPUB output.
- `tags`:
Optional list of strings. Accepted by the parser but not yet used in link or EPUB output logic.
Example:
```yaml
metadata:
author: "Jane Doe"
published: "2025-01-10"
subtitle: "Notebook entry"
summary: "A short summary"
tags: ["essay", "history"]
```
### `sections`
`sections` is an ordered list. Section order controls reading order and TOC grouping.
Each section supports:
- `id`:
Required string. Stable section identifier.
- `title`:
Required string. Section title shown in the TOC.
- `entries`:
Optional list of entry IDs in reading order. Usually should not be empty.
Example:
```yaml
sections:
- id: "part-1"
title: "Part 1"
entries:
- "opening-post"
- "notes"
```
### `entries`
`entries` is a map from entry ID to entry definition. Entry IDs are referenced from `sections` and link rules.
Each entry supports:
- `source`:
Required source definition block.
- `title`:
Optional string. Overrides the extracted article title.
- `metadata`:
Optional metadata override block.
- `toc`:
Optional TOC override block.
- `links`:
Optional per-entry link-policy block.
- `processing`:
Optional per-entry processing override block.
Example:
```yaml
entries:
opening-post:
source:
kind: "substack"
url: "https://example.substack.com/p/opening-post"
title: "Opening Post"
metadata:
published: "2025-01-10"
processing:
include_source_url: false
skip_first_paragraphs: 1
toc:
title: "Introduction"
links:
mode: "explicit"
allow_to: ["notes"]
```
### `entries.<id>.source`
Two source kinds are supported:
- `kind: "substack"`:
Use a public Substack article URL.
Fields:
`url` required string.
- `kind: "html"`:
Use a local HTML file.
Fields:
`path` required string, resolved relative to the manifest file.
Examples:
```yaml
source:
kind: "substack"
url: "https://example.substack.com/p/my-post"
```
```yaml
source:
kind: "html"
path: "articles/local-post.html"
```
You can mix Substack URLs and local HTML files in the same manifest:
```yaml
entries:
remote-post:
source:
kind: "substack"
url: "https://example.substack.com/p/remote-post"
local-post:
source:
kind: "html"
path: "articles/local-post.html"
```
### `entries.<id>.toc`
- `title`:
Optional string. Overrides the chapter label used in the TOC.
- `hidden`:
Optional boolean. Defaults to `false`. When `true`, the entry is omitted from the TOC.
Example:
```yaml
toc:
title: "Appendix"
hidden: false
```
### `entries.<id>.links`
- `mode`:
Optional string. One of:
`auto`, `explicit`, `none`.
If omitted, the global `link_rules.mode` is used.
- `allow_to`:
Optional list of entry IDs. If set, rewritten internal links are limited to these targets.
- `block_to`:
Optional list of entry IDs. These targets are excluded from rewriting.
Example:
```yaml
links:
mode: "explicit"
allow_to: ["intro", "appendix"]
block_to: ["draft-notes"]
```
### `link_rules`
- `link_rules.mode`:
Optional string. One of:
`auto`, `explicit`, `none`.
Defaults to `auto`.
- `link_rules.rewrite_external_substack_links`:
Optional boolean. Defaults to `true`. Accepted by the manifest parser, but not currently used to change behavior in `v0.1`.
- `link_rules.preserve_other_external_links`:
Optional boolean. Defaults to `true`. Accepted by the manifest parser, but not currently used to change behavior in `v0.1`.
- `link_rules.rules`:
Optional list of explicit link rules.
Example:
```yaml
link_rules:
mode: "explicit"
rewrite_external_substack_links: true
preserve_other_external_links: true
rules:
- from: ["notes"]
to: ["intro"]
match_mode: "canonical-url"
```
### `link_rules.rules[]`
Each rule supports:
- `from`:
Required list of selectors describing where the rule applies.
- `to`:
Required list of selectors describing eligible targets.
- `match_mode`:
Optional string. One of:
`canonical-url`, `source-url`, `disabled`.
Defaults to `canonical-url`.
Supported selectors in `from` and `to`:
- `*`:
Match all entries.
- `<entry-id>`:
Match one entry by ID.
- `section:<section-id>`:
Match all entries referenced by that section.
Example:
```yaml
link_rules:
mode: "explicit"
rules:
- from: ["section:essays"]
to: ["section:essays"]
match_mode: "canonical-url"
```
## Notes
- Output paths are resolved relative to the manifest file location.
- Local HTML paths are also resolved relative to the manifest file location.
- `sections` and `entries` are deserialized with empty defaults, but `validate` and `build` expect them to be meaningfully populated.
- `subtitle`, `summary`, `tags`, `rewrite_external_substack_links`, and `preserve_other_external_links` are accepted today but only partially wired into runtime behavior.
- For Substack sources, `v0.1` assumes public posts. Subscriber-only/session-based fetching is not implemented.