initial commit

2026-05-25 17:05:15 +02:00
commit 6ebe505a07
25 changed files with 5929 additions and 0 deletions
@@ -0,0 +1 @@
+/target
@@ -0,0 +1,9 @@
+[workspace]
+members = ["ebookm-core", "ebookm-cli"]
+resolver = "2"
+
+[workspace.package]
+edition = "2024"
+license = "MIT"
+version = "0.1.0"
+
@@ -0,0 +1,482 @@
+# ebookm
+
+`ebookm` is a Rust command-line tool that compiles a set of Substack posts and local HTML files into a single EPUB.
+
+## Current Scope
+
+`v0.1` supports:
+
+- YAML manifests
+- Public Substack post URLs
+- Local HTML files
+- Manifest-defined section order and TOC structure
+- Per-entry metadata and TOC overrides
+- Basic internal link rewriting between included entries
+- EPUB generation with bundled article assets
+
+## Build
+
+```bash
+cargo build
+```
+
+## Run
+
+Use the CLI through Cargo:
+
+```bash
+cargo run -p ebookm-cli -- <command>
+```
+
+Available commands:
+
+- `build -m <manifest>`
+- `validate -m <manifest>`
+- `inspect <url-or-file>`
+- `init`
+
+## Quick Start
+
+This repository includes a runnable example manifest and local HTML fixture in `examples/`.
+
+Validate the example manifest:
+
+```bash
+cargo run -p ebookm-cli -- validate -m examples/example-book.yaml
+```
+
+Build the example EPUB:
+
+```bash
+cargo run -p ebookm-cli -- build -m examples/example-book.yaml
+```
+
+The output EPUB will be written to:
+
+```text
+examples/dist/example-book.epub
+```
+
+Inspect a local HTML file:
+
+```bash
+cargo run -p ebookm-cli -- inspect examples/articles/intro.html
+```
+
+Generate a starter manifest:
+
+```bash
+cargo run -p ebookm-cli -- init
+```
+
+## Validate The Output EPUB
+
+There are two useful levels of validation.
+
+Quick XML/XHTML validation for generated chapter files:
+
+```bash
+unzip -p path/to/book.epub OEBPS/text/chapter.xhtml | xmllint --noout -
+```
+
+If you want to check every generated XHTML file in the EPUB:
+
+```bash
+mkdir -p /tmp/ebookm-check
+unzip -o path/to/book.epub -d /tmp/ebookm-check
+find /tmp/ebookm-check/OEBPS -name '*.xhtml' -print -exec xmllint --noout {} \;
+```
+
+Full EPUB validation:
+
+Use `epubcheck`, which validates the EPUB package itself, including metadata, navigation files, manifest/spine consistency, and XHTML correctness.
+
+```bash
+epubcheck path/to/book.epub
+```
+
+Practical guidance:
+
+- Use `xmllint` when you want to quickly confirm that generated XHTML is well-formed XML.
+- Use `epubcheck` when you want proper EPUB-level validation before distributing the file.
+- If an EPUB reader only shows part of a chapter, malformed XHTML is a common cause, so `xmllint` on the generated chapter files is a good first check.
+
+## Manifest Reference
+
+The top-level manifest keys are:
+
+- `book`: EPUB metadata
+- `output`: output path and optional cover image
+- `defaults`: shared normalization and metadata defaults
+- `sections`: ordered TOC and reading-order groups
+- `entries`: source definitions and per-entry overrides
+- `link_rules`: cross-link rewriting behavior
+
+Minimal example:
+
+```yaml
+book:
+  title: "My Book"
+  author: "Editor"
+  language: "en"
+  identifier: "urn:uuid:my-book"
+
+output:
+  path: "dist/my-book.epub"
+
+sections:
+  - id: "part-1"
+    title: "Part 1"
+    entries:
+      - "essay"
+
+entries:
+  essay:
+    source:
+      kind: "html"
+      path: "articles/essay.html"
+
+link_rules:
+  mode: "auto"
+```
+
+### Top-Level Fields
+
+- `book`:
+  EPUB metadata block. Required.
+- `output`:
+  Output configuration block. Required.
+- `defaults`:
+  Shared defaults applied across entries. Optional.
+- `sections`:
+  Ordered list of sections. Required in practice for a useful build.
+- `entries`:
+  Map of entry IDs to entry definitions. Required in practice.
+- `link_rules`:
+  Global link rewriting policy. Optional.
+
+### `book`
+
+- `book.title`:
+  Required string. Book title.
+- `book.author`:
+  Optional string. Book-level author.
+- `book.language`:
+  Optional string. Defaults to `en`.
+- `book.identifier`:
+  Optional string. Defaults to a generated `urn:uuid:...`.
+- `book.description`:
+  Optional string. Written into the EPUB package metadata.
+
+Example:
+
+```yaml
+book:
+  title: "Collected Essays"
+  author: "Jane Doe"
+  language: "en"
+  identifier: "urn:uuid:collected-essays"
+  description: "A single-volume EPUB generated by ebookm"
+```
+
+### `output`
+
+- `output.path`:
+  Required string. Output EPUB path. Resolved relative to the manifest file.
+- `output.cover_image`:
+  Optional string. Path to a cover image, resolved relative to the manifest file.
+
+Example:
+
+```yaml
+output:
+  path: "dist/book.epub"
+  cover_image: "assets/cover.jpg"
+```
+
+### `defaults`
+
+- `defaults.fetch_images`:
+  Optional boolean. Defaults to `true`. When enabled, image assets referenced from article HTML are fetched and bundled into the EPUB.
+- `defaults.normalize_substack_embeds`:
+  Optional boolean. Defaults to `true`. Currently removes iframe embeds during normalization.
+- `defaults.metadata`:
+  Optional metadata override block applied after extracted source metadata and before per-entry overrides.
+- `defaults.processing`:
+  Optional shared article-processing and chapter-header defaults.
+
+Example:
+
+```yaml
+defaults:
+  fetch_images: true
+  normalize_substack_embeds: true
+  processing:
+    include_author: true
+    include_date: true
+    include_source_url: true
+    skip_first_paragraphs: 0
+  metadata:
+    author: "Editorial Team"
+```
+
+### `defaults.processing` and `entries.<id>.processing`
+
+These fields control chapter-header rendering and article trimming.
+
+For `defaults.processing`, all fields are concrete values with defaults.
+For `entries.<id>.processing`, the same fields are optional overrides.
+
+- `include_author`:
+  Boolean. Defaults to `true`. Controls whether the extracted or overridden author name is shown at the start of the chapter.
+- `include_date`:
+  Boolean. Defaults to `true`. Controls whether the extracted or overridden publication date is shown at the start of the chapter.
+- `include_source_url`:
+  Boolean. Defaults to `true`. Controls whether the canonical article URL is shown at the start of the chapter.
+- `skip_first_paragraphs`:
+  Integer. Defaults to `0`. Removes the first `n` paragraph elements from the extracted article body before EPUB generation.
+
+Example:
+
+```yaml
+defaults:
+  processing:
+    include_author: true
+    include_date: false
+    include_source_url: false
+    skip_first_paragraphs: 0
+```
+
+### `defaults.metadata` and `entries.<id>.metadata`
+
+These fields use the same shape:
+
+- `author`:
+  Optional string.
+- `published`:
+  Optional date in `YYYY-MM-DD` format.
+- `subtitle`:
+  Optional string. Accepted by the parser but not yet emitted into the EPUB output.
+- `summary`:
+  Optional string. Accepted by the parser but not yet emitted into the EPUB output.
+- `tags`:
+  Optional list of strings. Accepted by the parser but not yet used in link or EPUB output logic.
+
+Example:
+
+```yaml
+metadata:
+  author: "Jane Doe"
+  published: "2025-01-10"
+  subtitle: "Notebook entry"
+  summary: "A short summary"
+  tags: ["essay", "history"]
+```
+
+### `sections`
+
+`sections` is an ordered list. Section order controls reading order and TOC grouping.
+
+Each section supports:
+
+- `id`:
+  Required string. Stable section identifier.
+- `title`:
+  Required string. Section title shown in the TOC.
+- `entries`:
+  Optional list of entry IDs in reading order. Usually should not be empty.
+
+Example:
+
+```yaml
+sections:
+  - id: "part-1"
+    title: "Part 1"
+    entries:
+      - "opening-post"
+      - "notes"
+```
+
+### `entries`
+
+`entries` is a map from entry ID to entry definition. Entry IDs are referenced from `sections` and link rules.
+
+Each entry supports:
+
+- `source`:
+  Required source definition block.
+- `title`:
+  Optional string. Overrides the extracted article title.
+- `metadata`:
+  Optional metadata override block.
+- `toc`:
+  Optional TOC override block.
+- `links`:
+  Optional per-entry link-policy block.
+- `processing`:
+  Optional per-entry processing override block.
+
+Example:
+
+```yaml
+entries:
+  opening-post:
+    source:
+      kind: "substack"
+      url: "https://example.substack.com/p/opening-post"
+    title: "Opening Post"
+    metadata:
+      published: "2025-01-10"
+    processing:
+      include_source_url: false
+      skip_first_paragraphs: 1
+    toc:
+      title: "Introduction"
+    links:
+      mode: "explicit"
+      allow_to: ["notes"]
+```
+
+### `entries.<id>.source`
+
+Two source kinds are supported:
+
+- `kind: "substack"`:
+  Use a public Substack article URL.
+  Fields:
+  `url` required string.
+- `kind: "html"`:
+  Use a local HTML file.
+  Fields:
+  `path` required string, resolved relative to the manifest file.
+
+Examples:
+
+```yaml
+source:
+  kind: "substack"
+  url: "https://example.substack.com/p/my-post"
+```
+
+```yaml
+source:
+  kind: "html"
+  path: "articles/local-post.html"
+```
+
+You can mix Substack URLs and local HTML files in the same manifest:
+
+```yaml
+entries:
+  remote-post:
+    source:
+      kind: "substack"
+      url: "https://example.substack.com/p/remote-post"
+
+  local-post:
+    source:
+      kind: "html"
+      path: "articles/local-post.html"
+```
+
+### `entries.<id>.toc`
+
+- `title`:
+  Optional string. Overrides the chapter label used in the TOC.
+- `hidden`:
+  Optional boolean. Defaults to `false`. When `true`, the entry is omitted from the TOC.
+
+Example:
+
+```yaml
+toc:
+  title: "Appendix"
+  hidden: false
+```
+
+### `entries.<id>.links`
+
+- `mode`:
+  Optional string. One of:
+  `auto`, `explicit`, `none`.
+  If omitted, the global `link_rules.mode` is used.
+- `allow_to`:
+  Optional list of entry IDs. If set, rewritten internal links are limited to these targets.
+- `block_to`:
+  Optional list of entry IDs. These targets are excluded from rewriting.
+
+Example:
+
+```yaml
+links:
+  mode: "explicit"
+  allow_to: ["intro", "appendix"]
+  block_to: ["draft-notes"]
+```
+
+### `link_rules`
+
+- `link_rules.mode`:
+  Optional string. One of:
+  `auto`, `explicit`, `none`.
+  Defaults to `auto`.
+- `link_rules.rewrite_external_substack_links`:
+  Optional boolean. Defaults to `true`. Accepted by the manifest parser, but not currently used to change behavior in `v0.1`.
+- `link_rules.preserve_other_external_links`:
+  Optional boolean. Defaults to `true`. Accepted by the manifest parser, but not currently used to change behavior in `v0.1`.
+- `link_rules.rules`:
+  Optional list of explicit link rules.
+
+Example:
+
+```yaml
+link_rules:
+  mode: "explicit"
+  rewrite_external_substack_links: true
+  preserve_other_external_links: true
+  rules:
+    - from: ["notes"]
+      to: ["intro"]
+      match_mode: "canonical-url"
+```
+
+### `link_rules.rules[]`
+
+Each rule supports:
+
+- `from`:
+  Required list of selectors describing where the rule applies.
+- `to`:
+  Required list of selectors describing eligible targets.
+- `match_mode`:
+  Optional string. One of:
+  `canonical-url`, `source-url`, `disabled`.
+  Defaults to `canonical-url`.
+
+Supported selectors in `from` and `to`:
+
+- `*`:
+  Match all entries.
+- `<entry-id>`:
+  Match one entry by ID.
+- `section:<section-id>`:
+  Match all entries referenced by that section.
+
+Example:
+
+```yaml
+link_rules:
+  mode: "explicit"
+  rules:
+    - from: ["section:essays"]
+      to: ["section:essays"]
+      match_mode: "canonical-url"
+```
+
+## Notes
+
+- Output paths are resolved relative to the manifest file location.
+- Local HTML paths are also resolved relative to the manifest file location.
+- `sections` and `entries` are deserialized with empty defaults, but `validate` and `build` expect them to be meaningfully populated.
+- `subtitle`, `summary`, `tags`, `rewrite_external_substack_links`, and `preserve_other_external_links` are accepted today but only partially wired into runtime behavior.
+- For Substack sources, `v0.1` assumes public posts. Subscriber-only/session-based fetching is not implemented.
@@ -0,0 +1,76 @@
+book:
+  title: "Age of Peace"
+  author: "John Gu"
+  language: "en"
+  identifier: "urn:uuid:ageofpeace:johngu"
+  description: "Age of Peace: a novel"
+
+output:
+  path: "AgeOfPeace.epub"
+  cover_image: "age_of_peace_cover.jpg"
+
+defaults:
+  metadata:
+    author: "John Gu"
+  fetch_images: true
+  normalize_substack_embeds: true
+  processing:
+    include_author: false
+    include_date: false
+    include_source_url: false
+
+sections:
+  - id: "part-0"
+    title: "Prelude"
+    entries:
+      - "intro"
+  - id: "part-1"
+    title: "Overture"
+    entries:
+      - "contested_island"
+  - id: "part-2"
+    title: "The High Castle"
+    entries:
+      - "vira"
+      - "madelyna"
+      - "biridana"
+  - id: "part-3"
+    title: "Nameless Country"
+    entries: []
+      
+
+entries:
+  intro:
+    source:
+      kind: "html"
+      path: "ageofpeace/introduction.html"
+  contested_island:
+    source:
+      kind: "substack"
+      url: "https://ageofpeace.substack.com/p/a-contested-island"
+    toc:
+      title: "A Contested Island"
+    processing:
+      skip_first_paragraphs: 1
+  vira:
+    source:
+      kind: "substack"
+      url: "https://ageofpeace.substack.com/p/vira"
+    toc:
+      title: "Vira"
+  madelyna:
+    source:
+      kind: "substack"
+      url: "https://ageofpeace.substack.com/p/madelyna"
+    toc:
+      title: "Madelỳna"
+  biridana:
+    source:
+      kind: "substack"
+      url: "https://ageofpeace.substack.com/p/biridana"
+    toc:
+      title: "Biridana"
+      
+
+link_rules:
+  mode: "auto"
@@ -0,0 +1,33 @@
+<!doctype html>
+<html>
+  <head>
+    <title>Introduction</title>
+    <meta name="author" content="John Gu" />
+    <meta property="article:published_time" content="2025-07-05T00:00:00Z" />
+    <link rel="canonical" href="https://ageofpeace.substack.com/p/introduction-and-welcome" />
+  </head>
+  <body>
+    <article>
+      <p><em>After securing a teaching job at a foreign university on “the thinnest set of credentials,” a young man sets off for life in Varrenia, an impoverished eastern kingdom still emerging from the shadow of a decades-long dictatorship. Years later, living in the decadent capital of Garamdal, our protagonist watches a war unfold in the republic’s restive eastern provinces and reflects on what he has gained — and lost — in a life of travel.</em></p>
+      <h2>About the author</h2>
+      <p>
+In my late twenties, I did a short stint as a grad student in mathematical logic at the University of Amsterdam, where I learned that I am <em>not</em> smart enough to be a mathematician. After dropping out of grad school, I ended up staying in Europe for five years. This novel, greatly influenced by that experience, is my love letter to Europe and to that time.
+      </p>
+      <img src="johngu.jpg" alt="John Gu" />
+      <p>I am also very proud to be able say that I grew up in Houston — I actually come from the same neighborhood as Lizzo, Mo Amer, and Tila Tequila, cultural luminaries all.</p>
+      <h2>Previous publications</h2>
+      <p> TBD </p>
+
+      <h2>Influences</h2>
+      <p>Some readers of my work have remarked that it shares an affinity with the following writers and books. In some cases, these figures represent inspirations that I have leaned into, in others, the similarities (of theme, style, subject matter) are more coincidental:</p>
+      <ul>
+        <li>In Patagonia, Bruce Chatwin</li>
+        <li>Waiting for the Barbarians, J.M. Coetzee</li>
+        <li>Balkan Ghosts, Robert Kaplan</li>
+        <li>A Bend in the River, V.S. Naipaul</li>
+        <li>Paul Bowles</li>
+        <li>Milan Kundera</li>
+      </ul>
+    </article>
+  </body>
+</html>
@@ -0,0 +1,10 @@
+[package]
+name = "ebookm-cli"
+version = "0.1.0"
+edition = "2024"
+
+[dependencies]
+clap = { version = "4.5", features = ["derive"] }
+ebookm-core = { path = "../ebookm-core" }
+miette = { version = "7.2", features = ["fancy"] }
+serde_json = "1.0"
@@ -0,0 +1,89 @@
+use std::path::PathBuf;
+
+use clap::{Parser, Subcommand};
+use ebookm_core::{
+    build_epub, inspect_source, load_manifest, render_init_manifest, validate_manifest,
+};
+use miette::{Context, IntoDiagnostic};
+
+#[derive(Debug, Parser)]
+#[command(
+    name = "ebookm",
+    version,
+    about = "Compile Substack articles into a single EPUB"
+)]
+struct Cli {
+    #[command(subcommand)]
+    command: Commands,
+}
+
+#[derive(Debug, Subcommand)]
+enum Commands {
+    Build {
+        #[arg(short, long)]
+        manifest: PathBuf,
+        #[arg(short, long)]
+        output: Option<PathBuf>,
+    },
+    Validate {
+        #[arg(short, long)]
+        manifest: PathBuf,
+    },
+    Inspect {
+        source: String,
+        #[arg(long, default_value = "json")]
+        format: String,
+    },
+    Init,
+}
+
+fn main() -> miette::Result<()> {
+    let cli = Cli::parse();
+
+    match cli.command {
+        Commands::Build { manifest, output } => {
+            let mut loaded = load_manifest(&manifest).into_diagnostic()?;
+            if let Some(output) = output {
+                loaded.output.path = output.display().to_string();
+            }
+            let warnings = validate_manifest(&loaded).into_diagnostic()?;
+            for warning in warnings {
+                eprintln!("warning: {warning}");
+            }
+            build_epub(&loaded, &manifest).into_diagnostic()?;
+            println!("{}", loaded.output.path);
+        }
+        Commands::Validate { manifest } => {
+            let loaded = load_manifest(&manifest).into_diagnostic()?;
+            let warnings = validate_manifest(&loaded).into_diagnostic()?;
+            for warning in warnings {
+                println!("warning: {warning}");
+            }
+            println!("manifest is valid");
+        }
+        Commands::Inspect { source, format } => {
+            let result = inspect_source(&source).into_diagnostic()?;
+            if format == "json" {
+                println!(
+                    "{}",
+                    serde_json::to_string_pretty(&result)
+                        .into_diagnostic()
+                        .wrap_err("failed to encode JSON")?
+                );
+            } else {
+                println!("title: {}", result.title.unwrap_or_default());
+                println!("author: {}", result.author.unwrap_or_default());
+                println!("published: {}", result.published.unwrap_or_default());
+                println!(
+                    "canonical_url: {}",
+                    result.canonical_url.unwrap_or_default()
+                );
+            }
+        }
+        Commands::Init => {
+            print!("{}", render_init_manifest());
+        }
+    }
+
+    Ok(())
+}
@@ -0,0 +1,25 @@
+[package]
+name = "ebookm-core"
+version = "0.1.0"
+edition = "2024"
+
+[dependencies]
+chrono = { version = "0.4", features = ["serde"] }
+indexmap = { version = "2.7", features = ["serde"] }
+kuchiki = "0.8"
+miette = { version = "7.2", features = ["fancy"] }
+quick-xml = "0.38"
+regex = "1.11"
+reqwest = { version = "0.12", default-features = false, features = ["blocking", "rustls-tls"] }
+scraper = "0.24"
+serde = { version = "1.0", features = ["derive"] }
+serde_json = "1.0"
+serde_yaml = "0.9"
+sha1 = "0.10"
+thiserror = "2.0"
+url = { version = "2.5", features = ["serde"] }
+uuid = { version = "1.18", features = ["v4"] }
+zip = "4.6"
+
+[dev-dependencies]
+tempfile = "3.15"
@@ -0,0 +1,298 @@
+use std::collections::BTreeSet;
+use std::fs::File;
+use std::io::Write;
+use std::path::Path;
+
+use quick_xml::escape::escape;
+use zip::CompressionMethod;
+use zip::write::{SimpleFileOptions, ZipWriter};
+
+use crate::error::{EbookmError, Result};
+use crate::pipeline::BuiltEntry;
+
+pub fn write_epub(
+    manifest: &crate::manifest::Manifest,
+    built: &[BuiltEntry],
+    output_path: &Path,
+    cover_bytes: Option<(String, Vec<u8>)>,
+) -> Result<()> {
+    if let Some(parent) = output_path.parent() {
+        std::fs::create_dir_all(parent).map_err(|source| EbookmError::Io {
+            path: parent.display().to_string(),
+            source,
+        })?;
+    }
+
+    let file = File::create(output_path).map_err(|source| EbookmError::Io {
+        path: output_path.display().to_string(),
+        source,
+    })?;
+    let mut zip = ZipWriter::new(file);
+
+    let stored = SimpleFileOptions::default().compression_method(CompressionMethod::Stored);
+    zip.start_file("mimetype", stored)
+        .map_err(|error| EbookmError::Epub {
+            message: error.to_string(),
+        })?;
+    zip.write_all(b"application/epub+zip")
+        .map_err(|error| EbookmError::Epub {
+            message: error.to_string(),
+        })?;
+
+    let deflated = SimpleFileOptions::default().compression_method(CompressionMethod::Deflated);
+    write_file(&mut zip, "META-INF/container.xml", deflated, CONTAINER_XML)?;
+    write_file(&mut zip, "OEBPS/styles/book.css", deflated, DEFAULT_STYLES)?;
+
+    let nav = build_nav(manifest, built);
+    let ncx = build_ncx(manifest, built);
+    let opf = build_opf(
+        manifest,
+        built,
+        cover_bytes.as_ref().map(|(href, _)| href.as_str()),
+    );
+
+    write_file(&mut zip, "OEBPS/nav.xhtml", deflated, &nav)?;
+    write_file(&mut zip, "OEBPS/toc.ncx", deflated, &ncx)?;
+    write_file(&mut zip, "OEBPS/content.opf", deflated, &opf)?;
+
+    if let Some((href, bytes)) = cover_bytes {
+        write_bytes(&mut zip, &format!("OEBPS/{href}"), deflated, &bytes)?;
+    }
+
+    let mut seen_assets = BTreeSet::new();
+    for entry in built {
+        write_file(
+            &mut zip,
+            &format!("OEBPS/text/{}.xhtml", entry.id),
+            deflated,
+            &entry.chapter.xhtml,
+        )?;
+        for asset in &entry.assets {
+            if seen_assets.insert(asset.href.clone()) {
+                write_bytes(
+                    &mut zip,
+                    &format!("OEBPS/{}", asset.href),
+                    deflated,
+                    &asset.bytes,
+                )?;
+            }
+        }
+    }
+
+    zip.finish().map_err(|error| EbookmError::Epub {
+        message: error.to_string(),
+    })?;
+    Ok(())
+}
+
+fn write_file(
+    zip: &mut ZipWriter<File>,
+    path: &str,
+    options: SimpleFileOptions,
+    contents: &str,
+) -> Result<()> {
+    write_bytes(zip, path, options, contents.as_bytes())
+}
+
+fn write_bytes(
+    zip: &mut ZipWriter<File>,
+    path: &str,
+    options: SimpleFileOptions,
+    contents: &[u8],
+) -> Result<()> {
+    zip.start_file(path, options)
+        .map_err(|error| EbookmError::Epub {
+            message: error.to_string(),
+        })?;
+    zip.write_all(contents).map_err(|error| EbookmError::Epub {
+        message: error.to_string(),
+    })?;
+    Ok(())
+}
+
+fn build_nav(manifest: &crate::manifest::Manifest, built: &[BuiltEntry]) -> String {
+    let mut nav_points = String::new();
+    for section in &manifest.sections {
+        let section_target = section
+            .entries
+            .iter()
+            .find_map(|entry_id| built.iter().find(|candidate| &candidate.id == entry_id))
+            .map(|entry| format!("text/{}.xhtml", entry.id));
+        nav_points.push_str("<li>");
+        if let Some(target) = section_target {
+            nav_points.push_str(&format!(
+                "<a href=\"{}\">{}</a><ol>",
+                escape(&target),
+                escape(&section.title)
+            ));
+        } else {
+            nav_points.push_str(&format!("<span>{}</span><ol>", escape(&section.title)));
+        }
+        for entry_id in &section.entries {
+            if let Some(entry) = built.iter().find(|candidate| &candidate.id == entry_id) {
+                if entry.hidden_from_toc {
+                    continue;
+                }
+                nav_points.push_str(&format!(
+                    "<li><a href=\"text/{}.xhtml\">{}</a></li>",
+                    escape(&entry.id),
+                    escape(&entry.chapter.nav_title)
+                ));
+            }
+        }
+        nav_points.push_str("</ol></li>");
+    }
+
+    format!(
+        r#"<?xml version="1.0" encoding="UTF-8"?>
+<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
+  <head>
+    <title>{}</title>
+    <link rel="stylesheet" type="text/css" href="styles/book.css"/>
+  </head>
+  <body>
+    <nav epub:type="toc" id="toc">
+      <h1>{}</h1>
+      <ol>{}</ol>
+    </nav>
+  </body>
+</html>"#,
+        escape(&manifest.book.title),
+        escape(&manifest.book.title),
+        nav_points
+    )
+}
+
+fn build_ncx(manifest: &crate::manifest::Manifest, built: &[BuiltEntry]) -> String {
+    let mut play_order = 1usize;
+    let mut nav_points = String::new();
+    for section in &manifest.sections {
+        let section_entries: Vec<_> = section
+            .entries
+            .iter()
+            .filter_map(|entry_id| built.iter().find(|candidate| &candidate.id == entry_id))
+            .filter(|entry| !entry.hidden_from_toc)
+            .collect();
+
+        if section_entries.is_empty() {
+            continue;
+        }
+
+        let section_play_order = play_order;
+        play_order += 1;
+
+        let mut child_points = String::new();
+        for entry in &section_entries {
+            child_points.push_str(&format!(
+                "<navPoint id=\"nav-{}\" playOrder=\"{}\"><navLabel><text>{}</text></navLabel><content src=\"text/{}.xhtml\"/></navPoint>",
+                escape(&entry.id),
+                play_order,
+                escape(&entry.chapter.nav_title),
+                escape(&entry.id)
+            ));
+            play_order += 1;
+        }
+
+        nav_points.push_str(&format!(
+            "<navPoint id=\"section-{}\" playOrder=\"{}\"><navLabel><text>{}</text></navLabel><content src=\"text/{}.xhtml\"/>{}</navPoint>",
+            escape(&section.id),
+            section_play_order,
+            escape(&section.title),
+            escape(&section_entries[0].id),
+            child_points
+        ));
+    }
+
+    format!(
+        r#"<?xml version="1.0" encoding="UTF-8"?>
+<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
+  <head>
+    <meta name="dtb:uid" content="{}"/>
+  </head>
+  <docTitle><text>{}</text></docTitle>
+  <navMap>{}</navMap>
+</ncx>"#,
+        escape(&manifest.book.identifier),
+        escape(&manifest.book.title),
+        nav_points
+    )
+}
+
+fn build_opf(
+    manifest: &crate::manifest::Manifest,
+    built: &[BuiltEntry],
+    cover_href: Option<&str>,
+) -> String {
+    let mut manifest_items = String::from(
+        r#"<item id="nav" href="nav.xhtml" media-type="application/xhtml+xml" properties="nav"/>
+<item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
+<item id="css" href="styles/book.css" media-type="text/css"/>"#,
+    );
+    let mut spine_items = String::new();
+
+    for entry in built {
+        manifest_items.push_str(&format!(
+            "<item id=\"{}\" href=\"text/{}.xhtml\" media-type=\"application/xhtml+xml\"/>",
+            escape(&entry.id),
+            escape(&entry.id)
+        ));
+        spine_items.push_str(&format!("<itemref idref=\"{}\"/>", escape(&entry.id)));
+        for asset in &entry.assets {
+            manifest_items.push_str(&format!(
+                "<item id=\"{}\" href=\"{}\" media-type=\"{}\"/>",
+                escape(&asset.id),
+                escape(&asset.href),
+                escape(&asset.media_type)
+            ));
+        }
+    }
+
+    if let Some(cover_href) = cover_href {
+        manifest_items.push_str(&format!(
+            "<item id=\"cover\" href=\"{}\" media-type=\"image/jpeg\" properties=\"cover-image\"/>",
+            escape(cover_href)
+        ));
+    }
+
+    let author = manifest
+        .book
+        .author
+        .clone()
+        .unwrap_or_else(|| "Unknown".to_string());
+    let description = manifest.book.description.clone().unwrap_or_default();
+    format!(
+        r#"<?xml version="1.0" encoding="UTF-8"?>
+<package version="3.0" xmlns="http://www.idpf.org/2007/opf" unique-identifier="bookid">
+  <metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
+    <dc:identifier id="bookid">{}</dc:identifier>
+    <dc:title>{}</dc:title>
+    <dc:creator>{}</dc:creator>
+    <dc:language>{}</dc:language>
+    <dc:description>{}</dc:description>
+  </metadata>
+  <manifest>{}</manifest>
+  <spine toc="ncx">{}</spine>
+</package>"#,
+        escape(&manifest.book.identifier),
+        escape(&manifest.book.title),
+        escape(&author),
+        escape(&manifest.book.language),
+        escape(&description),
+        manifest_items,
+        spine_items
+    )
+}
+
+const CONTAINER_XML: &str = r#"<?xml version="1.0" encoding="UTF-8"?>
+<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
+  <rootfiles>
+    <rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
+  </rootfiles>
+</container>"#;
+
+const DEFAULT_STYLES: &str = r#"body { font-family: serif; line-height: 1.5; margin: 5%; }
+h1 { margin-bottom: 0.2em; }
+.chapter-meta { color: #555; font-size: 0.9em; margin-bottom: 1.5em; }
+img { max-width: 100%; height: auto; }
+a { color: #0b4f7a; text-decoration: none; }
+"#;
@@ -0,0 +1,39 @@
+use thiserror::Error;
+
+pub type Result<T> = std::result::Result<T, EbookmError>;
+
+#[derive(Debug, Error)]
+pub enum EbookmError {
+    #[error("failed to read file {path}: {source}")]
+    Io {
+        path: String,
+        #[source]
+        source: std::io::Error,
+    },
+    #[error("failed to parse manifest {path}: {source}")]
+    ManifestParse {
+        path: String,
+        #[source]
+        source: serde_yaml::Error,
+    },
+    #[error("manifest validation failed: {issues:?}")]
+    Validation { issues: Vec<String> },
+    #[error("network request failed for {url}: {source}")]
+    Request {
+        url: String,
+        #[source]
+        source: reqwest::Error,
+    },
+    #[error("invalid source path: {path}")]
+    InvalidSourcePath { path: String },
+    #[error("failed to parse URL {value}: {source}")]
+    UrlParse {
+        value: String,
+        #[source]
+        source: url::ParseError,
+    },
+    #[error("article extraction failed for {input}")]
+    Extraction { input: String },
+    #[error("EPUB generation failed: {message}")]
+    Epub { message: String },
+}
@@ -0,0 +1,268 @@
+use chrono::{DateTime, NaiveDate};
+use scraper::{Html, Selector};
+use serde_json::Value;
+use url::Url;
+
+use crate::error::{EbookmError, Result};
+use crate::source::{LoadedSource, SourceOrigin};
+
+#[derive(Debug, Clone)]
+pub struct ExtractedArticle {
+    pub title: String,
+    pub author: Option<String>,
+    pub published: Option<NaiveDate>,
+    pub canonical_url: Option<Url>,
+    pub body_html: String,
+}
+
+#[derive(Debug, Clone, serde::Serialize)]
+pub struct InspectResult {
+    pub title: Option<String>,
+    pub author: Option<String>,
+    pub published: Option<String>,
+    pub canonical_url: Option<String>,
+}
+
+pub fn extract_article(loaded: &LoadedSource) -> Result<ExtractedArticle> {
+    let document = Html::parse_document(&loaded.html);
+    let json_ld = extract_primary_json_ld(&document);
+    let title = select_content(
+        &document,
+        &[
+            r#"meta[property="og:title"]"#,
+            r#"article .post-title"#,
+            ".post-title",
+            "h1",
+            "title",
+        ],
+        "content",
+    )
+    .or_else(|| {
+        select_text(
+            &document,
+            &[r#"article .post-title"#, ".post-title", "h1", "title"],
+        )
+    })
+    .or_else(|| json_ld_string(&json_ld, "headline"))
+    .ok_or_else(|| EbookmError::Extraction {
+        input: origin_label(&loaded.origin),
+    })?;
+
+    let author = select_content(
+        &document,
+        &[
+            r#"meta[name="author"]"#,
+            r#"meta[property="article:author"]"#,
+        ],
+        "content",
+    )
+    .or_else(|| {
+        select_text(
+            &document,
+            &[
+                "[data-testid='author-name']",
+                ".byline",
+                ".byline-wrapper a",
+                "address",
+            ],
+        )
+    })
+    .or_else(|| json_ld_author(&json_ld));
+
+    let published = select_content(
+        &document,
+        &[r#"meta[property="article:published_time"]"#, "time"],
+        "content",
+    )
+    .or_else(|| select_attr(&document, &["time"], "datetime"))
+    .or_else(|| json_ld_string(&json_ld, "datePublished"))
+    .and_then(parse_date);
+
+    let canonical_url = select_attr(&document, &[r#"link[rel="canonical"]"#], "href")
+        .or_else(|| match &loaded.origin {
+            SourceOrigin::Remote(url) => Some(url.to_string()),
+            SourceOrigin::LocalFile(_) => None,
+        })
+        .and_then(|raw| Url::parse(&raw).ok());
+
+    let body_html = select_html(
+        &document,
+        &[
+            ".available-content .body.markup",
+            ".available-content .markup",
+            "article .body.markup",
+            ".newsletter-post .body.markup",
+            "article",
+            "main",
+            "body",
+        ],
+    )
+    .ok_or_else(|| EbookmError::Extraction {
+        input: origin_label(&loaded.origin),
+    })?;
+
+    Ok(ExtractedArticle {
+        title,
+        author,
+        published,
+        canonical_url,
+        body_html,
+    })
+}
+
+pub fn inspect_article(loaded: &LoadedSource) -> Result<InspectResult> {
+    let extracted = extract_article(loaded)?;
+    Ok(InspectResult {
+        title: Some(extracted.title),
+        author: extracted.author,
+        published: extracted.published.map(|date| date.to_string()),
+        canonical_url: extracted.canonical_url.map(|url| url.to_string()),
+    })
+}
+
+fn select_content(document: &Html, selectors: &[&str], attr: &str) -> Option<String> {
+    selectors.iter().find_map(|selector| {
+        let selector = Selector::parse(selector).ok()?;
+        document
+            .select(&selector)
+            .next()
+            .and_then(|node| node.value().attr(attr))
+            .map(clean_text)
+    })
+}
+
+fn select_text(document: &Html, selectors: &[&str]) -> Option<String> {
+    selectors.iter().find_map(|selector| {
+        let selector = Selector::parse(selector).ok()?;
+        document
+            .select(&selector)
+            .next()
+            .map(|node| clean_text(&node.text().collect::<String>()))
+    })
+}
+
+fn select_attr(document: &Html, selectors: &[&str], attr: &str) -> Option<String> {
+    selectors.iter().find_map(|selector| {
+        let selector = Selector::parse(selector).ok()?;
+        document
+            .select(&selector)
+            .next()
+            .and_then(|node| node.value().attr(attr))
+            .map(clean_text)
+    })
+}
+
+fn select_html(document: &Html, selectors: &[&str]) -> Option<String> {
+    selectors.iter().find_map(|selector| {
+        let selector = Selector::parse(selector).ok()?;
+        document
+            .select(&selector)
+            .next()
+            .map(|node| node.inner_html())
+    })
+}
+
+fn clean_text(value: &str) -> String {
+    value.split_whitespace().collect::<Vec<_>>().join(" ")
+}
+
+fn parse_date(value: String) -> Option<NaiveDate> {
+    DateTime::parse_from_rfc3339(&value)
+        .map(|parsed| parsed.date_naive())
+        .ok()
+        .or_else(|| NaiveDate::parse_from_str(&value, "%Y-%m-%d").ok())
+        .or_else(|| NaiveDate::parse_from_str(&value, "%b %d, %Y").ok())
+}
+
+fn origin_label(origin: &SourceOrigin) -> String {
+    match origin {
+        SourceOrigin::Remote(url) => url.to_string(),
+        SourceOrigin::LocalFile(path) => path.display().to_string(),
+    }
+}
+
+fn extract_primary_json_ld(document: &Html) -> Option<Value> {
+    let selector = Selector::parse(r#"script[type="application/ld+json"]"#).ok()?;
+    for node in document.select(&selector) {
+        let raw = node.inner_html();
+        let Ok(value) = serde_json::from_str::<Value>(&raw) else {
+            continue;
+        };
+        if value.get("@type").and_then(Value::as_str).is_some() {
+            return Some(value);
+        }
+    }
+    None
+}
+
+fn json_ld_string(json_ld: &Option<Value>, key: &str) -> Option<String> {
+    json_ld
+        .as_ref()?
+        .get(key)?
+        .as_str()
+        .map(|value| value.to_string())
+}
+
+fn json_ld_author(json_ld: &Option<Value>) -> Option<String> {
+    let author = json_ld.as_ref()?.get("author")?;
+    if let Some(author_name) = author.get(0).and_then(|entry| entry.get("name")).and_then(Value::as_str) {
+        return Some(author_name.to_string());
+    }
+    if let Some(author_name) = author.get("name").and_then(Value::as_str) {
+        return Some(author_name.to_string());
+    }
+    None
+}
+
+#[cfg(test)]
+mod tests {
+    use super::*;
+    use crate::source::LoadedSource;
+
+    #[test]
+    fn extracts_substack_article_body_without_page_chrome() {
+        let html = r#"<!doctype html>
+<html>
+  <head>
+    <meta property="og:title" content="A Contested Island" />
+    <meta name="author" content="John Gu" />
+    <link rel="canonical" href="https://ageofpeace.substack.com/p/a-contested-island" />
+    <script type="application/ld+json">{"@context":"https://schema.org","@type":"NewsArticle","headline":"A Contested Island","datePublished":"2026-03-03T23:37:00+00:00","author":[{"@type":"Person","name":"John Gu"}]}</script>
+  </head>
+  <body>
+    <article class="typography newsletter-post post">
+      <div class="post-header">
+        <h1 class="post-title">Chapter 1: A Contested Island</h1>
+      </div>
+      <div class="available-content">
+        <div class="body markup">
+          <p>First paragraph.</p>
+          <p>Second paragraph.</p>
+        </div>
+      </div>
+      <div class="post-footer">
+        <button>Share</button>
+      </div>
+    </article>
+  </body>
+</html>"#;
+
+        let loaded = LoadedSource {
+            origin: SourceOrigin::Remote(
+                Url::parse("https://ageofpeace.substack.com/p/a-contested-island").expect("url"),
+            ),
+            html: html.to_string(),
+        };
+
+        let extracted = extract_article(&loaded).expect("extract article");
+        assert_eq!(extracted.title, "A Contested Island");
+        assert_eq!(extracted.author.as_deref(), Some("John Gu"));
+        assert_eq!(
+            extracted.published,
+            Some(NaiveDate::from_ymd_opt(2026, 3, 3).expect("date"))
+        );
+        assert!(extracted.body_html.contains("First paragraph."));
+        assert!(!extracted.body_html.contains("post-header"));
+        assert!(!extracted.body_html.contains("Share"));
+    }
+}
@@ -0,0 +1,167 @@
+use std::collections::{BTreeMap, BTreeSet};
+
+use url::Url;
+
+use crate::manifest::{BuildMode, LinkMatchMode, Manifest};
+
+#[derive(Debug, Clone)]
+pub struct LinkPolicy {
+    pub match_mode: LinkMatchMode,
+    pub targets: BTreeSet<String>,
+}
+
+pub fn build_link_policies(
+    manifest: &Manifest,
+    entry_metadata: &BTreeMap<String, EntryLinkMetadata>,
+) -> BTreeMap<String, LinkPolicy> {
+    entry_metadata
+        .iter()
+        .map(|(entry_id, _metadata)| {
+            let entry = &manifest.entries[entry_id];
+            let mode = entry
+                .links
+                .mode
+                .clone()
+                .unwrap_or(manifest.link_rules.mode.clone());
+            let targets = resolve_targets(manifest, entry_id, &mode);
+            let match_mode = select_match_mode(manifest, entry_id, &mode);
+            (
+                entry_id.clone(),
+                LinkPolicy {
+                    match_mode,
+                    targets,
+                },
+            )
+        })
+        .collect()
+}
+
+#[derive(Debug, Clone)]
+pub struct EntryLinkMetadata {
+    pub source_url: Option<Url>,
+    pub canonical_url: Option<Url>,
+}
+
+fn resolve_targets(manifest: &Manifest, entry_id: &str, mode: &BuildMode) -> BTreeSet<String> {
+    let entry = &manifest.entries[entry_id];
+    let mut targets = BTreeSet::new();
+
+    match mode {
+        BuildMode::None => return targets,
+        BuildMode::Auto => {
+            for candidate in manifest.entries.keys() {
+                if candidate != entry_id {
+                    targets.insert(candidate.clone());
+                }
+            }
+        }
+        BuildMode::Explicit => {
+            for rule in &manifest.link_rules.rules {
+                if rule.match_mode == LinkMatchMode::Disabled {
+                    continue;
+                }
+                if selector_matches_any(&rule.from, manifest, entry_id) {
+                    for target in expand_selectors(&rule.to, manifest) {
+                        if target != entry_id {
+                            targets.insert(target);
+                        }
+                    }
+                }
+            }
+        }
+    }
+
+    if !entry.links.allow_to.is_empty() {
+        targets.retain(|candidate| entry.links.allow_to.contains(candidate));
+    }
+
+    for blocked in &entry.links.block_to {
+        targets.remove(blocked);
+    }
+
+    targets
+}
+
+fn select_match_mode(manifest: &Manifest, entry_id: &str, mode: &BuildMode) -> LinkMatchMode {
+    match mode {
+        BuildMode::None => LinkMatchMode::Disabled,
+        BuildMode::Auto => LinkMatchMode::CanonicalUrl,
+        BuildMode::Explicit => manifest
+            .link_rules
+            .rules
+            .iter()
+            .find(|rule| selector_matches_any(&rule.from, manifest, entry_id))
+            .map(|rule| rule.match_mode.clone())
+            .unwrap_or(LinkMatchMode::CanonicalUrl),
+    }
+}
+
+fn selector_matches_any(selectors: &[String], manifest: &Manifest, entry_id: &str) -> bool {
+    selectors
+        .iter()
+        .any(|selector| selector_matches(selector, manifest, entry_id))
+}
+
+fn selector_matches(selector: &str, manifest: &Manifest, entry_id: &str) -> bool {
+    if selector == "*" {
+        return true;
+    }
+    if selector == entry_id {
+        return true;
+    }
+    if let Some(section_id) = selector.strip_prefix("section:") {
+        return manifest
+            .sections
+            .iter()
+            .find(|section| section.id == section_id)
+            .is_some_and(|section| section.entries.iter().any(|entry| entry == entry_id));
+    }
+    false
+}
+
+fn expand_selectors(selectors: &[String], manifest: &Manifest) -> BTreeSet<String> {
+    let mut expanded = BTreeSet::new();
+    for selector in selectors {
+        if selector == "*" {
+            expanded.extend(manifest.entries.keys().cloned());
+            continue;
+        }
+        if let Some(section_id) = selector.strip_prefix("section:") {
+            if let Some(section) = manifest
+                .sections
+                .iter()
+                .find(|section| section.id == section_id)
+            {
+                expanded.extend(section.entries.iter().cloned());
+            }
+            continue;
+        }
+        if manifest.entries.contains_key(selector) {
+            expanded.insert(selector.clone());
+        }
+    }
+    expanded
+}
+
+pub fn matches_target(
+    href: &Url,
+    policy: &LinkPolicy,
+    target_id: &str,
+    metadata: &EntryLinkMetadata,
+) -> bool {
+    if !policy.targets.contains(target_id) {
+        return false;
+    }
+
+    match policy.match_mode {
+        LinkMatchMode::Disabled => false,
+        LinkMatchMode::CanonicalUrl => metadata
+            .canonical_url
+            .as_ref()
+            .is_some_and(|candidate| candidate.as_str() == href.as_str()),
+        LinkMatchMode::SourceUrl => metadata
+            .source_url
+            .as_ref()
+            .is_some_and(|candidate| candidate.as_str() == href.as_str()),
+    }
+}
@@ -0,0 +1,19 @@
+mod epub;
+mod error;
+pub mod extract;
+pub mod graph;
+pub mod manifest;
+pub mod normalize;
+mod pipeline;
+pub mod source;
+mod template;
+
+pub use error::{EbookmError, Result};
+pub use extract::InspectResult;
+pub use manifest::{
+    BuildMode, EntryDefinition, EntryLinkConfig, LinkMatchMode, LinkRule, Manifest,
+    ProcessingDefaults, ProcessingOverrides,
+};
+pub use pipeline::{
+    build_epub, inspect_source, load_manifest, render_init_manifest, validate_manifest,
+};
@@ -0,0 +1,207 @@
+use chrono::NaiveDate;
+use indexmap::IndexMap;
+use serde::{Deserialize, Serialize};
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct Manifest {
+    pub book: BookMetadata,
+    pub output: OutputConfig,
+    #[serde(default)]
+    pub defaults: DefaultsConfig,
+    #[serde(default)]
+    pub sections: Vec<SectionDefinition>,
+    #[serde(default)]
+    pub entries: IndexMap<String, EntryDefinition>,
+    #[serde(default)]
+    pub link_rules: LinkRulesConfig,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct BookMetadata {
+    pub title: String,
+    #[serde(default)]
+    pub author: Option<String>,
+    #[serde(default = "default_language")]
+    pub language: String,
+    #[serde(default = "default_identifier")]
+    pub identifier: String,
+    #[serde(default)]
+    pub description: Option<String>,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct OutputConfig {
+    pub path: String,
+    #[serde(default)]
+    pub cover_image: Option<String>,
+}
+
+#[derive(Debug, Clone, Default, Serialize, Deserialize)]
+pub struct DefaultsConfig {
+    #[serde(default = "default_true")]
+    pub fetch_images: bool,
+    #[serde(default = "default_true")]
+    pub normalize_substack_embeds: bool,
+    #[serde(default)]
+    pub processing: ProcessingDefaults,
+    #[serde(default)]
+    pub metadata: MetadataOverrides,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct SectionDefinition {
+    pub id: String,
+    pub title: String,
+    #[serde(default)]
+    pub entries: Vec<String>,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct EntryDefinition {
+    pub source: SourceDefinition,
+    #[serde(default)]
+    pub title: Option<String>,
+    #[serde(default)]
+    pub metadata: MetadataOverrides,
+    #[serde(default)]
+    pub processing: ProcessingOverrides,
+    #[serde(default)]
+    pub toc: TocConfig,
+    #[serde(default)]
+    pub links: EntryLinkConfig,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+#[serde(tag = "kind", rename_all = "lowercase")]
+pub enum SourceDefinition {
+    Substack { url: String },
+    Html { path: String },
+}
+
+#[derive(Debug, Clone, Default, Serialize, Deserialize)]
+pub struct MetadataOverrides {
+    #[serde(default)]
+    pub author: Option<String>,
+    #[serde(default)]
+    pub published: Option<NaiveDate>,
+    #[serde(default)]
+    pub subtitle: Option<String>,
+    #[serde(default)]
+    pub summary: Option<String>,
+    #[serde(default)]
+    pub tags: Vec<String>,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct ProcessingDefaults {
+    #[serde(default = "default_true")]
+    pub include_author: bool,
+    #[serde(default = "default_true")]
+    pub include_date: bool,
+    #[serde(default = "default_true")]
+    pub include_source_url: bool,
+    #[serde(default)]
+    pub skip_first_paragraphs: u32,
+}
+
+impl Default for ProcessingDefaults {
+    fn default() -> Self {
+        Self {
+            include_author: true,
+            include_date: true,
+            include_source_url: true,
+            skip_first_paragraphs: 0,
+        }
+    }
+}
+
+#[derive(Debug, Clone, Default, Serialize, Deserialize)]
+pub struct ProcessingOverrides {
+    #[serde(default)]
+    pub include_author: Option<bool>,
+    #[serde(default)]
+    pub include_date: Option<bool>,
+    #[serde(default)]
+    pub include_source_url: Option<bool>,
+    #[serde(default)]
+    pub skip_first_paragraphs: Option<u32>,
+}
+
+#[derive(Debug, Clone, Default, Serialize, Deserialize)]
+pub struct TocConfig {
+    #[serde(default)]
+    pub title: Option<String>,
+    #[serde(default)]
+    pub hidden: bool,
+}
+
+#[derive(Debug, Clone, Default, Serialize, Deserialize)]
+pub struct EntryLinkConfig {
+    #[serde(default)]
+    pub mode: Option<BuildMode>,
+    #[serde(default)]
+    pub allow_to: Vec<String>,
+    #[serde(default)]
+    pub block_to: Vec<String>,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Default)]
+#[serde(rename_all = "lowercase")]
+pub enum BuildMode {
+    #[default]
+    Auto,
+    Explicit,
+    None,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Default)]
+#[serde(rename_all = "kebab-case")]
+pub enum LinkMatchMode {
+    #[default]
+    CanonicalUrl,
+    SourceUrl,
+    Disabled,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct LinkRule {
+    pub from: Vec<String>,
+    pub to: Vec<String>,
+    #[serde(default)]
+    pub match_mode: LinkMatchMode,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize)]
+pub struct LinkRulesConfig {
+    #[serde(default)]
+    pub mode: BuildMode,
+    #[serde(default = "default_true")]
+    pub rewrite_external_substack_links: bool,
+    #[serde(default = "default_true")]
+    pub preserve_other_external_links: bool,
+    #[serde(default)]
+    pub rules: Vec<LinkRule>,
+}
+
+impl Default for LinkRulesConfig {
+    fn default() -> Self {
+        Self {
+            mode: BuildMode::Auto,
+            rewrite_external_substack_links: true,
+            preserve_other_external_links: true,
+            rules: Vec::new(),
+        }
+    }
+}
+
+fn default_true() -> bool {
+    true
+}
+
+fn default_language() -> String {
+    "en".to_string()
+}
+
+fn default_identifier() -> String {
+    format!("urn:uuid:{}", uuid::Uuid::new_v4())
+}
@@ -0,0 +1,357 @@
+use std::collections::BTreeMap;
+use std::path::Path;
+
+use kuchiki::traits::*;
+use regex::Regex;
+use sha1::{Digest, Sha1};
+use url::Url;
+
+use crate::error::{EbookmError, Result};
+use crate::graph::{EntryLinkMetadata, LinkPolicy, matches_target};
+use crate::manifest::{DefaultsConfig, EntryDefinition};
+use crate::source::{SourceOrigin, resolve_relative_url};
+
+#[derive(Debug, Clone)]
+pub struct Asset {
+    pub id: String,
+    pub href: String,
+    pub media_type: String,
+    pub bytes: Vec<u8>,
+}
+
+#[derive(Debug, Clone)]
+pub struct NormalizedDocument {
+    pub title: String,
+    pub author: Option<String>,
+    pub published: Option<chrono::NaiveDate>,
+    pub canonical_url: Option<Url>,
+    pub body_xhtml: String,
+    pub assets: Vec<Asset>,
+}
+
+pub fn normalize_document(
+    entry_id: &str,
+    entry: &EntryDefinition,
+    defaults: &DefaultsConfig,
+    origin: &SourceOrigin,
+    extracted: crate::extract::ExtractedArticle,
+    policy: &LinkPolicy,
+    entry_metadata: &BTreeMap<String, EntryLinkMetadata>,
+) -> Result<NormalizedDocument> {
+    let mut document = kuchiki::parse_html().one(format!("<div>{}</div>", extracted.body_html));
+
+    remove_nodes(&mut document, "script,style,noscript,button,svg,source");
+    if defaults.normalize_substack_embeds {
+        remove_nodes(&mut document, "iframe");
+    }
+    skip_first_paragraphs(
+        &mut document,
+        entry
+            .processing
+            .skip_first_paragraphs
+            .unwrap_or(defaults.processing.skip_first_paragraphs),
+    );
+    scrub_attributes(&mut document);
+
+    let mut assets = Vec::new();
+    if defaults.fetch_images {
+        collect_images(origin, &mut document, &mut assets)?;
+    }
+
+    rewrite_links(entry_id, &mut document, origin, policy, entry_metadata);
+    let body_xhtml = serialize_document(&document)?;
+
+    Ok(NormalizedDocument {
+        title: entry.title.clone().unwrap_or(extracted.title),
+        author: entry
+            .metadata
+            .author
+            .clone()
+            .or(extracted.author)
+            .or(defaults.metadata.author.clone()),
+        published: entry
+            .metadata
+            .published
+            .or(extracted.published)
+            .or(defaults.metadata.published),
+        canonical_url: extracted.canonical_url,
+        body_xhtml,
+        assets,
+    })
+}
+
+fn remove_nodes(document: &mut kuchiki::NodeRef, selector: &str) {
+    if let Ok(nodes) = document.select(selector) {
+        let selected: Vec<_> = nodes.collect();
+        for node in selected {
+            node.as_node().detach();
+        }
+    }
+}
+
+fn collect_images(
+    origin: &SourceOrigin,
+    document: &mut kuchiki::NodeRef,
+    assets: &mut Vec<Asset>,
+) -> Result<()> {
+    let selected = document
+        .select("img")
+        .map(|items| items.collect::<Vec<_>>())
+        .unwrap_or_default();
+
+    for node in selected {
+        let mut attrs = node.attributes.borrow_mut();
+        let src = attrs
+            .get("src")
+            .or_else(|| attrs.get("data-src"))
+            .map(|value| value.to_string());
+        let Some(src) = src else {
+            continue;
+        };
+
+        if let Ok(asset) = fetch_asset(origin, &src) {
+            attrs.insert("src", format!("../{}", asset.href));
+            assets.push(asset);
+        }
+    }
+
+    Ok(())
+}
+
+fn fetch_asset(origin: &SourceOrigin, src: &str) -> Result<Asset> {
+    match origin {
+        SourceOrigin::LocalFile(base_path) => fetch_local_asset(base_path, src),
+        SourceOrigin::Remote(base_url) => {
+            let resolved = base_url.join(src).map_err(|source| EbookmError::UrlParse {
+                value: src.to_string(),
+                source,
+            })?;
+            fetch_remote_asset(&resolved)
+        }
+    }
+}
+
+fn fetch_local_asset(base_path: &Path, src: &str) -> Result<Asset> {
+    if let Ok(url) = Url::parse(src) {
+        match url.scheme() {
+            "http" | "https" => return fetch_remote_asset(&url),
+            "file" => {
+                let path = url
+                    .to_file_path()
+                    .map_err(|_| EbookmError::InvalidSourcePath {
+                        path: src.to_string(),
+                    })?;
+                return build_asset_from_path(&path);
+            }
+            _ => {}
+        }
+    }
+
+    let path = if Path::new(src).is_absolute() {
+        Path::new(src).to_path_buf()
+    } else {
+        base_path
+            .parent()
+            .unwrap_or_else(|| Path::new("."))
+            .join(src)
+    };
+    build_asset_from_path(&path)
+}
+
+fn fetch_remote_asset(url: &Url) -> Result<Asset> {
+    let bytes = reqwest::blocking::get(url.clone())
+        .and_then(|response| response.error_for_status())
+        .map_err(|source| EbookmError::Request {
+            url: url.to_string(),
+            source,
+        })?
+        .bytes()
+        .map_err(|source| EbookmError::Request {
+            url: url.to_string(),
+            source,
+        })?
+        .to_vec();
+
+    let extension = infer_extension_from_str(url.path());
+    let media_type = infer_media_type(&extension);
+    let digest = Sha1::digest(url.as_str().as_bytes());
+    let id = format!("{:x}", digest);
+    Ok(Asset {
+        id: id.clone(),
+        href: format!("assets/{}.{}", id, extension),
+        media_type,
+        bytes,
+    })
+}
+
+fn build_asset_from_path(path: &Path) -> Result<Asset> {
+    let bytes = std::fs::read(path).map_err(|source| EbookmError::Io {
+        path: path.display().to_string(),
+        source,
+    })?;
+    let extension = infer_extension_from_path(path);
+    let media_type = infer_media_type(&extension);
+    let digest = Sha1::digest(path.display().to_string().as_bytes());
+    let id = format!("{:x}", digest);
+    Ok(Asset {
+        id: id.clone(),
+        href: format!("assets/{}.{}", id, extension),
+        media_type,
+        bytes,
+    })
+}
+
+fn rewrite_links(
+    entry_id: &str,
+    document: &mut kuchiki::NodeRef,
+    origin: &SourceOrigin,
+    policy: &LinkPolicy,
+    entry_metadata: &BTreeMap<String, EntryLinkMetadata>,
+) {
+    let selected = document
+        .select("a[href]")
+        .map(|items| items.collect::<Vec<_>>())
+        .unwrap_or_default();
+
+    for node in selected {
+        let mut attrs = node.attributes.borrow_mut();
+        let href = attrs.get("href").map(|value| value.to_string());
+        let Some(href) = href else {
+            continue;
+        };
+
+        let Some(resolved) = resolve_relative_url(origin, &href) else {
+            continue;
+        };
+
+        if let Some((target_id, _)) = entry_metadata.iter().find(|(target_id, metadata)| {
+            *target_id != entry_id && matches_target(&resolved, policy, target_id, metadata)
+        }) {
+            attrs.insert("href", format!("../text/{}.xhtml", target_id));
+        }
+    }
+}
+
+fn serialize_document(document: &kuchiki::NodeRef) -> Result<String> {
+    let wrapper = document
+        .select_first("div")
+        .map_err(|_| EbookmError::Epub {
+            message: "failed to serialize normalized document".to_string(),
+        })?;
+
+    let mut bytes = Vec::new();
+    for child in wrapper.as_node().children() {
+        child
+            .serialize(&mut bytes)
+            .map_err(|error| EbookmError::Epub {
+                message: error.to_string(),
+            })?;
+    }
+
+    let html = String::from_utf8(bytes).map_err(|error| EbookmError::Epub {
+        message: error.to_string(),
+    })?;
+    Ok(to_xhtml_fragment(&html))
+}
+
+fn scrub_attributes(document: &mut kuchiki::NodeRef) {
+    if let Ok(nodes) = document.select("*") {
+        let selected: Vec<_> = nodes.collect();
+        for node in selected {
+            let mut attrs = node.attributes.borrow_mut();
+            let names: Vec<_> = attrs.map.keys().cloned().collect();
+            for name in names {
+                let local = name.local.to_string();
+                let keep = match node.name.local.as_ref() {
+                    "a" => matches!(local.as_str(), "href" | "title"),
+                    "img" => matches!(local.as_str(), "src" | "alt"),
+                    _ => false,
+                };
+                if !keep {
+                    attrs.map.remove(&name);
+                }
+            }
+        }
+    }
+}
+
+fn skip_first_paragraphs(document: &mut kuchiki::NodeRef, count: u32) {
+    if count == 0 {
+        return;
+    }
+    let selected = document
+        .select("p")
+        .map(|items| items.take(count as usize).collect::<Vec<_>>())
+        .unwrap_or_default();
+    for node in selected {
+        node.as_node().detach();
+    }
+}
+
+fn infer_extension_from_path(path: &Path) -> String {
+    path.extension()
+        .and_then(|value| value.to_str())
+        .filter(|value| !value.is_empty())
+        .unwrap_or("bin")
+        .to_string()
+}
+
+fn infer_extension_from_str(path: &str) -> String {
+    Path::new(path)
+        .extension()
+        .and_then(|value| value.to_str())
+        .filter(|value| !value.is_empty())
+        .unwrap_or("bin")
+        .to_string()
+}
+
+fn infer_media_type(extension: &str) -> String {
+    match extension {
+        "jpg" | "jpeg" => "image/jpeg",
+        "png" => "image/png",
+        "gif" => "image/gif",
+        "svg" => "image/svg+xml",
+        "webp" => "image/webp",
+        _ => "application/octet-stream",
+    }
+    .to_string()
+}
+
+fn to_xhtml_fragment(html: &str) -> String {
+    let img_re = Regex::new(r#"<img([^>]*)>"#).expect("valid img regex");
+    let hr_re = Regex::new(r#"<hr([^>]*)>"#).expect("valid hr regex");
+    let br_re = Regex::new(r#"<br([^>]*)>"#).expect("valid br regex");
+
+    let html = img_re.replace_all(html, "<img$1 />").into_owned();
+    let html = hr_re.replace_all(&html, "<hr$1 />").into_owned();
+    br_re.replace_all(&html, "<br$1 />").into_owned()
+}
+
+#[cfg(test)]
+mod tests {
+    use super::to_xhtml_fragment;
+    use quick_xml::events::Event;
+    use quick_xml::Reader;
+
+    #[test]
+    fn converts_void_html_tags_to_xhtml_self_closing_tags() {
+        let input = r#"<p>Intro</p><picture><img alt="" src="a.jpg"></picture><hr><br>"#;
+        let xhtml = to_xhtml_fragment(input);
+        assert!(xhtml.contains(r#"<img alt="" src="a.jpg" />"#));
+        assert!(xhtml.contains("<hr />"));
+        assert!(xhtml.contains("<br />"));
+
+        let wrapped = format!(
+            r#"<?xml version="1.0" encoding="UTF-8"?><root>{}</root>"#,
+            xhtml
+        );
+        let mut reader = Reader::from_str(&wrapped);
+        loop {
+            match reader.read_event() {
+                Ok(Event::Eof) => break,
+                Ok(_) => {}
+                Err(error) => panic!("invalid XML generated: {error}"),
+            }
+        }
+    }
+}
@@ -0,0 +1,432 @@
+use std::collections::BTreeMap;
+use std::fs;
+use std::path::Path;
+
+use crate::epub::write_epub;
+use crate::error::{EbookmError, Result};
+use crate::extract::{InspectResult, inspect_article};
+use crate::graph::{EntryLinkMetadata, build_link_policies};
+use crate::manifest::{Manifest, SourceDefinition};
+use crate::normalize::{Asset, NormalizedDocument};
+use crate::source::{SourceSpec, load_source};
+use crate::template::INIT_MANIFEST;
+
+#[derive(Debug, Clone)]
+pub struct BuiltChapter {
+    pub nav_title: String,
+    pub xhtml: String,
+}
+
+#[derive(Debug, Clone, Copy)]
+struct ChapterHeaderOptions {
+    include_author: bool,
+    include_date: bool,
+    include_source_url: bool,
+}
+
+#[derive(Debug, Clone)]
+pub struct BuiltEntry {
+    pub id: String,
+    pub hidden_from_toc: bool,
+    pub chapter: BuiltChapter,
+    pub assets: Vec<Asset>,
+}
+
+pub fn load_manifest(path: &Path) -> Result<Manifest> {
+    let contents = fs::read_to_string(path).map_err(|source| EbookmError::Io {
+        path: path.display().to_string(),
+        source,
+    })?;
+    serde_yaml::from_str(&contents).map_err(|source| EbookmError::ManifestParse {
+        path: path.display().to_string(),
+        source,
+    })
+}
+
+pub fn validate_manifest(manifest: &Manifest) -> Result<Vec<String>> {
+    let mut issues = Vec::new();
+    let mut warnings = Vec::new();
+
+    if manifest.book.title.trim().is_empty() {
+        issues.push("book.title must not be empty".to_string());
+    }
+    if manifest.output.path.trim().is_empty() {
+        issues.push("output.path must not be empty".to_string());
+    }
+    if manifest.sections.is_empty() {
+        issues.push("at least one section is required".to_string());
+    }
+    if manifest.entries.is_empty() {
+        issues.push("at least one entry is required".to_string());
+    }
+
+    for section in &manifest.sections {
+        if section.entries.is_empty() {
+            warnings.push(format!("section {} has no entries", section.id));
+        }
+        for entry_id in &section.entries {
+            if !manifest.entries.contains_key(entry_id) {
+                issues.push(format!(
+                    "section {} references unknown entry {}",
+                    section.id, entry_id
+                ));
+            }
+        }
+    }
+
+    for (entry_id, entry) in &manifest.entries {
+        for target in &entry.links.allow_to {
+            if !manifest.entries.contains_key(target) {
+                issues.push(format!(
+                    "entry {entry_id} allow_to target {target} does not exist"
+                ));
+            }
+        }
+        for target in &entry.links.block_to {
+            if !manifest.entries.contains_key(target) {
+                issues.push(format!(
+                    "entry {entry_id} block_to target {target} does not exist"
+                ));
+            }
+        }
+    }
+
+    for rule in &manifest.link_rules.rules {
+        validate_selectors(manifest, &rule.from, "from", &mut issues);
+        validate_selectors(manifest, &rule.to, "to", &mut issues);
+    }
+
+    for entry_id in manifest.entries.keys() {
+        if !manifest.sections.iter().any(|section| {
+            section
+                .entries
+                .iter()
+                .any(|candidate| candidate == entry_id)
+        }) {
+            warnings.push(format!("entry {entry_id} is not referenced by any section"));
+        }
+    }
+
+    if issues.is_empty() {
+        Ok(warnings)
+    } else {
+        Err(EbookmError::Validation { issues })
+    }
+}
+
+pub fn inspect_source(source: &str) -> Result<InspectResult> {
+    let spec = if source.starts_with("http://") || source.starts_with("https://") {
+        SourceSpec::from_definition(
+            &SourceDefinition::Substack {
+                url: source.to_string(),
+            },
+            Path::new("."),
+        )?
+    } else {
+        SourceSpec::from_definition(
+            &SourceDefinition::Html {
+                path: source.to_string(),
+            },
+            Path::new("."),
+        )?
+    };
+    let loaded = load_source(&spec)?;
+    inspect_article(&loaded)
+}
+
+pub fn build_epub(manifest: &Manifest, manifest_path: &Path) -> Result<()> {
+    let manifest_dir = manifest_path.parent().unwrap_or_else(|| Path::new("."));
+
+    let mut entry_specs = BTreeMap::new();
+    let mut loaded_sources = BTreeMap::new();
+    let mut extracted = BTreeMap::new();
+    let mut metadata = BTreeMap::new();
+
+    for (entry_id, entry) in &manifest.entries {
+        let spec = SourceSpec::from_definition(&entry.source, manifest_dir)?;
+        let loaded = load_source(&spec)?;
+        let article = crate::extract::extract_article(&loaded)?;
+        let source_url = match &spec {
+            SourceSpec::SubstackUrl(url) => Some(url.clone()),
+            SourceSpec::LocalHtml(_) => None,
+        };
+
+        metadata.insert(
+            entry_id.clone(),
+            EntryLinkMetadata {
+                source_url,
+                canonical_url: article.canonical_url.clone(),
+            },
+        );
+        entry_specs.insert(entry_id.clone(), spec);
+        loaded_sources.insert(entry_id.clone(), loaded);
+        extracted.insert(entry_id.clone(), article);
+    }
+
+    let policies = build_link_policies(manifest, &metadata);
+    let mut built_entries = Vec::new();
+
+    for section in &manifest.sections {
+        for entry_id in &section.entries {
+            let entry = &manifest.entries[entry_id];
+            let loaded = loaded_sources.get(entry_id).expect("entry was loaded");
+            let article = extracted
+                .get(entry_id)
+                .expect("entry was extracted")
+                .clone();
+            let policy = policies.get(entry_id).expect("policy was built");
+
+            let normalized = crate::normalize::normalize_document(
+                entry_id,
+                entry,
+                &manifest.defaults,
+                &loaded.origin,
+                article,
+                policy,
+                &metadata,
+            )?;
+
+            built_entries.push(BuiltEntry {
+                id: entry_id.clone(),
+                hidden_from_toc: entry.toc.hidden,
+                chapter: build_chapter(entry_id, entry, &manifest.defaults, &normalized),
+                assets: normalized.assets,
+            });
+        }
+    }
+
+    let cover = manifest
+        .output
+        .cover_image
+        .as_ref()
+        .map(|path| load_cover(path, manifest_dir))
+        .transpose()?;
+    let output_path = manifest_dir.join(&manifest.output.path);
+    write_epub(manifest, &built_entries, &output_path, cover)?;
+    Ok(())
+}
+
+pub fn render_init_manifest() -> &'static str {
+    INIT_MANIFEST
+}
+
+fn build_chapter(
+    entry_id: &str,
+    entry: &crate::manifest::EntryDefinition,
+    defaults: &crate::manifest::DefaultsConfig,
+    doc: &NormalizedDocument,
+) -> BuiltChapter {
+    let nav_title = entry.toc.title.clone().unwrap_or_else(|| doc.title.clone());
+    let header = resolve_header_options(entry, defaults);
+    let author = doc.author.clone().unwrap_or_default();
+    let published = doc
+        .published
+        .map(|date| date.to_string())
+        .unwrap_or_default();
+    let mut meta_lines = Vec::new();
+    if header.include_author && !author.is_empty() {
+        meta_lines.push(format!("<p>{}</p>", escape_html(&author)));
+    }
+    if header.include_date && !published.is_empty() {
+        meta_lines.push(format!("<p>{}</p>", escape_html(&published)));
+    }
+    if header.include_source_url {
+        if let Some(url) = doc.canonical_url.as_ref() {
+            let escaped = escape_html(url.as_str());
+            meta_lines.push(format!(r#"<p><a href="{0}">{0}</a></p>"#, escaped));
+        }
+    }
+
+    let meta_block = if meta_lines.is_empty() {
+        String::new()
+    } else {
+        format!(r#"<div class="chapter-meta">{}</div>"#, meta_lines.join(""))
+    };
+
+    let xhtml = format!(
+        r#"<?xml version="1.0" encoding="UTF-8"?>
+<html xmlns="http://www.w3.org/1999/xhtml">
+  <head>
+    <title>{}</title>
+    <link rel="stylesheet" type="text/css" href="../styles/book.css"/>
+  </head>
+  <body id="{}">
+    <h1>{}</h1>
+    {}
+    {}
+  </body>
+</html>"#,
+        escape_html(&doc.title),
+        escape_html(entry_id),
+        escape_html(&doc.title),
+        meta_block,
+        doc.body_xhtml
+    );
+
+    BuiltChapter { nav_title, xhtml }
+}
+
+fn validate_selectors(
+    manifest: &Manifest,
+    selectors: &[String],
+    field: &str,
+    issues: &mut Vec<String>,
+) {
+    for selector in selectors {
+        if selector == "*" {
+            continue;
+        }
+        if manifest.entries.contains_key(selector) {
+            continue;
+        }
+        if let Some(section_id) = selector.strip_prefix("section:") {
+            if manifest
+                .sections
+                .iter()
+                .any(|section| section.id == section_id)
+            {
+                continue;
+            }
+        }
+        issues.push(format!("unknown {field} selector {selector}"));
+    }
+}
+
+fn load_cover(path: &str, manifest_dir: &Path) -> Result<(String, Vec<u8>)> {
+    let full_path = manifest_dir.join(path);
+    let bytes = fs::read(&full_path).map_err(|source| EbookmError::Io {
+        path: full_path.display().to_string(),
+        source,
+    })?;
+    let extension = full_path
+        .extension()
+        .and_then(|value| value.to_str())
+        .unwrap_or("jpg");
+    Ok((format!("assets/cover.{extension}"), bytes))
+}
+
+fn escape_html(value: &str) -> String {
+    value
+        .replace('&', "&amp;")
+        .replace('<', "&lt;")
+        .replace('>', "&gt;")
+        .replace('"', "&quot;")
+}
+
+fn resolve_header_options(
+    entry: &crate::manifest::EntryDefinition,
+    defaults: &crate::manifest::DefaultsConfig,
+) -> ChapterHeaderOptions {
+    ChapterHeaderOptions {
+        include_author: entry
+            .processing
+            .include_author
+            .unwrap_or(defaults.processing.include_author),
+        include_date: entry
+            .processing
+            .include_date
+            .unwrap_or(defaults.processing.include_date),
+        include_source_url: entry
+            .processing
+            .include_source_url
+            .unwrap_or(defaults.processing.include_source_url),
+    }
+}
+
+#[cfg(test)]
+mod tests {
+    use std::fs;
+
+    use tempfile::tempdir;
+    use zip::ZipArchive;
+
+    use super::*;
+
+    #[test]
+    fn validates_and_builds_local_html_manifest() {
+        let temp = tempdir().expect("tempdir");
+        let root = temp.path();
+
+        fs::write(
+            root.join("article.html"),
+            r#"<!doctype html>
+<html>
+  <head>
+    <title>Local Essay</title>
+    <meta name="author" content="Local Author" />
+    <meta property="article:published_time" content="2025-01-10T00:00:00Z" />
+  </head>
+  <body>
+    <article>
+      <p>Hello world.</p>
+      <img src="author.jpg" alt="Author" />
+    </article>
+  </body>
+</html>"#,
+        )
+        .expect("write html");
+        fs::write(root.join("author.jpg"), b"fake-jpeg-data").expect("write image");
+
+        let manifest_path = root.join("book.yaml");
+        fs::write(
+            &manifest_path,
+            r#"book:
+  title: "Local Book"
+  author: "Editor"
+  language: "en"
+  identifier: "urn:uuid:test-book"
+output:
+  path: "dist/test.epub"
+defaults:
+  fetch_images: true
+  normalize_substack_embeds: true
+  processing:
+    include_author: true
+    include_date: false
+    include_source_url: false
+    skip_first_paragraphs: 0
+sections:
+  - id: "part-1"
+    title: "Part 1"
+    entries:
+      - "essay"
+entries:
+  essay:
+    source:
+      kind: "html"
+      path: "article.html"
+link_rules:
+  mode: "auto"
+"#,
+        )
+        .expect("write manifest");
+
+        let manifest = load_manifest(&manifest_path).expect("manifest");
+        validate_manifest(&manifest).expect("manifest valid");
+        build_epub(&manifest, &manifest_path).expect("build epub");
+
+        let epub_path = root.join("dist/test.epub");
+        assert!(epub_path.exists());
+
+        let file = fs::File::open(&epub_path).expect("epub file");
+        let mut archive = ZipArchive::new(file).expect("zip");
+        assert!(archive.by_name("mimetype").is_ok());
+        assert!(archive.by_name("OEBPS/content.opf").is_ok());
+        let mut chapter = archive
+            .by_name("OEBPS/text/essay.xhtml")
+            .expect("chapter file");
+        let mut chapter_contents = String::new();
+        use std::io::Read;
+        chapter
+            .read_to_string(&mut chapter_contents)
+            .expect("read chapter");
+        assert!(chapter_contents.contains("<p>Local Author</p>"));
+        assert!(!chapter_contents.contains("<p>2025-01-10</p>"));
+        assert!(!chapter_contents.contains("urn:uuid:test-book"));
+        assert!(chapter_contents.contains("../assets/"));
+        drop(chapter);
+        assert!(archive
+            .file_names()
+            .any(|name| name.starts_with("OEBPS/assets/") && name.ends_with(".jpg")));
+    }
+}
@@ -0,0 +1,97 @@
+use std::fs;
+use std::path::{Path, PathBuf};
+
+use url::Url;
+
+use crate::error::{EbookmError, Result};
+use crate::manifest::SourceDefinition;
+
+#[derive(Debug, Clone)]
+pub enum SourceSpec {
+    SubstackUrl(Url),
+    LocalHtml(PathBuf),
+}
+
+#[derive(Debug, Clone)]
+pub enum SourceOrigin {
+    Remote(Url),
+    LocalFile(PathBuf),
+}
+
+#[derive(Debug, Clone)]
+pub struct LoadedSource {
+    pub origin: SourceOrigin,
+    pub html: String,
+}
+
+impl SourceSpec {
+    pub fn from_definition(definition: &SourceDefinition, manifest_dir: &Path) -> Result<Self> {
+        match definition {
+            SourceDefinition::Substack { url } => Ok(SourceSpec::SubstackUrl(
+                Url::parse(url).map_err(|source| EbookmError::UrlParse {
+                    value: url.clone(),
+                    source,
+                })?,
+            )),
+            SourceDefinition::Html { path } => {
+                let joined = manifest_dir.join(path);
+                Ok(SourceSpec::LocalHtml(joined))
+            }
+        }
+    }
+}
+
+pub fn load_source(spec: &SourceSpec) -> Result<LoadedSource> {
+    match spec {
+        SourceSpec::SubstackUrl(url) => {
+            let client = reqwest::blocking::Client::builder()
+                .user_agent("ebookm/0.1")
+                .build()
+                .map_err(|source| EbookmError::Request {
+                    url: url.to_string(),
+                    source,
+                })?;
+            let html = client
+                .get(url.clone())
+                .send()
+                .and_then(|response| response.error_for_status())
+                .map_err(|source| EbookmError::Request {
+                    url: url.to_string(),
+                    source,
+                })?
+                .text()
+                .map_err(|source| EbookmError::Request {
+                    url: url.to_string(),
+                    source,
+                })?;
+            Ok(LoadedSource {
+                origin: SourceOrigin::Remote(url.clone()),
+                html,
+            })
+        }
+        SourceSpec::LocalHtml(path) => {
+            let html = fs::read_to_string(path).map_err(|source| EbookmError::Io {
+                path: path.display().to_string(),
+                source,
+            })?;
+            Ok(LoadedSource {
+                origin: SourceOrigin::LocalFile(path.clone()),
+                html,
+            })
+        }
+    }
+}
+
+pub fn resolve_relative_url(origin: &SourceOrigin, href: &str) -> Option<Url> {
+    match origin {
+        SourceOrigin::Remote(base) => base.join(href).ok(),
+        SourceOrigin::LocalFile(path) => {
+            if let Ok(url) = Url::parse(href) {
+                return Some(url);
+            }
+            let parent = path.parent()?;
+            let joined = parent.join(href);
+            Url::from_file_path(joined).ok()
+        }
+    }
+}
@@ -0,0 +1,56 @@
+pub const INIT_MANIFEST: &str = r#"book:
+  title: "Collected Substack Essays"
+  author: "Author Name"
+  language: "en"
+  identifier: "urn:uuid:11111111-2222-3333-4444-555555555555"
+  description: "A compiled EPUB built by ebookm"
+
+output:
+  path: "dist/collection.epub"
+
+defaults:
+  fetch_images: true
+  normalize_substack_embeds: true
+  processing:
+    include_author: true
+    include_date: true
+    include_source_url: true
+    skip_first_paragraphs: 0
+  metadata:
+    author: "Author Name"
+
+sections:
+  - id: "essays"
+    title: "Essays"
+    entries:
+      - "opening-post"
+      - "saved-html"
+
+entries:
+  opening-post:
+    source:
+      kind: "substack"
+      url: "https://example.substack.com/p/opening-post"
+    processing:
+      skip_first_paragraphs: 1
+    toc:
+      title: "Opening Post"
+
+  saved-html:
+    source:
+      kind: "html"
+      path: "articles/saved-post.html"
+    title: "Saved Local Article"
+    links:
+      mode: "explicit"
+      allow_to: ["opening-post"]
+
+link_rules:
+  mode: "auto"
+  rewrite_external_substack_links: true
+  preserve_other_external_links: true
+  rules:
+    - from: ["section:essays"]
+      to: ["section:essays"]
+      match_mode: "canonical-url"
+"#;
@@ -0,0 +1,15 @@
+<!doctype html>
+<html>
+  <head>
+    <title>Introduction</title>
+    <meta name="author" content="ebookm" />
+    <meta property="article:published_time" content="2025-01-10T00:00:00Z" />
+    <link rel="canonical" href="https://example.com/intro" />
+  </head>
+  <body>
+    <article>
+      <p>This is the first article in the bundled example.</p>
+      <p>It demonstrates a local HTML source entry.</p>
+    </article>
+  </body>
+</html>
@@ -0,0 +1,14 @@
+<!doctype html>
+<html>
+  <head>
+    <title>Working Notes</title>
+    <meta name="author" content="ebookm" />
+    <meta property="article:published_time" content="2025-01-11T00:00:00Z" />
+  </head>
+  <body>
+    <article>
+      <p>This second article links to the first one.</p>
+      <p><a href="https://example.com/intro">Go to the introduction</a></p>
+    </article>
+  </body>
+</html>
@@ -0,0 +1,48 @@
+book:
+  title: "ebookm Example Book"
+  author: "ebookm"
+  language: "en"
+  identifier: "urn:uuid:ebookm-example-book"
+  description: "Example manifest shipped with the repository"
+
+output:
+  path: "dist/example-book.epub"
+
+defaults:
+  fetch_images: false
+  normalize_substack_embeds: true
+  metadata:
+    author: "ebookm"
+
+sections:
+  - id: "part-1"
+    title: "Examples"
+    entries:
+      - "intro"
+      - "notes"
+
+entries:
+  intro:
+    source:
+      kind: "html"
+      path: "articles/intro.html"
+    toc:
+      title: "Introduction"
+
+  notes:
+    source:
+      kind: "html"
+      path: "articles/notes.html"
+    title: "Working Notes"
+    links:
+      mode: "explicit"
+      allow_to: ["intro"]
+
+link_rules:
+  mode: "explicit"
+  rewrite_external_substack_links: true
+  preserve_other_external_links: true
+  rules:
+    - from: ["notes"]
+      to: ["intro"]
+      match_mode: "canonical-url"