initial commit
This commit is contained in:
@@ -0,0 +1 @@
|
||||
/target
|
||||
Generated
+3187
File diff suppressed because it is too large
Load Diff
@@ -0,0 +1,9 @@
|
||||
[workspace]
|
||||
members = ["ebookm-core", "ebookm-cli"]
|
||||
resolver = "2"
|
||||
|
||||
[workspace.package]
|
||||
edition = "2024"
|
||||
license = "MIT"
|
||||
version = "0.1.0"
|
||||
|
||||
@@ -0,0 +1,482 @@
|
||||
# ebookm
|
||||
|
||||
`ebookm` is a Rust command-line tool that compiles a set of Substack posts and local HTML files into a single EPUB.
|
||||
|
||||
## Current Scope
|
||||
|
||||
`v0.1` supports:
|
||||
|
||||
- YAML manifests
|
||||
- Public Substack post URLs
|
||||
- Local HTML files
|
||||
- Manifest-defined section order and TOC structure
|
||||
- Per-entry metadata and TOC overrides
|
||||
- Basic internal link rewriting between included entries
|
||||
- EPUB generation with bundled article assets
|
||||
|
||||
## Build
|
||||
|
||||
```bash
|
||||
cargo build
|
||||
```
|
||||
|
||||
## Run
|
||||
|
||||
Use the CLI through Cargo:
|
||||
|
||||
```bash
|
||||
cargo run -p ebookm-cli -- <command>
|
||||
```
|
||||
|
||||
Available commands:
|
||||
|
||||
- `build -m <manifest>`
|
||||
- `validate -m <manifest>`
|
||||
- `inspect <url-or-file>`
|
||||
- `init`
|
||||
|
||||
## Quick Start
|
||||
|
||||
This repository includes a runnable example manifest and local HTML fixture in `examples/`.
|
||||
|
||||
Validate the example manifest:
|
||||
|
||||
```bash
|
||||
cargo run -p ebookm-cli -- validate -m examples/example-book.yaml
|
||||
```
|
||||
|
||||
Build the example EPUB:
|
||||
|
||||
```bash
|
||||
cargo run -p ebookm-cli -- build -m examples/example-book.yaml
|
||||
```
|
||||
|
||||
The output EPUB will be written to:
|
||||
|
||||
```text
|
||||
examples/dist/example-book.epub
|
||||
```
|
||||
|
||||
Inspect a local HTML file:
|
||||
|
||||
```bash
|
||||
cargo run -p ebookm-cli -- inspect examples/articles/intro.html
|
||||
```
|
||||
|
||||
Generate a starter manifest:
|
||||
|
||||
```bash
|
||||
cargo run -p ebookm-cli -- init
|
||||
```
|
||||
|
||||
## Validate The Output EPUB
|
||||
|
||||
There are two useful levels of validation.
|
||||
|
||||
Quick XML/XHTML validation for generated chapter files:
|
||||
|
||||
```bash
|
||||
unzip -p path/to/book.epub OEBPS/text/chapter.xhtml | xmllint --noout -
|
||||
```
|
||||
|
||||
If you want to check every generated XHTML file in the EPUB:
|
||||
|
||||
```bash
|
||||
mkdir -p /tmp/ebookm-check
|
||||
unzip -o path/to/book.epub -d /tmp/ebookm-check
|
||||
find /tmp/ebookm-check/OEBPS -name '*.xhtml' -print -exec xmllint --noout {} \;
|
||||
```
|
||||
|
||||
Full EPUB validation:
|
||||
|
||||
Use `epubcheck`, which validates the EPUB package itself, including metadata, navigation files, manifest/spine consistency, and XHTML correctness.
|
||||
|
||||
```bash
|
||||
epubcheck path/to/book.epub
|
||||
```
|
||||
|
||||
Practical guidance:
|
||||
|
||||
- Use `xmllint` when you want to quickly confirm that generated XHTML is well-formed XML.
|
||||
- Use `epubcheck` when you want proper EPUB-level validation before distributing the file.
|
||||
- If an EPUB reader only shows part of a chapter, malformed XHTML is a common cause, so `xmllint` on the generated chapter files is a good first check.
|
||||
|
||||
## Manifest Reference
|
||||
|
||||
The top-level manifest keys are:
|
||||
|
||||
- `book`: EPUB metadata
|
||||
- `output`: output path and optional cover image
|
||||
- `defaults`: shared normalization and metadata defaults
|
||||
- `sections`: ordered TOC and reading-order groups
|
||||
- `entries`: source definitions and per-entry overrides
|
||||
- `link_rules`: cross-link rewriting behavior
|
||||
|
||||
Minimal example:
|
||||
|
||||
```yaml
|
||||
book:
|
||||
title: "My Book"
|
||||
author: "Editor"
|
||||
language: "en"
|
||||
identifier: "urn:uuid:my-book"
|
||||
|
||||
output:
|
||||
path: "dist/my-book.epub"
|
||||
|
||||
sections:
|
||||
- id: "part-1"
|
||||
title: "Part 1"
|
||||
entries:
|
||||
- "essay"
|
||||
|
||||
entries:
|
||||
essay:
|
||||
source:
|
||||
kind: "html"
|
||||
path: "articles/essay.html"
|
||||
|
||||
link_rules:
|
||||
mode: "auto"
|
||||
```
|
||||
|
||||
### Top-Level Fields
|
||||
|
||||
- `book`:
|
||||
EPUB metadata block. Required.
|
||||
- `output`:
|
||||
Output configuration block. Required.
|
||||
- `defaults`:
|
||||
Shared defaults applied across entries. Optional.
|
||||
- `sections`:
|
||||
Ordered list of sections. Required in practice for a useful build.
|
||||
- `entries`:
|
||||
Map of entry IDs to entry definitions. Required in practice.
|
||||
- `link_rules`:
|
||||
Global link rewriting policy. Optional.
|
||||
|
||||
### `book`
|
||||
|
||||
- `book.title`:
|
||||
Required string. Book title.
|
||||
- `book.author`:
|
||||
Optional string. Book-level author.
|
||||
- `book.language`:
|
||||
Optional string. Defaults to `en`.
|
||||
- `book.identifier`:
|
||||
Optional string. Defaults to a generated `urn:uuid:...`.
|
||||
- `book.description`:
|
||||
Optional string. Written into the EPUB package metadata.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
book:
|
||||
title: "Collected Essays"
|
||||
author: "Jane Doe"
|
||||
language: "en"
|
||||
identifier: "urn:uuid:collected-essays"
|
||||
description: "A single-volume EPUB generated by ebookm"
|
||||
```
|
||||
|
||||
### `output`
|
||||
|
||||
- `output.path`:
|
||||
Required string. Output EPUB path. Resolved relative to the manifest file.
|
||||
- `output.cover_image`:
|
||||
Optional string. Path to a cover image, resolved relative to the manifest file.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
output:
|
||||
path: "dist/book.epub"
|
||||
cover_image: "assets/cover.jpg"
|
||||
```
|
||||
|
||||
### `defaults`
|
||||
|
||||
- `defaults.fetch_images`:
|
||||
Optional boolean. Defaults to `true`. When enabled, image assets referenced from article HTML are fetched and bundled into the EPUB.
|
||||
- `defaults.normalize_substack_embeds`:
|
||||
Optional boolean. Defaults to `true`. Currently removes iframe embeds during normalization.
|
||||
- `defaults.metadata`:
|
||||
Optional metadata override block applied after extracted source metadata and before per-entry overrides.
|
||||
- `defaults.processing`:
|
||||
Optional shared article-processing and chapter-header defaults.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
defaults:
|
||||
fetch_images: true
|
||||
normalize_substack_embeds: true
|
||||
processing:
|
||||
include_author: true
|
||||
include_date: true
|
||||
include_source_url: true
|
||||
skip_first_paragraphs: 0
|
||||
metadata:
|
||||
author: "Editorial Team"
|
||||
```
|
||||
|
||||
### `defaults.processing` and `entries.<id>.processing`
|
||||
|
||||
These fields control chapter-header rendering and article trimming.
|
||||
|
||||
For `defaults.processing`, all fields are concrete values with defaults.
|
||||
For `entries.<id>.processing`, the same fields are optional overrides.
|
||||
|
||||
- `include_author`:
|
||||
Boolean. Defaults to `true`. Controls whether the extracted or overridden author name is shown at the start of the chapter.
|
||||
- `include_date`:
|
||||
Boolean. Defaults to `true`. Controls whether the extracted or overridden publication date is shown at the start of the chapter.
|
||||
- `include_source_url`:
|
||||
Boolean. Defaults to `true`. Controls whether the canonical article URL is shown at the start of the chapter.
|
||||
- `skip_first_paragraphs`:
|
||||
Integer. Defaults to `0`. Removes the first `n` paragraph elements from the extracted article body before EPUB generation.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
defaults:
|
||||
processing:
|
||||
include_author: true
|
||||
include_date: false
|
||||
include_source_url: false
|
||||
skip_first_paragraphs: 0
|
||||
```
|
||||
|
||||
### `defaults.metadata` and `entries.<id>.metadata`
|
||||
|
||||
These fields use the same shape:
|
||||
|
||||
- `author`:
|
||||
Optional string.
|
||||
- `published`:
|
||||
Optional date in `YYYY-MM-DD` format.
|
||||
- `subtitle`:
|
||||
Optional string. Accepted by the parser but not yet emitted into the EPUB output.
|
||||
- `summary`:
|
||||
Optional string. Accepted by the parser but not yet emitted into the EPUB output.
|
||||
- `tags`:
|
||||
Optional list of strings. Accepted by the parser but not yet used in link or EPUB output logic.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
metadata:
|
||||
author: "Jane Doe"
|
||||
published: "2025-01-10"
|
||||
subtitle: "Notebook entry"
|
||||
summary: "A short summary"
|
||||
tags: ["essay", "history"]
|
||||
```
|
||||
|
||||
### `sections`
|
||||
|
||||
`sections` is an ordered list. Section order controls reading order and TOC grouping.
|
||||
|
||||
Each section supports:
|
||||
|
||||
- `id`:
|
||||
Required string. Stable section identifier.
|
||||
- `title`:
|
||||
Required string. Section title shown in the TOC.
|
||||
- `entries`:
|
||||
Optional list of entry IDs in reading order. Usually should not be empty.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
sections:
|
||||
- id: "part-1"
|
||||
title: "Part 1"
|
||||
entries:
|
||||
- "opening-post"
|
||||
- "notes"
|
||||
```
|
||||
|
||||
### `entries`
|
||||
|
||||
`entries` is a map from entry ID to entry definition. Entry IDs are referenced from `sections` and link rules.
|
||||
|
||||
Each entry supports:
|
||||
|
||||
- `source`:
|
||||
Required source definition block.
|
||||
- `title`:
|
||||
Optional string. Overrides the extracted article title.
|
||||
- `metadata`:
|
||||
Optional metadata override block.
|
||||
- `toc`:
|
||||
Optional TOC override block.
|
||||
- `links`:
|
||||
Optional per-entry link-policy block.
|
||||
- `processing`:
|
||||
Optional per-entry processing override block.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
entries:
|
||||
opening-post:
|
||||
source:
|
||||
kind: "substack"
|
||||
url: "https://example.substack.com/p/opening-post"
|
||||
title: "Opening Post"
|
||||
metadata:
|
||||
published: "2025-01-10"
|
||||
processing:
|
||||
include_source_url: false
|
||||
skip_first_paragraphs: 1
|
||||
toc:
|
||||
title: "Introduction"
|
||||
links:
|
||||
mode: "explicit"
|
||||
allow_to: ["notes"]
|
||||
```
|
||||
|
||||
### `entries.<id>.source`
|
||||
|
||||
Two source kinds are supported:
|
||||
|
||||
- `kind: "substack"`:
|
||||
Use a public Substack article URL.
|
||||
Fields:
|
||||
`url` required string.
|
||||
- `kind: "html"`:
|
||||
Use a local HTML file.
|
||||
Fields:
|
||||
`path` required string, resolved relative to the manifest file.
|
||||
|
||||
Examples:
|
||||
|
||||
```yaml
|
||||
source:
|
||||
kind: "substack"
|
||||
url: "https://example.substack.com/p/my-post"
|
||||
```
|
||||
|
||||
```yaml
|
||||
source:
|
||||
kind: "html"
|
||||
path: "articles/local-post.html"
|
||||
```
|
||||
|
||||
You can mix Substack URLs and local HTML files in the same manifest:
|
||||
|
||||
```yaml
|
||||
entries:
|
||||
remote-post:
|
||||
source:
|
||||
kind: "substack"
|
||||
url: "https://example.substack.com/p/remote-post"
|
||||
|
||||
local-post:
|
||||
source:
|
||||
kind: "html"
|
||||
path: "articles/local-post.html"
|
||||
```
|
||||
|
||||
### `entries.<id>.toc`
|
||||
|
||||
- `title`:
|
||||
Optional string. Overrides the chapter label used in the TOC.
|
||||
- `hidden`:
|
||||
Optional boolean. Defaults to `false`. When `true`, the entry is omitted from the TOC.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
toc:
|
||||
title: "Appendix"
|
||||
hidden: false
|
||||
```
|
||||
|
||||
### `entries.<id>.links`
|
||||
|
||||
- `mode`:
|
||||
Optional string. One of:
|
||||
`auto`, `explicit`, `none`.
|
||||
If omitted, the global `link_rules.mode` is used.
|
||||
- `allow_to`:
|
||||
Optional list of entry IDs. If set, rewritten internal links are limited to these targets.
|
||||
- `block_to`:
|
||||
Optional list of entry IDs. These targets are excluded from rewriting.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
links:
|
||||
mode: "explicit"
|
||||
allow_to: ["intro", "appendix"]
|
||||
block_to: ["draft-notes"]
|
||||
```
|
||||
|
||||
### `link_rules`
|
||||
|
||||
- `link_rules.mode`:
|
||||
Optional string. One of:
|
||||
`auto`, `explicit`, `none`.
|
||||
Defaults to `auto`.
|
||||
- `link_rules.rewrite_external_substack_links`:
|
||||
Optional boolean. Defaults to `true`. Accepted by the manifest parser, but not currently used to change behavior in `v0.1`.
|
||||
- `link_rules.preserve_other_external_links`:
|
||||
Optional boolean. Defaults to `true`. Accepted by the manifest parser, but not currently used to change behavior in `v0.1`.
|
||||
- `link_rules.rules`:
|
||||
Optional list of explicit link rules.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
link_rules:
|
||||
mode: "explicit"
|
||||
rewrite_external_substack_links: true
|
||||
preserve_other_external_links: true
|
||||
rules:
|
||||
- from: ["notes"]
|
||||
to: ["intro"]
|
||||
match_mode: "canonical-url"
|
||||
```
|
||||
|
||||
### `link_rules.rules[]`
|
||||
|
||||
Each rule supports:
|
||||
|
||||
- `from`:
|
||||
Required list of selectors describing where the rule applies.
|
||||
- `to`:
|
||||
Required list of selectors describing eligible targets.
|
||||
- `match_mode`:
|
||||
Optional string. One of:
|
||||
`canonical-url`, `source-url`, `disabled`.
|
||||
Defaults to `canonical-url`.
|
||||
|
||||
Supported selectors in `from` and `to`:
|
||||
|
||||
- `*`:
|
||||
Match all entries.
|
||||
- `<entry-id>`:
|
||||
Match one entry by ID.
|
||||
- `section:<section-id>`:
|
||||
Match all entries referenced by that section.
|
||||
|
||||
Example:
|
||||
|
||||
```yaml
|
||||
link_rules:
|
||||
mode: "explicit"
|
||||
rules:
|
||||
- from: ["section:essays"]
|
||||
to: ["section:essays"]
|
||||
match_mode: "canonical-url"
|
||||
```
|
||||
|
||||
## Notes
|
||||
|
||||
- Output paths are resolved relative to the manifest file location.
|
||||
- Local HTML paths are also resolved relative to the manifest file location.
|
||||
- `sections` and `entries` are deserialized with empty defaults, but `validate` and `build` expect them to be meaningfully populated.
|
||||
- `subtitle`, `summary`, `tags`, `rewrite_external_substack_links`, and `preserve_other_external_links` are accepted today but only partially wired into runtime behavior.
|
||||
- For Substack sources, `v0.1` assumes public posts. Subscriber-only/session-based fetching is not implemented.
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 926 KiB |
@@ -0,0 +1,76 @@
|
||||
book:
|
||||
title: "Age of Peace"
|
||||
author: "John Gu"
|
||||
language: "en"
|
||||
identifier: "urn:uuid:ageofpeace:johngu"
|
||||
description: "Age of Peace: a novel"
|
||||
|
||||
output:
|
||||
path: "AgeOfPeace.epub"
|
||||
cover_image: "age_of_peace_cover.jpg"
|
||||
|
||||
defaults:
|
||||
metadata:
|
||||
author: "John Gu"
|
||||
fetch_images: true
|
||||
normalize_substack_embeds: true
|
||||
processing:
|
||||
include_author: false
|
||||
include_date: false
|
||||
include_source_url: false
|
||||
|
||||
sections:
|
||||
- id: "part-0"
|
||||
title: "Prelude"
|
||||
entries:
|
||||
- "intro"
|
||||
- id: "part-1"
|
||||
title: "Overture"
|
||||
entries:
|
||||
- "contested_island"
|
||||
- id: "part-2"
|
||||
title: "The High Castle"
|
||||
entries:
|
||||
- "vira"
|
||||
- "madelyna"
|
||||
- "biridana"
|
||||
- id: "part-3"
|
||||
title: "Nameless Country"
|
||||
entries: []
|
||||
|
||||
|
||||
entries:
|
||||
intro:
|
||||
source:
|
||||
kind: "html"
|
||||
path: "ageofpeace/introduction.html"
|
||||
contested_island:
|
||||
source:
|
||||
kind: "substack"
|
||||
url: "https://ageofpeace.substack.com/p/a-contested-island"
|
||||
toc:
|
||||
title: "A Contested Island"
|
||||
processing:
|
||||
skip_first_paragraphs: 1
|
||||
vira:
|
||||
source:
|
||||
kind: "substack"
|
||||
url: "https://ageofpeace.substack.com/p/vira"
|
||||
toc:
|
||||
title: "Vira"
|
||||
madelyna:
|
||||
source:
|
||||
kind: "substack"
|
||||
url: "https://ageofpeace.substack.com/p/madelyna"
|
||||
toc:
|
||||
title: "Madelỳna"
|
||||
biridana:
|
||||
source:
|
||||
kind: "substack"
|
||||
url: "https://ageofpeace.substack.com/p/biridana"
|
||||
toc:
|
||||
title: "Biridana"
|
||||
|
||||
|
||||
link_rules:
|
||||
mode: "auto"
|
||||
@@ -0,0 +1,33 @@
|
||||
<!doctype html>
|
||||
<html>
|
||||
<head>
|
||||
<title>Introduction</title>
|
||||
<meta name="author" content="John Gu" />
|
||||
<meta property="article:published_time" content="2025-07-05T00:00:00Z" />
|
||||
<link rel="canonical" href="https://ageofpeace.substack.com/p/introduction-and-welcome" />
|
||||
</head>
|
||||
<body>
|
||||
<article>
|
||||
<p><em>After securing a teaching job at a foreign university on “the thinnest set of credentials,” a young man sets off for life in Varrenia, an impoverished eastern kingdom still emerging from the shadow of a decades-long dictatorship. Years later, living in the decadent capital of Garamdal, our protagonist watches a war unfold in the republic’s restive eastern provinces and reflects on what he has gained — and lost — in a life of travel.</em></p>
|
||||
<h2>About the author</h2>
|
||||
<p>
|
||||
In my late twenties, I did a short stint as a grad student in mathematical logic at the University of Amsterdam, where I learned that I am <em>not</em> smart enough to be a mathematician. After dropping out of grad school, I ended up staying in Europe for five years. This novel, greatly influenced by that experience, is my love letter to Europe and to that time.
|
||||
</p>
|
||||
<img src="johngu.jpg" alt="John Gu" />
|
||||
<p>I am also very proud to be able say that I grew up in Houston — I actually come from the same neighborhood as Lizzo, Mo Amer, and Tila Tequila, cultural luminaries all.</p>
|
||||
<h2>Previous publications</h2>
|
||||
<p> TBD </p>
|
||||
|
||||
<h2>Influences</h2>
|
||||
<p>Some readers of my work have remarked that it shares an affinity with the following writers and books. In some cases, these figures represent inspirations that I have leaned into, in others, the similarities (of theme, style, subject matter) are more coincidental:</p>
|
||||
<ul>
|
||||
<li>In Patagonia, Bruce Chatwin</li>
|
||||
<li>Waiting for the Barbarians, J.M. Coetzee</li>
|
||||
<li>Balkan Ghosts, Robert Kaplan</li>
|
||||
<li>A Bend in the River, V.S. Naipaul</li>
|
||||
<li>Paul Bowles</li>
|
||||
<li>Milan Kundera</li>
|
||||
</ul>
|
||||
</article>
|
||||
</body>
|
||||
</html>
|
||||
Binary file not shown.
|
After Width: | Height: | Size: 119 KiB |
@@ -0,0 +1,10 @@
|
||||
[package]
|
||||
name = "ebookm-cli"
|
||||
version = "0.1.0"
|
||||
edition = "2024"
|
||||
|
||||
[dependencies]
|
||||
clap = { version = "4.5", features = ["derive"] }
|
||||
ebookm-core = { path = "../ebookm-core" }
|
||||
miette = { version = "7.2", features = ["fancy"] }
|
||||
serde_json = "1.0"
|
||||
@@ -0,0 +1,89 @@
|
||||
use std::path::PathBuf;
|
||||
|
||||
use clap::{Parser, Subcommand};
|
||||
use ebookm_core::{
|
||||
build_epub, inspect_source, load_manifest, render_init_manifest, validate_manifest,
|
||||
};
|
||||
use miette::{Context, IntoDiagnostic};
|
||||
|
||||
#[derive(Debug, Parser)]
|
||||
#[command(
|
||||
name = "ebookm",
|
||||
version,
|
||||
about = "Compile Substack articles into a single EPUB"
|
||||
)]
|
||||
struct Cli {
|
||||
#[command(subcommand)]
|
||||
command: Commands,
|
||||
}
|
||||
|
||||
#[derive(Debug, Subcommand)]
|
||||
enum Commands {
|
||||
Build {
|
||||
#[arg(short, long)]
|
||||
manifest: PathBuf,
|
||||
#[arg(short, long)]
|
||||
output: Option<PathBuf>,
|
||||
},
|
||||
Validate {
|
||||
#[arg(short, long)]
|
||||
manifest: PathBuf,
|
||||
},
|
||||
Inspect {
|
||||
source: String,
|
||||
#[arg(long, default_value = "json")]
|
||||
format: String,
|
||||
},
|
||||
Init,
|
||||
}
|
||||
|
||||
fn main() -> miette::Result<()> {
|
||||
let cli = Cli::parse();
|
||||
|
||||
match cli.command {
|
||||
Commands::Build { manifest, output } => {
|
||||
let mut loaded = load_manifest(&manifest).into_diagnostic()?;
|
||||
if let Some(output) = output {
|
||||
loaded.output.path = output.display().to_string();
|
||||
}
|
||||
let warnings = validate_manifest(&loaded).into_diagnostic()?;
|
||||
for warning in warnings {
|
||||
eprintln!("warning: {warning}");
|
||||
}
|
||||
build_epub(&loaded, &manifest).into_diagnostic()?;
|
||||
println!("{}", loaded.output.path);
|
||||
}
|
||||
Commands::Validate { manifest } => {
|
||||
let loaded = load_manifest(&manifest).into_diagnostic()?;
|
||||
let warnings = validate_manifest(&loaded).into_diagnostic()?;
|
||||
for warning in warnings {
|
||||
println!("warning: {warning}");
|
||||
}
|
||||
println!("manifest is valid");
|
||||
}
|
||||
Commands::Inspect { source, format } => {
|
||||
let result = inspect_source(&source).into_diagnostic()?;
|
||||
if format == "json" {
|
||||
println!(
|
||||
"{}",
|
||||
serde_json::to_string_pretty(&result)
|
||||
.into_diagnostic()
|
||||
.wrap_err("failed to encode JSON")?
|
||||
);
|
||||
} else {
|
||||
println!("title: {}", result.title.unwrap_or_default());
|
||||
println!("author: {}", result.author.unwrap_or_default());
|
||||
println!("published: {}", result.published.unwrap_or_default());
|
||||
println!(
|
||||
"canonical_url: {}",
|
||||
result.canonical_url.unwrap_or_default()
|
||||
);
|
||||
}
|
||||
}
|
||||
Commands::Init => {
|
||||
print!("{}", render_init_manifest());
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
@@ -0,0 +1,25 @@
|
||||
[package]
|
||||
name = "ebookm-core"
|
||||
version = "0.1.0"
|
||||
edition = "2024"
|
||||
|
||||
[dependencies]
|
||||
chrono = { version = "0.4", features = ["serde"] }
|
||||
indexmap = { version = "2.7", features = ["serde"] }
|
||||
kuchiki = "0.8"
|
||||
miette = { version = "7.2", features = ["fancy"] }
|
||||
quick-xml = "0.38"
|
||||
regex = "1.11"
|
||||
reqwest = { version = "0.12", default-features = false, features = ["blocking", "rustls-tls"] }
|
||||
scraper = "0.24"
|
||||
serde = { version = "1.0", features = ["derive"] }
|
||||
serde_json = "1.0"
|
||||
serde_yaml = "0.9"
|
||||
sha1 = "0.10"
|
||||
thiserror = "2.0"
|
||||
url = { version = "2.5", features = ["serde"] }
|
||||
uuid = { version = "1.18", features = ["v4"] }
|
||||
zip = "4.6"
|
||||
|
||||
[dev-dependencies]
|
||||
tempfile = "3.15"
|
||||
@@ -0,0 +1,298 @@
|
||||
use std::collections::BTreeSet;
|
||||
use std::fs::File;
|
||||
use std::io::Write;
|
||||
use std::path::Path;
|
||||
|
||||
use quick_xml::escape::escape;
|
||||
use zip::CompressionMethod;
|
||||
use zip::write::{SimpleFileOptions, ZipWriter};
|
||||
|
||||
use crate::error::{EbookmError, Result};
|
||||
use crate::pipeline::BuiltEntry;
|
||||
|
||||
pub fn write_epub(
|
||||
manifest: &crate::manifest::Manifest,
|
||||
built: &[BuiltEntry],
|
||||
output_path: &Path,
|
||||
cover_bytes: Option<(String, Vec<u8>)>,
|
||||
) -> Result<()> {
|
||||
if let Some(parent) = output_path.parent() {
|
||||
std::fs::create_dir_all(parent).map_err(|source| EbookmError::Io {
|
||||
path: parent.display().to_string(),
|
||||
source,
|
||||
})?;
|
||||
}
|
||||
|
||||
let file = File::create(output_path).map_err(|source| EbookmError::Io {
|
||||
path: output_path.display().to_string(),
|
||||
source,
|
||||
})?;
|
||||
let mut zip = ZipWriter::new(file);
|
||||
|
||||
let stored = SimpleFileOptions::default().compression_method(CompressionMethod::Stored);
|
||||
zip.start_file("mimetype", stored)
|
||||
.map_err(|error| EbookmError::Epub {
|
||||
message: error.to_string(),
|
||||
})?;
|
||||
zip.write_all(b"application/epub+zip")
|
||||
.map_err(|error| EbookmError::Epub {
|
||||
message: error.to_string(),
|
||||
})?;
|
||||
|
||||
let deflated = SimpleFileOptions::default().compression_method(CompressionMethod::Deflated);
|
||||
write_file(&mut zip, "META-INF/container.xml", deflated, CONTAINER_XML)?;
|
||||
write_file(&mut zip, "OEBPS/styles/book.css", deflated, DEFAULT_STYLES)?;
|
||||
|
||||
let nav = build_nav(manifest, built);
|
||||
let ncx = build_ncx(manifest, built);
|
||||
let opf = build_opf(
|
||||
manifest,
|
||||
built,
|
||||
cover_bytes.as_ref().map(|(href, _)| href.as_str()),
|
||||
);
|
||||
|
||||
write_file(&mut zip, "OEBPS/nav.xhtml", deflated, &nav)?;
|
||||
write_file(&mut zip, "OEBPS/toc.ncx", deflated, &ncx)?;
|
||||
write_file(&mut zip, "OEBPS/content.opf", deflated, &opf)?;
|
||||
|
||||
if let Some((href, bytes)) = cover_bytes {
|
||||
write_bytes(&mut zip, &format!("OEBPS/{href}"), deflated, &bytes)?;
|
||||
}
|
||||
|
||||
let mut seen_assets = BTreeSet::new();
|
||||
for entry in built {
|
||||
write_file(
|
||||
&mut zip,
|
||||
&format!("OEBPS/text/{}.xhtml", entry.id),
|
||||
deflated,
|
||||
&entry.chapter.xhtml,
|
||||
)?;
|
||||
for asset in &entry.assets {
|
||||
if seen_assets.insert(asset.href.clone()) {
|
||||
write_bytes(
|
||||
&mut zip,
|
||||
&format!("OEBPS/{}", asset.href),
|
||||
deflated,
|
||||
&asset.bytes,
|
||||
)?;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
zip.finish().map_err(|error| EbookmError::Epub {
|
||||
message: error.to_string(),
|
||||
})?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn write_file(
|
||||
zip: &mut ZipWriter<File>,
|
||||
path: &str,
|
||||
options: SimpleFileOptions,
|
||||
contents: &str,
|
||||
) -> Result<()> {
|
||||
write_bytes(zip, path, options, contents.as_bytes())
|
||||
}
|
||||
|
||||
fn write_bytes(
|
||||
zip: &mut ZipWriter<File>,
|
||||
path: &str,
|
||||
options: SimpleFileOptions,
|
||||
contents: &[u8],
|
||||
) -> Result<()> {
|
||||
zip.start_file(path, options)
|
||||
.map_err(|error| EbookmError::Epub {
|
||||
message: error.to_string(),
|
||||
})?;
|
||||
zip.write_all(contents).map_err(|error| EbookmError::Epub {
|
||||
message: error.to_string(),
|
||||
})?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn build_nav(manifest: &crate::manifest::Manifest, built: &[BuiltEntry]) -> String {
|
||||
let mut nav_points = String::new();
|
||||
for section in &manifest.sections {
|
||||
let section_target = section
|
||||
.entries
|
||||
.iter()
|
||||
.find_map(|entry_id| built.iter().find(|candidate| &candidate.id == entry_id))
|
||||
.map(|entry| format!("text/{}.xhtml", entry.id));
|
||||
nav_points.push_str("<li>");
|
||||
if let Some(target) = section_target {
|
||||
nav_points.push_str(&format!(
|
||||
"<a href=\"{}\">{}</a><ol>",
|
||||
escape(&target),
|
||||
escape(§ion.title)
|
||||
));
|
||||
} else {
|
||||
nav_points.push_str(&format!("<span>{}</span><ol>", escape(§ion.title)));
|
||||
}
|
||||
for entry_id in §ion.entries {
|
||||
if let Some(entry) = built.iter().find(|candidate| &candidate.id == entry_id) {
|
||||
if entry.hidden_from_toc {
|
||||
continue;
|
||||
}
|
||||
nav_points.push_str(&format!(
|
||||
"<li><a href=\"text/{}.xhtml\">{}</a></li>",
|
||||
escape(&entry.id),
|
||||
escape(&entry.chapter.nav_title)
|
||||
));
|
||||
}
|
||||
}
|
||||
nav_points.push_str("</ol></li>");
|
||||
}
|
||||
|
||||
format!(
|
||||
r#"<?xml version="1.0" encoding="UTF-8"?>
|
||||
<html xmlns="http://www.w3.org/1999/xhtml" xmlns:epub="http://www.idpf.org/2007/ops">
|
||||
<head>
|
||||
<title>{}</title>
|
||||
<link rel="stylesheet" type="text/css" href="styles/book.css"/>
|
||||
</head>
|
||||
<body>
|
||||
<nav epub:type="toc" id="toc">
|
||||
<h1>{}</h1>
|
||||
<ol>{}</ol>
|
||||
</nav>
|
||||
</body>
|
||||
</html>"#,
|
||||
escape(&manifest.book.title),
|
||||
escape(&manifest.book.title),
|
||||
nav_points
|
||||
)
|
||||
}
|
||||
|
||||
fn build_ncx(manifest: &crate::manifest::Manifest, built: &[BuiltEntry]) -> String {
|
||||
let mut play_order = 1usize;
|
||||
let mut nav_points = String::new();
|
||||
for section in &manifest.sections {
|
||||
let section_entries: Vec<_> = section
|
||||
.entries
|
||||
.iter()
|
||||
.filter_map(|entry_id| built.iter().find(|candidate| &candidate.id == entry_id))
|
||||
.filter(|entry| !entry.hidden_from_toc)
|
||||
.collect();
|
||||
|
||||
if section_entries.is_empty() {
|
||||
continue;
|
||||
}
|
||||
|
||||
let section_play_order = play_order;
|
||||
play_order += 1;
|
||||
|
||||
let mut child_points = String::new();
|
||||
for entry in §ion_entries {
|
||||
child_points.push_str(&format!(
|
||||
"<navPoint id=\"nav-{}\" playOrder=\"{}\"><navLabel><text>{}</text></navLabel><content src=\"text/{}.xhtml\"/></navPoint>",
|
||||
escape(&entry.id),
|
||||
play_order,
|
||||
escape(&entry.chapter.nav_title),
|
||||
escape(&entry.id)
|
||||
));
|
||||
play_order += 1;
|
||||
}
|
||||
|
||||
nav_points.push_str(&format!(
|
||||
"<navPoint id=\"section-{}\" playOrder=\"{}\"><navLabel><text>{}</text></navLabel><content src=\"text/{}.xhtml\"/>{}</navPoint>",
|
||||
escape(§ion.id),
|
||||
section_play_order,
|
||||
escape(§ion.title),
|
||||
escape(§ion_entries[0].id),
|
||||
child_points
|
||||
));
|
||||
}
|
||||
|
||||
format!(
|
||||
r#"<?xml version="1.0" encoding="UTF-8"?>
|
||||
<ncx xmlns="http://www.daisy.org/z3986/2005/ncx/" version="2005-1">
|
||||
<head>
|
||||
<meta name="dtb:uid" content="{}"/>
|
||||
</head>
|
||||
<docTitle><text>{}</text></docTitle>
|
||||
<navMap>{}</navMap>
|
||||
</ncx>"#,
|
||||
escape(&manifest.book.identifier),
|
||||
escape(&manifest.book.title),
|
||||
nav_points
|
||||
)
|
||||
}
|
||||
|
||||
fn build_opf(
|
||||
manifest: &crate::manifest::Manifest,
|
||||
built: &[BuiltEntry],
|
||||
cover_href: Option<&str>,
|
||||
) -> String {
|
||||
let mut manifest_items = String::from(
|
||||
r#"<item id="nav" href="nav.xhtml" media-type="application/xhtml+xml" properties="nav"/>
|
||||
<item id="ncx" href="toc.ncx" media-type="application/x-dtbncx+xml"/>
|
||||
<item id="css" href="styles/book.css" media-type="text/css"/>"#,
|
||||
);
|
||||
let mut spine_items = String::new();
|
||||
|
||||
for entry in built {
|
||||
manifest_items.push_str(&format!(
|
||||
"<item id=\"{}\" href=\"text/{}.xhtml\" media-type=\"application/xhtml+xml\"/>",
|
||||
escape(&entry.id),
|
||||
escape(&entry.id)
|
||||
));
|
||||
spine_items.push_str(&format!("<itemref idref=\"{}\"/>", escape(&entry.id)));
|
||||
for asset in &entry.assets {
|
||||
manifest_items.push_str(&format!(
|
||||
"<item id=\"{}\" href=\"{}\" media-type=\"{}\"/>",
|
||||
escape(&asset.id),
|
||||
escape(&asset.href),
|
||||
escape(&asset.media_type)
|
||||
));
|
||||
}
|
||||
}
|
||||
|
||||
if let Some(cover_href) = cover_href {
|
||||
manifest_items.push_str(&format!(
|
||||
"<item id=\"cover\" href=\"{}\" media-type=\"image/jpeg\" properties=\"cover-image\"/>",
|
||||
escape(cover_href)
|
||||
));
|
||||
}
|
||||
|
||||
let author = manifest
|
||||
.book
|
||||
.author
|
||||
.clone()
|
||||
.unwrap_or_else(|| "Unknown".to_string());
|
||||
let description = manifest.book.description.clone().unwrap_or_default();
|
||||
format!(
|
||||
r#"<?xml version="1.0" encoding="UTF-8"?>
|
||||
<package version="3.0" xmlns="http://www.idpf.org/2007/opf" unique-identifier="bookid">
|
||||
<metadata xmlns:dc="http://purl.org/dc/elements/1.1/">
|
||||
<dc:identifier id="bookid">{}</dc:identifier>
|
||||
<dc:title>{}</dc:title>
|
||||
<dc:creator>{}</dc:creator>
|
||||
<dc:language>{}</dc:language>
|
||||
<dc:description>{}</dc:description>
|
||||
</metadata>
|
||||
<manifest>{}</manifest>
|
||||
<spine toc="ncx">{}</spine>
|
||||
</package>"#,
|
||||
escape(&manifest.book.identifier),
|
||||
escape(&manifest.book.title),
|
||||
escape(&author),
|
||||
escape(&manifest.book.language),
|
||||
escape(&description),
|
||||
manifest_items,
|
||||
spine_items
|
||||
)
|
||||
}
|
||||
|
||||
const CONTAINER_XML: &str = r#"<?xml version="1.0" encoding="UTF-8"?>
|
||||
<container version="1.0" xmlns="urn:oasis:names:tc:opendocument:xmlns:container">
|
||||
<rootfiles>
|
||||
<rootfile full-path="OEBPS/content.opf" media-type="application/oebps-package+xml"/>
|
||||
</rootfiles>
|
||||
</container>"#;
|
||||
|
||||
const DEFAULT_STYLES: &str = r#"body { font-family: serif; line-height: 1.5; margin: 5%; }
|
||||
h1 { margin-bottom: 0.2em; }
|
||||
.chapter-meta { color: #555; font-size: 0.9em; margin-bottom: 1.5em; }
|
||||
img { max-width: 100%; height: auto; }
|
||||
a { color: #0b4f7a; text-decoration: none; }
|
||||
"#;
|
||||
@@ -0,0 +1,39 @@
|
||||
use thiserror::Error;
|
||||
|
||||
pub type Result<T> = std::result::Result<T, EbookmError>;
|
||||
|
||||
#[derive(Debug, Error)]
|
||||
pub enum EbookmError {
|
||||
#[error("failed to read file {path}: {source}")]
|
||||
Io {
|
||||
path: String,
|
||||
#[source]
|
||||
source: std::io::Error,
|
||||
},
|
||||
#[error("failed to parse manifest {path}: {source}")]
|
||||
ManifestParse {
|
||||
path: String,
|
||||
#[source]
|
||||
source: serde_yaml::Error,
|
||||
},
|
||||
#[error("manifest validation failed: {issues:?}")]
|
||||
Validation { issues: Vec<String> },
|
||||
#[error("network request failed for {url}: {source}")]
|
||||
Request {
|
||||
url: String,
|
||||
#[source]
|
||||
source: reqwest::Error,
|
||||
},
|
||||
#[error("invalid source path: {path}")]
|
||||
InvalidSourcePath { path: String },
|
||||
#[error("failed to parse URL {value}: {source}")]
|
||||
UrlParse {
|
||||
value: String,
|
||||
#[source]
|
||||
source: url::ParseError,
|
||||
},
|
||||
#[error("article extraction failed for {input}")]
|
||||
Extraction { input: String },
|
||||
#[error("EPUB generation failed: {message}")]
|
||||
Epub { message: String },
|
||||
}
|
||||
@@ -0,0 +1,268 @@
|
||||
use chrono::{DateTime, NaiveDate};
|
||||
use scraper::{Html, Selector};
|
||||
use serde_json::Value;
|
||||
use url::Url;
|
||||
|
||||
use crate::error::{EbookmError, Result};
|
||||
use crate::source::{LoadedSource, SourceOrigin};
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct ExtractedArticle {
|
||||
pub title: String,
|
||||
pub author: Option<String>,
|
||||
pub published: Option<NaiveDate>,
|
||||
pub canonical_url: Option<Url>,
|
||||
pub body_html: String,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, serde::Serialize)]
|
||||
pub struct InspectResult {
|
||||
pub title: Option<String>,
|
||||
pub author: Option<String>,
|
||||
pub published: Option<String>,
|
||||
pub canonical_url: Option<String>,
|
||||
}
|
||||
|
||||
pub fn extract_article(loaded: &LoadedSource) -> Result<ExtractedArticle> {
|
||||
let document = Html::parse_document(&loaded.html);
|
||||
let json_ld = extract_primary_json_ld(&document);
|
||||
let title = select_content(
|
||||
&document,
|
||||
&[
|
||||
r#"meta[property="og:title"]"#,
|
||||
r#"article .post-title"#,
|
||||
".post-title",
|
||||
"h1",
|
||||
"title",
|
||||
],
|
||||
"content",
|
||||
)
|
||||
.or_else(|| {
|
||||
select_text(
|
||||
&document,
|
||||
&[r#"article .post-title"#, ".post-title", "h1", "title"],
|
||||
)
|
||||
})
|
||||
.or_else(|| json_ld_string(&json_ld, "headline"))
|
||||
.ok_or_else(|| EbookmError::Extraction {
|
||||
input: origin_label(&loaded.origin),
|
||||
})?;
|
||||
|
||||
let author = select_content(
|
||||
&document,
|
||||
&[
|
||||
r#"meta[name="author"]"#,
|
||||
r#"meta[property="article:author"]"#,
|
||||
],
|
||||
"content",
|
||||
)
|
||||
.or_else(|| {
|
||||
select_text(
|
||||
&document,
|
||||
&[
|
||||
"[data-testid='author-name']",
|
||||
".byline",
|
||||
".byline-wrapper a",
|
||||
"address",
|
||||
],
|
||||
)
|
||||
})
|
||||
.or_else(|| json_ld_author(&json_ld));
|
||||
|
||||
let published = select_content(
|
||||
&document,
|
||||
&[r#"meta[property="article:published_time"]"#, "time"],
|
||||
"content",
|
||||
)
|
||||
.or_else(|| select_attr(&document, &["time"], "datetime"))
|
||||
.or_else(|| json_ld_string(&json_ld, "datePublished"))
|
||||
.and_then(parse_date);
|
||||
|
||||
let canonical_url = select_attr(&document, &[r#"link[rel="canonical"]"#], "href")
|
||||
.or_else(|| match &loaded.origin {
|
||||
SourceOrigin::Remote(url) => Some(url.to_string()),
|
||||
SourceOrigin::LocalFile(_) => None,
|
||||
})
|
||||
.and_then(|raw| Url::parse(&raw).ok());
|
||||
|
||||
let body_html = select_html(
|
||||
&document,
|
||||
&[
|
||||
".available-content .body.markup",
|
||||
".available-content .markup",
|
||||
"article .body.markup",
|
||||
".newsletter-post .body.markup",
|
||||
"article",
|
||||
"main",
|
||||
"body",
|
||||
],
|
||||
)
|
||||
.ok_or_else(|| EbookmError::Extraction {
|
||||
input: origin_label(&loaded.origin),
|
||||
})?;
|
||||
|
||||
Ok(ExtractedArticle {
|
||||
title,
|
||||
author,
|
||||
published,
|
||||
canonical_url,
|
||||
body_html,
|
||||
})
|
||||
}
|
||||
|
||||
pub fn inspect_article(loaded: &LoadedSource) -> Result<InspectResult> {
|
||||
let extracted = extract_article(loaded)?;
|
||||
Ok(InspectResult {
|
||||
title: Some(extracted.title),
|
||||
author: extracted.author,
|
||||
published: extracted.published.map(|date| date.to_string()),
|
||||
canonical_url: extracted.canonical_url.map(|url| url.to_string()),
|
||||
})
|
||||
}
|
||||
|
||||
fn select_content(document: &Html, selectors: &[&str], attr: &str) -> Option<String> {
|
||||
selectors.iter().find_map(|selector| {
|
||||
let selector = Selector::parse(selector).ok()?;
|
||||
document
|
||||
.select(&selector)
|
||||
.next()
|
||||
.and_then(|node| node.value().attr(attr))
|
||||
.map(clean_text)
|
||||
})
|
||||
}
|
||||
|
||||
fn select_text(document: &Html, selectors: &[&str]) -> Option<String> {
|
||||
selectors.iter().find_map(|selector| {
|
||||
let selector = Selector::parse(selector).ok()?;
|
||||
document
|
||||
.select(&selector)
|
||||
.next()
|
||||
.map(|node| clean_text(&node.text().collect::<String>()))
|
||||
})
|
||||
}
|
||||
|
||||
fn select_attr(document: &Html, selectors: &[&str], attr: &str) -> Option<String> {
|
||||
selectors.iter().find_map(|selector| {
|
||||
let selector = Selector::parse(selector).ok()?;
|
||||
document
|
||||
.select(&selector)
|
||||
.next()
|
||||
.and_then(|node| node.value().attr(attr))
|
||||
.map(clean_text)
|
||||
})
|
||||
}
|
||||
|
||||
fn select_html(document: &Html, selectors: &[&str]) -> Option<String> {
|
||||
selectors.iter().find_map(|selector| {
|
||||
let selector = Selector::parse(selector).ok()?;
|
||||
document
|
||||
.select(&selector)
|
||||
.next()
|
||||
.map(|node| node.inner_html())
|
||||
})
|
||||
}
|
||||
|
||||
fn clean_text(value: &str) -> String {
|
||||
value.split_whitespace().collect::<Vec<_>>().join(" ")
|
||||
}
|
||||
|
||||
fn parse_date(value: String) -> Option<NaiveDate> {
|
||||
DateTime::parse_from_rfc3339(&value)
|
||||
.map(|parsed| parsed.date_naive())
|
||||
.ok()
|
||||
.or_else(|| NaiveDate::parse_from_str(&value, "%Y-%m-%d").ok())
|
||||
.or_else(|| NaiveDate::parse_from_str(&value, "%b %d, %Y").ok())
|
||||
}
|
||||
|
||||
fn origin_label(origin: &SourceOrigin) -> String {
|
||||
match origin {
|
||||
SourceOrigin::Remote(url) => url.to_string(),
|
||||
SourceOrigin::LocalFile(path) => path.display().to_string(),
|
||||
}
|
||||
}
|
||||
|
||||
fn extract_primary_json_ld(document: &Html) -> Option<Value> {
|
||||
let selector = Selector::parse(r#"script[type="application/ld+json"]"#).ok()?;
|
||||
for node in document.select(&selector) {
|
||||
let raw = node.inner_html();
|
||||
let Ok(value) = serde_json::from_str::<Value>(&raw) else {
|
||||
continue;
|
||||
};
|
||||
if value.get("@type").and_then(Value::as_str).is_some() {
|
||||
return Some(value);
|
||||
}
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
fn json_ld_string(json_ld: &Option<Value>, key: &str) -> Option<String> {
|
||||
json_ld
|
||||
.as_ref()?
|
||||
.get(key)?
|
||||
.as_str()
|
||||
.map(|value| value.to_string())
|
||||
}
|
||||
|
||||
fn json_ld_author(json_ld: &Option<Value>) -> Option<String> {
|
||||
let author = json_ld.as_ref()?.get("author")?;
|
||||
if let Some(author_name) = author.get(0).and_then(|entry| entry.get("name")).and_then(Value::as_str) {
|
||||
return Some(author_name.to_string());
|
||||
}
|
||||
if let Some(author_name) = author.get("name").and_then(Value::as_str) {
|
||||
return Some(author_name.to_string());
|
||||
}
|
||||
None
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::*;
|
||||
use crate::source::LoadedSource;
|
||||
|
||||
#[test]
|
||||
fn extracts_substack_article_body_without_page_chrome() {
|
||||
let html = r#"<!doctype html>
|
||||
<html>
|
||||
<head>
|
||||
<meta property="og:title" content="A Contested Island" />
|
||||
<meta name="author" content="John Gu" />
|
||||
<link rel="canonical" href="https://ageofpeace.substack.com/p/a-contested-island" />
|
||||
<script type="application/ld+json">{"@context":"https://schema.org","@type":"NewsArticle","headline":"A Contested Island","datePublished":"2026-03-03T23:37:00+00:00","author":[{"@type":"Person","name":"John Gu"}]}</script>
|
||||
</head>
|
||||
<body>
|
||||
<article class="typography newsletter-post post">
|
||||
<div class="post-header">
|
||||
<h1 class="post-title">Chapter 1: A Contested Island</h1>
|
||||
</div>
|
||||
<div class="available-content">
|
||||
<div class="body markup">
|
||||
<p>First paragraph.</p>
|
||||
<p>Second paragraph.</p>
|
||||
</div>
|
||||
</div>
|
||||
<div class="post-footer">
|
||||
<button>Share</button>
|
||||
</div>
|
||||
</article>
|
||||
</body>
|
||||
</html>"#;
|
||||
|
||||
let loaded = LoadedSource {
|
||||
origin: SourceOrigin::Remote(
|
||||
Url::parse("https://ageofpeace.substack.com/p/a-contested-island").expect("url"),
|
||||
),
|
||||
html: html.to_string(),
|
||||
};
|
||||
|
||||
let extracted = extract_article(&loaded).expect("extract article");
|
||||
assert_eq!(extracted.title, "A Contested Island");
|
||||
assert_eq!(extracted.author.as_deref(), Some("John Gu"));
|
||||
assert_eq!(
|
||||
extracted.published,
|
||||
Some(NaiveDate::from_ymd_opt(2026, 3, 3).expect("date"))
|
||||
);
|
||||
assert!(extracted.body_html.contains("First paragraph."));
|
||||
assert!(!extracted.body_html.contains("post-header"));
|
||||
assert!(!extracted.body_html.contains("Share"));
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,167 @@
|
||||
use std::collections::{BTreeMap, BTreeSet};
|
||||
|
||||
use url::Url;
|
||||
|
||||
use crate::manifest::{BuildMode, LinkMatchMode, Manifest};
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct LinkPolicy {
|
||||
pub match_mode: LinkMatchMode,
|
||||
pub targets: BTreeSet<String>,
|
||||
}
|
||||
|
||||
pub fn build_link_policies(
|
||||
manifest: &Manifest,
|
||||
entry_metadata: &BTreeMap<String, EntryLinkMetadata>,
|
||||
) -> BTreeMap<String, LinkPolicy> {
|
||||
entry_metadata
|
||||
.iter()
|
||||
.map(|(entry_id, _metadata)| {
|
||||
let entry = &manifest.entries[entry_id];
|
||||
let mode = entry
|
||||
.links
|
||||
.mode
|
||||
.clone()
|
||||
.unwrap_or(manifest.link_rules.mode.clone());
|
||||
let targets = resolve_targets(manifest, entry_id, &mode);
|
||||
let match_mode = select_match_mode(manifest, entry_id, &mode);
|
||||
(
|
||||
entry_id.clone(),
|
||||
LinkPolicy {
|
||||
match_mode,
|
||||
targets,
|
||||
},
|
||||
)
|
||||
})
|
||||
.collect()
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct EntryLinkMetadata {
|
||||
pub source_url: Option<Url>,
|
||||
pub canonical_url: Option<Url>,
|
||||
}
|
||||
|
||||
fn resolve_targets(manifest: &Manifest, entry_id: &str, mode: &BuildMode) -> BTreeSet<String> {
|
||||
let entry = &manifest.entries[entry_id];
|
||||
let mut targets = BTreeSet::new();
|
||||
|
||||
match mode {
|
||||
BuildMode::None => return targets,
|
||||
BuildMode::Auto => {
|
||||
for candidate in manifest.entries.keys() {
|
||||
if candidate != entry_id {
|
||||
targets.insert(candidate.clone());
|
||||
}
|
||||
}
|
||||
}
|
||||
BuildMode::Explicit => {
|
||||
for rule in &manifest.link_rules.rules {
|
||||
if rule.match_mode == LinkMatchMode::Disabled {
|
||||
continue;
|
||||
}
|
||||
if selector_matches_any(&rule.from, manifest, entry_id) {
|
||||
for target in expand_selectors(&rule.to, manifest) {
|
||||
if target != entry_id {
|
||||
targets.insert(target);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
if !entry.links.allow_to.is_empty() {
|
||||
targets.retain(|candidate| entry.links.allow_to.contains(candidate));
|
||||
}
|
||||
|
||||
for blocked in &entry.links.block_to {
|
||||
targets.remove(blocked);
|
||||
}
|
||||
|
||||
targets
|
||||
}
|
||||
|
||||
fn select_match_mode(manifest: &Manifest, entry_id: &str, mode: &BuildMode) -> LinkMatchMode {
|
||||
match mode {
|
||||
BuildMode::None => LinkMatchMode::Disabled,
|
||||
BuildMode::Auto => LinkMatchMode::CanonicalUrl,
|
||||
BuildMode::Explicit => manifest
|
||||
.link_rules
|
||||
.rules
|
||||
.iter()
|
||||
.find(|rule| selector_matches_any(&rule.from, manifest, entry_id))
|
||||
.map(|rule| rule.match_mode.clone())
|
||||
.unwrap_or(LinkMatchMode::CanonicalUrl),
|
||||
}
|
||||
}
|
||||
|
||||
fn selector_matches_any(selectors: &[String], manifest: &Manifest, entry_id: &str) -> bool {
|
||||
selectors
|
||||
.iter()
|
||||
.any(|selector| selector_matches(selector, manifest, entry_id))
|
||||
}
|
||||
|
||||
fn selector_matches(selector: &str, manifest: &Manifest, entry_id: &str) -> bool {
|
||||
if selector == "*" {
|
||||
return true;
|
||||
}
|
||||
if selector == entry_id {
|
||||
return true;
|
||||
}
|
||||
if let Some(section_id) = selector.strip_prefix("section:") {
|
||||
return manifest
|
||||
.sections
|
||||
.iter()
|
||||
.find(|section| section.id == section_id)
|
||||
.is_some_and(|section| section.entries.iter().any(|entry| entry == entry_id));
|
||||
}
|
||||
false
|
||||
}
|
||||
|
||||
fn expand_selectors(selectors: &[String], manifest: &Manifest) -> BTreeSet<String> {
|
||||
let mut expanded = BTreeSet::new();
|
||||
for selector in selectors {
|
||||
if selector == "*" {
|
||||
expanded.extend(manifest.entries.keys().cloned());
|
||||
continue;
|
||||
}
|
||||
if let Some(section_id) = selector.strip_prefix("section:") {
|
||||
if let Some(section) = manifest
|
||||
.sections
|
||||
.iter()
|
||||
.find(|section| section.id == section_id)
|
||||
{
|
||||
expanded.extend(section.entries.iter().cloned());
|
||||
}
|
||||
continue;
|
||||
}
|
||||
if manifest.entries.contains_key(selector) {
|
||||
expanded.insert(selector.clone());
|
||||
}
|
||||
}
|
||||
expanded
|
||||
}
|
||||
|
||||
pub fn matches_target(
|
||||
href: &Url,
|
||||
policy: &LinkPolicy,
|
||||
target_id: &str,
|
||||
metadata: &EntryLinkMetadata,
|
||||
) -> bool {
|
||||
if !policy.targets.contains(target_id) {
|
||||
return false;
|
||||
}
|
||||
|
||||
match policy.match_mode {
|
||||
LinkMatchMode::Disabled => false,
|
||||
LinkMatchMode::CanonicalUrl => metadata
|
||||
.canonical_url
|
||||
.as_ref()
|
||||
.is_some_and(|candidate| candidate.as_str() == href.as_str()),
|
||||
LinkMatchMode::SourceUrl => metadata
|
||||
.source_url
|
||||
.as_ref()
|
||||
.is_some_and(|candidate| candidate.as_str() == href.as_str()),
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,19 @@
|
||||
mod epub;
|
||||
mod error;
|
||||
pub mod extract;
|
||||
pub mod graph;
|
||||
pub mod manifest;
|
||||
pub mod normalize;
|
||||
mod pipeline;
|
||||
pub mod source;
|
||||
mod template;
|
||||
|
||||
pub use error::{EbookmError, Result};
|
||||
pub use extract::InspectResult;
|
||||
pub use manifest::{
|
||||
BuildMode, EntryDefinition, EntryLinkConfig, LinkMatchMode, LinkRule, Manifest,
|
||||
ProcessingDefaults, ProcessingOverrides,
|
||||
};
|
||||
pub use pipeline::{
|
||||
build_epub, inspect_source, load_manifest, render_init_manifest, validate_manifest,
|
||||
};
|
||||
@@ -0,0 +1,207 @@
|
||||
use chrono::NaiveDate;
|
||||
use indexmap::IndexMap;
|
||||
use serde::{Deserialize, Serialize};
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct Manifest {
|
||||
pub book: BookMetadata,
|
||||
pub output: OutputConfig,
|
||||
#[serde(default)]
|
||||
pub defaults: DefaultsConfig,
|
||||
#[serde(default)]
|
||||
pub sections: Vec<SectionDefinition>,
|
||||
#[serde(default)]
|
||||
pub entries: IndexMap<String, EntryDefinition>,
|
||||
#[serde(default)]
|
||||
pub link_rules: LinkRulesConfig,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct BookMetadata {
|
||||
pub title: String,
|
||||
#[serde(default)]
|
||||
pub author: Option<String>,
|
||||
#[serde(default = "default_language")]
|
||||
pub language: String,
|
||||
#[serde(default = "default_identifier")]
|
||||
pub identifier: String,
|
||||
#[serde(default)]
|
||||
pub description: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct OutputConfig {
|
||||
pub path: String,
|
||||
#[serde(default)]
|
||||
pub cover_image: Option<String>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
|
||||
pub struct DefaultsConfig {
|
||||
#[serde(default = "default_true")]
|
||||
pub fetch_images: bool,
|
||||
#[serde(default = "default_true")]
|
||||
pub normalize_substack_embeds: bool,
|
||||
#[serde(default)]
|
||||
pub processing: ProcessingDefaults,
|
||||
#[serde(default)]
|
||||
pub metadata: MetadataOverrides,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct SectionDefinition {
|
||||
pub id: String,
|
||||
pub title: String,
|
||||
#[serde(default)]
|
||||
pub entries: Vec<String>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct EntryDefinition {
|
||||
pub source: SourceDefinition,
|
||||
#[serde(default)]
|
||||
pub title: Option<String>,
|
||||
#[serde(default)]
|
||||
pub metadata: MetadataOverrides,
|
||||
#[serde(default)]
|
||||
pub processing: ProcessingOverrides,
|
||||
#[serde(default)]
|
||||
pub toc: TocConfig,
|
||||
#[serde(default)]
|
||||
pub links: EntryLinkConfig,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
#[serde(tag = "kind", rename_all = "lowercase")]
|
||||
pub enum SourceDefinition {
|
||||
Substack { url: String },
|
||||
Html { path: String },
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
|
||||
pub struct MetadataOverrides {
|
||||
#[serde(default)]
|
||||
pub author: Option<String>,
|
||||
#[serde(default)]
|
||||
pub published: Option<NaiveDate>,
|
||||
#[serde(default)]
|
||||
pub subtitle: Option<String>,
|
||||
#[serde(default)]
|
||||
pub summary: Option<String>,
|
||||
#[serde(default)]
|
||||
pub tags: Vec<String>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct ProcessingDefaults {
|
||||
#[serde(default = "default_true")]
|
||||
pub include_author: bool,
|
||||
#[serde(default = "default_true")]
|
||||
pub include_date: bool,
|
||||
#[serde(default = "default_true")]
|
||||
pub include_source_url: bool,
|
||||
#[serde(default)]
|
||||
pub skip_first_paragraphs: u32,
|
||||
}
|
||||
|
||||
impl Default for ProcessingDefaults {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
include_author: true,
|
||||
include_date: true,
|
||||
include_source_url: true,
|
||||
skip_first_paragraphs: 0,
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
|
||||
pub struct ProcessingOverrides {
|
||||
#[serde(default)]
|
||||
pub include_author: Option<bool>,
|
||||
#[serde(default)]
|
||||
pub include_date: Option<bool>,
|
||||
#[serde(default)]
|
||||
pub include_source_url: Option<bool>,
|
||||
#[serde(default)]
|
||||
pub skip_first_paragraphs: Option<u32>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
|
||||
pub struct TocConfig {
|
||||
#[serde(default)]
|
||||
pub title: Option<String>,
|
||||
#[serde(default)]
|
||||
pub hidden: bool,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Default, Serialize, Deserialize)]
|
||||
pub struct EntryLinkConfig {
|
||||
#[serde(default)]
|
||||
pub mode: Option<BuildMode>,
|
||||
#[serde(default)]
|
||||
pub allow_to: Vec<String>,
|
||||
#[serde(default)]
|
||||
pub block_to: Vec<String>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Default)]
|
||||
#[serde(rename_all = "lowercase")]
|
||||
pub enum BuildMode {
|
||||
#[default]
|
||||
Auto,
|
||||
Explicit,
|
||||
None,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Default)]
|
||||
#[serde(rename_all = "kebab-case")]
|
||||
pub enum LinkMatchMode {
|
||||
#[default]
|
||||
CanonicalUrl,
|
||||
SourceUrl,
|
||||
Disabled,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct LinkRule {
|
||||
pub from: Vec<String>,
|
||||
pub to: Vec<String>,
|
||||
#[serde(default)]
|
||||
pub match_mode: LinkMatchMode,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Serialize, Deserialize)]
|
||||
pub struct LinkRulesConfig {
|
||||
#[serde(default)]
|
||||
pub mode: BuildMode,
|
||||
#[serde(default = "default_true")]
|
||||
pub rewrite_external_substack_links: bool,
|
||||
#[serde(default = "default_true")]
|
||||
pub preserve_other_external_links: bool,
|
||||
#[serde(default)]
|
||||
pub rules: Vec<LinkRule>,
|
||||
}
|
||||
|
||||
impl Default for LinkRulesConfig {
|
||||
fn default() -> Self {
|
||||
Self {
|
||||
mode: BuildMode::Auto,
|
||||
rewrite_external_substack_links: true,
|
||||
preserve_other_external_links: true,
|
||||
rules: Vec::new(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn default_true() -> bool {
|
||||
true
|
||||
}
|
||||
|
||||
fn default_language() -> String {
|
||||
"en".to_string()
|
||||
}
|
||||
|
||||
fn default_identifier() -> String {
|
||||
format!("urn:uuid:{}", uuid::Uuid::new_v4())
|
||||
}
|
||||
@@ -0,0 +1,357 @@
|
||||
use std::collections::BTreeMap;
|
||||
use std::path::Path;
|
||||
|
||||
use kuchiki::traits::*;
|
||||
use regex::Regex;
|
||||
use sha1::{Digest, Sha1};
|
||||
use url::Url;
|
||||
|
||||
use crate::error::{EbookmError, Result};
|
||||
use crate::graph::{EntryLinkMetadata, LinkPolicy, matches_target};
|
||||
use crate::manifest::{DefaultsConfig, EntryDefinition};
|
||||
use crate::source::{SourceOrigin, resolve_relative_url};
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct Asset {
|
||||
pub id: String,
|
||||
pub href: String,
|
||||
pub media_type: String,
|
||||
pub bytes: Vec<u8>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct NormalizedDocument {
|
||||
pub title: String,
|
||||
pub author: Option<String>,
|
||||
pub published: Option<chrono::NaiveDate>,
|
||||
pub canonical_url: Option<Url>,
|
||||
pub body_xhtml: String,
|
||||
pub assets: Vec<Asset>,
|
||||
}
|
||||
|
||||
pub fn normalize_document(
|
||||
entry_id: &str,
|
||||
entry: &EntryDefinition,
|
||||
defaults: &DefaultsConfig,
|
||||
origin: &SourceOrigin,
|
||||
extracted: crate::extract::ExtractedArticle,
|
||||
policy: &LinkPolicy,
|
||||
entry_metadata: &BTreeMap<String, EntryLinkMetadata>,
|
||||
) -> Result<NormalizedDocument> {
|
||||
let mut document = kuchiki::parse_html().one(format!("<div>{}</div>", extracted.body_html));
|
||||
|
||||
remove_nodes(&mut document, "script,style,noscript,button,svg,source");
|
||||
if defaults.normalize_substack_embeds {
|
||||
remove_nodes(&mut document, "iframe");
|
||||
}
|
||||
skip_first_paragraphs(
|
||||
&mut document,
|
||||
entry
|
||||
.processing
|
||||
.skip_first_paragraphs
|
||||
.unwrap_or(defaults.processing.skip_first_paragraphs),
|
||||
);
|
||||
scrub_attributes(&mut document);
|
||||
|
||||
let mut assets = Vec::new();
|
||||
if defaults.fetch_images {
|
||||
collect_images(origin, &mut document, &mut assets)?;
|
||||
}
|
||||
|
||||
rewrite_links(entry_id, &mut document, origin, policy, entry_metadata);
|
||||
let body_xhtml = serialize_document(&document)?;
|
||||
|
||||
Ok(NormalizedDocument {
|
||||
title: entry.title.clone().unwrap_or(extracted.title),
|
||||
author: entry
|
||||
.metadata
|
||||
.author
|
||||
.clone()
|
||||
.or(extracted.author)
|
||||
.or(defaults.metadata.author.clone()),
|
||||
published: entry
|
||||
.metadata
|
||||
.published
|
||||
.or(extracted.published)
|
||||
.or(defaults.metadata.published),
|
||||
canonical_url: extracted.canonical_url,
|
||||
body_xhtml,
|
||||
assets,
|
||||
})
|
||||
}
|
||||
|
||||
fn remove_nodes(document: &mut kuchiki::NodeRef, selector: &str) {
|
||||
if let Ok(nodes) = document.select(selector) {
|
||||
let selected: Vec<_> = nodes.collect();
|
||||
for node in selected {
|
||||
node.as_node().detach();
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn collect_images(
|
||||
origin: &SourceOrigin,
|
||||
document: &mut kuchiki::NodeRef,
|
||||
assets: &mut Vec<Asset>,
|
||||
) -> Result<()> {
|
||||
let selected = document
|
||||
.select("img")
|
||||
.map(|items| items.collect::<Vec<_>>())
|
||||
.unwrap_or_default();
|
||||
|
||||
for node in selected {
|
||||
let mut attrs = node.attributes.borrow_mut();
|
||||
let src = attrs
|
||||
.get("src")
|
||||
.or_else(|| attrs.get("data-src"))
|
||||
.map(|value| value.to_string());
|
||||
let Some(src) = src else {
|
||||
continue;
|
||||
};
|
||||
|
||||
if let Ok(asset) = fetch_asset(origin, &src) {
|
||||
attrs.insert("src", format!("../{}", asset.href));
|
||||
assets.push(asset);
|
||||
}
|
||||
}
|
||||
|
||||
Ok(())
|
||||
}
|
||||
|
||||
fn fetch_asset(origin: &SourceOrigin, src: &str) -> Result<Asset> {
|
||||
match origin {
|
||||
SourceOrigin::LocalFile(base_path) => fetch_local_asset(base_path, src),
|
||||
SourceOrigin::Remote(base_url) => {
|
||||
let resolved = base_url.join(src).map_err(|source| EbookmError::UrlParse {
|
||||
value: src.to_string(),
|
||||
source,
|
||||
})?;
|
||||
fetch_remote_asset(&resolved)
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn fetch_local_asset(base_path: &Path, src: &str) -> Result<Asset> {
|
||||
if let Ok(url) = Url::parse(src) {
|
||||
match url.scheme() {
|
||||
"http" | "https" => return fetch_remote_asset(&url),
|
||||
"file" => {
|
||||
let path = url
|
||||
.to_file_path()
|
||||
.map_err(|_| EbookmError::InvalidSourcePath {
|
||||
path: src.to_string(),
|
||||
})?;
|
||||
return build_asset_from_path(&path);
|
||||
}
|
||||
_ => {}
|
||||
}
|
||||
}
|
||||
|
||||
let path = if Path::new(src).is_absolute() {
|
||||
Path::new(src).to_path_buf()
|
||||
} else {
|
||||
base_path
|
||||
.parent()
|
||||
.unwrap_or_else(|| Path::new("."))
|
||||
.join(src)
|
||||
};
|
||||
build_asset_from_path(&path)
|
||||
}
|
||||
|
||||
fn fetch_remote_asset(url: &Url) -> Result<Asset> {
|
||||
let bytes = reqwest::blocking::get(url.clone())
|
||||
.and_then(|response| response.error_for_status())
|
||||
.map_err(|source| EbookmError::Request {
|
||||
url: url.to_string(),
|
||||
source,
|
||||
})?
|
||||
.bytes()
|
||||
.map_err(|source| EbookmError::Request {
|
||||
url: url.to_string(),
|
||||
source,
|
||||
})?
|
||||
.to_vec();
|
||||
|
||||
let extension = infer_extension_from_str(url.path());
|
||||
let media_type = infer_media_type(&extension);
|
||||
let digest = Sha1::digest(url.as_str().as_bytes());
|
||||
let id = format!("{:x}", digest);
|
||||
Ok(Asset {
|
||||
id: id.clone(),
|
||||
href: format!("assets/{}.{}", id, extension),
|
||||
media_type,
|
||||
bytes,
|
||||
})
|
||||
}
|
||||
|
||||
fn build_asset_from_path(path: &Path) -> Result<Asset> {
|
||||
let bytes = std::fs::read(path).map_err(|source| EbookmError::Io {
|
||||
path: path.display().to_string(),
|
||||
source,
|
||||
})?;
|
||||
let extension = infer_extension_from_path(path);
|
||||
let media_type = infer_media_type(&extension);
|
||||
let digest = Sha1::digest(path.display().to_string().as_bytes());
|
||||
let id = format!("{:x}", digest);
|
||||
Ok(Asset {
|
||||
id: id.clone(),
|
||||
href: format!("assets/{}.{}", id, extension),
|
||||
media_type,
|
||||
bytes,
|
||||
})
|
||||
}
|
||||
|
||||
fn rewrite_links(
|
||||
entry_id: &str,
|
||||
document: &mut kuchiki::NodeRef,
|
||||
origin: &SourceOrigin,
|
||||
policy: &LinkPolicy,
|
||||
entry_metadata: &BTreeMap<String, EntryLinkMetadata>,
|
||||
) {
|
||||
let selected = document
|
||||
.select("a[href]")
|
||||
.map(|items| items.collect::<Vec<_>>())
|
||||
.unwrap_or_default();
|
||||
|
||||
for node in selected {
|
||||
let mut attrs = node.attributes.borrow_mut();
|
||||
let href = attrs.get("href").map(|value| value.to_string());
|
||||
let Some(href) = href else {
|
||||
continue;
|
||||
};
|
||||
|
||||
let Some(resolved) = resolve_relative_url(origin, &href) else {
|
||||
continue;
|
||||
};
|
||||
|
||||
if let Some((target_id, _)) = entry_metadata.iter().find(|(target_id, metadata)| {
|
||||
*target_id != entry_id && matches_target(&resolved, policy, target_id, metadata)
|
||||
}) {
|
||||
attrs.insert("href", format!("../text/{}.xhtml", target_id));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn serialize_document(document: &kuchiki::NodeRef) -> Result<String> {
|
||||
let wrapper = document
|
||||
.select_first("div")
|
||||
.map_err(|_| EbookmError::Epub {
|
||||
message: "failed to serialize normalized document".to_string(),
|
||||
})?;
|
||||
|
||||
let mut bytes = Vec::new();
|
||||
for child in wrapper.as_node().children() {
|
||||
child
|
||||
.serialize(&mut bytes)
|
||||
.map_err(|error| EbookmError::Epub {
|
||||
message: error.to_string(),
|
||||
})?;
|
||||
}
|
||||
|
||||
let html = String::from_utf8(bytes).map_err(|error| EbookmError::Epub {
|
||||
message: error.to_string(),
|
||||
})?;
|
||||
Ok(to_xhtml_fragment(&html))
|
||||
}
|
||||
|
||||
fn scrub_attributes(document: &mut kuchiki::NodeRef) {
|
||||
if let Ok(nodes) = document.select("*") {
|
||||
let selected: Vec<_> = nodes.collect();
|
||||
for node in selected {
|
||||
let mut attrs = node.attributes.borrow_mut();
|
||||
let names: Vec<_> = attrs.map.keys().cloned().collect();
|
||||
for name in names {
|
||||
let local = name.local.to_string();
|
||||
let keep = match node.name.local.as_ref() {
|
||||
"a" => matches!(local.as_str(), "href" | "title"),
|
||||
"img" => matches!(local.as_str(), "src" | "alt"),
|
||||
_ => false,
|
||||
};
|
||||
if !keep {
|
||||
attrs.map.remove(&name);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
fn skip_first_paragraphs(document: &mut kuchiki::NodeRef, count: u32) {
|
||||
if count == 0 {
|
||||
return;
|
||||
}
|
||||
let selected = document
|
||||
.select("p")
|
||||
.map(|items| items.take(count as usize).collect::<Vec<_>>())
|
||||
.unwrap_or_default();
|
||||
for node in selected {
|
||||
node.as_node().detach();
|
||||
}
|
||||
}
|
||||
|
||||
fn infer_extension_from_path(path: &Path) -> String {
|
||||
path.extension()
|
||||
.and_then(|value| value.to_str())
|
||||
.filter(|value| !value.is_empty())
|
||||
.unwrap_or("bin")
|
||||
.to_string()
|
||||
}
|
||||
|
||||
fn infer_extension_from_str(path: &str) -> String {
|
||||
Path::new(path)
|
||||
.extension()
|
||||
.and_then(|value| value.to_str())
|
||||
.filter(|value| !value.is_empty())
|
||||
.unwrap_or("bin")
|
||||
.to_string()
|
||||
}
|
||||
|
||||
fn infer_media_type(extension: &str) -> String {
|
||||
match extension {
|
||||
"jpg" | "jpeg" => "image/jpeg",
|
||||
"png" => "image/png",
|
||||
"gif" => "image/gif",
|
||||
"svg" => "image/svg+xml",
|
||||
"webp" => "image/webp",
|
||||
_ => "application/octet-stream",
|
||||
}
|
||||
.to_string()
|
||||
}
|
||||
|
||||
fn to_xhtml_fragment(html: &str) -> String {
|
||||
let img_re = Regex::new(r#"<img([^>]*)>"#).expect("valid img regex");
|
||||
let hr_re = Regex::new(r#"<hr([^>]*)>"#).expect("valid hr regex");
|
||||
let br_re = Regex::new(r#"<br([^>]*)>"#).expect("valid br regex");
|
||||
|
||||
let html = img_re.replace_all(html, "<img$1 />").into_owned();
|
||||
let html = hr_re.replace_all(&html, "<hr$1 />").into_owned();
|
||||
br_re.replace_all(&html, "<br$1 />").into_owned()
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use super::to_xhtml_fragment;
|
||||
use quick_xml::events::Event;
|
||||
use quick_xml::Reader;
|
||||
|
||||
#[test]
|
||||
fn converts_void_html_tags_to_xhtml_self_closing_tags() {
|
||||
let input = r#"<p>Intro</p><picture><img alt="" src="a.jpg"></picture><hr><br>"#;
|
||||
let xhtml = to_xhtml_fragment(input);
|
||||
assert!(xhtml.contains(r#"<img alt="" src="a.jpg" />"#));
|
||||
assert!(xhtml.contains("<hr />"));
|
||||
assert!(xhtml.contains("<br />"));
|
||||
|
||||
let wrapped = format!(
|
||||
r#"<?xml version="1.0" encoding="UTF-8"?><root>{}</root>"#,
|
||||
xhtml
|
||||
);
|
||||
let mut reader = Reader::from_str(&wrapped);
|
||||
loop {
|
||||
match reader.read_event() {
|
||||
Ok(Event::Eof) => break,
|
||||
Ok(_) => {}
|
||||
Err(error) => panic!("invalid XML generated: {error}"),
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,432 @@
|
||||
use std::collections::BTreeMap;
|
||||
use std::fs;
|
||||
use std::path::Path;
|
||||
|
||||
use crate::epub::write_epub;
|
||||
use crate::error::{EbookmError, Result};
|
||||
use crate::extract::{InspectResult, inspect_article};
|
||||
use crate::graph::{EntryLinkMetadata, build_link_policies};
|
||||
use crate::manifest::{Manifest, SourceDefinition};
|
||||
use crate::normalize::{Asset, NormalizedDocument};
|
||||
use crate::source::{SourceSpec, load_source};
|
||||
use crate::template::INIT_MANIFEST;
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct BuiltChapter {
|
||||
pub nav_title: String,
|
||||
pub xhtml: String,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone, Copy)]
|
||||
struct ChapterHeaderOptions {
|
||||
include_author: bool,
|
||||
include_date: bool,
|
||||
include_source_url: bool,
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct BuiltEntry {
|
||||
pub id: String,
|
||||
pub hidden_from_toc: bool,
|
||||
pub chapter: BuiltChapter,
|
||||
pub assets: Vec<Asset>,
|
||||
}
|
||||
|
||||
pub fn load_manifest(path: &Path) -> Result<Manifest> {
|
||||
let contents = fs::read_to_string(path).map_err(|source| EbookmError::Io {
|
||||
path: path.display().to_string(),
|
||||
source,
|
||||
})?;
|
||||
serde_yaml::from_str(&contents).map_err(|source| EbookmError::ManifestParse {
|
||||
path: path.display().to_string(),
|
||||
source,
|
||||
})
|
||||
}
|
||||
|
||||
pub fn validate_manifest(manifest: &Manifest) -> Result<Vec<String>> {
|
||||
let mut issues = Vec::new();
|
||||
let mut warnings = Vec::new();
|
||||
|
||||
if manifest.book.title.trim().is_empty() {
|
||||
issues.push("book.title must not be empty".to_string());
|
||||
}
|
||||
if manifest.output.path.trim().is_empty() {
|
||||
issues.push("output.path must not be empty".to_string());
|
||||
}
|
||||
if manifest.sections.is_empty() {
|
||||
issues.push("at least one section is required".to_string());
|
||||
}
|
||||
if manifest.entries.is_empty() {
|
||||
issues.push("at least one entry is required".to_string());
|
||||
}
|
||||
|
||||
for section in &manifest.sections {
|
||||
if section.entries.is_empty() {
|
||||
warnings.push(format!("section {} has no entries", section.id));
|
||||
}
|
||||
for entry_id in §ion.entries {
|
||||
if !manifest.entries.contains_key(entry_id) {
|
||||
issues.push(format!(
|
||||
"section {} references unknown entry {}",
|
||||
section.id, entry_id
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
for (entry_id, entry) in &manifest.entries {
|
||||
for target in &entry.links.allow_to {
|
||||
if !manifest.entries.contains_key(target) {
|
||||
issues.push(format!(
|
||||
"entry {entry_id} allow_to target {target} does not exist"
|
||||
));
|
||||
}
|
||||
}
|
||||
for target in &entry.links.block_to {
|
||||
if !manifest.entries.contains_key(target) {
|
||||
issues.push(format!(
|
||||
"entry {entry_id} block_to target {target} does not exist"
|
||||
));
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
for rule in &manifest.link_rules.rules {
|
||||
validate_selectors(manifest, &rule.from, "from", &mut issues);
|
||||
validate_selectors(manifest, &rule.to, "to", &mut issues);
|
||||
}
|
||||
|
||||
for entry_id in manifest.entries.keys() {
|
||||
if !manifest.sections.iter().any(|section| {
|
||||
section
|
||||
.entries
|
||||
.iter()
|
||||
.any(|candidate| candidate == entry_id)
|
||||
}) {
|
||||
warnings.push(format!("entry {entry_id} is not referenced by any section"));
|
||||
}
|
||||
}
|
||||
|
||||
if issues.is_empty() {
|
||||
Ok(warnings)
|
||||
} else {
|
||||
Err(EbookmError::Validation { issues })
|
||||
}
|
||||
}
|
||||
|
||||
pub fn inspect_source(source: &str) -> Result<InspectResult> {
|
||||
let spec = if source.starts_with("http://") || source.starts_with("https://") {
|
||||
SourceSpec::from_definition(
|
||||
&SourceDefinition::Substack {
|
||||
url: source.to_string(),
|
||||
},
|
||||
Path::new("."),
|
||||
)?
|
||||
} else {
|
||||
SourceSpec::from_definition(
|
||||
&SourceDefinition::Html {
|
||||
path: source.to_string(),
|
||||
},
|
||||
Path::new("."),
|
||||
)?
|
||||
};
|
||||
let loaded = load_source(&spec)?;
|
||||
inspect_article(&loaded)
|
||||
}
|
||||
|
||||
pub fn build_epub(manifest: &Manifest, manifest_path: &Path) -> Result<()> {
|
||||
let manifest_dir = manifest_path.parent().unwrap_or_else(|| Path::new("."));
|
||||
|
||||
let mut entry_specs = BTreeMap::new();
|
||||
let mut loaded_sources = BTreeMap::new();
|
||||
let mut extracted = BTreeMap::new();
|
||||
let mut metadata = BTreeMap::new();
|
||||
|
||||
for (entry_id, entry) in &manifest.entries {
|
||||
let spec = SourceSpec::from_definition(&entry.source, manifest_dir)?;
|
||||
let loaded = load_source(&spec)?;
|
||||
let article = crate::extract::extract_article(&loaded)?;
|
||||
let source_url = match &spec {
|
||||
SourceSpec::SubstackUrl(url) => Some(url.clone()),
|
||||
SourceSpec::LocalHtml(_) => None,
|
||||
};
|
||||
|
||||
metadata.insert(
|
||||
entry_id.clone(),
|
||||
EntryLinkMetadata {
|
||||
source_url,
|
||||
canonical_url: article.canonical_url.clone(),
|
||||
},
|
||||
);
|
||||
entry_specs.insert(entry_id.clone(), spec);
|
||||
loaded_sources.insert(entry_id.clone(), loaded);
|
||||
extracted.insert(entry_id.clone(), article);
|
||||
}
|
||||
|
||||
let policies = build_link_policies(manifest, &metadata);
|
||||
let mut built_entries = Vec::new();
|
||||
|
||||
for section in &manifest.sections {
|
||||
for entry_id in §ion.entries {
|
||||
let entry = &manifest.entries[entry_id];
|
||||
let loaded = loaded_sources.get(entry_id).expect("entry was loaded");
|
||||
let article = extracted
|
||||
.get(entry_id)
|
||||
.expect("entry was extracted")
|
||||
.clone();
|
||||
let policy = policies.get(entry_id).expect("policy was built");
|
||||
|
||||
let normalized = crate::normalize::normalize_document(
|
||||
entry_id,
|
||||
entry,
|
||||
&manifest.defaults,
|
||||
&loaded.origin,
|
||||
article,
|
||||
policy,
|
||||
&metadata,
|
||||
)?;
|
||||
|
||||
built_entries.push(BuiltEntry {
|
||||
id: entry_id.clone(),
|
||||
hidden_from_toc: entry.toc.hidden,
|
||||
chapter: build_chapter(entry_id, entry, &manifest.defaults, &normalized),
|
||||
assets: normalized.assets,
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
let cover = manifest
|
||||
.output
|
||||
.cover_image
|
||||
.as_ref()
|
||||
.map(|path| load_cover(path, manifest_dir))
|
||||
.transpose()?;
|
||||
let output_path = manifest_dir.join(&manifest.output.path);
|
||||
write_epub(manifest, &built_entries, &output_path, cover)?;
|
||||
Ok(())
|
||||
}
|
||||
|
||||
pub fn render_init_manifest() -> &'static str {
|
||||
INIT_MANIFEST
|
||||
}
|
||||
|
||||
fn build_chapter(
|
||||
entry_id: &str,
|
||||
entry: &crate::manifest::EntryDefinition,
|
||||
defaults: &crate::manifest::DefaultsConfig,
|
||||
doc: &NormalizedDocument,
|
||||
) -> BuiltChapter {
|
||||
let nav_title = entry.toc.title.clone().unwrap_or_else(|| doc.title.clone());
|
||||
let header = resolve_header_options(entry, defaults);
|
||||
let author = doc.author.clone().unwrap_or_default();
|
||||
let published = doc
|
||||
.published
|
||||
.map(|date| date.to_string())
|
||||
.unwrap_or_default();
|
||||
let mut meta_lines = Vec::new();
|
||||
if header.include_author && !author.is_empty() {
|
||||
meta_lines.push(format!("<p>{}</p>", escape_html(&author)));
|
||||
}
|
||||
if header.include_date && !published.is_empty() {
|
||||
meta_lines.push(format!("<p>{}</p>", escape_html(&published)));
|
||||
}
|
||||
if header.include_source_url {
|
||||
if let Some(url) = doc.canonical_url.as_ref() {
|
||||
let escaped = escape_html(url.as_str());
|
||||
meta_lines.push(format!(r#"<p><a href="{0}">{0}</a></p>"#, escaped));
|
||||
}
|
||||
}
|
||||
|
||||
let meta_block = if meta_lines.is_empty() {
|
||||
String::new()
|
||||
} else {
|
||||
format!(r#"<div class="chapter-meta">{}</div>"#, meta_lines.join(""))
|
||||
};
|
||||
|
||||
let xhtml = format!(
|
||||
r#"<?xml version="1.0" encoding="UTF-8"?>
|
||||
<html xmlns="http://www.w3.org/1999/xhtml">
|
||||
<head>
|
||||
<title>{}</title>
|
||||
<link rel="stylesheet" type="text/css" href="../styles/book.css"/>
|
||||
</head>
|
||||
<body id="{}">
|
||||
<h1>{}</h1>
|
||||
{}
|
||||
{}
|
||||
</body>
|
||||
</html>"#,
|
||||
escape_html(&doc.title),
|
||||
escape_html(entry_id),
|
||||
escape_html(&doc.title),
|
||||
meta_block,
|
||||
doc.body_xhtml
|
||||
);
|
||||
|
||||
BuiltChapter { nav_title, xhtml }
|
||||
}
|
||||
|
||||
fn validate_selectors(
|
||||
manifest: &Manifest,
|
||||
selectors: &[String],
|
||||
field: &str,
|
||||
issues: &mut Vec<String>,
|
||||
) {
|
||||
for selector in selectors {
|
||||
if selector == "*" {
|
||||
continue;
|
||||
}
|
||||
if manifest.entries.contains_key(selector) {
|
||||
continue;
|
||||
}
|
||||
if let Some(section_id) = selector.strip_prefix("section:") {
|
||||
if manifest
|
||||
.sections
|
||||
.iter()
|
||||
.any(|section| section.id == section_id)
|
||||
{
|
||||
continue;
|
||||
}
|
||||
}
|
||||
issues.push(format!("unknown {field} selector {selector}"));
|
||||
}
|
||||
}
|
||||
|
||||
fn load_cover(path: &str, manifest_dir: &Path) -> Result<(String, Vec<u8>)> {
|
||||
let full_path = manifest_dir.join(path);
|
||||
let bytes = fs::read(&full_path).map_err(|source| EbookmError::Io {
|
||||
path: full_path.display().to_string(),
|
||||
source,
|
||||
})?;
|
||||
let extension = full_path
|
||||
.extension()
|
||||
.and_then(|value| value.to_str())
|
||||
.unwrap_or("jpg");
|
||||
Ok((format!("assets/cover.{extension}"), bytes))
|
||||
}
|
||||
|
||||
fn escape_html(value: &str) -> String {
|
||||
value
|
||||
.replace('&', "&")
|
||||
.replace('<', "<")
|
||||
.replace('>', ">")
|
||||
.replace('"', """)
|
||||
}
|
||||
|
||||
fn resolve_header_options(
|
||||
entry: &crate::manifest::EntryDefinition,
|
||||
defaults: &crate::manifest::DefaultsConfig,
|
||||
) -> ChapterHeaderOptions {
|
||||
ChapterHeaderOptions {
|
||||
include_author: entry
|
||||
.processing
|
||||
.include_author
|
||||
.unwrap_or(defaults.processing.include_author),
|
||||
include_date: entry
|
||||
.processing
|
||||
.include_date
|
||||
.unwrap_or(defaults.processing.include_date),
|
||||
include_source_url: entry
|
||||
.processing
|
||||
.include_source_url
|
||||
.unwrap_or(defaults.processing.include_source_url),
|
||||
}
|
||||
}
|
||||
|
||||
#[cfg(test)]
|
||||
mod tests {
|
||||
use std::fs;
|
||||
|
||||
use tempfile::tempdir;
|
||||
use zip::ZipArchive;
|
||||
|
||||
use super::*;
|
||||
|
||||
#[test]
|
||||
fn validates_and_builds_local_html_manifest() {
|
||||
let temp = tempdir().expect("tempdir");
|
||||
let root = temp.path();
|
||||
|
||||
fs::write(
|
||||
root.join("article.html"),
|
||||
r#"<!doctype html>
|
||||
<html>
|
||||
<head>
|
||||
<title>Local Essay</title>
|
||||
<meta name="author" content="Local Author" />
|
||||
<meta property="article:published_time" content="2025-01-10T00:00:00Z" />
|
||||
</head>
|
||||
<body>
|
||||
<article>
|
||||
<p>Hello world.</p>
|
||||
<img src="author.jpg" alt="Author" />
|
||||
</article>
|
||||
</body>
|
||||
</html>"#,
|
||||
)
|
||||
.expect("write html");
|
||||
fs::write(root.join("author.jpg"), b"fake-jpeg-data").expect("write image");
|
||||
|
||||
let manifest_path = root.join("book.yaml");
|
||||
fs::write(
|
||||
&manifest_path,
|
||||
r#"book:
|
||||
title: "Local Book"
|
||||
author: "Editor"
|
||||
language: "en"
|
||||
identifier: "urn:uuid:test-book"
|
||||
output:
|
||||
path: "dist/test.epub"
|
||||
defaults:
|
||||
fetch_images: true
|
||||
normalize_substack_embeds: true
|
||||
processing:
|
||||
include_author: true
|
||||
include_date: false
|
||||
include_source_url: false
|
||||
skip_first_paragraphs: 0
|
||||
sections:
|
||||
- id: "part-1"
|
||||
title: "Part 1"
|
||||
entries:
|
||||
- "essay"
|
||||
entries:
|
||||
essay:
|
||||
source:
|
||||
kind: "html"
|
||||
path: "article.html"
|
||||
link_rules:
|
||||
mode: "auto"
|
||||
"#,
|
||||
)
|
||||
.expect("write manifest");
|
||||
|
||||
let manifest = load_manifest(&manifest_path).expect("manifest");
|
||||
validate_manifest(&manifest).expect("manifest valid");
|
||||
build_epub(&manifest, &manifest_path).expect("build epub");
|
||||
|
||||
let epub_path = root.join("dist/test.epub");
|
||||
assert!(epub_path.exists());
|
||||
|
||||
let file = fs::File::open(&epub_path).expect("epub file");
|
||||
let mut archive = ZipArchive::new(file).expect("zip");
|
||||
assert!(archive.by_name("mimetype").is_ok());
|
||||
assert!(archive.by_name("OEBPS/content.opf").is_ok());
|
||||
let mut chapter = archive
|
||||
.by_name("OEBPS/text/essay.xhtml")
|
||||
.expect("chapter file");
|
||||
let mut chapter_contents = String::new();
|
||||
use std::io::Read;
|
||||
chapter
|
||||
.read_to_string(&mut chapter_contents)
|
||||
.expect("read chapter");
|
||||
assert!(chapter_contents.contains("<p>Local Author</p>"));
|
||||
assert!(!chapter_contents.contains("<p>2025-01-10</p>"));
|
||||
assert!(!chapter_contents.contains("urn:uuid:test-book"));
|
||||
assert!(chapter_contents.contains("../assets/"));
|
||||
drop(chapter);
|
||||
assert!(archive
|
||||
.file_names()
|
||||
.any(|name| name.starts_with("OEBPS/assets/") && name.ends_with(".jpg")));
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,97 @@
|
||||
use std::fs;
|
||||
use std::path::{Path, PathBuf};
|
||||
|
||||
use url::Url;
|
||||
|
||||
use crate::error::{EbookmError, Result};
|
||||
use crate::manifest::SourceDefinition;
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub enum SourceSpec {
|
||||
SubstackUrl(Url),
|
||||
LocalHtml(PathBuf),
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub enum SourceOrigin {
|
||||
Remote(Url),
|
||||
LocalFile(PathBuf),
|
||||
}
|
||||
|
||||
#[derive(Debug, Clone)]
|
||||
pub struct LoadedSource {
|
||||
pub origin: SourceOrigin,
|
||||
pub html: String,
|
||||
}
|
||||
|
||||
impl SourceSpec {
|
||||
pub fn from_definition(definition: &SourceDefinition, manifest_dir: &Path) -> Result<Self> {
|
||||
match definition {
|
||||
SourceDefinition::Substack { url } => Ok(SourceSpec::SubstackUrl(
|
||||
Url::parse(url).map_err(|source| EbookmError::UrlParse {
|
||||
value: url.clone(),
|
||||
source,
|
||||
})?,
|
||||
)),
|
||||
SourceDefinition::Html { path } => {
|
||||
let joined = manifest_dir.join(path);
|
||||
Ok(SourceSpec::LocalHtml(joined))
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pub fn load_source(spec: &SourceSpec) -> Result<LoadedSource> {
|
||||
match spec {
|
||||
SourceSpec::SubstackUrl(url) => {
|
||||
let client = reqwest::blocking::Client::builder()
|
||||
.user_agent("ebookm/0.1")
|
||||
.build()
|
||||
.map_err(|source| EbookmError::Request {
|
||||
url: url.to_string(),
|
||||
source,
|
||||
})?;
|
||||
let html = client
|
||||
.get(url.clone())
|
||||
.send()
|
||||
.and_then(|response| response.error_for_status())
|
||||
.map_err(|source| EbookmError::Request {
|
||||
url: url.to_string(),
|
||||
source,
|
||||
})?
|
||||
.text()
|
||||
.map_err(|source| EbookmError::Request {
|
||||
url: url.to_string(),
|
||||
source,
|
||||
})?;
|
||||
Ok(LoadedSource {
|
||||
origin: SourceOrigin::Remote(url.clone()),
|
||||
html,
|
||||
})
|
||||
}
|
||||
SourceSpec::LocalHtml(path) => {
|
||||
let html = fs::read_to_string(path).map_err(|source| EbookmError::Io {
|
||||
path: path.display().to_string(),
|
||||
source,
|
||||
})?;
|
||||
Ok(LoadedSource {
|
||||
origin: SourceOrigin::LocalFile(path.clone()),
|
||||
html,
|
||||
})
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
pub fn resolve_relative_url(origin: &SourceOrigin, href: &str) -> Option<Url> {
|
||||
match origin {
|
||||
SourceOrigin::Remote(base) => base.join(href).ok(),
|
||||
SourceOrigin::LocalFile(path) => {
|
||||
if let Ok(url) = Url::parse(href) {
|
||||
return Some(url);
|
||||
}
|
||||
let parent = path.parent()?;
|
||||
let joined = parent.join(href);
|
||||
Url::from_file_path(joined).ok()
|
||||
}
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,56 @@
|
||||
pub const INIT_MANIFEST: &str = r#"book:
|
||||
title: "Collected Substack Essays"
|
||||
author: "Author Name"
|
||||
language: "en"
|
||||
identifier: "urn:uuid:11111111-2222-3333-4444-555555555555"
|
||||
description: "A compiled EPUB built by ebookm"
|
||||
|
||||
output:
|
||||
path: "dist/collection.epub"
|
||||
|
||||
defaults:
|
||||
fetch_images: true
|
||||
normalize_substack_embeds: true
|
||||
processing:
|
||||
include_author: true
|
||||
include_date: true
|
||||
include_source_url: true
|
||||
skip_first_paragraphs: 0
|
||||
metadata:
|
||||
author: "Author Name"
|
||||
|
||||
sections:
|
||||
- id: "essays"
|
||||
title: "Essays"
|
||||
entries:
|
||||
- "opening-post"
|
||||
- "saved-html"
|
||||
|
||||
entries:
|
||||
opening-post:
|
||||
source:
|
||||
kind: "substack"
|
||||
url: "https://example.substack.com/p/opening-post"
|
||||
processing:
|
||||
skip_first_paragraphs: 1
|
||||
toc:
|
||||
title: "Opening Post"
|
||||
|
||||
saved-html:
|
||||
source:
|
||||
kind: "html"
|
||||
path: "articles/saved-post.html"
|
||||
title: "Saved Local Article"
|
||||
links:
|
||||
mode: "explicit"
|
||||
allow_to: ["opening-post"]
|
||||
|
||||
link_rules:
|
||||
mode: "auto"
|
||||
rewrite_external_substack_links: true
|
||||
preserve_other_external_links: true
|
||||
rules:
|
||||
- from: ["section:essays"]
|
||||
to: ["section:essays"]
|
||||
match_mode: "canonical-url"
|
||||
"#;
|
||||
@@ -0,0 +1,15 @@
|
||||
<!doctype html>
|
||||
<html>
|
||||
<head>
|
||||
<title>Introduction</title>
|
||||
<meta name="author" content="ebookm" />
|
||||
<meta property="article:published_time" content="2025-01-10T00:00:00Z" />
|
||||
<link rel="canonical" href="https://example.com/intro" />
|
||||
</head>
|
||||
<body>
|
||||
<article>
|
||||
<p>This is the first article in the bundled example.</p>
|
||||
<p>It demonstrates a local HTML source entry.</p>
|
||||
</article>
|
||||
</body>
|
||||
</html>
|
||||
@@ -0,0 +1,14 @@
|
||||
<!doctype html>
|
||||
<html>
|
||||
<head>
|
||||
<title>Working Notes</title>
|
||||
<meta name="author" content="ebookm" />
|
||||
<meta property="article:published_time" content="2025-01-11T00:00:00Z" />
|
||||
</head>
|
||||
<body>
|
||||
<article>
|
||||
<p>This second article links to the first one.</p>
|
||||
<p><a href="https://example.com/intro">Go to the introduction</a></p>
|
||||
</article>
|
||||
</body>
|
||||
</html>
|
||||
Vendored
BIN
Binary file not shown.
@@ -0,0 +1,48 @@
|
||||
book:
|
||||
title: "ebookm Example Book"
|
||||
author: "ebookm"
|
||||
language: "en"
|
||||
identifier: "urn:uuid:ebookm-example-book"
|
||||
description: "Example manifest shipped with the repository"
|
||||
|
||||
output:
|
||||
path: "dist/example-book.epub"
|
||||
|
||||
defaults:
|
||||
fetch_images: false
|
||||
normalize_substack_embeds: true
|
||||
metadata:
|
||||
author: "ebookm"
|
||||
|
||||
sections:
|
||||
- id: "part-1"
|
||||
title: "Examples"
|
||||
entries:
|
||||
- "intro"
|
||||
- "notes"
|
||||
|
||||
entries:
|
||||
intro:
|
||||
source:
|
||||
kind: "html"
|
||||
path: "articles/intro.html"
|
||||
toc:
|
||||
title: "Introduction"
|
||||
|
||||
notes:
|
||||
source:
|
||||
kind: "html"
|
||||
path: "articles/notes.html"
|
||||
title: "Working Notes"
|
||||
links:
|
||||
mode: "explicit"
|
||||
allow_to: ["intro"]
|
||||
|
||||
link_rules:
|
||||
mode: "explicit"
|
||||
rewrite_external_substack_links: true
|
||||
preserve_other_external_links: true
|
||||
rules:
|
||||
- from: ["notes"]
|
||||
to: ["intro"]
|
||||
match_mode: "canonical-url"
|
||||
Reference in New Issue
Block a user