PDF Generator Gap Analysis

Last updated: 2026-05-27

This document answers one narrow question:

what is still missing if the goal is “LaTeX-like PDF quality”

It is written against the current code, not against an ideal future system.

Related roadmap:

LaTeX roadmap

Two Different PDF Systems

The repo currently has two different PDF-related paths.

1. TeX-emission path

This path emits TeX and lets LaTeX do composition.

Relevant files:

layouts/_default/baseof.tex
layouts/_default/_markup/render-image.tex
layouts/partials/preprocess-tex.tex

2. Browser PDF compositor

This path has two related but distinct browser implementations:

the editor/Nowtype PDF preview path paginates one rendered DOM tree and displays PDF-style pages for editing and preview;
the CLI book export path assembles Hugo-rendered chapter bodies in Chromium, normalizes/mutates the DOM for print, runs page preparation diagnostics, and writes a PDF through Playwright/Chromium.

Pagination now has a small shared layout-metrics layer in toggleMarkdown.js: collectNowtypePdfLayoutMetrics() measures block, heading, section, and content geometry root-relative before pagination. The paginator, rendered ToC, and smoke diagnostics should consume that metrics bundle instead of reading offsetTop or ad-hoc DOM rectangles independently. That invariant exists because composed Nowtype/Hugo documents can contain nested offset parents; treating raw offsetTop as document-global caused long chapters to collapse into too few browser-PDF pages even though the Markdown content was present.

Relevant files:

cdn/custom/toggleMarkdown.js
LaTeX roadmap
scripts/nowtype-pdf-smoke.js

The browser compositor also has a book-scope preview path for local draft workspaces. For large books it falls back to fetching Hugo-rendered page bodies instead of reparsing one very large Markdown composite. That fallback should keep the CLI thesis builder as the semantic reference: root frontmatter supplies the formal cover metadata, top-level section indexes are Parts, and child leaf pages are Chapters. Browser-only details such as live selection projection must remain presentation helpers; they must not become document structure.

The book exporter now accepts --document-kind thesis|article. thesis keeps the formal title/frontmatter/recto behavior. article keeps the same Hugo rendering and pagination machinery but uses a compact first-page title block and continuous section flow, so journal-style article templates can be built on the same browser compositor rather than on a separate export path. Generated frontmatter pages such as the table of contents, object lists, glossary, acknowledgements, and part dividers intentionally use the same inner text box as normal chapter body text; only the title page has special page geometry. Figure and table lists are sorted by their resolved book numbers, and caption extraction falls back to Hugo’s print metadata when malformed generated markdown nests one figure inside another caption. The simple glossary-derived locator index is opt-in (generatedIndex: true in content/_index.md or --generated-index on) because a real thesis index should be authored from topic markers rather than mechanically repeating glossary and nomenclature entries. Top-level bibliography/ or references/ content is classified as unnumbered backmatter during manifest construction, so it remains in the book order and recto plan without consuming a chapter number. Acknowledgements and generated frontmatter pages are registered as frontmatter entries with uppercase Roman labels. They emit section probes just like body chapters, so the final ToC can include them with resolved page numbers after the second pass. When the probe PDF overestimates a late backmatter start, the final stamping pass reconciles out-of-range backmatter section starts against the rendered PDF text before drawing running heads. The same final text extraction is used to validate planned blank versos before header/footer stamping, so a page that drifted from “blank” in the probe plan to real chapter content in the final PDF is still stamped and becomes the chapter start for running-head purposes. Section-level running-head probes on the same page as a chapter opener are suppressed so the opener keeps the chapter mark.

For iteration, the book exporter also accepts --fast (alias --single-pass). This does one browser composition and one PDF write, skipping the probe PDF, recto blank-page correction, and resolved generated-list page numbers. Use the default two-pass path for final output.

Generated glossary content can be grounded by a site-owned definitions document at one of:

data/thesis-definitions.json
data/thesis-definitions.yaml
data/print-definitions.json
data/print-definitions.yaml

The schema is intentionally small:

acronyms:
  TRSB: Time-reversal symmetry breaking
terms:
  loop supercurrent: A superconducting state with internal phase winding.
symbols:
  Delta: Superconducting gap amplitude.

During export, detected acronyms, emphasised terms, and common math symbols that are not defined there are printed as [book:warn] undefined ... lines and also included in diagnostics JSON. Nomenclature labels keep their canonical LaTeX where possible: detected symbols reuse the KaTeX markup already emitted by Hugo, and definition aliases such as alpha/α or Gamma/Γ are canonicalised before rendering so duplicate definition rows collapse to one mathematical label. If the optional generated index is enabled, symbol-index labels use the same rendering path.

Equation numbering is supplied by Hugo before the browser compositor sees the page. The shared eq/chapter-number partial follows the same semantic book structure as the browser manifest: top-level sections with child pages are Parts, their ordered child pages are Chapters, and top-level sections without child pages are standalone Chapters. It must not depend on numeric directory prefixes, because current thesis sites use semantic slugs plus front matter weights for ordering.

Citation links are also normalized during book assembly. Per-chapter pages emit numeric #bibreference-N links, but the browser book export removes the chapter-local reference tails and keeps a global bibliography page whose entries use #bibentry-key IDs. The assembler rewrites citation anchors to those global entry IDs before Chromium writes the PDF, so citation numbers remain clickable.

The browser book exporter also exposes a small Hugo-first authoring contract for LaTeX-like print intent. The render hooks emit data-ql-print-* metadata for figures, tables, equations, citations, and local object references; the browser page planner reads those attributes before using any DOM heuristics. This keeps counter/reference authority in Hugo while leaving browser-only measurement and pagination in JavaScript. Print metadata values are text-only attribute payloads. Do not pass rendered caption HTML through data-ql-print-*; figure captions may contain citations or KaTeX shortcodes that resolve after the image render hook and can otherwise break the surrounding HTML attribute. Figure numbering is also Hugo-owned: the image render hook asks the shared chapter-label partial for the current chapter or appendix label, then emits labels such as 1.1 or A.1 before the browser pass runs. Generated object lists may clone sanitized caption HTML from the rendered DOM so inline KaTeX is preserved there, but they must not recover semantic numbering by re-parsing captions. Wrapper-based figures should pass the same print metadata when it is available; for legacy wrapper output that has only a local or missing figure number, the browser pass normalizes the final label from the owning chapter section and DOM-local figure order.

The authored surface is:

figures and tables can use data-float="here|top|bottom|page" or classes such as figure-here, float-top, float-bottom, and page-float
side figures can use .left and .right; those are treated as authored placements and are not moved by the deferred float filler
tables can use long-table / table-long, keep-table / table-keep, landscape-table / sidewaystable, and decimal-align / siunitx; numeric columns are also detected automatically for tabular-number alignment
equations are numbered by Hugo with chapter labels for normal chapters and alphabetic labels for appendices before the browser pass stamps references
generated diagnostics report table policy counts, unresolved page references, accessibility warnings, missing image alt text, duplicate IDs, heading-level skips, math accessibility coverage, and overfull/underfull layout warnings

The final export requests Chromium tagged-PDF and outline generation, and the stamping pass sets PDF title, author, subject, keywords, producer, creator, page-label metadata, and the outline page mode. Each run writes an adjacent .build.json manifest with command, Git, Node, Playwright/Chromium, and summary layout diagnostics so exported PDFs can be reproduced or audited later.

These two paths should not be evaluated the same way. The TeX path already gets most of LaTeX’s layout algorithms “for free”. The browser compositor does not.

Short Answer

The TeX path does not need a custom kerning engine.
The TeX path does need better use of LaTeX’s existing float and typography machinery.
The browser compositor is still missing several core LaTeX-class composition features:
- hyphenation-aware line breaking
- badness/deemerit-based justification
- real float placement
- footnote-aware page building
- margin-note collision management
- long-table pagination
- structurally correct list pagination

Current State By Path

TeX path: what it already has

The TeX templates already use standard LaTeX figure and wrapfigure environments.

packages are loaded in baseof.tex through baseof.tex
images normally emit wrapfigure, figure*, or figure with [htbp] in render-image.tex

That means the TeX path already relies on:

TeX paragraph line breaking
TeX page building
normal float placement
engine-level OpenType shaping and standard font kerning

TeX path: what is currently getting in its own way

The preprocess layer rewrites many figures and tables into forced [H] floats.

Examples:

HTML figure conversion in preprocess-tex.tex
markdown image conversion in preprocess-tex.tex
table wrapping in preprocess-tex.tex

That is the opposite of “intelligent figure placement”. It prevents LaTeX from doing the thing it is good at.

Browser compositor: what it currently does

The browser compositor uses a custom break-candidate and scoring system in computeNowtypePdfPageBreaks().

It currently:

measures logical content height
finds block-top and block-bottom break candidates
marks no-break ranges for paragraphs, headings, figures, and tables
marks whole list containers as keep-together ranges from DOM geometry
applies local penalties around target page height
keeps figure/table blocks together where possible
has best-effort widow/orphan scoring

The code explicitly says it is still clipping one logical DOM and therefore breaking at block boundaries rather than doing true fragment reflow:

toggleMarkdown.js

This is already much better than naive pixel slicing, but it is still not TeX-class composition.

Browser compositor: one concrete current bug class

List pagination is still structurally wrong in some cases.

The paginator currently scans both list containers and list items as block boundaries:

toggleMarkdown.js

and it treats UL / OL containers as whole no-break ranges:

toggleMarkdown.js

That means the browser paginator can infer an over-large unbreakable region from container geometry instead of from the real list structure. In practice this can produce exactly the bug we have already seen:

a bullet list captures a following paragraph that is not part of the list
the paginator then refuses otherwise legal page breaks
the page underfills and turns into an apparent half page

This is not just a scoring issue. It is a structural ownership bug in how list ranges are inferred.

The same ownership rule also applies to heading keep-with-next logic. If a heading is followed by a ul / ol, the keep range must stop at the first li, not at the list container bottom, or the paginator recreates whole-list overcapture through the heading rule instead of through the list rule.

Missing Features Compared With LaTeX

TeX path improvements

These are the things the TeX path should improve first.

1. Stop forcing normal figures and tables to `[H]`

This is the biggest current problem.

Use [htbp] or a configurable float policy by default, and reserve [H] for explicit author intent or exceptional cases only.

Why:

LaTeX’s float queue is specifically designed to avoid collisions and bad page breaks
[H] disables that machinery

2. Add microtypography

The current base template loads fontspec and unicode-math, but not microtype:

baseof.tex

That means the TeX path is currently missing:

protrusion
font expansion
some spacing refinements that reduce overfull lines and visible rivers

This matters more than hand-implementing kerning.

3. Add stronger keep-with-next controls for headings

The TeX path should use structural controls such as:

needspace
heading penalties
explicit section-before-space / after-space policy

Right now the browser compositor already tries to keep headings with the next block, but the TeX templates do not visibly encode that as a print contract.

4. Improve multi-page table handling

The current preprocess layer converts tables to tabular, tabularx, and adjustbox wrappers, but not longtable:

preprocess-tex.tex

Missing behavior:

repeated headers
proper page breaks inside long tables
row-integrity rules

5. Add print-grade float controls instead of one-off fixes

The TeX path should use explicit float tuning rather than scattered coercions.

Examples worth considering:

placeins
needspace
floatpagefraction
topfraction
bottomfraction
textfraction

Browser compositor improvements

These are the places where the browser path is still fundamentally below LaTeX.

1. Real line-fragment-aware composition

The current paginator still breaks at block boundaries because it does not do true fragment reflow:

toggleMarkdown.js

If the goal is LaTeX-like page quality, the compositor eventually needs:

legal breakpoints within paragraphs
fragment continuation across pages
paragraph layout that is aware of the next page

1a. Correct list ownership and list-fragment modeling

Before more advanced typography work, the browser compositor needs a real list model.

At minimum it needs:

explicit detection of where a list ends structurally, not just visually
no-break ownership that stops at the last list item rather than leaking into following siblings
legal breakpoints between list items when a whole list does not fit
keep-with-next rules for nested lists and list-item continuations

Until that exists, list-heavy pages will keep producing artificial white space and false half-page breaks even if paragraph scoring improves.

2. Hyphenation-aware line breaking

The roadmap already flags this as missing:

LaTeX roadmap

Without hyphenation dictionaries and a soft-hyphen pipeline, dense justified text will always look worse than LaTeX.

3. Better justification using badness, not just local penalties

Also already called out in the roadmap:

LaTeX roadmap

The browser compositor should move toward:

paragraph-level badness scoring
demerits across adjacent lines
penalties for visually poor spacing, not just page-fill distance

4. Real float placement

The roadmap is explicit here too:

float classes missing: LaTeX roadmap
placement policy missing: LaTeX roadmap

Current figure handling is still a local keep-together or shrink decision using page-fit policy:

policy parsing in toggleMarkdown.js
shrink/page-fit logic around figures in toggleMarkdown.js

What is missing is a real deferred float queue with top/bottom/page placement.

There is now a narrower browser-side step in place: markdown image/table placement hints (place, placement, float, pos, position) feed top/bottom/page break preferences in the paginator. In the live compose root, top and page-float figures can now survive all the way into committed page starts instead of being trimmed back into generic block breaks. Queue-feasible bottom floats can now pull an earlier prep break so they land materially lower in the target page instead of snapping to the top of it. bottom placement is still only best-effort in mixed cases: when an earlier competing float decision saturates the available page sequence, the browser path still degrades to ordinary source-order deferral rather than running a real float queue.

The browser runtime now also exposes a float queue report through the __qlNowtypePdfSmoke.getPaginationState() contract. That report records each queued figure/table float, whether it was honored or deferred, and a simple blocked-by-earlier-float attribution for the common mixed-case failure mode.

There is now one bounded queue-arbitration step on top of that reporting: if a page starts with an honored top/page-float, the paginator can hand off early after that float so a following bottom float lands low on the next page, and then prefer a break after that bottom float before looking ahead to a later float page. This is enough to satisfy the common top -> bottom -> page-float sequence in the local harness. It is still not a full global float queue across arbitrary competing floats.

5. Footnote-aware page building

The roadmap still lists per-page footnote placement as incomplete:

LaTeX roadmap

LaTeX does not bolt footnotes on after pagination. It inserts them into page building. The browser path should move toward that model. The browser runtime now takes a partial step: it estimates per-page note load during pagination, penalizes breaks that would overflow the margin-note lane, and records per-page noteLoad / noteLoadSummary diagnostics in the smoke API. It now also commits note ownership to specific later pages when the original lane overflows, so screen/print preview can render carried notes from pagination state instead of re-collecting them from the clipped DOM. The live editor path also now preserves source [^id] refs in the canonical markdown buffer during nt_changed sync and resolves misparsed a.previewcite footnote anchors back through source footnote definitions for margin-note planning. Spread/print page shells and single-page viewport mode can now promote committed footnotes into a real bottom deck that reduces the visible content window on those pages, but the live editor DOM still lacks a native .footnotes structure and pagination still does not jointly solve page breaks with footnote height the way TeX does. So this is a meaningful compositor step, not full TeX-grade footnote/page construction yet.

6. Margin-note collision avoidance

Also explicitly still in progress:

LaTeX roadmap

The current system has a margin-notes lane, but it is not yet equivalent to a true note-placement engine with collision handling and stable spill behavior. The current improvement is bounded: it can steer page breaks away from predicted note overload, commit spill-forward note ownership, and report overflow risk, but it still does not perform LaTeX-style note allocation within page building.

7. Long-table pagination

Still missing according to the roadmap:

LaTeX roadmap

This is a structural gap, not a cosmetic one.

Algorithms LaTeX Uses That Are Worth Copying

These are the main ideas worth borrowing if the browser compositor is supposed to feel LaTeX-like.

Knuth-Plass line breaking

Use paragraph-level optimization with badness and demerits instead of purely local line fitting.

Why:

reduces rivers
balances adjacent lines
gives more stable typography under small edits

Liang hyphenation

Use language-specific hyphenation patterns as part of line breaking.

Why:

reduces spacing distortion in justified paragraphs
avoids overfull lines and ugly word spacing

TeX page building with penalties

Model pages using:

legal breakpoints
penalties
keep-with-next behavior
widow/orphan control
footnote cost
float cost

Why:

page quality is not just “closest break to target height”

Deferred float queues

Model figures and tables as float candidates with:

inline
top
bottom
float page
margin float

Why:

this is how LaTeX avoids figure/title collisions without forcing every figure exactly where it first appears

Microtypography

Approximate:

protrusion
font expansion
small tracking adjustments

Why:

this is a major part of why LaTeX PDFs look calmer and denser than naive browser-justified text

What Should Stay Delegated To LaTeX

Do not reimplement these in the TeX path unless there is a very strong reason.

standard font kerning and ligatures
paragraph line breaking
normal float placement
page building
bibliography formatting

The TeX path should mostly improve by giving LaTeX better structure and fewer forced placements, not by replacing LaTeX’s own layout logic.

What The Browser Path Must Reimplement

If the browser PDF surface is meant to be genuinely LaTeX-like, it eventually must own approximations of:

hyphenation-aware line breaking
paragraph badness/deemerit scoring
penalty-based page breaking
float placement policy
footnote and margin-note placement
long-table pagination

That is the real gap.

Recommended Order

First: TeX path cleanup

Stop forcing [H] except when explicitly requested.
Add microtype.
Add proper long-table support.
Add heading keep-with-next / needspace policy.

Second: browser compositor quality

Fix list ownership and false no-break capture around bullet lists.
Finish no-line-split guarantees.
Add hyphenation dictionaries.
Add better paragraph badness scoring.
Introduce real float classes and float queues.
Move footnotes and margin notes into page building.

Third: convergence

Once both paths are cleaner, align:

counters
references
figure numbering
caption behavior
page semantics

The goal should be:

TeX path for final print authority
browser PDF path for faithful editing and preview

not two unrelated layout systems.