PDF Generator Gap Analysis

PDF Generator Gap Analysis

Last updated: 2026-05-27

This document answers one narrow question:

  • what is still missing if the goal is “LaTeX-like PDF quality”

It is written against the current code, not against an ideal future system.

Related roadmap:

Two Different PDF Systems

The repo currently has two different PDF-related paths.

1. TeX-emission path

This path emits TeX and lets LaTeX do composition.

Relevant files:

  • layouts/_default/baseof.tex
  • layouts/_default/_markup/render-image.tex
  • layouts/partials/preprocess-tex.tex

2. Browser PDF compositor

This path has two related but distinct browser implementations:

  • the editor/Nowtype PDF preview path paginates one rendered DOM tree and displays PDF-style pages for editing and preview;
  • the CLI book export path assembles Hugo-rendered chapter bodies in Chromium, normalizes/mutates the DOM for print, runs page preparation diagnostics, and writes a PDF through Playwright/Chromium.

Pagination now has a small shared layout-metrics layer in toggleMarkdown.js: collectNowtypePdfLayoutMetrics() measures block, heading, section, and content geometry root-relative before pagination. The paginator, rendered ToC, and smoke diagnostics should consume that metrics bundle instead of reading offsetTop or ad-hoc DOM rectangles independently. That invariant exists because composed Nowtype/Hugo documents can contain nested offset parents; treating raw offsetTop as document-global caused long chapters to collapse into too few browser-PDF pages even though the Markdown content was present.

Relevant files:

  • cdn/custom/toggleMarkdown.js
  • LaTeX roadmap
  • scripts/nowtype-pdf-smoke.js

The browser compositor also has a book-scope preview path for local draft workspaces. For large books it falls back to fetching Hugo-rendered page bodies instead of reparsing one very large Markdown composite. That fallback should keep the CLI thesis builder as the semantic reference: root frontmatter supplies the formal cover metadata, top-level section indexes are Parts, and child leaf pages are Chapters. Browser-only details such as live selection projection must remain presentation helpers; they must not become document structure.

The book exporter now accepts --document-kind thesis|article. thesis keeps the formal title/frontmatter/recto behavior. article keeps the same Hugo rendering and pagination machinery but uses a compact first-page title block and continuous section flow, so journal-style article templates can be built on the same browser compositor rather than on a separate export path. Generated frontmatter pages such as the table of contents, object lists, glossary, acknowledgements, and part dividers intentionally use the same inner text box as normal chapter body text; only the title page has special page geometry. Figure and table lists are sorted by their resolved book numbers, and caption extraction falls back to Hugo’s print metadata when malformed generated markdown nests one figure inside another caption. The simple glossary-derived locator index is opt-in (generatedIndex: true in content/_index.md or --generated-index on) because a real thesis index should be authored from topic markers rather than mechanically repeating glossary and nomenclature entries. Top-level bibliography/ or references/ content is classified as unnumbered backmatter during manifest construction, so it remains in the book order and recto plan without consuming a chapter number. Acknowledgements and generated frontmatter pages are registered as frontmatter entries with uppercase Roman labels. They emit section probes just like body chapters, so the final ToC can include them with resolved page numbers after the second pass. When the probe PDF overestimates a late backmatter start, the final stamping pass reconciles out-of-range backmatter section starts against the rendered PDF text before drawing running heads. The same final text extraction is used to validate planned blank versos before header/footer stamping, so a page that drifted from “blank” in the probe plan to real chapter content in the final PDF is still stamped and becomes the chapter start for running-head purposes. Section-level running-head probes on the same page as a chapter opener are suppressed so the opener keeps the chapter mark.

For iteration, the book exporter also accepts --fast (alias --single-pass). This does one browser composition and one PDF write, skipping the probe PDF, recto blank-page correction, and resolved generated-list page numbers. Use the default two-pass path for final output.

Generated glossary content can be grounded by a site-owned definitions document at one of:

  • data/thesis-definitions.json
  • data/thesis-definitions.yaml
  • data/print-definitions.json
  • data/print-definitions.yaml

The schema is intentionally small:

acronyms:
  TRSB: Time-reversal symmetry breaking
terms:
  loop supercurrent: A superconducting state with internal phase winding.
symbols:
  Delta: Superconducting gap amplitude.

During export, detected acronyms, emphasised terms, and common math symbols that are not defined there are printed as [book:warn] undefined ... lines and also included in diagnostics JSON. Nomenclature labels keep their canonical LaTeX where possible: detected symbols reuse the KaTeX markup already emitted by Hugo, and definition aliases such as alpha/α or Gamma/Γ are canonicalised before rendering so duplicate definition rows collapse to one mathematical label. If the optional generated index is enabled, symbol-index labels use the same rendering path.

Equation numbering is supplied by Hugo before the browser compositor sees the page. The shared eq/chapter-number partial follows the same semantic book structure as the browser manifest: top-level sections with child pages are Parts, their ordered child pages are Chapters, and top-level sections without child pages are standalone Chapters. It must not depend on numeric directory prefixes, because current thesis sites use semantic slugs plus front matter weights for ordering.

Citation links are also normalized during book assembly. Per-chapter pages emit numeric #bibreference-N links, but the browser book export removes the chapter-local reference tails and keeps a global bibliography page whose entries use #bibentry-key IDs. The assembler rewrites citation anchors to those global entry IDs before Chromium writes the PDF, so citation numbers remain clickable.

The browser book exporter also exposes a small Hugo-first authoring contract for LaTeX-like print intent. The render hooks emit data-ql-print-* metadata for figures, tables, equations, citations, and local object references; the browser page planner reads those attributes before using any DOM heuristics. This keeps counter/reference authority in Hugo while leaving browser-only measurement and pagination in JavaScript. Print metadata values are text-only attribute payloads. Do not pass rendered caption HTML through data-ql-print-*; figure captions may contain citations or KaTeX shortcodes that resolve after the image render hook and can otherwise break the surrounding HTML attribute. Figure numbering is also Hugo-owned: the image render hook asks the shared chapter-label partial for the current chapter or appendix label, then emits labels such as 1.1 or A.1 before the browser pass runs. Generated object lists may clone sanitized caption HTML from the rendered DOM so inline KaTeX is preserved there, but they must not recover semantic numbering by re-parsing captions. Wrapper-based figures should pass the same print metadata when it is available; for legacy wrapper output that has only a local or missing figure number, the browser pass normalizes the final label from the owning chapter section and DOM-local figure order.

The authored surface is:

  • figures and tables can use data-float="here|top|bottom|page" or classes such as figure-here, float-top, float-bottom, and page-float
  • side figures can use .left and .right; those are treated as authored placements and are not moved by the deferred float filler
  • tables can use long-table / table-long, keep-table / table-keep, landscape-table / sidewaystable, and decimal-align / siunitx; numeric columns are also detected automatically for tabular-number alignment
  • equations are numbered by Hugo with chapter labels for normal chapters and alphabetic labels for appendices before the browser pass stamps references
  • generated diagnostics report table policy counts, unresolved page references, accessibility warnings, missing image alt text, duplicate IDs, heading-level skips, math accessibility coverage, and overfull/underfull layout warnings

The final export requests Chromium tagged-PDF and outline generation, and the stamping pass sets PDF title, author, subject, keywords, producer, creator, page-label metadata, and the outline page mode. Each run writes an adjacent .build.json manifest with command, Git, Node, Playwright/Chromium, and summary layout diagnostics so exported PDFs can be reproduced or audited later.

These two paths should not be evaluated the same way. The TeX path already gets most of LaTeX’s layout algorithms “for free”. The browser compositor does not.

Short Answer

  • The TeX path does not need a custom kerning engine.
  • The TeX path does need better use of LaTeX’s existing float and typography machinery.
  • The browser compositor is still missing several core LaTeX-class composition features:
    • hyphenation-aware line breaking
    • badness/deemerit-based justification
    • real float placement
    • footnote-aware page building
    • margin-note collision management
    • long-table pagination
    • structurally correct list pagination

Current State By Path

TeX path: what it already has

The TeX templates already use standard LaTeX figure and wrapfigure environments.

  • packages are loaded in baseof.tex through baseof.tex
  • images normally emit wrapfigure, figure*, or figure with [htbp] in render-image.tex

That means the TeX path already relies on:

  • TeX paragraph line breaking
  • TeX page building
  • normal float placement
  • engine-level OpenType shaping and standard font kerning

TeX path: what is currently getting in its own way

The preprocess layer rewrites many figures and tables into forced [H] floats.

Examples:

  • HTML figure conversion in preprocess-tex.tex
  • markdown image conversion in preprocess-tex.tex
  • table wrapping in preprocess-tex.tex

That is the opposite of “intelligent figure placement”. It prevents LaTeX from doing the thing it is good at.

Browser compositor: what it currently does

The browser compositor uses a custom break-candidate and scoring system in computeNowtypePdfPageBreaks().

It currently:

  • measures logical content height
  • finds block-top and block-bottom break candidates
  • marks no-break ranges for paragraphs, headings, figures, and tables
  • marks whole list containers as keep-together ranges from DOM geometry
  • applies local penalties around target page height
  • keeps figure/table blocks together where possible
  • has best-effort widow/orphan scoring

The code explicitly says it is still clipping one logical DOM and therefore breaking at block boundaries rather than doing true fragment reflow:

  • toggleMarkdown.js

This is already much better than naive pixel slicing, but it is still not TeX-class composition.

Browser compositor: one concrete current bug class

List pagination is still structurally wrong in some cases.

The paginator currently scans both list containers and list items as block boundaries:

  • toggleMarkdown.js

and it treats UL / OL containers as whole no-break ranges:

  • toggleMarkdown.js

That means the browser paginator can infer an over-large unbreakable region from container geometry instead of from the real list structure. In practice this can produce exactly the bug we have already seen:

  • a bullet list captures a following paragraph that is not part of the list
  • the paginator then refuses otherwise legal page breaks
  • the page underfills and turns into an apparent half page

This is not just a scoring issue. It is a structural ownership bug in how list ranges are inferred.

The same ownership rule also applies to heading keep-with-next logic. If a heading is followed by a ul / ol, the keep range must stop at the first li, not at the list container bottom, or the paginator recreates whole-list overcapture through the heading rule instead of through the list rule.

Missing Features Compared With LaTeX

TeX path improvements

These are the things the TeX path should improve first.

1. Stop forcing normal figures and tables to [H]

This is the biggest current problem.

Use [htbp] or a configurable float policy by default, and reserve [H] for explicit author intent or exceptional cases only.

Why:

  • LaTeX’s float queue is specifically designed to avoid collisions and bad page breaks
  • [H] disables that machinery

2. Add microtypography

The current base template loads fontspec and unicode-math, but not microtype:

  • baseof.tex

That means the TeX path is currently missing:

  • protrusion
  • font expansion
  • some spacing refinements that reduce overfull lines and visible rivers

This matters more than hand-implementing kerning.

3. Add stronger keep-with-next controls for headings

The TeX path should use structural controls such as:

  • needspace
  • heading penalties
  • explicit section-before-space / after-space policy

Right now the browser compositor already tries to keep headings with the next block, but the TeX templates do not visibly encode that as a print contract.

4. Improve multi-page table handling

The current preprocess layer converts tables to tabular, tabularx, and adjustbox wrappers, but not longtable:

  • preprocess-tex.tex

Missing behavior:

  • repeated headers
  • proper page breaks inside long tables
  • row-integrity rules

5. Add print-grade float controls instead of one-off fixes

The TeX path should use explicit float tuning rather than scattered coercions.

Examples worth considering:

  • placeins
  • needspace
  • floatpagefraction
  • topfraction
  • bottomfraction
  • textfraction

Browser compositor improvements

These are the places where the browser path is still fundamentally below LaTeX.

1. Real line-fragment-aware composition

The current paginator still breaks at block boundaries because it does not do true fragment reflow:

  • toggleMarkdown.js

If the goal is LaTeX-like page quality, the compositor eventually needs:

  • legal breakpoints within paragraphs
  • fragment continuation across pages
  • paragraph layout that is aware of the next page

1a. Correct list ownership and list-fragment modeling

Before more advanced typography work, the browser compositor needs a real list model.

At minimum it needs:

  • explicit detection of where a list ends structurally, not just visually
  • no-break ownership that stops at the last list item rather than leaking into following siblings
  • legal breakpoints between list items when a whole list does not fit
  • keep-with-next rules for nested lists and list-item continuations

Until that exists, list-heavy pages will keep producing artificial white space and false half-page breaks even if paragraph scoring improves.

2. Hyphenation-aware line breaking

The roadmap already flags this as missing:

Without hyphenation dictionaries and a soft-hyphen pipeline, dense justified text will always look worse than LaTeX.

3. Better justification using badness, not just local penalties

Also already called out in the roadmap:

The browser compositor should move toward:

  • paragraph-level badness scoring
  • demerits across adjacent lines
  • penalties for visually poor spacing, not just page-fill distance

4. Real float placement

The roadmap is explicit here too:

Current figure handling is still a local keep-together or shrink decision using page-fit policy:

  • policy parsing in toggleMarkdown.js
  • shrink/page-fit logic around figures in toggleMarkdown.js

What is missing is a real deferred float queue with top/bottom/page placement.

There is now a narrower browser-side step in place: markdown image/table placement hints (place, placement, float, pos, position) feed top/bottom/page break preferences in the paginator. In the live compose root, top and page-float figures can now survive all the way into committed page starts instead of being trimmed back into generic block breaks. Queue-feasible bottom floats can now pull an earlier prep break so they land materially lower in the target page instead of snapping to the top of it. bottom placement is still only best-effort in mixed cases: when an earlier competing float decision saturates the available page sequence, the browser path still degrades to ordinary source-order deferral rather than running a real float queue.

The browser runtime now also exposes a float queue report through the __qlNowtypePdfSmoke.getPaginationState() contract. That report records each queued figure/table float, whether it was honored or deferred, and a simple blocked-by-earlier-float attribution for the common mixed-case failure mode.

There is now one bounded queue-arbitration step on top of that reporting: if a page starts with an honored top/page-float, the paginator can hand off early after that float so a following bottom float lands low on the next page, and then prefer a break after that bottom float before looking ahead to a later float page. This is enough to satisfy the common top -> bottom -> page-float sequence in the local harness. It is still not a full global float queue across arbitrary competing floats.

5. Footnote-aware page building

The roadmap still lists per-page footnote placement as incomplete:

LaTeX does not bolt footnotes on after pagination. It inserts them into page building. The browser path should move toward that model. The browser runtime now takes a partial step: it estimates per-page note load during pagination, penalizes breaks that would overflow the margin-note lane, and records per-page noteLoad / noteLoadSummary diagnostics in the smoke API. It now also commits note ownership to specific later pages when the original lane overflows, so screen/print preview can render carried notes from pagination state instead of re-collecting them from the clipped DOM. The live editor path also now preserves source [^id] refs in the canonical markdown buffer during nt_changed sync and resolves misparsed a.previewcite footnote anchors back through source footnote definitions for margin-note planning. Spread/print page shells and single-page viewport mode can now promote committed footnotes into a real bottom deck that reduces the visible content window on those pages, but the live editor DOM still lacks a native .footnotes structure and pagination still does not jointly solve page breaks with footnote height the way TeX does. So this is a meaningful compositor step, not full TeX-grade footnote/page construction yet.

6. Margin-note collision avoidance

Also explicitly still in progress:

The current system has a margin-notes lane, but it is not yet equivalent to a true note-placement engine with collision handling and stable spill behavior. The current improvement is bounded: it can steer page breaks away from predicted note overload, commit spill-forward note ownership, and report overflow risk, but it still does not perform LaTeX-style note allocation within page building.

7. Long-table pagination

Still missing according to the roadmap:

This is a structural gap, not a cosmetic one.

Algorithms LaTeX Uses That Are Worth Copying

These are the main ideas worth borrowing if the browser compositor is supposed to feel LaTeX-like.

Knuth-Plass line breaking

Use paragraph-level optimization with badness and demerits instead of purely local line fitting.

Why:

  • reduces rivers
  • balances adjacent lines
  • gives more stable typography under small edits

Liang hyphenation

Use language-specific hyphenation patterns as part of line breaking.

Why:

  • reduces spacing distortion in justified paragraphs
  • avoids overfull lines and ugly word spacing

TeX page building with penalties

Model pages using:

  • legal breakpoints
  • penalties
  • keep-with-next behavior
  • widow/orphan control
  • footnote cost
  • float cost

Why:

  • page quality is not just “closest break to target height”

Deferred float queues

Model figures and tables as float candidates with:

  • inline
  • top
  • bottom
  • float page
  • margin float

Why:

  • this is how LaTeX avoids figure/title collisions without forcing every figure exactly where it first appears

Microtypography

Approximate:

  • protrusion
  • font expansion
  • small tracking adjustments

Why:

  • this is a major part of why LaTeX PDFs look calmer and denser than naive browser-justified text

What Should Stay Delegated To LaTeX

Do not reimplement these in the TeX path unless there is a very strong reason.

  • standard font kerning and ligatures
  • paragraph line breaking
  • normal float placement
  • page building
  • bibliography formatting

The TeX path should mostly improve by giving LaTeX better structure and fewer forced placements, not by replacing LaTeX’s own layout logic.

What The Browser Path Must Reimplement

If the browser PDF surface is meant to be genuinely LaTeX-like, it eventually must own approximations of:

  • hyphenation-aware line breaking
  • paragraph badness/deemerit scoring
  • penalty-based page breaking
  • float placement policy
  • footnote and margin-note placement
  • long-table pagination

That is the real gap.

First: TeX path cleanup

  1. Stop forcing [H] except when explicitly requested.
  2. Add microtype.
  3. Add proper long-table support.
  4. Add heading keep-with-next / needspace policy.

Second: browser compositor quality

  1. Fix list ownership and false no-break capture around bullet lists.
  2. Finish no-line-split guarantees.
  3. Add hyphenation dictionaries.
  4. Add better paragraph badness scoring.
  5. Introduce real float classes and float queues.
  6. Move footnotes and margin notes into page building.

Third: convergence

Once both paths are cleaner, align:

  • counters
  • references
  • figure numbering
  • caption behavior
  • page semantics

The goal should be:

  • TeX path for final print authority
  • browser PDF path for faithful editing and preview

not two unrelated layout systems.