PDF Generator Gap Analysis
PDF Generator Gap Analysis
Last updated: 2026-05-27
This document answers one narrow question:
- what is still missing if the goal is “LaTeX-like PDF quality”
It is written against the current code, not against an ideal future system.
Related roadmap:
Two Different PDF Systems
The repo currently has two different PDF-related paths.
1. TeX-emission path
This path emits TeX and lets LaTeX do composition.
Relevant files:
layouts/_default/baseof.texlayouts/_default/_markup/render-image.texlayouts/partials/preprocess-tex.tex
2. Browser PDF compositor
This path has two related but distinct browser implementations:
- the editor/Nowtype PDF preview path paginates one rendered DOM tree and displays PDF-style pages for editing and preview;
- the CLI book export path assembles Hugo-rendered chapter bodies in Chromium, normalizes/mutates the DOM for print, runs page preparation diagnostics, and writes a PDF through Playwright/Chromium.
Pagination now has a small shared layout-metrics layer in
toggleMarkdown.js:
collectNowtypePdfLayoutMetrics() measures block, heading, section, and content
geometry root-relative before pagination. The paginator, rendered ToC, and smoke
diagnostics should consume that metrics bundle instead of reading offsetTop
or ad-hoc DOM rectangles independently. That invariant exists because composed
Nowtype/Hugo documents can contain nested offset parents; treating raw
offsetTop as document-global caused long chapters to collapse into too few
browser-PDF pages even though the Markdown content was present.
Relevant files:
cdn/custom/toggleMarkdown.js- LaTeX roadmap
scripts/nowtype-pdf-smoke.js
The browser compositor also has a book-scope preview path for local draft workspaces. For large books it falls back to fetching Hugo-rendered page bodies instead of reparsing one very large Markdown composite. That fallback should keep the CLI thesis builder as the semantic reference: root frontmatter supplies the formal cover metadata, top-level section indexes are Parts, and child leaf pages are Chapters. Browser-only details such as live selection projection must remain presentation helpers; they must not become document structure.
The book exporter now accepts --document-kind thesis|article. thesis keeps
the formal title/frontmatter/recto behavior. article keeps the same Hugo
rendering and pagination machinery but uses a compact first-page title block
and continuous section flow, so journal-style article templates can be built on
the same browser compositor rather than on a separate export path.
Generated frontmatter pages such as the table of contents, object lists,
glossary, acknowledgements, and part dividers intentionally use the same inner
text box as normal chapter body text; only the title page has special page
geometry. Figure and table lists are sorted by their resolved book numbers, and
caption extraction falls back to Hugo’s print metadata when malformed generated
markdown nests one figure inside another caption. The simple glossary-derived
locator index is opt-in
(generatedIndex: true in content/_index.md or --generated-index on)
because a real thesis index should be authored from topic markers rather than
mechanically repeating glossary and nomenclature entries. Top-level
bibliography/ or references/ content is classified as unnumbered backmatter
during manifest construction, so it remains in the book order and recto plan
without consuming a chapter number.
Acknowledgements and generated frontmatter pages are registered as frontmatter
entries with uppercase Roman labels. They emit section probes just like body
chapters, so the final ToC can include them with resolved page numbers after the
second pass.
When the probe PDF overestimates a late backmatter start, the final stamping
pass reconciles out-of-range backmatter section starts against the rendered PDF
text before drawing running heads.
The same final text extraction is used to validate planned blank versos before
header/footer stamping, so a page that drifted from “blank” in the probe plan to
real chapter content in the final PDF is still stamped and becomes the chapter
start for running-head purposes. Section-level running-head probes on the same
page as a chapter opener are suppressed so the opener keeps the chapter mark.
For iteration, the book exporter also accepts --fast (alias
--single-pass). This does one browser composition and one PDF write, skipping
the probe PDF, recto blank-page correction, and resolved generated-list page
numbers. Use the default two-pass path for final output.
Generated glossary content can be grounded by a site-owned definitions document at one of:
data/thesis-definitions.jsondata/thesis-definitions.yamldata/print-definitions.jsondata/print-definitions.yaml
The schema is intentionally small:
acronyms:
TRSB: Time-reversal symmetry breaking
terms:
loop supercurrent: A superconducting state with internal phase winding.
symbols:
Delta: Superconducting gap amplitude.During export, detected acronyms, emphasised terms, and common math symbols
that are not defined there are printed as [book:warn] undefined ... lines and
also included in diagnostics JSON.
Nomenclature labels keep their canonical LaTeX where possible: detected symbols
reuse the KaTeX markup already emitted by Hugo, and definition aliases such as
alpha/α or Gamma/Γ are canonicalised before rendering so duplicate
definition rows collapse to one mathematical label. If the optional generated
index is enabled, symbol-index labels use the same rendering path.
Equation numbering is supplied by Hugo before the browser compositor sees the
page. The shared eq/chapter-number partial follows the same semantic book
structure as the browser manifest: top-level sections with child pages are
Parts, their ordered child pages are Chapters, and top-level sections without
child pages are standalone Chapters. It must not depend on numeric directory
prefixes, because current thesis sites use semantic slugs plus front matter
weights for ordering.
Citation links are also normalized during book assembly. Per-chapter pages emit
numeric #bibreference-N links, but the browser book export removes the
chapter-local reference tails and keeps a global bibliography page whose entries
use #bibentry-key IDs. The assembler rewrites citation anchors to those global
entry IDs before Chromium writes the PDF, so citation numbers remain clickable.
The browser book exporter also exposes a small Hugo-first authoring contract for
LaTeX-like print intent. The render hooks emit data-ql-print-* metadata for
figures, tables, equations, citations, and local object references; the browser
page planner reads those attributes before using any DOM heuristics. This keeps
counter/reference authority in Hugo while leaving browser-only measurement and
pagination in JavaScript.
Print metadata values are text-only attribute payloads. Do not pass rendered
caption HTML through data-ql-print-*; figure captions may contain citations or
KaTeX shortcodes that resolve after the image render hook and can otherwise
break the surrounding HTML attribute.
Figure numbering is also Hugo-owned: the image render hook asks the shared
chapter-label partial for the current chapter or appendix label, then emits
labels such as 1.1 or A.1 before the browser pass runs. Generated object
lists may clone sanitized caption HTML from the rendered DOM so inline KaTeX is
preserved there, but they must not recover semantic numbering by re-parsing
captions. Wrapper-based figures should pass the same print metadata when it is
available; for legacy wrapper output that has only a local or missing figure
number, the browser pass normalizes the final label from the owning chapter
section and DOM-local figure order.
The authored surface is:
- figures and tables can use
data-float="here|top|bottom|page"or classes such asfigure-here,float-top,float-bottom, andpage-float - side figures can use
.leftand.right; those are treated as authored placements and are not moved by the deferred float filler - tables can use
long-table/table-long,keep-table/table-keep,landscape-table/sidewaystable, anddecimal-align/siunitx; numeric columns are also detected automatically for tabular-number alignment - equations are numbered by Hugo with chapter labels for normal chapters and alphabetic labels for appendices before the browser pass stamps references
- generated diagnostics report table policy counts, unresolved page references, accessibility warnings, missing image alt text, duplicate IDs, heading-level skips, math accessibility coverage, and overfull/underfull layout warnings
The final export requests Chromium tagged-PDF and outline generation, and the
stamping pass sets PDF title, author, subject, keywords, producer, creator,
page-label metadata, and the outline page mode. Each run writes an adjacent
.build.json manifest with command, Git, Node, Playwright/Chromium, and summary
layout diagnostics so exported PDFs can be reproduced or audited later.
These two paths should not be evaluated the same way. The TeX path already gets most of LaTeX’s layout algorithms “for free”. The browser compositor does not.
Short Answer
- The TeX path does not need a custom kerning engine.
- The TeX path does need better use of LaTeX’s existing float and typography machinery.
- The browser compositor is still missing several core LaTeX-class composition
features:
- hyphenation-aware line breaking
- badness/deemerit-based justification
- real float placement
- footnote-aware page building
- margin-note collision management
- long-table pagination
- structurally correct list pagination
Current State By Path
TeX path: what it already has
The TeX templates already use standard LaTeX figure and wrapfigure environments.
- packages are loaded in
baseof.texthroughbaseof.tex - images normally emit
wrapfigure,figure*, orfigurewith[htbp]inrender-image.tex
That means the TeX path already relies on:
- TeX paragraph line breaking
- TeX page building
- normal float placement
- engine-level OpenType shaping and standard font kerning
TeX path: what is currently getting in its own way
The preprocess layer rewrites many figures and tables into forced [H] floats.
Examples:
- HTML figure conversion in
preprocess-tex.tex - markdown image conversion in
preprocess-tex.tex - table wrapping in
preprocess-tex.tex
That is the opposite of “intelligent figure placement”. It prevents LaTeX from doing the thing it is good at.
Browser compositor: what it currently does
The browser compositor uses a custom break-candidate and scoring system in
computeNowtypePdfPageBreaks().
It currently:
- measures logical content height
- finds block-top and block-bottom break candidates
- marks no-break ranges for paragraphs, headings, figures, and tables
- marks whole list containers as keep-together ranges from DOM geometry
- applies local penalties around target page height
- keeps figure/table blocks together where possible
- has best-effort widow/orphan scoring
The code explicitly says it is still clipping one logical DOM and therefore breaking at block boundaries rather than doing true fragment reflow:
toggleMarkdown.js
This is already much better than naive pixel slicing, but it is still not TeX-class composition.
Browser compositor: one concrete current bug class
List pagination is still structurally wrong in some cases.
The paginator currently scans both list containers and list items as block boundaries:
toggleMarkdown.js
and it treats UL / OL containers as whole no-break ranges:
toggleMarkdown.js
That means the browser paginator can infer an over-large unbreakable region from container geometry instead of from the real list structure. In practice this can produce exactly the bug we have already seen:
- a bullet list captures a following paragraph that is not part of the list
- the paginator then refuses otherwise legal page breaks
- the page underfills and turns into an apparent half page
This is not just a scoring issue. It is a structural ownership bug in how list ranges are inferred.
The same ownership rule also applies to heading keep-with-next logic. If a
heading is followed by a ul / ol, the keep range must stop at the first
li, not at the list container bottom, or the paginator recreates whole-list
overcapture through the heading rule instead of through the list rule.
Missing Features Compared With LaTeX
TeX path improvements
These are the things the TeX path should improve first.
1. Stop forcing normal figures and tables to [H]
This is the biggest current problem.
Use [htbp] or a configurable float policy by default, and reserve [H] for
explicit author intent or exceptional cases only.
Why:
- LaTeX’s float queue is specifically designed to avoid collisions and bad page breaks
[H]disables that machinery
2. Add microtypography
The current base template loads fontspec and unicode-math, but not
microtype:
baseof.tex
That means the TeX path is currently missing:
- protrusion
- font expansion
- some spacing refinements that reduce overfull lines and visible rivers
This matters more than hand-implementing kerning.
3. Add stronger keep-with-next controls for headings
The TeX path should use structural controls such as:
needspace- heading penalties
- explicit section-before-space / after-space policy
Right now the browser compositor already tries to keep headings with the next block, but the TeX templates do not visibly encode that as a print contract.
4. Improve multi-page table handling
The current preprocess layer converts tables to tabular, tabularx, and
adjustbox wrappers, but not longtable:
preprocess-tex.tex
Missing behavior:
- repeated headers
- proper page breaks inside long tables
- row-integrity rules
5. Add print-grade float controls instead of one-off fixes
The TeX path should use explicit float tuning rather than scattered coercions.
Examples worth considering:
placeinsneedspacefloatpagefractiontopfractionbottomfractiontextfraction
Browser compositor improvements
These are the places where the browser path is still fundamentally below LaTeX.
1. Real line-fragment-aware composition
The current paginator still breaks at block boundaries because it does not do true fragment reflow:
toggleMarkdown.js
If the goal is LaTeX-like page quality, the compositor eventually needs:
- legal breakpoints within paragraphs
- fragment continuation across pages
- paragraph layout that is aware of the next page
1a. Correct list ownership and list-fragment modeling
Before more advanced typography work, the browser compositor needs a real list model.
At minimum it needs:
- explicit detection of where a list ends structurally, not just visually
- no-break ownership that stops at the last list item rather than leaking into following siblings
- legal breakpoints between list items when a whole list does not fit
- keep-with-next rules for nested lists and list-item continuations
Until that exists, list-heavy pages will keep producing artificial white space and false half-page breaks even if paragraph scoring improves.
2. Hyphenation-aware line breaking
The roadmap already flags this as missing:
Without hyphenation dictionaries and a soft-hyphen pipeline, dense justified text will always look worse than LaTeX.
3. Better justification using badness, not just local penalties
Also already called out in the roadmap:
The browser compositor should move toward:
- paragraph-level badness scoring
- demerits across adjacent lines
- penalties for visually poor spacing, not just page-fill distance
4. Real float placement
The roadmap is explicit here too:
- float classes missing: LaTeX roadmap
- placement policy missing: LaTeX roadmap
Current figure handling is still a local keep-together or shrink decision using page-fit policy:
- policy parsing in
toggleMarkdown.js - shrink/page-fit logic around figures in
toggleMarkdown.js
What is missing is a real deferred float queue with top/bottom/page placement.
There is now a narrower browser-side step in place: markdown image/table
placement hints (place, placement, float, pos, position) feed
top/bottom/page break preferences in the paginator. In the live compose root,
top and page-float figures can now survive all the way into committed page
starts instead of being trimmed back into generic block breaks. Queue-feasible
bottom floats can now pull an earlier prep break so they land materially
lower in the target page instead of snapping to the top of it. bottom
placement is still only best-effort in mixed cases: when an earlier competing
float decision saturates the available page sequence, the browser path still
degrades to ordinary source-order deferral rather than running a real float
queue.
The browser runtime now also exposes a float queue report through the
__qlNowtypePdfSmoke.getPaginationState() contract. That report records each
queued figure/table float, whether it was honored or deferred, and a simple
blocked-by-earlier-float attribution for the common mixed-case failure mode.
There is now one bounded queue-arbitration step on top of that reporting:
if a page starts with an honored top/page-float, the paginator can hand off
early after that float so a following bottom float lands low on the next
page, and then prefer a break after that bottom float before looking ahead to a
later float page. This is enough to satisfy the common top -> bottom -> page-float sequence in the local harness. It is still not a full global float
queue across arbitrary competing floats.
5. Footnote-aware page building
The roadmap still lists per-page footnote placement as incomplete:
LaTeX does not bolt footnotes on after pagination. It inserts them into page
building. The browser path should move toward that model. The browser runtime
now takes a partial step: it estimates per-page note load during pagination,
penalizes breaks that would overflow the margin-note lane, and records per-page
noteLoad / noteLoadSummary diagnostics in the smoke API. It now also
commits note ownership to specific later pages when the original lane
overflows, so screen/print preview can render carried notes from pagination
state instead of re-collecting them from the clipped DOM. The live editor path
also now preserves source [^id] refs in the canonical markdown buffer during
nt_changed sync and resolves misparsed a.previewcite footnote anchors back
through source footnote definitions for margin-note planning. Spread/print page
shells and single-page viewport mode can now promote committed footnotes into a
real bottom deck that reduces the visible content window on those pages, but
the live editor DOM still lacks a native .footnotes structure and pagination
still does not jointly solve page breaks with footnote height the way TeX does.
So this is a meaningful compositor step, not full TeX-grade footnote/page
construction yet.
6. Margin-note collision avoidance
Also explicitly still in progress:
The current system has a margin-notes lane, but it is not yet equivalent to a true note-placement engine with collision handling and stable spill behavior. The current improvement is bounded: it can steer page breaks away from predicted note overload, commit spill-forward note ownership, and report overflow risk, but it still does not perform LaTeX-style note allocation within page building.
7. Long-table pagination
Still missing according to the roadmap:
This is a structural gap, not a cosmetic one.
Algorithms LaTeX Uses That Are Worth Copying
These are the main ideas worth borrowing if the browser compositor is supposed to feel LaTeX-like.
Knuth-Plass line breaking
Use paragraph-level optimization with badness and demerits instead of purely local line fitting.
Why:
- reduces rivers
- balances adjacent lines
- gives more stable typography under small edits
Liang hyphenation
Use language-specific hyphenation patterns as part of line breaking.
Why:
- reduces spacing distortion in justified paragraphs
- avoids overfull lines and ugly word spacing
TeX page building with penalties
Model pages using:
- legal breakpoints
- penalties
- keep-with-next behavior
- widow/orphan control
- footnote cost
- float cost
Why:
- page quality is not just “closest break to target height”
Deferred float queues
Model figures and tables as float candidates with:
- inline
- top
- bottom
- float page
- margin float
Why:
- this is how LaTeX avoids figure/title collisions without forcing every figure exactly where it first appears
Microtypography
Approximate:
- protrusion
- font expansion
- small tracking adjustments
Why:
- this is a major part of why LaTeX PDFs look calmer and denser than naive browser-justified text
What Should Stay Delegated To LaTeX
Do not reimplement these in the TeX path unless there is a very strong reason.
- standard font kerning and ligatures
- paragraph line breaking
- normal float placement
- page building
- bibliography formatting
The TeX path should mostly improve by giving LaTeX better structure and fewer forced placements, not by replacing LaTeX’s own layout logic.
What The Browser Path Must Reimplement
If the browser PDF surface is meant to be genuinely LaTeX-like, it eventually must own approximations of:
- hyphenation-aware line breaking
- paragraph badness/deemerit scoring
- penalty-based page breaking
- float placement policy
- footnote and margin-note placement
- long-table pagination
That is the real gap.
Recommended Order
First: TeX path cleanup
- Stop forcing
[H]except when explicitly requested. - Add
microtype. - Add proper long-table support.
- Add heading keep-with-next / needspace policy.
Second: browser compositor quality
- Fix list ownership and false no-break capture around bullet lists.
- Finish no-line-split guarantees.
- Add hyphenation dictionaries.
- Add better paragraph badness scoring.
- Introduce real float classes and float queues.
- Move footnotes and margin notes into page building.
Third: convergence
Once both paths are cleaner, align:
- counters
- references
- figure numbering
- caption behavior
- page semantics
The goal should be:
- TeX path for final print authority
- browser PDF path for faithful editing and preview
not two unrelated layout systems.