Skip to content

perf: image-heavy DOCX OOM is driven by eager full-image decode at import, before virtualization (complements #2879) #3763

Description

@shabooboo006

What happened?

Type

  • Bug / performance

Summary

This complements #2879 (browser crashes on large DOCX). That issue attributes the
OOM to (1) the full ProseMirror tree in memory and (2) synchronous single-pass
layout — both page-count driven. For image-heavy documents there is a
third, often dominant driver that #2879 doesn't mention:

All embedded images are eagerly decoded into the JS heap at import — before the
PM tree is built and before page virtualization can help.
The OOM predictor is
decoded-bitmap memory ≈ Σ(width × height × 4) across all images — not file
size, and not page count.

Root cause (with evidence)

  • In super-editor's DocxZipper.getDocxData(), the unzip loop iterates every
    word/media/* entry and, per image, eagerly builds two in-heap copies:
    a base64 data URI (this.mediaFiles[name] = "data:...;base64,...") and an
    object URL (this.media[name] = URL.createObjectURL(...)). It is unconditional —
    no size/lazy gating (grep for lazy|defer|limit|threshold in that loop returns
    nothing). (v1.41 source: super-editor/src/editors/v1/core/DocxZipper.js, the
    word/media branch of the getDocxData loop.)
  • Load order is Editor.loadXmlData() (decodes all media) → build PM doc →
    mount() (DOM). Decode completes before the PM tree exists and before any
    <img> mounts
    , so the ~5-page virtualization window cannot release it — the
    memory peak happens before the first page renders. PM image nodes store only the
    media path; renderDOM maps path → decoded URL at render time, so the decoded
    payload for all images stays resident regardless of how many pages are mounted.

Why this matters (real-world data)

Three real Chinese bid documents (anonymized; decoded-bitmap = Σ pixels × 4):

Doc File size Images Σ pixels Decoded bitmap Result
A 77 MB 333 480 M ~1.92 GB opens
B 58 MB 243 533 M ~2.13 GB crashed on v1.41 (opens on v1.42, but with image-render glitches)
C 329 MB 854 2243 M ~9.0 GB far beyond any tab budget

Note the inversion: Doc A is larger, has more images and more pages than
Doc B, yet A opens and the smaller B crashed — because the only metric that orders
them correctly is total decoded pixels, and both sit right at a ~2 GB renderer
ceiling. For Doc C, even downsampling every image to ≤1600px long edge still leaves
~4.7 GB decoded — i.e. with hundreds of images, per-image size reduction can't get
under the ceiling; the count dominates.

Empirical confirmation that image decode (not the PM tree) is the driver here:
pre-processing the .docx to downsample only the embedded images — same media paths,
document.xml byte-identical — cuts decoded-bitmap memory ~2× and makes
previously-crashing files open, without touching the PM tree or layout. So for
image-heavy docs the bottleneck is media decode, separate from #2879's page/layout
root cause.

Expected

Importing an image-heavy DOCX should not hold every image's decoded bitmap in heap
at once. Image materialization should be bounded (ideally tied to the same
viewport/virtualization window the renderer already uses).

Possible directions (non-prescriptive)

  • Lazy / per-page image decode: only build object URLs for images on pages near
    the viewport; release off-screen ones — reuse the existing virtualization window.
  • Stop double-holding each image: currently every image is kept as both a base64
    data URI (~1.33× the bytes, as a giant string) and a blob/objectURL. Keeping only
    URL.createObjectURL(blob) roughly halves the per-image heap and avoids huge strings.
  • Optionally a documented import cap / downsample option for huge media payloads.

Steps to reproduce

Steps to reproduce

  1. Generate an image-heavy .docx (script below; no real data — random-noise images):
    npm i docx sharp && node repro-gen.mjs
  2. Load image-heavy-repro.docx in a SuperDoc editor (e.g. the v1.42 React demo).
  3. Watch JS heap during import (DevTools → Memory / Performance). It climbs to
    multiple GB before the first page renders; at high image counts the tab OOMs.
repro-gen.mjs
import sharp from 'sharp';
import { Document, Packer, Paragraph, ImageRun } from 'docx';
import { writeFileSync } from 'fs';
import { randomFillSync } from 'crypto';

const N = 200, W = 1600, H = 2200; // 200 full-page images ≈ 2.8 GB decoded
const noiseJpeg = async (w, h) => {
  const raw = Buffer.allocUnsafe(w * h * 3); randomFillSync(raw);
  return sharp(raw, { raw: { width: w, height: h, channels: 3 } }).jpeg({ quality: 85 }).toBuffer();
};
const children = [];
for (let i = 0; i < N; i++) {
  const jpg = await noiseJpeg(W, H);
  children.push(new Paragraph({ children: [new ImageRun({ type: 'jpg', data: jpg, transformation: { width: 600, height: 825 } })] }));
}
writeFileSync('image-heavy-repro.docx', await Packer.toBuffer(new Document({ sections: [{ children }] })));
console.log('Σ decoded ≈', (N * W * H * 4 / 1e9).toFixed(1), 'GB');

(Real photos behave the same with a smaller file — noise just makes decoded ≈ file size; lower N to stay near the ~2 GB cliff.)

SuperDoc version

superdoc core v1.42.0 / @superdoc-dev/react v1.13.0

Browser

Chrome

Additional context

Environment

Metadata

Metadata

Assignees

Type

No type

Fields

No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions