PDF redaction done right: why black rectangles in Word and Preview don't work

The most common way people redact a PDF is also the most common way people leak the thing they were trying to redact. They open the file in Word, Preview, Acrobat Reader, or some web-based editor; they draw a black rectangle over a name, an address, a salary number, a witness identity; they save; they send the file. The rectangle looks solid. The page looks clean. The text under the rectangle is still there, in the PDF, exactly where it was, ready to be selected and copied by anyone who opens the file.

This is not a fringe mistake or a problem unique to one piece of software. It is the default outcome of the way PDF works as a format and the way most editors treat shapes. It has caused dozens of public leaks involving the U.S. Department of Justice, the TSA, the NSA, the New York City Transit Authority, the New York Times, several major law firms, and more than one government regulator. The most instructive thing about the list is how little the leakers had in common - except for the assumption that drawing a rectangle is the same as deleting what is underneath.

This piece walks through what actually happens inside a PDF when you "redact" with a black box, the canonical incidents that made this a textbook problem, the techniques that are equally broken, what real redaction looks like at the content-stream level, the metadata trap that catches even people who get the visible part right, and a workflow you can run on your own device for a file you do not want to upload.

What a PDF actually is

A PDF is not a picture of a page. It is a small program that tells a renderer how to draw a page: place this glyph here, in this font, at this size; draw this line; fill this rectangle with this color; show this image at these coordinates. Each page has a "content stream" - an ordered list of drawing instructions - and a set of resources (fonts, images, color profiles) the instructions reference.

When you draw a black rectangle on top of a paragraph, you are appending one more instruction to the content stream: "fill a rectangle of these dimensions with black at this position." The instructions that drew the paragraph are still in the stream, earlier in the order. The renderer obediently draws the paragraph, then draws the rectangle on top of it. Your eyes see the rectangle. The PDF still contains the paragraph.

Selecting text in a PDF does not interact with the rendered pixels at all; it walks the text-drawing instructions in the content stream. That is why you can triple-click on the redacted block, copy, paste into a text editor, and see the original sentence. The selection tool sees the text the rectangle was hiding, because the text is still there to be seen.

The incidents that made this a textbook example

Examples are useful because they show that this is not a problem of careless amateurs. A short, non-exhaustive list of well-documented cases:

Department of Justice, Paul Manafort filing (2019). Manafort's legal team submitted a court filing with redactions over passages discussing alleged contacts with a Russian associate. Reporters discovered within hours that the black bars in the PDF could be selected and copied, exposing the unredacted text and forcing the matter back into the news cycle.
TSA Screening Management SOP (2009). The Transportation Security Administration published a redacted version of its screening procedures online. The redactions were rectangles laid over text. The underlying procedures - including details of which travelers received reduced screening - were extracted within a day and reproduced widely in the press.
NSA report on Russian election interference (2017). A contractor leaked a partially redacted intelligence report. Independent of the leak itself, analysts noted that the redactions in the published copy used overlay rectangles in places, exposing fragments of the text underneath.
New York City Transit Authority subway diagrams. Sensitive infrastructure documents released under freedom-of-information requests have repeatedly included overlay-rectangle redactions whose underlying detail was recoverable.
Multiple law-firm court filings. Federal and state court dockets contain a steady drip of refiled documents whose original versions were withdrawn after journalists or opposing counsel demonstrated that the redactions were cosmetic.

The pattern is consistent. The redactor opens the file in a tool that lets them draw shapes, draws shapes, exports a PDF, and assumes the export "burns in" the shapes. None of those tools - Word, Preview, the markup features in Acrobat Reader, most browser-based editors - actually delete the underlying text. They all just stack the rectangle on top.

Techniques that look like redaction and are not

The black rectangle is the famous one, but it is part of a wider family of techniques that share the same flaw: they hide content visually without removing it from the file.

The black rectangle (or "highlight in black")

Drawing a filled shape over text. The text remains in the content stream and is recoverable by selection or by any text-extraction tool (`pdftotext`, browser PDF viewers, Acrobat's accessibility export, almost any PDF library).

White-on-white text

Changing the font color of sensitive text to white so it disappears against the page. The text is still there, still indexable, still copyable. Inverting the page colors or selecting the whole page makes it visible immediately.

Image overlays

Pasting a screenshot or an opaque image rectangle over the text. Slightly more robust against casual selection because the image sits between the cursor and the text, but the text is still in the underlying content stream and trivially recoverable with text extraction.

Drawing tools in Acrobat Reader's "Comment" / "Markup" menu

These are annotations - they are stored in a separate annotation layer and do not modify the page content stream at all. Removing the annotation in any PDF editor unhides the original page. This is one of the most common sources of accidental disclosure because the tool is called "Black out" or "Redact" in some versions of Reader, even though the actual redaction tool requires Acrobat Pro.

"Print to PDF" with shapes on top

Sometimes works, sometimes does not. If the print path rasterises the page (in which case you have lost text-layer accessibility), the text is gone. If the print path emits a vector PDF that preserves text - which most modern systems do, by design, because it produces smaller, sharper files - the text is preserved and the rectangles are still on top of it. You cannot tell which behavior you got without inspecting the output.

Cropping the page

PDF cropping changes the visible "media box" of a page but does not delete content outside it. Many viewers will reveal the cropped-out content if you adjust the crop box back, and content-extraction tools ignore the crop entirely. If you crop sensitive content out of a page, treat the file as still containing it.

What real PDF redaction looks like

Proper redaction is two operations, not one:

Remove the underlying content from the page's content stream. The text-drawing operators that produce the sensitive characters are deleted from the stream. Any image or vector content that includes the sensitive information is similarly removed or cropped at the source. After this step, opening the file in any tool and selecting the redacted area returns nothing, because there is nothing there to return.
Cover the now-empty area with an opaque shape (usually a black rectangle, sometimes a labeled one - "REDACTED §6", "EXEMPTION (b)(7)", and so on). This is the visual cue that something was removed; without it the page would have an awkward gap. The cover is not what protects the information - the deletion in step one is. The rectangle is just the marker.

A real redaction tool also rewrites the file's cross-reference table and, importantly, does not save the result as an "incremental update" appended to the original. PDFs support incremental saves, where new content is added to the end of the file and the original content stays earlier in the file, recoverable by anyone who opens the bytes in a hex editor and walks the older cross-reference table. A correctly redacted file is rewritten from scratch with only the post-redaction objects.

And a real redaction workflow does one more thing: it strips the document metadata, comments, form-field history, embedded files, bookmarks, JavaScript, and any other auxiliary structure that might carry a copy or a reference to the redacted content. Visual redaction never touches any of these. Several well-known disclosures happened because the page was redacted correctly but the document properties still listed the names of the people the redaction was protecting.

The metadata trap

Every PDF carries at least two layers of metadata: the legacy "Info dictionary" (Author, Title, Subject, Keywords, Creator, Producer, CreationDate, ModDate) and the modern XMP packet (an XML block that can carry arbitrary structured metadata, including Adobe-specific revision history, original filenames, and identifiers tied to the originating workstation). On top of those, a typical editor adds bookmarks, comments authored by named users, form fields with default values, and sometimes embedded files - an attached spreadsheet, an original Word source, an email thread.

Visual redaction touches none of this. A redacted memo whose Author field still reads "Jane Q. Whistleblower" defeats itself. A redacted contract with the original DOCX attached as an embedded file defeats itself twice. A scanned report whose XMP packet records the path `C:\\Users\\j.smith\\Documents\\Internal\\Drafts\\Final-Confidential-v3.pdf` tells the reader exactly which workstation produced it and what the file was called internally. A real workflow inspects all of this and removes whatever is not meant to be there.

On Privvert this is a separate tool: the PDF metadata viewer and stripper reads the Info dictionary and the XMP packet locally, shows you what is in there, and lets you remove specific fields or the entire packet before you save. Run it after redaction, every time. The whole pipeline runs in your browser tab; the file is never uploaded.

A workflow that actually protects the document

For anyone who handles confidential PDFs regularly, the workflow below produces a file that is genuinely safe to release. It works for legal filings, FOIA responses, expert reports, internal documents shared with outside counsel, and anything else where a leak would matter.

Open the file in a real redaction tool. Acrobat Pro is the incumbent; the in-browser PDF redactor on Privvert does the same content-stream deletion locally, with no upload. Avoid Word, Preview, Acrobat Reader's markup tools, and "draw a shape" features in general-purpose PDF editors.
Mark every region that needs to go. Be generous: if a paragraph reveals the redacted information by context (the surrounding sentence makes the missing word obvious), redact more. Pay particular attention to running headers, footers, footnotes, and stamps - they often repeat the same names you redacted in the body.
Apply the redactions. This is the step that physically deletes the content. After it runs, save the file under a new name; do not overwrite the original. (Keep the original in a private location in case you need to redo the redaction differently later.)
Sanitise metadata in a separate step. Open the saved file in the metadata stripper and remove the Info dictionary fields you do not want disclosed (Author, Title if it leaks anything, original filenames in the XMP), the comments, and any embedded files or attachments you did not intend to share.
Verify by extraction, not by eye. Open the file in a fresh viewer, select all, copy, paste into a plain text editor. Read what comes out. If your redacted names appear, the redaction failed and you have to start over. As a second check, open the file in the PDF to text converter and search the resulting text for any of the strings you meant to remove. Do this every time. Visual confirmation is not enough; tools see what the file says, not what the rectangles say.
If the file contains scanned pages, OCR-then-redact, or rasterise. A scanned PDF where the OCR layer was added later still has searchable text behind the image; redacting only the image leaves the OCR layer intact. Either redact both layers, or convert the page to a flat raster (Privvert's PDF-to-image tool does this locally) and re-build a fresh PDF from the rasterised pages with no OCR underneath. Black out the sensitive area on the image with the image cropper or editor before re-assembling. You lose accessibility on those pages, but you gain a guarantee that no text layer can betray you.
For very long documents, consider splitting first. Use the PDF splitter to extract only the pages you actually need to release. Pages you never include in the released file cannot be un-redacted later, because they are not in the file at all.

Why "just upload it to a redaction website" misses the point

There are plenty of web-based redaction tools, and most of them work the way web-based file converters work: you upload, they process on their server, you download. For a file you needed to redact in the first place, this is the wrong shape of solution. You have, by definition, just sent the unredacted file to someone else's computer in order to redact it. The operator may delete it afterwards. They may not. They may have a robust security posture. They may not. The risk you were trying to manage was disclosure of confidential information; "disclose it to a third party first" is not a sound mitigation.

Local processing changes the calculation. The same content-stream deletion can run in a browser tab using PDF libraries compiled to WebAssembly, with the file loaded via the browser's File API and never sent over the network. You can verify this in any browser's developer tools (Network panel, F12) by watching for the upload that never happens. That is the basis on which every PDF tool on Privvert - including the redactor, metadata stripper, splitter, and text extractor - is built.

Frequently asked questions

If I draw a black rectangle over text in Preview or Word, is the text really still there?

Yes. The rectangle is a separate object stacked on top of the page. The text under it remains in the PDF's content stream, fully selectable and copyable. Anyone who opens the file in Acrobat, a browser PDF viewer, or any text-extraction tool can lift the text back out in seconds. This is not a theoretical risk - it has been the cause of multiple high-profile leaks involving the U.S. Department of Justice, the TSA, the NSA, and several major law firms.

Does flattening the PDF or 'printing to PDF' fix it?

Flattening or re-printing rasterises the page to an image, which does remove the underlying text layer for that page. The catch is that you have just turned a searchable, accessible document into a picture of a document - bad for screen readers, bad for search, bad for file size, and easy to OCR back into text. It also does nothing about non-text content (embedded images, form fields, attachments, JavaScript, and document metadata) unless you flatten everything and strip metadata separately. Real redaction removes only the sensitive content, not the entire document's usability.

Is highlighting text and then changing the font color to black a valid redaction?

No. The text and its formatting are stored together in the PDF; changing color is a presentation tweak, not a deletion. Copy-pasting from the file recovers the text immediately, and many viewers will still display the text if you toggle the rendering or change the background color. Treat any technique that 'hides' rather than 'removes' as broken.

What about metadata - the document Author, Title, comments and revision history?

PDF metadata lives in a separate structure (the Info dictionary and XMP packet) that visual redaction never touches. The same is true of comments, form-field history, JavaScript, embedded files, bookmarks, and the file's own creation/modification timestamps. Several public leaks happened not because the redaction failed but because the leaker forgot the metadata. A real redaction workflow strips metadata as a separate, explicit step.

Can a redacted PDF be 'un-redacted'?

If it was redacted properly - text removed from the content stream and the area filled with an opaque shape, metadata stripped, the file re-saved without incremental updates - then no, the original content is gone. If it was redacted with any of the broken techniques (overlay rectangles, font-color tricks, white-on-white text, opacity changes, image overlays without text deletion), then yes, in seconds, with tools as ordinary as Ctrl+A and Ctrl+C.

Is it safe to upload a confidential PDF to an online redaction tool?

It defeats most of the point. The file with the unredacted information goes to a third-party server, which is precisely the situation you were trying to avoid. Even if the operator deletes the file afterwards, you have lost custody of it during the upload, processing and storage window, and you have to take their word on the policy. For anything sensitive enough to need redaction in the first place, do the redaction on your own device.

Does Acrobat Pro's redaction tool do it correctly?

Yes - Acrobat Pro has a true two-step redaction tool (mark, then 'apply redactions') that physically removes content from the document and offers to sanitise metadata at the same time. The catch is licensing: Acrobat Pro is a paid subscription, and the free Acrobat Reader does not include the redaction tool. A local in-browser redactor that performs the same content-stream removal gives you the same security property without the license.

Putting it into practice

Real redaction is one of those problems where the right answer is boring and the wrong answer is everywhere. The reason black rectangles persist as a default is that they look like they are working: the redactor sees a clean page, the file opens, the rectangle is solid, and the obvious test passes. The less obvious tests - select all, copy, paste; open in a different viewer; strip the annotations; read the metadata - reliably fail, and they are the tests an adversary will run first.

The good news is that none of this requires special software, an enterprise license, or sending the file anywhere. A modern browser plus a tool that actually edits the content stream is sufficient. If you want to try the workflow on a real document, the PDF redactor, metadata stripper, and the rest of the Privvert toolkit run entirely on your device. For more on why local processing is the right default for any file you would not email to a stranger, the companion piece on the blog goes through the evidence in detail.

And if your "redaction" task is actually about an image rather than a PDF - screenshot of a chat, a photo of a document, a scan with sensitive text in it - the same warnings apply. Crop the sensitive area out with the image cropper rather than painting over it, then strip the image's metadata with the EXIF stripper so you do not accidentally release the original GPS coordinates or device serial number with the cropped picture.