Back to Knowledge Base
Security Practices

Document Sanitization.

Best practices for metadata removal and permanent byte-level redaction.

What is Document Sanitization?

Document sanitization is the process of removing sensitive information from files before sharing them. This includes both visible content (text, images) and hidden metadata (author information, revision history, embedded data).

Proper sanitization is critical for professionals handling confidential documents—legal teams, healthcare providers, government agencies, and corporate security teams. Incomplete sanitization can lead to data breaches, privacy violations, and legal liability.

Hidden Metadata in PDFs

PDF files contain multiple layers of metadata that users often overlook:

  • Document Info: Author, title, subject, keywords, creation date, modification date
  • Application Metadata: Software used to create the PDF, PDF version, producer information
  • XMP Metadata: Extensible Metadata Platform data including custom properties
  • Embedded Files: Attachments hidden within the PDF structure
  • Hidden Layers: Content not visible in the default view but present in the file
  • Revision History: Track changes, annotations, comments
  • JavaScript: Embedded scripts that may collect information

The Danger of "Redaction" by Overlay

A common mistake is to place black rectangles over sensitive text and believe the information is hidden. This is not redaction. The underlying text remains in the PDF and can be revealed by:

  • Copying and pasting the "covered" text
  • Searching within the PDF document
  • Removing the overlay rectangle in a PDF editor
  • Converting the PDF to another format

Critical Warning

Never rely on visual overlays for redaction. True redaction requires removing the actual text and image data from the PDF at the byte level, not just covering it visually.

Proper Redaction Techniques

Effective redaction involves:

  1. Identify Sensitive Content: Review the document for names, addresses, financial information, and other confidential data
  2. Select Redaction Areas: Mark the exact regions to be redacted
  3. Apply Redaction: Remove the underlying content and replace with a solid fill
  4. Flatten the Document: Ensure redactions cannot be undone
  5. Verify: Check that redacted content cannot be recovered

Byte-Level Redaction

DocuStitch's redaction tool performs byte-level redaction:

// Redaction process
1. Parse PDF structure
2. Identify text objects in redaction zones
3. Remove text objects from content stream
4. Replace with black rectangle
5. Remove any associated annotations
6. Flatten the page to prevent undo
7. Regenerate XREF table

This ensures that redacted information is permanently removed and cannot be recovered, even with specialized forensic tools.

Metadata Removal

Before sharing documents, remove all non-essential metadata:

  • Document Info: Clear author, title, subject fields
  • Creation Dates: Remove timestamps that may reveal workflow information
  • Application Data: Strip software version and producer information
  • Embedded Files: Remove any attachments
  • Hidden Content: Remove invisible layers and content
  • JavaScript: Remove all embedded scripts

Common Sanitization Failures

Real-world examples of sanitization failures include:

  • Legal Documents: Redacted text revealed when copied from PDF
  • Government Releases: Hidden layers contained unredacted information
  • Financial Reports: Metadata revealed internal author information
  • Medical Records: Patient identifiers in embedded comments

Sanitization Workflow

A recommended sanitization workflow:

  1. Review document for all sensitive information
  2. Use proper redaction tools (not overlays)
  3. Remove all metadata using metadata viewer tools
  4. Flatten the document to lock changes
  5. Verify by attempting to recover redacted content
  6. Keep original and sanitized versions separate

Compliance Requirements

Various regulations require proper document sanitization:

  • HIPAA: Protected Health Information must be properly redacted before disclosure
  • GDPR: Personal data removal requires complete deletion, not just hiding
  • FOIA: Government releases must be properly sanitized
  • Court Rules: Legal filings must redact confidential information properly

Local Processing Advantages

Using local tools for sanitization provides:

  • No data exposure: Sensitive documents never leave your device
  • Verification: You can inspect the code and verify sanitization logic
  • Audit Trail: You maintain control of the entire process
  • Compliance: Satisfies data sovereignty requirements