Docs

Output Formats

PDFPipe supports 10 output formats across three categories. Set the format field in your convert request to choose one.

FormatValueCategoryContent-TypeTiers
JSONjsonExtractionapplication/jsonAll tiers
TexttextExtractiontext/plainAll tiers
MarkdownmarkdownExtractiontext/markdownStarter+
XMLxmlExtractionapplication/xmlStarter+
CSVcsvExtractiontext/csvStarter+
Base64base64Encodedtext/plainStarter+
BinarybinaryEncodedapplication/pdfStarter+
PNGpngImageimage/pngStarter+
JPGjpgImageimage/jpegStarter+
WebPwebpImageimage/webpStarter+

Extraction Formats

Extraction formats parse the PDF content and return structured or plain-text data. These are the most commonly used formats.

JSON

format: "json"

Structured page-by-page extraction with text, tables, metadata, font information, and coordinates. The richest output format.

Example JSON output
{
  "pages": [
    {
      "page": 1,
      "width": 612,
      "height": 792,
      "text": "Invoice #2026-0142...",
      "tables": [
        {
          "rows": [
            ["Item", "Qty", "Price"],
            ["API Credits", "1000", "$49.00"]
          ]
        }
      ],
      "metadata": {
        "fonts": ["Helvetica"],
        "hasImages": false
      }
    }
  ],
  "documentMetadata": {
    "title": "Invoice",
    "author": "Acme Corp",
    "pageCount": 1,
    "fileSize": 24576
  }
}

Common use cases

  • Data extraction and processing pipelines
  • LLM / RAG document ingestion
  • Table extraction for spreadsheets or databases

Text

format: "text"

Plain text extraction. All pages concatenated with page breaks. No structural metadata.

Example Text output
Invoice #2026-0142
Date: February 15, 2026

Item          Qty    Price
API Credits   1000   $49.00
Priority      1      $29.00

Total: $78.00

Common use cases

  • Full-text search indexing
  • Simple text processing
  • Content previews

Markdown

format: "markdown"

Markdown-formatted text with headings, tables, and lists preserved as Markdown syntax. Ideal for LLM prompts.

Example Markdown output
# Invoice #2026-0142

**Date:** February 15, 2026

| Item | Qty | Price |
|------|-----|-------|
| API Credits | 1000 | $49.00 |
| Priority | 1 | $29.00 |

**Total:** $78.00

Common use cases

  • LLM prompt context (best format for AI models)
  • Documentation generation
  • Content migration

XML

format: "xml"

Structured XML output with page, text, and metadata elements. Useful for systems that consume XML natively.

Example XML output
<?xml version="1.0" encoding="UTF-8"?>
<document>
  <metadata>
    <title>Invoice #2026-0142</title>
    <pageCount>1</pageCount>
  </metadata>
  <pages>
    <page number="1">
      <text>Invoice #2026-0142...</text>
    </page>
  </pages>
</document>

Common use cases

  • Enterprise system integrations
  • XSLT transformation pipelines
  • Legacy system compatibility

CSV

format: "csv"

Comma-separated values from detected tables. Each table is output sequentially. Best for PDFs with clear tabular data.

Example CSV output
Item,Qty,Price
API Credits,1000,$49.00
Priority Support,1,$29.00

Common use cases

  • Spreadsheet import (Excel, Google Sheets)
  • Database loading
  • Financial document processing

Encoded Formats

Encoded formats return the raw PDF file content in a transport-friendly encoding. Useful when you need the original file rather than extracted text.

Base64

format: "base64"

The raw PDF file content encoded as a Base64 string. Useful when you need the original PDF bytes embedded in a JSON payload or email.

Example Base64 output
JVBERi0xLjQKMSAwIG9iago8PAov
VHlwZSAvQ2F0YWxvZwovUGFnZXMg
MiAwIFIKPj4KZW5kb2JqCjIgMCAo...

Common use cases

  • Embedding PDFs in API responses
  • Email attachments
  • Systems that require Base64 input

Binary

format: "binary"

The raw PDF file bytes. The result URL serves the original PDF as a binary download.

Example Binary output
(Binary PDF data — download via the presigned result URL)

Common use cases

  • PDF archival and storage
  • Re-serving downloaded attachment PDFs
  • Proxying PDFs through your own system

Image Formats

Image formats render each page of the PDF as a raster image. Useful for previews, thumbnails, and visual processing.

PNG

format: "png"

High-quality rasterised images of each PDF page in PNG format. Lossless compression, best for documents with text.

Example PNG output
(PNG image data — download via the presigned result URL)

Common use cases

  • Document thumbnails and previews
  • OCR pre-processing
  • Visual comparison and auditing

JPG

format: "jpg"

Rasterised page images in JPEG format. Smaller file sizes than PNG with lossy compression.

Example JPG output
(JPEG image data — download via the presigned result URL)

Common use cases

  • Web thumbnails where file size matters
  • Social media previews
  • Quick visual previews

WebP

format: "webp"

Modern image format with superior compression. Best balance of quality and file size for web display.

Example WebP output
(WebP image data — download via the presigned result URL)

Common use cases

  • Web applications optimised for performance
  • Mobile-friendly document previews
  • Progressive web apps