Use Case

Extract Text from PDF API

Pull clean, plain text from any PDF URL with a single API call. Works with scanned documents, token-gated downloads, and JavaScript-triggered attachments - perfect for search indexing, NLP pipelines, and content migration.

Why is extracting text from PDFs so painful?

PDFs store text as positioned glyphs, not as readable paragraphs. Libraries like pdf.js give you character-level coordinates, but reconstructing reading order, handling multi-column layouts, and merging hyphenated words is on you. Scanned PDFs are even harder - they contain images, not text at all, requiring OCR to extract anything useful.

And if the PDF lives behind authentication, a redirect chain, or a JavaScript-triggered download, your backend can't even fetch the file in the first place. You end up stitching together HTTP clients, headless browsers, OCR engines, and text reconstruction logic.

PDFPipe handles all of it. Send us any URL - inline or auto-download - and get back clean, readable plain text. We handle layout analysis, reading order reconstruction, and OCR for scanned documents automatically.

How it works

One POST request. We handle the rest.

1. Send a request

curl
curl -X POST https://api.pdfpipe.dev/v1/convert \
  -H "Authorization: Bearer pk_live_..." \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com/report.pdf",
    "format": "text",
    "returnMethod": "inline"
  }'

2. Get the response

JSON response
{
  "requestId": "req_01J9X7K2M...",
  "status": "complete",
  "format": "text",
  "pagesProcessed": 5,
  "creditsUsed": 1,
  "contentType": "text/plain",
  "content": "QUARTERLY FINANCIAL REPORT\nQ4 2025\n..."
}

Clean, readable plain text

PDFPipe reconstructs proper reading order from raw PDF coordinates. You get natural paragraphs, preserved line breaks, and clean formatting - ready to feed directly into your search index, NLP model, or content pipeline.

  • Reading order reconstruction across columns
  • OCR for scanned and image-based PDFs
  • Preserved paragraph and section structure
  • Handles multi-language documents
  • Works with token-gated and auto-download PDFs
Sample text output
QUARTERLY FINANCIAL REPORT
Q4 2025

Prepared by: Acme Corp
Date: January 15, 2026

Executive Summary

Revenue for Q4 2025 reached $4.2M, a 23% increase
over the previous quarter. Operating margins improved
to 18.5%, driven by reduced infrastructure costs and
increased automation across the fulfillment pipeline.

Key Metrics
- Revenue: $4,200,000
- Operating Margin: 18.5%
- Customer Acquisition Cost: $142
- Monthly Active Users: 52,400
Node.js
const response = await fetch(
  "https://api.pdfpipe.dev/v1/convert",
  {
    method: "POST",
    headers: {
      "Authorization": "Bearer pk_live_...",
      "Content-Type": "application/json",
    },
    body: JSON.stringify({
      url: "https://example.com/report.pdf",
      format: "text",
      returnMethod: "inline",
    }),
  }
);

const data = await response.json();
const text = data.content;

// Feed into your NLP pipeline
const sentences = text.split(/\n+/).filter(Boolean);
console.log(`Extracted ${sentences.length} lines`);

Works with any language

PDFPipe is a standard REST API. If your language can make HTTP requests, it can extract text from PDFs. No SDKs required - though we provide them for convenience.

JavaScriptPythonGoRubyPHPJavaC#curl

Start extracting text from PDFs today

Free tier includes 10 requests per month. No credit card required.