Content Extraction API
The Content Extraction API provides AI-powered tools to extract and analyze content from various file types:
- PDF Extraction - Extract text content and metadata from PDF documents
- Image Analysis - Generate descriptions, captions, and alt text for images using AI
Both APIs support multiple input sources:
- Public URLs
- Base64-encoded data
PDF Extraction
Section titled “PDF Extraction”Extract text content from PDF documents for indexing, analysis, or content management.
Basic PDF Extraction
Section titled “Basic PDF Extraction”query ExtractPDF {  pdf {    extract(input: { source: "https://example.com/document.pdf" }) {      source      content      pageCount      metadata      processedAt    }  }}PDF Extraction with Base64
Section titled “PDF Extraction with Base64”query ExtractPDFFromBase64 {  pdf {    extract(      input: {        source: "data:application/pdf;base64,JVBERi0xLjQKJeLjz9MKMSAwIG9iago8..."        meta: {          action: "extract"          system: "content-management"          source: "upload-form"        }      }    ) {      source      content      pageCount      metadata      processedAt    }  }}PDF Extraction Response
Section titled “PDF Extraction Response”The extraction returns:
- source - The source that was processed (URL, file path, or “base64”)
- content - Extracted text content from the PDF
- pageCount - Number of pages in the PDF
- metadata - Document metadata (file size, modification time, etc.)
- processedAt - Timestamp when the extraction was processed
Common Use Cases
Section titled “Common Use Cases”Index PDF content for search:
query ExtractPDFForIndexing {  pdf {    extract(input: { source: "https://example.com/whitepaper.pdf" }) {      content      pageCount    }  }}After extraction, you can index the content using the Index API.
Image Analysis
Section titled “Image Analysis”Generate AI-powered descriptions, captions, and alt text for images to improve accessibility and SEO.
Basic Image Analysis
Section titled “Basic Image Analysis”query AnalyzeImage {  image {    analyze(input: { source: "https://example.com/photo.jpg" }) {      source      description      caption      altText      processedAt    }  }}Image Analysis with Base64
Section titled “Image Analysis with Base64”query AnalyzeImageFromBase64 {  image {    analyze(      input: {        source: "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEAYABgAAD..."        meta: { action: "analyze", system: "media-library", source: "upload" }      }    ) {      source      description      caption      altText      processedAt    }  }}Image Analysis Response
Section titled “Image Analysis Response”The analysis returns:
- source - The source that was analyzed (URL, file path, or “base64”)
- description - Detailed description of the image content
- caption - Concise caption suitable for social media or articles
- altText - Brief alt text for accessibility and screen readers
- processedAt - Timestamp when the analysis was processed
Common Use Cases
Section titled “Common Use Cases”Generate alt text for accessibility:
query GenerateAltText {  image {    analyze(input: { source: "https://example.com/product-image.jpg" }) {      altText    }  }}Create social media captions:
query GenerateSocialCaption {  image {    analyze(input: { source: "https://example.com/marketing-photo.jpg" }) {      caption      description    }  }}API Reference
Section titled “API Reference”PDFExtractionInput
Section titled “PDFExtractionInput”| Field | Type | Description | 
|---|---|---|
| source | String! | PDF source - can be a URL or base64 encoded data (required) | 
| meta | MetaInput | Optional metadata for logging | 
PDFExtractionResult
Section titled “PDFExtractionResult”| Field | Type | Description | 
|---|---|---|
| source | String! | The source that was processed | 
| content | String! | Extracted text content from the PDF | 
| pageCount | Int! | Number of pages in the PDF | 
| metadata | Map | Document metadata (file size, modification time, etc.) | 
| processedAt | DateTime! | Timestamp when the extraction was processed | 
ImageAnalysisInput
Section titled “ImageAnalysisInput”| Field | Type | Description | 
|---|---|---|
| source | String! | Image source - can be a URL or base64 encoded data (required) | 
| meta | MetaInput | Optional metadata for logging | 
ImageAnalysisResult
Section titled “ImageAnalysisResult”| Field | Type | Description | 
|---|---|---|
| source | String! | The source that was analyzed | 
| description | String! | Detailed description of the image content | 
| caption | String! | Concise caption suitable for social media or articles | 
| altText | String! | Brief alt text for accessibility and screen readers | 
| processedAt | DateTime! | Timestamp when the analysis was processed | 
MetaInput
Section titled “MetaInput”Optional metadata input for logging
| Field | Type | Description | 
|---|---|---|
| action | String | Performed action e.g. extract, analyze | 
| system | String | The requester system name | 
| source | String | The requester hostname | 
Input Source Formats
Section titled “Input Source Formats”All extraction APIs support two input source formats:
source: "https://example.com/file.pdf"source: "https://example.com/image.jpg"Base64 Encoded
Section titled “Base64 Encoded”source: "data:application/pdf;base64,VBERi0xLjQK..."source: "data:image/jpeg;base64,/9j/4AAQSkZJRg..."For base64 encoding, include the data URI scheme with the appropriate MIME type.