Content Extraction API
The Content Extraction API provides AI-powered tools to extract and analyze content from various file types:
- PDF Extraction - Extract text content and metadata from PDF documents
- Image Analysis - Generate descriptions, captions, and alt text for images using AI
Both APIs support multiple input sources:
- Public URLs
- Base64-encoded data
PDF Extraction
Section titled “PDF Extraction”Extract text content from PDF documents for indexing, analysis, or content management.
Basic PDF Extraction
Section titled “Basic PDF Extraction”query ExtractPDF { pdf { extract(input: { source: "https://example.com/document.pdf" }) { source content pageCount metadata processedAt } }}
PDF Extraction with Base64
Section titled “PDF Extraction with Base64”query ExtractPDFFromBase64 { pdf { extract(input: { source: "data:application/pdf;base64,JVBERi0xLjQKJeLjz9MKMSAwIG9iago8..." meta: { action: "extract" system: "content-management" source: "upload-form" } }) { source content pageCount metadata processedAt } }}
PDF Extraction Response
Section titled “PDF Extraction Response”The extraction returns:
- source - The source that was processed (URL, file path, or “base64”)
- content - Extracted text content from the PDF
- pageCount - Number of pages in the PDF
- metadata - Document metadata (file size, modification time, etc.)
- processedAt - Timestamp when the extraction was processed
Common Use Cases
Section titled “Common Use Cases”Index PDF content for search:
query ExtractPDFForIndexing { pdf { extract(input: { source: "https://example.com/whitepaper.pdf" }) { content pageCount } }}
After extraction, you can index the content using the Index API.
Image Analysis
Section titled “Image Analysis”Generate AI-powered descriptions, captions, and alt text for images to improve accessibility and SEO.
Basic Image Analysis
Section titled “Basic Image Analysis”query AnalyzeImage { image { analyze(input: { source: "https://example.com/photo.jpg" }) { source description caption altText processedAt } }}
Image Analysis with Base64
Section titled “Image Analysis with Base64”query AnalyzeImageFromBase64 { image { analyze(input: { source: "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEAYABgAAD..." meta: { action: "analyze" system: "media-library" source: "upload" } }) { source description caption altText processedAt } }}
Image Analysis Response
Section titled “Image Analysis Response”The analysis returns:
- source - The source that was analyzed (URL, file path, or “base64”)
- description - Detailed description of the image content
- caption - Concise caption suitable for social media or articles
- altText - Brief alt text for accessibility and screen readers
- processedAt - Timestamp when the analysis was processed
Common Use Cases
Section titled “Common Use Cases”Generate alt text for accessibility:
query GenerateAltText { image { analyze(input: { source: "https://example.com/product-image.jpg" }) { altText } }}
Create social media captions:
query GenerateSocialCaption { image { analyze(input: { source: "https://example.com/marketing-photo.jpg" }) { caption description } }}
API Reference
Section titled “API Reference”PDFExtractionInput
Section titled “PDFExtractionInput”Field | Type | Description |
---|---|---|
source | String! | PDF source - can be a URL or base64 encoded data (required) |
meta | MetaInput | Optional metadata for logging |
PDFExtractionResult
Section titled “PDFExtractionResult”Field | Type | Description |
---|---|---|
source | String! | The source that was processed |
content | String! | Extracted text content from the PDF |
pageCount | Int! | Number of pages in the PDF |
metadata | Map | Document metadata (file size, modification time, etc.) |
processedAt | DateTime! | Timestamp when the extraction was processed |
ImageAnalysisInput
Section titled “ImageAnalysisInput”Field | Type | Description |
---|---|---|
source | String! | Image source - can be a URL or base64 encoded data (required) |
meta | MetaInput | Optional metadata for logging |
ImageAnalysisResult
Section titled “ImageAnalysisResult”Field | Type | Description |
---|---|---|
source | String! | The source that was analyzed |
description | String! | Detailed description of the image content |
caption | String! | Concise caption suitable for social media or articles |
altText | String! | Brief alt text for accessibility and screen readers |
processedAt | DateTime! | Timestamp when the analysis was processed |
MetaInput
Section titled “MetaInput”Optional metadata input for logging
Field | Type | Description |
---|---|---|
action | String | Performed action e.g. extract, analyze |
system | String | The requester system name |
source | String | The requester hostname |
Input Source Formats
Section titled “Input Source Formats”All extraction APIs support two input source formats:
source: "https://example.com/file.pdf"source: "https://example.com/image.jpg"
Base64 Encoded
Section titled “Base64 Encoded”source: "data:application/pdf;base64,JVBERi0xLjQK..."source: "data:image/jpeg;base64,/9j/4AAQSkZJRg..."
For base64 encoding, include the data URI scheme with the appropriate MIME type.