Skip to content

Content Extraction API

The Content Extraction API provides AI-powered tools to extract and analyze content from various file types:

  • PDF Extraction - Extract text content and metadata from PDF documents
  • Image Analysis - Generate descriptions, captions, and alt text for images using AI

Both APIs support multiple input sources:

  • Public URLs
  • Base64-encoded data

Extract text content from PDF documents for indexing, analysis, or content management.

query ExtractPDF {
pdf {
extract(input: {
source: "https://example.com/document.pdf"
}) {
source
content
pageCount
metadata
processedAt
}
}
}
query ExtractPDFFromBase64 {
pdf {
extract(input: {
source: "data:application/pdf;base64,JVBERi0xLjQKJeLjz9MKMSAwIG9iago8..."
meta: {
action: "extract"
system: "content-management"
source: "upload-form"
}
}) {
source
content
pageCount
metadata
processedAt
}
}
}

The extraction returns:

  • source - The source that was processed (URL, file path, or “base64”)
  • content - Extracted text content from the PDF
  • pageCount - Number of pages in the PDF
  • metadata - Document metadata (file size, modification time, etc.)
  • processedAt - Timestamp when the extraction was processed

Index PDF content for search:

query ExtractPDFForIndexing {
pdf {
extract(input: {
source: "https://example.com/whitepaper.pdf"
}) {
content
pageCount
}
}
}

After extraction, you can index the content using the Index API.

Generate AI-powered descriptions, captions, and alt text for images to improve accessibility and SEO.

query AnalyzeImage {
image {
analyze(input: {
source: "https://example.com/photo.jpg"
}) {
source
description
caption
altText
processedAt
}
}
}
query AnalyzeImageFromBase64 {
image {
analyze(input: {
source: "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEAYABgAAD..."
meta: {
action: "analyze"
system: "media-library"
source: "upload"
}
}) {
source
description
caption
altText
processedAt
}
}
}

The analysis returns:

  • source - The source that was analyzed (URL, file path, or “base64”)
  • description - Detailed description of the image content
  • caption - Concise caption suitable for social media or articles
  • altText - Brief alt text for accessibility and screen readers
  • processedAt - Timestamp when the analysis was processed

Generate alt text for accessibility:

query GenerateAltText {
image {
analyze(input: {
source: "https://example.com/product-image.jpg"
}) {
altText
}
}
}

Create social media captions:

query GenerateSocialCaption {
image {
analyze(input: {
source: "https://example.com/marketing-photo.jpg"
}) {
caption
description
}
}
}
Field Type Description
source String! PDF source - can be a URL or base64 encoded data (required)
meta MetaInput Optional metadata for logging
Field Type Description
source String! The source that was processed
content String! Extracted text content from the PDF
pageCount Int! Number of pages in the PDF
metadata Map Document metadata (file size, modification time, etc.)
processedAt DateTime! Timestamp when the extraction was processed
Field Type Description
source String! Image source - can be a URL or base64 encoded data (required)
meta MetaInput Optional metadata for logging
Field Type Description
source String! The source that was analyzed
description String! Detailed description of the image content
caption String! Concise caption suitable for social media or articles
altText String! Brief alt text for accessibility and screen readers
processedAt DateTime! Timestamp when the analysis was processed

Optional metadata input for logging

Field Type Description
action String Performed action e.g. extract, analyze
system String The requester system name
source String The requester hostname

All extraction APIs support two input source formats:

source: "https://example.com/file.pdf"
source: "https://example.com/image.jpg"
source: "data:application/pdf;base64,JVBERi0xLjQK..."
source: "data:image/jpeg;base64,/9j/4AAQSkZJRg..."

For base64 encoding, include the data URI scheme with the appropriate MIME type.