Content Extraction API

The Content Extraction API provides AI-powered tools to extract and analyze content from various file types:

PDF Extraction - Extract text content and metadata from PDF documents
Image Analysis - Generate descriptions, captions, and alt text for images using AI

Both APIs support multiple input sources:

Public URLs
Base64-encoded data

PDF Extraction

Extract text content from PDF documents for indexing, analysis, or content management.

Basic PDF Extraction

query ExtractPDF {
  pdf {
    extract(input: { source: "https://example.com/document.pdf" }) {
      source
      content
      pageCount
      metadata
      processedAt
    }
  }
}

PDF Extraction with Base64

query ExtractPDFFromBase64 {
  pdf {
    extract(
      input: {
        source: "data:application/pdf;base64,JVBERi0xLjQKJeLjz9MKMSAwIG9iago8..."
        meta: {
          action: "extract"
          system: "content-management"
          source: "upload-form"
        }
      }
    ) {
      source
      content
      pageCount
      metadata
      processedAt
    }
  }
}

PDF Extraction Response

The extraction returns:

source - The source that was processed (URL, file path, or “base64”)
content - Extracted text content from the PDF
pageCount - Number of pages in the PDF
metadata - Document metadata (file size, modification time, etc.)
processedAt - Timestamp when the extraction was processed

Common Use Cases

Index PDF content for search:

query ExtractPDFForIndexing {
  pdf {
    extract(input: { source: "https://example.com/whitepaper.pdf" }) {
      content
      pageCount
    }
  }
}

After extraction, you can index the content using the Index API.

Image Analysis

Generate AI-powered descriptions, captions, and alt text for images to improve accessibility and SEO.

Basic Image Analysis

query AnalyzeImage {
  image {
    analyze(input: { source: "https://example.com/photo.jpg" }) {
      source
      description
      caption
      altText
      processedAt
    }
  }
}

Image Analysis with Base64

query AnalyzeImageFromBase64 {
  image {
    analyze(
      input: {
        source: "data:image/jpeg;base64,/9j/4AAQSkZJRgABAQEAYABgAAD..."
        meta: { action: "analyze", system: "media-library", source: "upload" }
      }
    ) {
      source
      description
      caption
      altText
      processedAt
    }
  }
}

Image Analysis Response

The analysis returns:

source - The source that was analyzed (URL, file path, or “base64”)
description - Detailed description of the image content
caption - Concise caption suitable for social media or articles
altText - Brief alt text for accessibility and screen readers
processedAt - Timestamp when the analysis was processed

Common Use Cases

Generate alt text for accessibility:

query GenerateAltText {
  image {
    analyze(input: { source: "https://example.com/product-image.jpg" }) {
      altText
    }
  }
}

Create social media captions:

query GenerateSocialCaption {
  image {
    analyze(input: { source: "https://example.com/marketing-photo.jpg" }) {
      caption
      description
    }
  }
}

API Reference

PDFExtractionInput

Field	Type	Description
source	String!	PDF source - can be a URL or base64 encoded data (required)
meta	MetaInput	Optional metadata for logging

PDFExtractionResult

Field	Type	Description
source	String!	The source that was processed
content	String!	Extracted text content from the PDF
pageCount	Int!	Number of pages in the PDF
metadata	Map	Document metadata (file size, modification time, etc.)
processedAt	DateTime!	Timestamp when the extraction was processed

ImageAnalysisInput

Field	Type	Description
source	String!	Image source - can be a URL or base64 encoded data (required)
meta	MetaInput	Optional metadata for logging

ImageAnalysisResult

Field	Type	Description
source	String!	The source that was analyzed
description	String!	Detailed description of the image content
caption	String!	Concise caption suitable for social media or articles
altText	String!	Brief alt text for accessibility and screen readers
processedAt	DateTime!	Timestamp when the analysis was processed

MetaInput

Optional metadata input for logging

Field	Type	Description
action	String	Performed action e.g. extract, analyze
system	String	The requester system name
source	String	The requester hostname

Input Source Formats

All extraction APIs support two input source formats:

URL

source: "https://example.com/file.pdf"
source: "https://example.com/image.jpg"

Base64 Encoded

source: "data:application/pdf;base64,VBERi0xLjQK..."
source: "data:image/jpeg;base64,/9j/4AAQSkZJRg..."

For base64 encoding, include the data URI scheme with the appropriate MIME type.