AI Document Processing: OCR, Extraction, and Structured Data

Transform unstructured documents into structured, actionable data

返回教程列表
进阶38 分钟

AI Document Processing: OCR, Extraction, and Structured Data

Transform unstructured documents into structured, actionable data

Complete guide to AI-powered document processing including OCR, layout analysis, information extraction, and structured data output. Process invoices, contracts, forms, and reports at scale.

document-processingocrinformation-extractionpdfazure

AI Document Processing

The Document Processing Challenge

Organizations deal with millions of unstructured documents:
  • Invoices and receipts
  • Contracts and legal documents
  • Medical records
  • Government forms
  • Financial reports
  • OCR with Tesseract + AI Enhancement

    python
    import pytesseract
    from PIL import Image
    import cv2

    def extract_text_from_image(image_path: str) -> str: # Preprocess image for better OCR img = cv2.imread(image_path) gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY) denoised = cv2.fastNlMeansDenoising(gray, h=10) # Tesseract OCR pil_img = Image.fromarray(denoised) text = pytesseract.image_to_string(pil_img, config='--psm 6') return text

    def enhance_with_ai(raw_text: str, document_type: str) -> dict: """Use AI to fix OCR errors and extract structure""" response = client.chat.completions.create( model="gpt-4o", messages=[{ "role": "user", "content": f"""Clean up this {document_type} OCR text and extract: - All field names and values - Fix obvious OCR errors Return as structured JSON. Raw text: {raw_text}""" }], response_format={"type": "json_object"} ) return json.loads(response.choices[0].message.content)

    Azure Document Intelligence (Form Recognizer)

    python
    from azure.ai.formrecognizer import DocumentAnalysisClient
    from azure.core.credentials import AzureKeyCredential

    client = DocumentAnalysisClient( endpoint=AZURE_ENDPOINT, credential=AzureKeyCredential(AZURE_KEY) )

    with open("invoice.pdf", "rb") as f: poller = client.begin_analyze_document("prebuilt-invoice", f) result = poller.result()

    invoice = result.documents[0] print(f"Invoice Number: {invoice.fields['InvoiceId'].value}") print(f"Total: {invoice.fields['InvoiceTotal'].value}")

    LLM-Based Extraction with Schema

    python
    from pydantic import BaseModel
    from typing import Optional, List

    class InvoiceLineItem(BaseModel): description: str quantity: float unit_price: float total: float

    class Invoice(BaseModel): invoice_number: str date: str vendor_name: str total_amount: float line_items: List[InvoiceLineItem] tax_amount: Optional[float] = None

    def extract_invoice(text: str) -> Invoice: response = client.beta.chat.completions.parse( model="gpt-4o", messages=[ {"role": "user", "content": f"Extract invoice data: {text}"} ], response_format=Invoice ) return response.choices[0].message.parsed

    Pipeline for Scale

    python
    from celery import Celery

    app = Celery('document_processor')

    @app.task def process_document(doc_path: str, doc_type: str): raw_text = extract_text_from_image(doc_path) structured_data = extract_invoice(raw_text) save_to_database(structured_data) return {"status": "success", "data": structured_data.dict()}

    相关工具

    tesseractazure-form-recognizeropenaipydantic