AI-Assisted Data Labeling: Scale Annotation Workflows

Use AI to accelerate and improve training data creation

进阶约 38 分钟

AI-Assisted Data Labeling: Scale Annotation Workflows

Use AI to accelerate and improve training data creation

Learn to use AI to assist with data labeling including pre-labeling, active learning, quality control, and weak supervision. Reduce annotation costs by 60-80% while maintaining data quality.

data-labelingannotationactive-learningweak-supervisionsnorkel

AI-Assisted Data Labeling

The Data Labeling Challenge

Manual labeling is expensive ($0.05-2 per label)

Quality is inconsistent across annotators

Scale requirements are growing

Expert knowledge bottleneck for specialized domains

Pre-Labeling with Foundation Models

Use zero-shot models to create draft annotations:

python
from transformers import pipeline
import json
Pre-label text classification
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")
def pre_label_texts(texts: list, categories: list) -> list:
    results = []
    for text in texts:
        result = classifier(text, categories)
        results.append({
            "text": text,
            "pre_label": result["labels"][0],
            "confidence": result["scores"][0],
            "needs_review": result["scores"][0] < 0.8
        })
    return results
Pre-label images with vision models
from transformers import pipeline as img_pipeline
object_detector = img_pipeline("object-detection", model="facebook/detr-resnet-50")def pre_label_images(image_paths: list) -> list:
    results = []
    for path in image_paths:
        detections = object_detector(path)
        results.append({
            "image": path,
            "pre_annotations": detections,
            "needs_review": any(d["score"] < 0.7 for d in detections)
        })
    return results

Active Learning

Prioritize labeling the most valuable samples:

python
import numpy as np
from sklearn.ensemble import RandomForestClassifierclass ActiveLearner:
    def __init__(self, model):
        self.model = model
    
    def uncertainty_sampling(self, unlabeled_pool: np.array, n_samples: int) -> list:
        probas = self.model.predict_proba(unlabeled_pool)
        # Select samples model is most uncertain about
        uncertainty = 1 - np.max(probas, axis=1)
        uncertain_indices = np.argsort(uncertainty)[::-1][:n_samples]
        return uncertain_indices.tolist()
    
    def diversity_sampling(self, unlabeled_pool: np.array, n_samples: int) -> list:
        # Select diverse samples using k-means clustering
        from sklearn.cluster import KMeans
        kmeans = KMeans(n_clusters=n_samples)
        kmeans.fit(unlabeled_pool)
        # Find sample closest to each centroid
        return find_closest_to_centroids(unlabeled_pool, kmeans.cluster_centers_)

Weak Supervision with Snorkel

Create training data programmatically:

python
from snorkel.labeling import labeling_function, PandasLFApplier
@labeling_function()
def lf_contains_positive_words(x):
    positive_words = ['excellent', 'great', 'amazing', 'love']
    if any(word in x.text.lower() for word in positive_words):
        return 1  # Positive
    return -1  # Abstain
@labeling_function()
def lf_contains_negative_words(x):
    negative_words = ['terrible', 'awful', 'horrible', 'hate']
    if any(word in x.text.lower() for word in negative_words):
        return 0  # Negative
    return -1  # Abstain
Apply labeling functions
applier = PandasLFApplier([lf_contains_positive_words, lf_contains_negative_words])
L_train = applier.apply(df_train)

Quality Control

Inter-annotator agreement (Cohen kappa)

Honeypot samples with known labels

Automated quality checks

Feedback loops to improve annotator accuracy

Getting Started

Learn how to get started with this application.

Learn more

Installation Guide

AI-Assisted Data Labeling: Scale Annotation Workflows

AI-Assisted Data Labeling

The Data Labeling Challenge

Pre-Labeling with Foundation Models

Pre-label text classification

Pre-label images with vision models

Active Learning

Weak Supervision with Snorkel

Apply labeling functions

Quality Control

Documentation

Getting Started

Learn more