AI-Assisted Data Labeling: Scale Annotation Workflows
Use AI to accelerate and improve training data creation
AI-Assisted Data Labeling: Scale Annotation Workflows
Use AI to accelerate and improve training data creation
Learn to use AI to assist with data labeling including pre-labeling, active learning, quality control, and weak supervision. Reduce annotation costs by 60-80% while maintaining data quality.
AI-Assisted Data Labeling
The Data Labeling Challenge
Pre-Labeling with Foundation Models
Use zero-shot models to create draft annotations:python
from transformers import pipeline
import jsonPre-label text classification
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")def pre_label_texts(texts: list, categories: list) -> list:
results = []
for text in texts:
result = classifier(text, categories)
results.append({
"text": text,
"pre_label": result["labels"][0],
"confidence": result["scores"][0],
"needs_review": result["scores"][0] < 0.8
})
return results
Pre-label images with vision models
from transformers import pipeline as img_pipelineobject_detector = img_pipeline("object-detection", model="facebook/detr-resnet-50")
def pre_label_images(image_paths: list) -> list:
results = []
for path in image_paths:
detections = object_detector(path)
results.append({
"image": path,
"pre_annotations": detections,
"needs_review": any(d["score"] < 0.7 for d in detections)
})
return results
Active Learning
Prioritize labeling the most valuable samples:python
import numpy as np
from sklearn.ensemble import RandomForestClassifierclass ActiveLearner:
def __init__(self, model):
self.model = model
def uncertainty_sampling(self, unlabeled_pool: np.array, n_samples: int) -> list:
probas = self.model.predict_proba(unlabeled_pool)
# Select samples model is most uncertain about
uncertainty = 1 - np.max(probas, axis=1)
uncertain_indices = np.argsort(uncertainty)[::-1][:n_samples]
return uncertain_indices.tolist()
def diversity_sampling(self, unlabeled_pool: np.array, n_samples: int) -> list:
# Select diverse samples using k-means clustering
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=n_samples)
kmeans.fit(unlabeled_pool)
# Find sample closest to each centroid
return find_closest_to_centroids(unlabeled_pool, kmeans.cluster_centers_)
Weak Supervision with Snorkel
Create training data programmatically:python
from snorkel.labeling import labeling_function, PandasLFApplier@labeling_function()
def lf_contains_positive_words(x):
positive_words = ['excellent', 'great', 'amazing', 'love']
if any(word in x.text.lower() for word in positive_words):
return 1 # Positive
return -1 # Abstain
@labeling_function()
def lf_contains_negative_words(x):
negative_words = ['terrible', 'awful', 'horrible', 'hate']
if any(word in x.text.lower() for word in negative_words):
return 0 # Negative
return -1 # Abstain
Apply labeling functions
applier = PandasLFApplier([lf_contains_positive_words, lf_contains_negative_words])
L_train = applier.apply(df_train)
Quality Control
相关工具
相关教程
Create personalized recommendation engines for products, content, and more
Modern approaches to predicting sequential data
投资者和分析师必备:10 分钟用 AI 完成专业财报解读