ML Model Versioning with DVC

Data Version Control for ML experiments and model tracking

高级约 18 分钟

ML Model Versioning with DVC

Data Version Control for ML experiments and model tracking

ML Model Versioning with DVC Overview Data Version Control for ML experiments and model tracking. This guide covers practical implementation for production ML systems. Why This Matters in MLOps Modern ML systems require rigorous operations practi

mlops production machine-learning dvc experiment-tracking

ML Model Versioning with DVC

Overview

Data Version Control for ML experiments and model tracking. This guide covers practical implementation for production ML systems.

Why This Matters in MLOps

Setup

bash
Install required tools
pip install dvc mlflow pandas numpy scikit-learn
Or with Docker
docker pull python:3.11-slim

Core Implementation

python
import os
import json
import logging
from datetime import datetime
from pathlib import Path
logger = logging.getLogger(__name__)
class MLModelVersioningwithDVC:
    """
    ML Model Versioning with DVC implementation.
    
    Handles: experiment tracking
    Tool: dvc
    """
    
    def __init__(self, config: dict = None):
        self.config = config or self._default_config()
        self._setup()
    
    def _default_config(self) -> dict:
        return {
            "tool": "dvc",
            "environment": os.getenv("ENVIRONMENT", "development"),
            "log_level": "INFO",
        }
    
    def _setup(self):
        """Initialize dvc connection and resources."""
        logging.basicConfig(level=self.config.get("log_level", "INFO"))
        logger.info(f"Initialized ML Model Versioning with DVC with config: {self.config}")
    
    def run(self, **kwargs) -> dict:
        """Execute experiment tracking."""
        start = datetime.utcnow()
        
        try:
            result = self._execute(**kwargs)
            
            elapsed = (datetime.utcnow() - start).total_seconds()
            logger.info(f"ML Model Versioning with DVC completed in {elapsed:.2f}s")
            
            return {
                "status": "success",
                "result": result,
                "elapsed_seconds": elapsed
            }
        
        except Exception as e:
            logger.error(f"ML Model Versioning with DVC failed: {e}")
            return {
                "status": "failed", 
                "error": str(e)
            }
    
    def _execute(self, **kwargs) -> dict:
        """Core experiment tracking logic. Override to customize."""
        return {"completed": True, "tool": "dvc"}
Configuration
config = {
    "tool": "dvc",
    "tracking_uri": os.getenv("MLFLOW_TRACKING_URI", "http://localhost:5000"),
    "artifact_root": "./artifacts",
}
Initialize
processor = MLModelVersioningwithDVC(config)
result = processor.run()
print(json.dumps(result, indent=2))

DVC Integration

python
Specific dvc integration for experiment tracking
import subprocess
def setup_dvc():
    """Configure dvc for experiment tracking."""
    
    # Initialize project
    print(f"Setting up dvc for experiment tracking...")
    
    # Example configuration
    config = {
        "project": "my-ml-project",
        "tool": "dvc",
        "specialty": "experiment tracking",
        "version": "1.0.0"
    }
    
    # Save configuration
    Path(".dvc").mkdir(exist_ok=True)
    with open(f".dvc/config.json", "w") as f:
        json.dump(config, f, indent=2)
    
    print(f"dvc configured for experiment tracking")
    return configconfig = setup_dvc()

Monitoring and Alerting

python
from dataclasses import dataclass
import time
class MLOpsMonitor:
    """Monitor experiment tracking metrics."""
    
    def __init__(self):
        self.metrics: list[MetricSnapshot] = []
        self.thresholds = {
            "error_rate": 0.05,
            "latency_p99_ms": 1000,
            "data_drift_score": 0.3
        }
    
    def record(self, metric: str, value: float, labels: dict = None):
        snapshot = MetricSnapshot(
            timestamp=time.time(),
            metric_name=metric,
            value=value,
            labels=labels or {}
        )
        self.metrics.append(snapshot)
        self._check_threshold(metric, value)
    
    def _check_threshold(self, metric: str, value: float):
        threshold = self.thresholds.get(metric)
        if threshold and value > threshold:
            logger.warning(f"ALERT: {metric}={value:.3f} exceeds threshold {threshold}")monitor = MLOpsMonitor()

CI/CD Integration

yaml .github/workflows/ml-pipeline.yml name: ML Pipeline on: push: paths: ['src/', 'data/']

jobs: train-and-evaluate: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Setup Python uses: actions/setup-python@v5 with: python-version: '3.11' - name: Install dependencies run: pip install -r requirements.txt - name: Run experiment tracking run: python -m src.ml_model_versioning_with_dvc env: MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_URI }} - name: Check model quality run: python -m src.validate_model

ML Model Versioning with DVC

ML Model Versioning with DVC

ML Model Versioning with DVC

Overview

Why This Matters in MLOps

Setup

Install required tools

Or with Docker

Core Implementation

Configuration

Initialize

DVC Integration

Specific dvc integration for experiment tracking

Monitoring and Alerting

CI/CD Integration

.github/workflows/ml-pipeline.yml

Best Practices

Resources

Documentation

Getting Started

Learn more