← Back to tutorials

WhyLabs AI Observatory: Complete Setup Guide

Real-time data and AI monitoring with WhyLabs

WhyLabs & ML Observability: Setup Guide

WhyLabs built its platform on an idea that outlives any vendor: monitor statistical profiles of your data, not the raw data itself. Its open-source library whylogs generates compact statistical summaries (distributions, missing rates, cardinalities) of whatever flows through your pipeline; the platform watches those profiles for drift and anomalies over time. This guide covers the profile-based monitoring pattern, setup, and where it fits in an LLM-era observability stack. *(Vendor landscape note: check the company's current product status and pricing before committing — this category has consolidated repeatedly; the whylogs pattern itself is open source and portable.)*

The core idea: profiles, not payloads

python
import whylogs as why
import pandas as pd

df = pd.DataFrame(batch_of_predictions) # features + predictions + metadata profile = why.log(df) # statistical summary, NOT the rows profile.writer('whylabs').write() # or write locally / to your own store

A profile captures per-column distributions, null rates, type counts, and frequent items in kilobytes — the raw data never leaves your boundary, which is why this pattern clears privacy review where log-everything tools don't (GDPR-friendly by construction). Profiles from every batch/hour/day line up into time series, and monitoring becomes: *did today's distribution shift against the baseline?*

What you catch with it

  • Input drift: the upstream team renamed a field; a market shift changed user demographics; a scraper started returning empty strings — all visible as distribution change *before* accuracy visibly degrades.
  • Prediction drift: your classifier's output mix moving (60/40 → 80/20) flags a problem even with no ground-truth labels yet — the key trick for production ML where labels lag by weeks.
  • Data-quality regressions: null-rate spikes, cardinality explosions, schema changes — the boring failures that cause most "model broke" incidents.
  • Setup is: profile every scoring batch → set baselines (training data or a stable window) → alert on divergence metrics per column → route to the owning team.

    The LLM-era extension

    The same pattern extends to text systems, with embeddings and metrics standing in for tabular columns:

  • Profile prompt/response *metrics*: lengths, language mix, refusal-phrase rates, toxicity/PII detector scores, validation-failure rates — distribution shifts in these catch prompt-injection waves, model-version drift, and traffic-mix changes.
  • Embedding-space drift: embed a sample of inputs; monitor centroid/spread movement — "our users started asking about something new" as a measurable event (langkit-style toolkits package these text metrics around whylogs).
  • This complements, not replaces, trace-level LLM observability (LangSmith/Langfuse/Helicone): traces answer "what happened in this request"; profiles answer "is the population shifting" — mature stacks run both, and profiles are the half that scales to millions of calls for pennies.
  • Production patterns

  • Profile at every boundary: features in, predictions out, (later) labels — drift *between* boundaries localizes the problem (input drift vs model staleness vs label shift).
  • Segment profiles by the dimensions you'd debug by (region, platform, customer tier) — aggregate drift often hides a single segment on fire.
  • Baseline hygiene: re-baseline deliberately after intentional changes (new model, new market) or every alert becomes "yes, we know."
  • Wire alerts to ownership: drift alerts without a routing table become a muted channel within a month — same lesson as every monitoring system.
  • FAQ

    Do I need this if I have Datadog/Grafana? APM monitors *systems* (latency, errors); this monitors *data and predictions*. The profile metrics can land in your existing dashboards — the gap it fills is statistical, not infrastructural.

    Open-source-only path? whylogs profiles + your own storage + scheduled comparison jobs gets you 70% of the value without a platform — a scheduled pipeline away.

    When is this overkill? Single low-stakes model, labels arrive instantly, volume is small — eyeball a dashboard. The pattern earns its keep when labels lag, volume is real, or compliance asks "how would you know if the model degraded?"


    *Last updated: June 2026. Verify current WhyLabs product status and the maintained text-metrics toolkit before adopting; the whylogs pattern is OSS regardless.*

    Also available in 中文.

    WhyLabs AI Observatory: Complete Setup Guide | AI Skill Navigation | AI Skill Navigation