WhyLabs AI Observatory: Complete Setup Guide

Real-time data and AI monitoring with WhyLabs

By AI Skill Navigation Editorial TeamPublished June 12, 2026

WhyLabs & ML Observability: Setup Guide

WhyLabs built its platform on an idea that outlives any vendor: monitor statistical profiles of your data, not the raw data itself. Its open-source library whylogs generates compact statistical summaries (distributions, missing rates, cardinalities) of whatever flows through your pipeline; the platform watches those profiles for drift and anomalies over time. This guide covers the profile-based monitoring pattern, setup, and where it fits in an LLM-era observability stack. *(Vendor landscape note: check the company's current product status and pricing before committing — this category has consolidated repeatedly; the whylogs pattern itself is open source and portable.)*

The core idea: profiles, not payloads

python
import whylogs as why
import pandas as pddf = pd.DataFrame(batch_of_predictions)      # features + predictions + metadata
profile = why.log(df)                         # statistical summary, NOT the rows
profile.writer('whylabs').write()             # or write locally / to your own store

A profile captures per-column distributions, null rates, type counts, and frequent items in kilobytes — the raw data never leaves your boundary, which is why this pattern clears privacy review where log-everything tools don't (GDPR-friendly by construction). Profiles from every batch/hour/day line up into time series, and monitoring becomes: *did today's distribution shift against the baseline?*

What you catch with it

Input drift: the upstream team renamed a field; a market shift changed user demographics; a scraper started returning empty strings — all visible as distribution change *before* accuracy visibly degrades.

Prediction drift: your classifier's output mix moving (60/40 → 80/20) flags a problem even with no ground-truth labels yet — the key trick for production ML where labels lag by weeks.

Data-quality regressions: null-rate spikes, cardinality explosions, schema changes — the boring failures that cause most "model broke" incidents.

Setup is: profile every scoring batch → set baselines (training data or a stable window) → alert on divergence metrics per column → route to the owning team.

The LLM-era extension

The same pattern extends to text systems, with embeddings and metrics standing in for tabular columns:

Profile prompt/response *metrics*: lengths, language mix, refusal-phrase rates, toxicity/PII detector scores, validation-failure rates — distribution shifts in these catch prompt-injection waves, model-version drift, and traffic-mix changes.

Embedding-space drift: embed a sample of inputs; monitor centroid/spread movement — "our users started asking about something new" as a measurable event (langkit-style toolkits package these text metrics around whylogs).

This complements, not replaces, trace-level LLM observability (LangSmith/Langfuse/Helicone): traces answer "what happened in this request"; profiles answer "is the population shifting" — mature stacks run both, and profiles are the half that scales to millions of calls for pennies.

Production patterns

Profile at every boundary: features in, predictions out, (later) labels — drift *between* boundaries localizes the problem (input drift vs model staleness vs label shift).

Segment profiles by the dimensions you'd debug by (region, platform, customer tier) — aggregate drift often hides a single segment on fire.

Baseline hygiene: re-baseline deliberately after intentional changes (new model, new market) or every alert becomes "yes, we know."

Wire alerts to ownership: drift alerts without a routing table become a muted channel within a month — same lesson as every monitoring system.

FAQ

Do I need this if I have Datadog/Grafana? APM monitors *systems* (latency, errors); this monitors *data and predictions*. The profile metrics can land in your existing dashboards — the gap it fills is statistical, not infrastructural.

Open-source-only path? whylogs profiles + your own storage + scheduled comparison jobs gets you 70% of the value without a platform — a scheduled pipeline away.

When is this overkill? Single low-stakes model, labels arrive instantly, volume is small — eyeball a dashboard. The pattern earns its keep when labels lag, volume is real, or compliance asks "how would you know if the model degraded?"

*Last updated: June 2026. Verify current WhyLabs product status and the maintained text-metrics toolkit before adopting; the whylogs pattern is OSS regardless.*

Also available in 中文.

WhyLabs AI Observatory: Complete Setup Guide

WhyLabs & ML Observability: Setup Guide

The core idea: profiles, not payloads

What you catch with it

The LLM-era extension

Production patterns

FAQ

Documentation

Getting Started

Learn more