Production ML Systems: From Notebook to Scale

The Notebook Trap

Most ML projects die in the notebook phase. Not because the model is poor, but because the surrounding infrastructure was never designed for production. A notebook is a research environment. Production requires reproducibility, observability, and governance. The transition demands deliberate architecture.

The prototype trap works like this: a data scientist builds a model in a notebook. The metrics look good. The business stakeholder sees the accuracy and assumes the problem is solved. Then engineering tries to productionise it. Six months later, the project is cancelled.

We have seen this pattern dozens of times. The root cause is not technical incompetence; it is a mismatch between the tools and processes of research and the requirements of production. Notebooks are for exploration. Production requires reproducibility, versioning, testing, monitoring, and integration with the rest of the software lifecycle.

Feature Stores: The Foundation

Feature stores solve the training-serving skew problem. When features are computed differently at training time and inference time, models fail silently in production. A feature store maintains consistent definitions across both paths, versioning features as first-class artefacts.

Training-serving skew is the most insidious problem in production ML. It occurs when the code that generates features during training differs from the code that generates features during inference. Even a small difference - a missing null handling step, a different scaling method, a timezone conversion error - can reduce model accuracy by 10-20%.

The model works in training but fails silently in production because the inputs are not what it expects.

A feature store solves this by defining features declaratively and serving them consistently. When a data scientist defines a feature, the feature store materialises it for both training (batch, historical) and serving (real-time, current). The same transformation code runs in both contexts, eliminating skew.

We typically see two approaches: online stores for real-time inference (sub-10ms latency) and offline stores for batch training jobs. Feast and Tecton dominate the open-source landscape, but simpler implementations using Redis plus Parquet often suffice for teams without dedicated ML infrastructure engineers.

Case Study: We implemented Feast for a retail client with 200+ features across 12 models. Before the feature store, each data scientist maintained their own feature pipeline. After migration to Feast, feature reuse increased by 70%, and training-serving skew incidents dropped to zero.

The critical decision is feature ownership. When data scientists define features ad hoc in notebooks, nobody maintains them. Production feature stores require data engineering involvement and clear ownership boundaries.

Model Registries and Versioning

A model registry is not a model store. It is a metadata system that tracks experiments, artefacts, dependencies, and deployment state. MLflow is the default choice, but Weights & Biases and Neptune offer superior collaboration features for larger teams.

Lineage Tracking: When a model fails in production, you need to know exactly which data version, code commit, and hyperparameters produced it. Without this, debugging becomes guesswork. With it, rollback takes minutes.

Versioning goes beyond models. You must version your data, your code, your feature pipeline, and your model weights. Use DVC or LakeFS for data versioning. Use Git for code. Use MLflow or Weights & Biases for experiment tracking. If you cannot reproduce a model from six months ago, you do not have a production system.

Data versioning is the foundation. Unlike code, which is text and small, training datasets can be terabytes. DVC (Data Version Control) addresses this by storing metadata in Git and the actual data in object storage (S3, GCS, Azure Blob). When you checkout a Git commit, DVC pulls the correct dataset version automatically.

Deployment Patterns

Production ML deployment has three dominant patterns, each with different trade-offs:

Shadow mode. New model runs alongside the old, predictions are logged but not acted upon. Validates performance without risk. The new model processes real production traffic but its outputs are discarded, while the old model continues to serve users.
Canary deployment. Small traffic percentage routes to new model, increasing gradually as metrics hold. We typically start with 1% traffic, monitor for 24 hours, then increase to 5%, 10%, 25%, 50%, and finally 100%.
A/B testing. Two models serve different user segments simultaneously. Measures business impact, not just model accuracy. Requires careful experimental design: random assignment, sufficient sample size, and clear success metrics.

Skipping Shadow Mode: Most teams should start with shadow mode, then canary, then A/B. Skipping straight to A/B testing without validating model stability produces noisy results and erodes stakeholder trust. We have seen teams run A/B tests that showed a 5% improvement in click-through rate, only to discover that the improvement came from a bug in the experiment assignment.

Monitoring: What to Watch

Production ML monitoring covers four layers:

Data drift. Input distributions shift over time — seasonality, user behaviour changes, upstream pipeline modifications. Statistical tests detect this automatically. We typically see data drift after marketing campaigns, product launches, and external events like bank holidays.
Concept drift. The relationship between inputs and outputs changes. Harder to detect than data drift, requires periodic re-evaluation against labelled holdout sets. Concept drift often occurs when the underlying business process changes.
Model performance. Accuracy, precision, recall, or business metrics tracked against time. Set alerts on degradation, not absolute thresholds. We recommend alerting when performance drops by more than 5% relative to the baseline.
System health. Latency, throughput, error rates, resource utilisation. A model can be accurate but too slow to serve at peak load. We have seen models with 99.9% accuracy that could not serve traffic at 10 requests per second.

Delayed Feedback: Monitoring infrastructure for ML is different from traditional application monitoring. Batch metrics come from ground truth data that arrives hours or days after predictions, so your monitoring pipeline must handle delayed feedback. We typically store predictions in a data warehouse and join them with ground truth labels when they become available.

Retraining Strategy

Models degrade. The question is not whether to retrain, but how frequently and what triggers it. Three approaches dominate:

Scheduled retraining. Weekly or monthly retraining on the latest data. Simple but potentially wasteful. We recommend starting here because it is predictable and easy to operationalise.
Trigger-based retraining. Automated retraining when performance drops below a threshold or data drift exceeds tolerance. More efficient than scheduled retraining but requires robust monitoring. We typically implement this as a second phase.
Continuous learning. Incremental model updates from each new batch of data. Complex, prone to catastrophic forgetting, rarely justified for most organisations. Research organisations and tech giants use this, but for most companies, the operational complexity outweighs the benefit.

Retraining Pipeline Failures: We have seen retraining pipelines fail because of a schema change in an upstream data source that was not caught by validation. The model trained on corrupted data, passed evaluation because the evaluation data was also corrupted, and was deployed to production where it failed catastrophically. Data validation in the retraining pipeline is as critical as in the serving pipeline.

Our Recommendation

Start with a simple feature store (Feast or custom), MLflow for model registry, and shadow mode deployment. Add monitoring for data drift and model performance. Only after this foundation is solid should you introduce canary deployments and automated retraining. The teams that fail are those that try to build the entire platform at once.