The first week after an inline AI inspection system goes live, every quality engineer watches the detection rate. It's the obvious number — how many defects is the model catching? But once the initial deployment stabilizes, detection rate stops being the most informative metric. The metrics that actually tell you whether the system is working well, degrading slowly, or about to cause a problem are a different set, and they're less obvious.
This post covers the operational metrics we've seen quality engineering teams converge on after a few months of running inline AI inspection. FPA — false positive analysis — is the category most teams underinvest in at deployment but end up relying on heavily once the system is mature. The other metrics in this set are model confidence drift, escape rate per shift, and reject confirmation rate.
False positive rate: the metric that erodes trust
False positives are good parts that the model flags as defective. On a high-volume line, even a small false positive rate produces a significant volume of incorrectly rejected good parts. On a line running 180 ppm across two 8-hour shifts, a 0.5% false positive rate rejects 864 good parts per shift — roughly 56,000 per month.
The production cost of false positives depends on what happens to rejected parts. If rejected parts go to a re-inspection station where an auditor confirms or overrides the model's decision, the cost is re-inspection labor. If rejected parts are automatically scrapped without human review, the cost is the full production value of each incorrectly rejected good part. The second scenario — automatic scrap on AI reject — is defensible only when the false positive rate is very low and the per-part value is also low. For most applications we see, a human confirmation step on rejected parts is the right operating model, especially in the first several months of deployment.
False positive analysis (FPA) is the process of reviewing confirmed false positives to understand why the model made the error. The categories we see most often:
- Surface texture edge cases: Parts with unusual surface finish from a tooling change or a different raw material lot. The model hasn't seen this texture in training and interprets it as a defect signal.
- Lighting variation: Ambient light changes on the production floor — a nearby machine turned on or off, a fixture moved — shift the illumination in a way the model wasn't trained for.
- Part positioning variation: The part arrived at the inspection station slightly off the expected position. The defect-region-of-interest falls at the image edge or the feature of interest is partially occluded.
- Dust and contamination on the camera lens or illumination: Gradual contamination that shifts the image characteristics slowly until the model starts misclassifying.
Each category requires a different response. Surface texture edge cases require adding training examples from the new material or tooling state. Lighting variation requires either fixing the environmental condition or adding training examples across the lighting range. Part positioning variation requires mechanical fixture adjustment. Lens contamination requires a maintenance protocol.
The FPA process only works if you're logging rejected parts with images and timestamps, and if your re-inspection workflow produces confirmed pass/fail labels on what the model rejected. Both of those are operational disciplines, not just software features. Procunit logs every rejection with the image and model confidence score. The confirmation workflow — where an auditor marks whether the rejection was valid — needs to be set up as a production process when the system goes live.
Model confidence drift
Model confidence is the output score the model assigns to each prediction — a number between 0 and 1 representing how certain the model is of its classification. A high-confidence detection (score 0.93 for a defect class) is more reliable than a low-confidence detection (score 0.61).
Confidence drift is when the distribution of confidence scores shifts over time. If the model was initially producing defect detections at an average confidence of 0.87, and three months later the same defect class is being detected at an average confidence of 0.72, something in the production environment has changed in a way the model wasn't trained for. The model is still detecting (it's still flagging parts), but it's less certain — which means it's more likely to make classification errors at the margin.
Confidence drift is the leading indicator of escape rate problems. By the time escape rate trends upward in your audit data, the model has typically been showing confidence drift for two to four weeks. Catching confidence drift early gives you time to respond with retraining before escapes occur.
The practical metric to track is the rolling 7-day median confidence score per defect class. Set a threshold — typically 10 to 15 percentage points below the initial deployment median — and treat a breach of that threshold as a retraining trigger. In Procunit, we surface this as a dashboard alert so quality engineers don't have to actively monitor the score distribution. The alert doesn't necessarily mean the model is failing; it means something changed and an investigation is warranted.
Escape rate per shift
We've written about escape rate measurement in a separate post. In the context of ongoing operations, the shift-level granularity of escape rate matters for a specific operational reason: if escapes are clustering in particular shifts, the cause is likely something shift-specific — a different operator, a machine condition that develops over the shift cycle, or an ambient condition that differs between day and night shifts.
Shift-level escape rate requires a sampling audit that covers at least one lot per shift per day. This is more labor-intensive than a single daily audit, but for lines where you have reason to suspect shift variation (e.g., a machining operation where tool temperature affects surface finish), the additional resolution is worth the cost.
The control metric: calculate escape rate per shift for each of the past 14 shifts and compare to the overall baseline. If night shift escape rate is running 2× day shift escape rate, investigate what's different. This level of analysis isn't possible with daily aggregate data.
Reject confirmation rate
Reject confirmation rate is the fraction of model rejects that are confirmed as genuine defects by the re-inspection auditor. This is the inverse of the false positive rate, expressed as the fraction of rejects that were correct rather than the fraction that were wrong.
A well-tuned model on a stable line should show a reject confirmation rate of 85 to 95%. Below 80%, the false positive rate is high enough that the model is creating substantial re-inspection burden and potentially eroding operator trust in the system. Above 97%, the model's detection threshold may be set too conservatively — it's only flagging high-confidence defects and potentially missing borderline cases.
Track this metric at weekly granularity. The response to a declining confirm rate (increasing FPs) is different from the response to a rising confirm rate (potentially increasing escapes on borderline cases). Both directions matter and both require different interventions.
Building the dashboard
The four metrics above — false positive rate (or confirm rate), confidence drift per defect class, escape rate per shift, and overall escape rate — form the core of a useful quality operations dashboard for inline AI inspection. Everything else is secondary.
The temptation when building a quality dashboard is to add more metrics because more feels like more visibility. In practice, quality engineers we've worked with report that dashboards with more than six or seven primary metrics get ignored. The metrics described above are specifically chosen because each one triggers a distinct operational response when it moves outside control limits. A metric that changes but doesn't require a specific action is noise, not signal.
We're not saying other metrics have no value — throughput, detection count per part family, and shift-by-shift summary are all useful supporting data. But if you have to pick the four metrics that would alert you to a developing problem before it causes a customer impact, the four above are the ones to watch.
When to retrain
Retraining triggers can be proactive (confidence drift alert) or reactive (escape confirmed in audit, FP rate spike after a production change). Both are valid starting points, but proactive retraining — acting on confidence drift before escapes increase — is preferable when the production environment permits it.
Retraining doesn't require starting over. In Procunit, a retraining session adds the new examples (confirmed false positives, confirmed escapes, or images from the new production condition) to the existing training set and runs a fine-tuning pass. For most production environment changes, a 20 to 40 image addition and an overnight retrain is sufficient to recover model performance. You don't need to collect another full 50-image training set from scratch.
The discipline is keeping the confirmed examples organized — tagging each one with the date, the production condition change that generated it, and the defect class or false positive category. That record becomes your model history and makes future retraining decisions faster because you know what the model has and hasn't been trained for.