Machine Learning

Training Defect Detection Models with 50 Samples

September 3, 2024 9 min read

Grid of labeled defect image samples for machine learning training data

The central problem in industrial defect detection is not what people think it is. It's not inference speed. It's not edge deployment. It's not model size. The central problem is that defects are rare, which means the training set for any given failure mode is small — often 50 to 100 labeled examples if you're lucky. This shapes every architectural decision we made when building Procunit, and it's different enough from the benchmark-chasing world of academic computer vision that the standard tutorials are nearly useless as guides.

This post describes the techniques we rely on for low-sample defect detection. Not as a survey of the literature — as a description of what we actually use and why.

Why industrial defect datasets are always small

The defect rate on a healthy production line is typically between 0.1% and 2%. On a 120 ppm line with a 0.5% defect rate, you're seeing about 36 defective parts per hour. But defects are not uniformly distributed — die wear produces a burst of edge cracks over three days, then the die is replaced and the line runs clean for six weeks. A coating skip defect might appear in clusters when the spray nozzle partially clogs, get fixed by maintenance, then not recur for months.

This means that when a manufacturer first deploys an inline vision system, they often have access to parts from a scrap bin collected over the past few weeks — maybe 40 to 80 defective examples of the failure mode they're most worried about. Sometimes less. The constraint is not "we didn't label enough data." The constraint is that the defect physically doesn't exist in larger quantities yet.

We're not saying you can train a production-grade model on 10 images — you can't, and anyone who tells you otherwise is selling something. What we're saying is that 50 well-selected samples, combined with the right training strategy, can get you to a model that works reliably in production. The path from there to a better model is adding real production data over time, not trying to reach 10,000 labels before deployment.

Pretrained feature extractors: the actual foundation

Modern defect detection at small sample sizes only works because of transfer learning from large-scale pretrained models. The ImageNet-pretrained backbone of a ResNet or EfficientNet family model has already learned texture gradients, edge responses, and spatial frequency representations that are directly useful for detecting surface anomalies on metal parts. You're not training a defect detector from scratch on 50 images — you're fine-tuning a feature extractor that already knows what a scratch-like edge gradient looks like.

The practical consequence is that the choice of backbone matters significantly when sample count is low. In our testing on stamped metal and injection-molded plastic defects, EfficientNet-B3 pretrained on ImageNet generally outperforms larger models at the 50–100 sample regime. Counterintuitively, bigger backbones often hurt at small N because they overfit faster and the inductive bias from their pretraining becomes less relevant relative to the fine-tuning noise.

What you're doing when you fine-tune on 50 defect samples is teaching the last few layers to recognize the specific spatial signatures of your defect types. The pretrained features carry almost all the load. This changes how you should think about data augmentation — you're not trying to artificially inflate your dataset to the scale the backbone was designed for. You're trying to prevent the fine-tuned layers from memorizing the exact appearance of your 50 training examples.

Augmentation that preserves defect physics

Standard augmentation pipelines — random crop, horizontal flip, color jitter, random rotation — are all useful but need to be applied with awareness of what they do to defect signatures on industrial surfaces.

Rotation augmentation matters more for surface defects than for object classification because defects don't have a canonical orientation. An edge crack at 45 degrees and an edge crack at 135 degrees are the same defect under the same physics. We apply full 360-degree rotation augmentation for surface defect types. For directional defects like scoring (linear scratches from die contact), we use more limited rotation ranges — full rotation would create training examples that don't correspond to physically realistic scoring angles given the part's travel direction on the line.

Elastic deformation augmentation simulates the appearance variation you get from slight surface curvature differences across parts — useful for defects on curved stampings where the same defect type looks slightly different depending on which portion of the radius it appears on.

Critically, we do not use aggressive color augmentation beyond modest brightness/contrast jitter. Defect detection on metal parts depends significantly on the reflectance properties of the surface defect versus the surrounding material. A deep score mark has a characteristic dark region under ring illumination precisely because of the local geometry change. Aggressive color augmentation — hue shifts, saturation changes — destroys this signal. The color augmentation that's safe for natural image classification is often actively harmful for surface defect detection.

Anomaly detection as a complement, not a replacement

When sample count drops below about 30 defect examples, purely supervised classification becomes unreliable. Below that threshold, we layer in anomaly detection — specifically, feature distribution methods that model what "good" parts look like (which you can always collect plenty of) and flag deviations.

The PatchCore approach — building a memory bank of patch-level feature embeddings from clean parts and scoring test images against nearest-neighbor distances in that bank — generalizes well to a range of defect types without requiring any labeled defect examples at all. It tends to produce a higher false positive rate than a well-trained supervised model, but it provides useful coverage during the early deployment phase before enough real defect examples have been collected.

In practice, we use a hybrid: a supervised detection head for defect types where we have 30+ samples, and an anomaly scoring layer running alongside it for failure modes where the sample count is still building up. The anomaly score doesn't trigger hard rejection on its own — it flags for human review rather than automatic rejection until the supervised model has enough data to take over that decision.

Validation strategy at low N

Standard k-fold cross-validation becomes statistically unreliable at 50 samples — a single fold contains 10 examples, and performance estimates vary too much across folds to mean anything useful. We use stratified leave-one-out cross-validation for initial performance estimation, accepting that it's computationally expensive but necessary to get stable AUC estimates.

The more important validation step is operational: seeded samples. Before going live with a new model, we physically tag a set of known-defective parts and run them through the line. For a 50-sample training run, we typically hold out 15 samples as a seeded test set — never used in training. Seeded validation gives you a real-production estimate of detection rate that k-fold cannot, because it tests the model under actual imaging conditions (lighting variation, vibration, part-to-part positioning variance) rather than on images drawn from the same collection session as training.

The iteration loop matters more than the initial model

Here's the thing about 50-sample deployment: the initial model is a starting point, not a finished product. The value of getting something running early is that you start accumulating real production false positives and false negatives immediately. Every confirmed miss — a defective part that the model scored as clean — is a high-value training example. The model's hardest cases (the low-confidence borderline calls) can be automatically flagged for human label review without requiring a human to look at every frame.

A model trained on 50 samples and actively improved through this loop reaches 200-sample performance faster than waiting to collect 200 samples before deploying anything. We built Procunit's training pipeline around this loop specifically because the customers who need inline inspection most urgently are the ones who can't wait three months to collect a large labeled dataset. Deploy early, improve with real production data, build confidence metrics that tell you when the model is ready to take over fully automated rejection.

The 50-sample number is not a fixed minimum — it's the threshold below which the iteration loop starts to matter more than the initial training quality. Get something running, make sure the feedback mechanism is in place, and let the production data do the work.