Skip to main content
Machine Learning

Quality Scoring Systems with Deep Learning: Using Embeddings for Assessment

This guide explains how modern AI systems automatically assess data quality—without training large models from scratch.

The approach covered here works for images, text, audio, or any domain with a decent pre-trained encoder.

1. The Problem

Imagine a system that receives thousands of inputs every day: images uploaded by users, MRI scans, audio recordings, or text documents.

Some of these inputs are clear, correctly formatted, and worth processing. Others are blurry, corrupted, wrongly oriented, or simply unusable.

Why manual review doesn’t work:

  • It doesn’t scale to thousands of inputs
  • Rule-based checks fail because quality is subjective
  • Different types of errors are hard to define explicitly

The solution is a system that automatically assigns a quality score.

2. What Is a Quality Score?

A quality score is a number between 0 and 1:

ScoreMeaning
0.9 – 1.0Excellent quality
0.6 – 0.8Acceptable quality
0.3 – 0.5Poor quality
0.0 – 0.2Reject

This is more useful than a binary GOOD/BAD label because quality exists on a spectrum. A score of 0.65 might be acceptable for one application but rejected in another.

3. The Core Idea

Instead of training a large AI model from scratch, we:

  1. Use a pre-trained model (already trained by experts on millions of examples)
  2. Extract useful numerical features from the input
  3. Train a small, simple model to map those features to a quality score
Input → Pre-trained Encoder → Embeddings → Small Head → Quality Score
             (frozen)          (768-dim)    (trainable)    (0.0 - 1.0)

This approach is fast, cheap, reliable, and easy to maintain.

4. Key Concepts

What Is a Pre-trained Model?

A pre-trained model is an AI model that has already learned patterns from millions of examples. For instance, image models trained on millions of photographs have learned to understand shapes, structure, patterns, and anomalies.

Instead of relearning all of this from scratch, we reuse this existing knowledge.

What Is an Encoder?

An encoder is a model that converts raw input into numbers:

Image → Encoder → [768 numbers]

What Are Embeddings?

Those 768 numbers are called embeddings. Think of embeddings as a compressed summary of the input that captures structure, geometry, texture, and consistency.

  • Good-quality inputs produce stable, consistent embeddings
  • Bad-quality inputs produce irregular embeddings

5. The Architecture

Here’s the minimal setup for the quality scoring head:

import torch
import torch.nn as nn

class QualityScorer(nn.Module):
    def __init__(self, embedding_dim=768):
        super().__init__()
        self.head = nn.Sequential(
            nn.Linear(embedding_dim, 256),
            nn.ReLU(),
            nn.Dropout(0.1),
            nn.Linear(256, 64),
            nn.ReLU(),
            nn.Linear(64, 1),
            nn.Sigmoid()  # Output between 0-1
        )
    
    def forward(self, embeddings):
        return self.head(embeddings)

Three linear layers. The encoder does the real work.

Why Sigmoid is used: Sigmoid is a mathematical function that converts any number into the range 0.0 to 1.0. Without it, the model could output meaningless values like 5.7 or -2.3.

6. Choosing Your Encoder

For image quality assessment:

EncoderEmbedding DimSpeedQuality
DINO (ViT-B/16)768MediumExcellent
DINOv2 (ViT-B/14)768MediumExcellent
CLIP (ViT-B/32)512FastGood
CLIP (ViT-L/14)768MediumGood
ResNet-502048FastDecent

Recommendation: Start with DINO or DINOv2. They’re trained with self-supervision and capture structural features well—exactly what quality assessment requires.

import torch

# Option 1: DINO v1
encoder = torch.hub.load('facebookresearch/dino:main', 'dino_vitb16')

# Option 2: DINOv2 (generally better)
encoder = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')

encoder.eval()

# Freeze - no need to train this
for param in encoder.parameters():
    param.requires_grad = False

Why We Freeze the Encoder

Freezing means the encoder’s parameters are not changed during training.

  • The encoder is already very powerful
  • Training it again requires massive compute resources
  • We only need a small adjustment layer on top

This makes training faster, more stable, and requires less data.

Why .eval() Mode Matters

Models behave differently during training versus prediction. Batch normalization and dropout layers produce different outputs in each mode.

If you forget to set evaluation mode, scores may vary randomly on identical inputs—sometimes by 20% or more. Always set both the encoder and quality model to evaluation mode before scoring.

7. The Inference Pattern

When scoring new data, no training happens—the system only predicts. This makes inference fast, safe, and deterministic.

def score_quality(input_tensor, encoder, quality_head, device='cpu'):
    """
    Returns quality score between 0 and 1.
    Higher = better quality.
    """
    encoder.eval()
    quality_head.eval()
    
    with torch.no_grad():
        input_tensor = input_tensor.to(device)
        
        # Get embeddings from encoder
        embeddings = encoder(input_tensor)
        
        # Handle different encoder output formats
        # DINO/DINOv2: returns tensor directly
        # HuggingFace models: returns dict with 'last_hidden_state'
        if isinstance(embeddings, dict):
            embeddings = embeddings['last_hidden_state']
        
        # If sequence output (batch, seq_len, dim), pool to (batch, dim)
        if len(embeddings.shape) == 3:
            embeddings = embeddings.mean(dim=1)
        
        score = quality_head(embeddings)
    
    return score.cpu().numpy()

Encoder output formats vary between libraries. Verify the output structure before building your pipeline:

# Debug: inspect encoder output
sample_output = encoder(dummy_input)
print(f"Type: {type(sample_output)}")
if hasattr(sample_output, 'shape'):
    print(f"Shape: {sample_output.shape}")
else:
    print(f"Keys: {sample_output.keys()}")

8. Aggregating Multiple Scores

Sometimes one input produces many scores—for example, frames from a video, slices from an MRI, or chunks from a document.

Simple averaging often isn’t the best approach because outliers can skew results.

def aggregate_scores(scores, method='median'):
    """
    Aggregate quality scores from multiple samples.
    
    Args:
        scores: tensor of shape (N,) containing individual scores
        method: 'mean', 'median', 'min', or 'trimmed_mean'
    """
    if method == 'mean':
        return scores.mean()
    
    elif method == 'median':
        return scores.median()  # Robust to outliers
    
    elif method == 'min':
        return scores.min()  # Conservative - worst sample determines outcome
    
    elif method == 'trimmed_mean':
        # Drop top/bottom 25%, average the rest
        k = len(scores) // 4
        sorted_scores = torch.sort(scores)[0]
        return sorted_scores[k:-k].mean() if k > 0 else scores.mean()

When to use each method:

MethodUse Case
medianDefault choice. Handles outliers well.
minWhen one bad sample should fail the entire batch.
trimmed_meanWhen data has noise but sensitivity to trends matters.
meanWhen data is clean and every sample is equally trustworthy.

Example of why median works better:

Scores: [0.95, 0.92, 0.90, 0.10]

Mean = 0.72   (misleading — one bad frame drags it down)
Median = 0.91 (accurate — reflects the majority)

9. Mapping Scores to Labels

Machines output numbers. Humans need actionable decisions.

def score_to_label(score):
    """
    Convert numeric score to actionable label.
    
    Thresholds should be calibrated to your data distribution.
    """
    if score >= 0.7:
        return "high", "Included in processing"
    elif score >= 0.4:
        return "medium", "Included with review flag"
    elif score >= 0.2:
        return "low", "Manual review required"
    else:
        return "rejected", "Excluded from processing"

These thresholds are business choices, not AI rules. Different applications set different thresholds based on their tolerance for risk.

Important: Don’t set thresholds without examining your score distribution first.

import numpy as np

# Score a validation set
all_scores = np.array([score_quality(x, encoder, head) for x in validation_set])

# Check distribution
print(f"Min: {all_scores.min():.3f}")
print(f"Max: {all_scores.max():.3f}")
print(f"Mean: {all_scores.mean():.3f}")
print(f"Std: {all_scores.std():.3f}")

# Percentiles guide threshold selection
for p in [10, 25, 50, 75, 90]:
    print(f"P{p}: {np.percentile(all_scores, p):.3f}")

If the “high quality” threshold is 0.7 but 90% of the data scores above 0.8, the threshold needs adjustment.

10. Training the Quality Head

Training means showing the model examples with known quality labels. The model gradually learns which embeddings correspond to good quality and which correspond to poor quality.

Only the small model learns—the encoder stays fixed.

import torch.optim as optim

def train_quality_head(encoder, quality_head, train_loader, epochs=10, lr=1e-3):
    """
    Train only the quality head. Encoder stays frozen.
    """
    optimizer = optim.Adam(quality_head.parameters(), lr=lr)
    criterion = nn.BCELoss()  # Binary cross-entropy for 0-1 targets
    
    encoder.eval()  # Frozen
    quality_head.train()
    
    for epoch in range(epochs):
        total_loss = 0
        for inputs, labels in train_loader:
            optimizer.zero_grad()
            
            # Get embeddings (no gradient needed)
            with torch.no_grad():
                embeddings = encoder(inputs)
            
            # Train the head
            predictions = quality_head(embeddings)
            loss = criterion(predictions.squeeze(), labels.float())
            
            loss.backward()
            optimizer.step()
            
            total_loss += loss.item()
        
        avg_loss = total_loss / len(train_loader)
        print(f"Epoch {epoch+1}/{epochs}, Loss: {avg_loss:.4f}")

Why Binary Cross-Entropy Loss: This loss function measures how wrong the predicted score is compared to the true score. It works well when the output is between 0 and 1.

Labeled data is required, but even a few hundred examples work well when the encoder is strong.


11. Production Patterns

Loading Models Efficiently

Initialize once, reuse everywhere.

class QualityService:
    """Singleton service for quality scoring."""
    
    _instance = None
    
    def __new__(cls, model_dir=None, device='cpu'):
        if cls._instance is None:
            cls._instance = super().__new__(cls)
            cls._instance._initialize(model_dir, device)
        return cls._instance
    
    def _initialize(self, model_dir, device):
        self.device = device
        self.encoder = self._load_encoder(model_dir)
        self.head = self._load_head(model_dir)
    
    def _load_encoder(self, model_dir):
        encoder = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
        encoder.eval()
        for param in encoder.parameters():
            param.requires_grad = False
        return encoder.to(self.device)
    
    def _load_head(self, model_dir):
        head = QualityScorer()
        head.load_state_dict(
            torch.load(f"{model_dir}/quality_head.pth", map_location=self.device)
        )
        head.eval()
        return head.to(self.device)
    
    def score(self, input_tensor):
        return score_quality(input_tensor, self.encoder, self.head, self.device)

Batching for Throughput

Score multiple inputs at once:

def score_batch(inputs, encoder, quality_head, batch_size=32, device='cpu'):
    """
    Score a large number of inputs efficiently.
    """
    all_scores = []
    
    for i in range(0, len(inputs), batch_size):
        batch = torch.stack(inputs[i:i+batch_size]).to(device)
        
        with torch.no_grad():
            embeddings = encoder(batch)
            scores = quality_head(embeddings)
        
        all_scores.extend(scores.cpu().numpy().flatten())
    
    return np.array(all_scores)

Caching Embeddings

When scoring the same inputs repeatedly (e.g., A/B testing different thresholds), cache the embeddings:

import hashlib

class EmbeddingCache:
    def __init__(self, encoder, device='cpu', max_size=1000):
        self.encoder = encoder
        self.device = device
        self.cache = {}
        self.max_size = max_size
    
    def get_embedding(self, input_tensor):
        # .cpu().detach() ensures this works for CUDA tensors and tensors with gradients
        key = hashlib.md5(input_tensor.cpu().detach().numpy().tobytes()).hexdigest()
        
        if key not in self.cache:
            # Evict oldest if at capacity
            if len(self.cache) >= self.max_size:
                oldest_key = next(iter(self.cache))
                del self.cache[oldest_key]
            
            with torch.no_grad():
                self.cache[key] = self.encoder(input_tensor.to(self.device)).cpu()
        
        return self.cache[key]

12. Debugging Quality Scores

When scores don’t behave as expected, work through this checklist:

Check Input Normalization

# Encoders expect specific normalization
print(f"Input range: [{input_tensor.min():.2f}, {input_tensor.max():.2f}]")
print(f"Input mean: {input_tensor.mean():.2f}")
print(f"Input std: {input_tensor.std():.2f}")

# ImageNet-pretrained models expect:
# mean = [0.485, 0.456, 0.406]
# std = [0.229, 0.224, 0.225]

Check Embedding Sanity

embeddings = encoder(input_tensor)
print(f"Embedding shape: {embeddings.shape}")
print(f"Embedding range: [{embeddings.min():.2f}, {embeddings.max():.2f}]")
print(f"Any NaN: {torch.isnan(embeddings).any()}")
print(f"Any Inf: {torch.isinf(embeddings).any()}")

Check Model Modes

print(f"Encoder training mode: {encoder.training}")  # Should be False
print(f"Head training mode: {quality_head.training}")  # Should be False

Validate Against Known Samples

# Keep reference samples with known quality
good_sample = load_known_good_sample()
bad_sample = load_known_bad_sample()

good_score = score_quality(good_sample, encoder, head)
bad_score = score_quality(bad_sample, encoder, head)

print(f"Good sample score: {good_score:.3f}")  # Should be high
print(f"Bad sample score: {bad_score:.3f}")    # Should be low

assert good_score > bad_score, "Model sanity check failed"

13. Complete Example

Putting it all together:

import torch
import torch.nn as nn

# 1. Load encoder (once, at startup)
encoder = torch.hub.load('facebookresearch/dinov2', 'dinov2_vitb14')
encoder.eval()
for param in encoder.parameters():
    param.requires_grad = False

# 2. Load trained quality head
quality_head = QualityScorer(embedding_dim=768)
quality_head.load_state_dict(torch.load('quality_head.pth'))
quality_head.eval()

# 3. Score new inputs
def assess_quality(input_tensor):
    with torch.no_grad():
        emb = encoder(input_tensor)
        score = quality_head(emb).item()
    
    label, action = score_to_label(score)
    return {
        'score': round(score, 3),
        'label': label,
        'action': action
    }

# Usage
result = assess_quality(preprocessed_input)
print(f"Quality: {result['score']} ({result['label']}) - {result['action']}")
# Output: Quality: 0.847 (high) - Included in processing

14. Key Takeaways

PrincipleReason
Use pre-trained encodersThey already understand quality-relevant features
Keep the head smallThree linear layers is sufficient; larger heads overfit
Freeze the encoderFine-tuning requires massive data with minimal benefit
Use median for aggregationRobust to outliers in most use cases
Calibrate thresholds empiricallyExamine score distributions before setting cutoffs
Cache embeddings when possibleThe encoder is the computational bottleneck
Verify .eval() modeForgetting this causes inconsistent scores in production

Summary

Training large AI models from scratch to assess quality is unnecessary. Pre-trained encoders already understand structure—quality problems are fundamentally structural problems.

A small scoring head learns what “good” means for your specific domain, using surprisingly little labeled data. This approach is fast, cheap, reliable, and widely used in production systems across industries.