Machine Learning

Optimizing Neural Networks for Production

February 12, 2025
9 min read

Training a neural network is one thing. Deploying it to production where it needs to serve millions of predictions efficiently is another challenge entirely. Let me share what I've learned optimizing dozens of models for production.

The Production Reality Check

Your 500MB PyTorch model that takes 2 seconds per inference isn't going to cut it in production. Real-world constraints:

  • Latency: Users expect sub-100ms responses
  • Cost: GPU instances are expensive
  • Scale: Handling thousands of concurrent requests
  • Memory: Limited RAM on edge devices

Optimization Techniques

1. Quantization

Convert floating-point weights to lower precision:

import torch

# Post-training quantization
model_int8 = torch.quantization.quantize_dynamic(
    model, {torch.nn.Linear}, dtype=torch.qint8
)

# Can reduce model size by 4x with minimal accuracy loss

Results: 75% smaller model, 2-4x faster inference, <1% accuracy drop

2. Pruning

Remove unnecessary weights:

import torch.nn.utils.prune as prune

# Prune 30% of weights
prune.l1_unstructured(model.layer, name='weight', amount=0.3)

3. Knowledge Distillation

Train a smaller model to mimic a larger one:

def distillation_loss(student_logits, teacher_logits, labels, temperature=3.0):
    soft_targets = F.softmax(teacher_logits / temperature, dim=1)
    soft_prob = F.log_softmax(student_logits / temperature, dim=1)
    distillation_loss = F.kl_div(soft_prob, soft_targets, reduction='batchmean')
    student_loss = F.cross_entropy(student_logits, labels)
    return 0.7 * distillation_loss + 0.3 * student_loss

Architecture Optimization

Choose the Right Architecture

Not all architectures are production-friendly:

  • Good: MobileNet, EfficientNet, DistilBERT
  • Challenging: Large Vision Transformers, GPT-style models

Optimize Layers

# Replace expensive operations
class OptimizedBlock(nn.Module):
    def __init__(self):
        super().__init__()
        # Use depthwise separable convolutions
        self.depthwise = nn.Conv2d(in_ch, in_ch, 3, groups=in_ch)
        self.pointwise = nn.Conv2d(in_ch, out_ch, 1)
    
    def forward(self, x):
        return self.pointwise(self.depthwise(x))

Deployment Strategies

1. ONNX Runtime

Convert models to ONNX for faster inference:

import torch.onnx

# Export to ONNX
torch.onnx.export(
    model,
    dummy_input,
    "model.onnx",
    opset_version=11,
    input_names=['input'],
    output_names=['output'],
    dynamic_axes={'input': {0: 'batch_size'}}
)

# Load with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")

Benefit: 2-3x faster inference on CPU

2. TensorRT for GPU

For NVIDIA GPUs:

import tensorrt as trt

# Convert ONNX to TensorRT
builder = trt.Builder(trt.Logger(trt.Logger.WARNING))
network = builder.create_network(...)
parser = trt.OnnxParser(network, trt.Logger())
parser.parse_from_file("model.onnx")

# Build optimized engine
engine = builder.build_cuda_engine(network)

Benefit: 5-10x faster on GPU

3. Edge Deployment

For mobile/IoT:

import tensorflow as tf

# Convert to TFLite
converter = tf.lite.TFLiteConverter.from_saved_model(model_path)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()

Batching Strategies

Dynamic Batching

Process multiple requests together:

class BatchedPredictor:
    def __init__(self, model, max_batch_size=32, wait_time=0.01):
        self.model = model
        self.max_batch_size = max_batch_size
        self.wait_time = wait_time
        self.queue = []
    
    async def predict(self, input_data):
        future = asyncio.Future()
        self.queue.append((input_data, future))
        
        if len(self.queue) >= self.max_batch_size:
            await self._process_batch()
        
        return await future
    
    async def _process_batch(self):
        batch = self.queue[:self.max_batch_size]
        self.queue = self.queue[self.max_batch_size:]
        
        inputs = torch.stack([item[0] for item in batch])
        outputs = self.model(inputs)
        
        for i, (_, future) in enumerate(batch):
            future.set_result(outputs[i])

Caching Strategies

Smart caching can dramatically reduce compute:

from functools import lru_cache
import hashlib

def hash_input(input_tensor):
    return hashlib.md5(input_tensor.numpy().tobytes()).hexdigest()

@lru_cache(maxsize=1000)
def cached_predict(input_hash):
    # Prediction logic
    pass

def predict_with_cache(input_tensor):
    input_hash = hash_input(input_tensor)
    return cached_predict(input_hash)

Monitoring & Profiling

Profile Your Model

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU,
                torch.profiler.ProfilerActivity.CUDA],
    record_shapes=True
) as prof:
    model(input_tensor)

print(prof.key_averages().table(
    sort_by="cuda_time_total", row_limit=10
))

Production Metrics

Track these in production:

metrics = {
    'inference_time_p50': percentile(times, 50),
    'inference_time_p95': percentile(times, 95),
    'inference_time_p99': percentile(times, 99),
    'batch_size_avg': np.mean(batch_sizes),
    'gpu_utilization': get_gpu_utilization(),
    'memory_usage': get_memory_usage(),
}

Real-World Results

Here's what optimization achieved for a computer vision model:

Before:

  • Model size: 450MB
  • Inference time: 1.2s
  • Cost: $500/month for GPU instances

After:

  • Model size: 85MB (19% of original)
  • Inference time: 120ms (10x faster)
  • Cost: $50/month (CPU instances sufficient)
  • Accuracy: 96.5% (vs 97.2% original, 0.7% drop)

Lessons Learned

  1. Profile First: Don't optimize blindly
  2. Measure Trade-offs: Always track accuracy vs. speed
  3. Start Conservative: Begin with safe optimizations
  4. Test Thoroughly: Validate on real production data
  5. Monitor Continuously: Catch degradation early

Conclusion

Model optimization isn't a one-time task—it's an ongoing process. Start with the low-hanging fruit (quantization, ONNX), measure results, then move to more advanced techniques if needed.

The goal isn't the smallest or fastest model—it's finding the right balance between performance, accuracy, and cost for your specific use case.

Remember: A 95% accurate model that ships is better than a 99% accurate model that never leaves your laptop.

Tags:
Neural Networks
Optimization
Performance
Production
ML

Want to read more articles?