Optimizing Neural Networks for Production
Training a neural network is one thing. Deploying it to production where it needs to serve millions of predictions efficiently is another challenge entirely. Let me share what I've learned optimizing dozens of models for production.
The Production Reality Check
Your 500MB PyTorch model that takes 2 seconds per inference isn't going to cut it in production. Real-world constraints:
- Latency: Users expect sub-100ms responses
- Cost: GPU instances are expensive
- Scale: Handling thousands of concurrent requests
- Memory: Limited RAM on edge devices
Optimization Techniques
1. Quantization
Convert floating-point weights to lower precision:
import torch
# Post-training quantization
model_int8 = torch.quantization.quantize_dynamic(
model, {torch.nn.Linear}, dtype=torch.qint8
)
# Can reduce model size by 4x with minimal accuracy loss
Results: 75% smaller model, 2-4x faster inference, <1% accuracy drop
2. Pruning
Remove unnecessary weights:
import torch.nn.utils.prune as prune
# Prune 30% of weights
prune.l1_unstructured(model.layer, name='weight', amount=0.3)
3. Knowledge Distillation
Train a smaller model to mimic a larger one:
def distillation_loss(student_logits, teacher_logits, labels, temperature=3.0):
soft_targets = F.softmax(teacher_logits / temperature, dim=1)
soft_prob = F.log_softmax(student_logits / temperature, dim=1)
distillation_loss = F.kl_div(soft_prob, soft_targets, reduction='batchmean')
student_loss = F.cross_entropy(student_logits, labels)
return 0.7 * distillation_loss + 0.3 * student_loss
Architecture Optimization
Choose the Right Architecture
Not all architectures are production-friendly:
- Good: MobileNet, EfficientNet, DistilBERT
- Challenging: Large Vision Transformers, GPT-style models
Optimize Layers
# Replace expensive operations
class OptimizedBlock(nn.Module):
def __init__(self):
super().__init__()
# Use depthwise separable convolutions
self.depthwise = nn.Conv2d(in_ch, in_ch, 3, groups=in_ch)
self.pointwise = nn.Conv2d(in_ch, out_ch, 1)
def forward(self, x):
return self.pointwise(self.depthwise(x))
Deployment Strategies
1. ONNX Runtime
Convert models to ONNX for faster inference:
import torch.onnx
# Export to ONNX
torch.onnx.export(
model,
dummy_input,
"model.onnx",
opset_version=11,
input_names=['input'],
output_names=['output'],
dynamic_axes={'input': {0: 'batch_size'}}
)
# Load with ONNX Runtime
import onnxruntime as ort
session = ort.InferenceSession("model.onnx")
Benefit: 2-3x faster inference on CPU
2. TensorRT for GPU
For NVIDIA GPUs:
import tensorrt as trt
# Convert ONNX to TensorRT
builder = trt.Builder(trt.Logger(trt.Logger.WARNING))
network = builder.create_network(...)
parser = trt.OnnxParser(network, trt.Logger())
parser.parse_from_file("model.onnx")
# Build optimized engine
engine = builder.build_cuda_engine(network)
Benefit: 5-10x faster on GPU
3. Edge Deployment
For mobile/IoT:
import tensorflow as tf
# Convert to TFLite
converter = tf.lite.TFLiteConverter.from_saved_model(model_path)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
Batching Strategies
Dynamic Batching
Process multiple requests together:
class BatchedPredictor:
def __init__(self, model, max_batch_size=32, wait_time=0.01):
self.model = model
self.max_batch_size = max_batch_size
self.wait_time = wait_time
self.queue = []
async def predict(self, input_data):
future = asyncio.Future()
self.queue.append((input_data, future))
if len(self.queue) >= self.max_batch_size:
await self._process_batch()
return await future
async def _process_batch(self):
batch = self.queue[:self.max_batch_size]
self.queue = self.queue[self.max_batch_size:]
inputs = torch.stack([item[0] for item in batch])
outputs = self.model(inputs)
for i, (_, future) in enumerate(batch):
future.set_result(outputs[i])
Caching Strategies
Smart caching can dramatically reduce compute:
from functools import lru_cache
import hashlib
def hash_input(input_tensor):
return hashlib.md5(input_tensor.numpy().tobytes()).hexdigest()
@lru_cache(maxsize=1000)
def cached_predict(input_hash):
# Prediction logic
pass
def predict_with_cache(input_tensor):
input_hash = hash_input(input_tensor)
return cached_predict(input_hash)
Monitoring & Profiling
Profile Your Model
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU,
torch.profiler.ProfilerActivity.CUDA],
record_shapes=True
) as prof:
model(input_tensor)
print(prof.key_averages().table(
sort_by="cuda_time_total", row_limit=10
))
Production Metrics
Track these in production:
metrics = {
'inference_time_p50': percentile(times, 50),
'inference_time_p95': percentile(times, 95),
'inference_time_p99': percentile(times, 99),
'batch_size_avg': np.mean(batch_sizes),
'gpu_utilization': get_gpu_utilization(),
'memory_usage': get_memory_usage(),
}
Real-World Results
Here's what optimization achieved for a computer vision model:
Before:
- Model size: 450MB
- Inference time: 1.2s
- Cost: $500/month for GPU instances
After:
- Model size: 85MB (19% of original)
- Inference time: 120ms (10x faster)
- Cost: $50/month (CPU instances sufficient)
- Accuracy: 96.5% (vs 97.2% original, 0.7% drop)
Lessons Learned
- Profile First: Don't optimize blindly
- Measure Trade-offs: Always track accuracy vs. speed
- Start Conservative: Begin with safe optimizations
- Test Thoroughly: Validate on real production data
- Monitor Continuously: Catch degradation early
Conclusion
Model optimization isn't a one-time task—it's an ongoing process. Start with the low-hanging fruit (quantization, ONNX), measure results, then move to more advanced techniques if needed.
The goal isn't the smallest or fastest model—it's finding the right balance between performance, accuracy, and cost for your specific use case.
Remember: A 95% accurate model that ships is better than a 99% accurate model that never leaves your laptop.
Want to read more articles?