Latency
P50 / P95 / P99
Track tail latency explicitly to reflect production user experience.
Assignment 3 Report Design
Optimize models for practical serving: latency, throughput, memory, and quality retention.
Primary goalImprove efficiency without violating quality floor.
Decision metricP95 latency with minimum quality threshold.
DeliverableDeployment recommendation with rollback strategy.
| Area | Baseline | Advanced | Success Condition |
|---|---|---|---|
| Export | Native Checkpoint | ONNX / TorchScript | Portable runtime with no measurable accuracy drift. |
| Optimization | FP32 Inference | Quantization + Pruning | At least 30% latency drop within quality floor. |
| Serving | Single Requests | Batched and Profiled Pipeline | Higher throughput and stable P95/P99 latency. |
Latency
Track tail latency explicitly to reflect production user experience.
Throughput
Measure with realistic concurrency and batching settings.
Efficiency
Include memory peak and estimated serving cost at target traffic.
Candidate __ reduced P95 latency by __% and memory by __%.
Quality changed by __ on sensitive class/group: __.
Deploy candidate __ with monitored rollback threshold: __.