Assignment 3 Report Design

Deployment and Model Optimization

Optimize models for practical serving: latency, throughput, memory, and quality retention.

Primary goalImprove efficiency without violating quality floor.

Decision metricP95 latency with minimum quality threshold.

DeliverableDeployment recommendation with rollback strategy.

Scope and Targets

Area Baseline Advanced Success Condition
Export Native Checkpoint ONNX / TorchScript Portable runtime with no measurable accuracy drift.
Optimization FP32 Inference Quantization + Pruning At least 30% latency drop within quality floor.
Serving Single Requests Batched and Profiled Pipeline Higher throughput and stable P95/P99 latency.

Latency

P50 / P95 / P99

Track tail latency explicitly to reflect production user experience.

Throughput

Requests Per Second

Measure with realistic concurrency and batching settings.

Efficiency

Memory + Cost per 1K

Include memory peak and estimated serving cost at target traffic.

1. Deployment Target

  • Platform, traffic profile, and SLO boundaries.
  • Quality floor and acceptable regression limits.
  • Operational constraints and compliance notes.

2. System Profile

  • Runtime stack, hardware, and serving framework.
  • Batch policy, queueing, and autoscaling assumptions.
  • Profiling setup and benchmark reproducibility details.

3. Hypothesis Matrix

  • H1: quantization gives largest latency gain.
  • H2: pruning gives better memory-per-quality ratio.
  • H3: batching shifts bottleneck to preprocessing stage.

4. Optimization Registry

  • Technique, config, target model, and version tag.
  • Expected benefit, regression risk, and fallback path.
  • Benchmark command, seed, and artifact links.

5. Benchmark Dashboard

  • Latency distribution and throughput under load tiers.
  • Model size, memory peak, and utilization curves.
  • Quality-retention table against baseline.

6. Production Risk Review

  • Numerical stability and precision-sensitive cases.
  • Rollback trigger thresholds and alert signals.
  • Final recommendation ranked by business impact.

Executive Summary Template

What improved?

Candidate __ reduced P95 latency by __% and memory by __%.

What trade-off appeared?

Quality changed by __ on sensitive class/group: __.

What is next?

Deploy candidate __ with monitored rollback threshold: __.