Assignment 2 Report Design

Robustness and Generalization Study

Evaluate how models behave under realistic distribution shifts and non-ideal data settings.

Primary goalMinimize robustness gap between IID and OOD settings.

Decision metricOOD weighted F1 with ECE as a guardrail.

DeliverableRobustness profile with actionable mitigation steps.

Scope and Targets

Area Baseline Advanced Success Condition
Shift Testing IID Validation OOD Robustness Suite Robustness gap reduced by at least 20%.
Data Strategy Standard Split Stress and Stratified Splits Stable performance across shift buckets.
Model Strategy Single Model Regularization / Ensemble Lower variance across seeds and perturbations.

Shift Quality

IID vs OOD F1

Report both absolute scores and relative degradation percentages.

Calibration

ECE + Confidence Drift

Identify overconfident predictions under synthetic and real shifts.

Stability

Seed and Scenario Variance

Track variance to avoid one-off improvements with weak reliability.

1. Robustness Question

  • Define deployment assumptions and likely shift categories.
  • Set primary robustness metric and failure tolerance.
  • Declare risk-critical classes and use-cases.

2. Shift Profile

  • Noise, blur, style/domain, and class-prior shift settings.
  • Severity levels with reproducible generation configs.
  • Expected real-world analog for each synthetic shift.

3. Hypothesis Matrix

  • H1: regularization reduces overconfidence.
  • H2: data stress splits improve generalization gap.
  • H3: ensemble reduces worst-case shift failure.

4. Experiment Registry

  • Scenario ID, model, defense, seed, and config hash.
  • Runtime, compute cost, and artifact references.
  • Reason for each run and expected signal.

5. Results Dashboard

  • IID/OOD score table with per-scenario breakdown.
  • Robustness gap chart and calibration comparison.
  • Top resilient classes and most fragile classes.

6. Risk Review

  • Spurious feature reliance and shortcut behavior.
  • False-positive/false-negative risk severity.
  • Mitigations prioritized by impact and cost.

Executive Summary Template

What improved?

Robust method reduced OOD degradation by __% versus baseline while preserving IID quality.

What remains risky?

Most fragile scenario remains __, with failure mode: __.

What is next?

Prioritize calibration tuning and targeted augmentation for shift: __.