This is a review of
"Decoupled Knowledge Distillation"
presented at CVPR 2022.
Introduction

Types of Knowledge Distillation
Logits-based method
Computational and storage cost ↓ Unsatisfactory performance
Feature-based method
Superior performance Extra computational cost and storage usage
∴ Potential of logit distillation is limited.
Decoupled Knowledge Distillation
Target classifiation knowledge distillation TCKD
- Binary logit distillation
Non-target classification knowledge distillation NCKD
- Knowledge among non-target logits
Method
Reformulation
Binary probabilities
Probabilities among non-target classes
Vanilla KD
- While NCKD focus on the knowledge among non-target classes, TCKD focus on the knowledge related to the target class.
Effects of TCKD and NCKD

- Solely applying TCKD is unhelpful or even harmful.
- Performane of NCKD are comparable and even better than vanilla KD
∴ Target-class related knwoedge could not be as important as knolwedge among non-target classes.


∴ The more difficult the training data is, the more benefits TCKD could provide.
Decoupled Knowledge Distillation
- NCKD loss is coupled with
: More confident predictions results in smaller NCKD weights. highly suppressed weights - Weights of NCKD and TCKD are coupled.
Experiments
Ablation: and

CIFAR-100


ImageNet

COCO

Conclusions
- Reformulation of vanilla KD loss into two parts: TCKD and NCKD
- Decoupled Knowledge Distillation overcomes limitation of coupled formulation or effective transfer.
- Significant improvements on various datasets