Paper Review/Knowledge Distillation

[Paper Review] Decoupled Knowledge Distillation

hakk35 2024. 11. 25. 08:49

This is a review of

"Decoupled Knowledge Distillation"
presented at CVPR 2022.

Introduction

Types of Knowledge Distillation

Logits-based method

  • (+) Computational and storage cost ↓
  • () Unsatisfactory performance

Feature-based method

  • (+) Superior performance
  • () Extra computational cost and storage usage

∴ Potential of logit distillation is limited.

 

Decoupled Knowledge Distillation

Target classifiation knowledge distillation (TCKD)

  • Binary logit distillation

Non-target classification knowledge distillation (NCKD)

  • Knowledge among non-target logits

 

 

Method

Reformulation

pi=exp(zi)j=1Cexp(zj),p=[p1,p2,,pt,,pC]R1×C

Binary probabilities

pt=exp(zt)j=1Cexp(zj),pt=k=1,ktCexp(zk)j=1Cexp(zj),b=[pt,pt]R1×2

Probabilities among non-target classes

p^i=exp(zi)j=1,jtCexp(zj),p^=[p^1,,p^t1,p^t+1,,p^C]R1×(C1)

Vanilla KD

KD=KL(pTpS)=ptTlog(ptTptS)+i=1,itCpiTlog(piTpiS)

KD=ptTlog(ptTptS)+ptTi=1,itCp^iT(log(p^iTp^iS)+log(ptTptS))=ptTlog(ptTptS)+ptTlog(ptTptS)+ptTi=1,itCp^iTlog(p^iTp^iS).

KD=KL(bTbS)+(1ptT)KL(p^Tp^S)

KD=TCKD+(1ptT)NCKD

  • While NCKD focus on the knowledge among non-target classes, TCKD focus on the knowledge related to the target class.

 

Effects of TCKD and NCKD

  • Solely applying TCKD is unhelpful or even harmful.
  • Performane of NCKD are comparable and even better than vanilla KD

∴ Target-class related knwoedge could not be as important as knolwedge among non-target classes.

∴ The more difficult the training data is, the more benefits TCKD could provide.

 

Decoupled Knowledge Distillation

KD=TCKD+(1ptT)NCKD

  • NCKD loss is coupled with (1ptT): More confident predictions results in smaller NCKD weights. (highly suppressed weights)
  • Weights of NCKD and TCKD are coupled.

DKD=αTCKD+βNCKD

 

 

Experiments

Ablation: α and β

 

CIFAR-100

 

ImageNet

 

COCO

 

 

Conclusions

  • Reformulation of vanilla KD loss into two parts: TCKD and NCKD
  • Decoupled Knowledge Distillation overcomes limitation of coupled formulation or effective transfer.
  • Significant improvements on various datasets