[Paper Review] ShiftKD: Benchmarking Knowledge Distillation under Distribution Shift

Paper Review/Knowledge Distillation

[Paper Review] ShiftKD: Benchmarking Knowledge Distillation under Distribution Shift

成學 2025. 7. 26. 09:00

This is a Korean review of "ShiftKD: Benchmarking Knowledge Distillation underDistribution Shift" published in arXiv 2025.

TL;DR

Real-world에서는 훈련 데이터와 테스트 데이터 간의 분포 차이가 빈번하게 발생함. 따라서, Domain Shift에서 기존 KD 방법들의 신뢰성과 강건성을 확인해야 함.
두 가지의 일반적인 분포 변화 유형(Diversity shift, Correlation shift)에서 다양한 KD 기법들을 평가하며, 이외에도 데이터 증강, 프루닝, 최적화 알고리즘에 따른 성능 변화를 분석함.

Introduction

잘 학습된 대형 모델이 주어졌을 때, 분포 이동 상황에서도 성능 저하 없이 더 작고 강건한 구조로 압축하는 것이 필요함. 이를 위해 KD가 주목받고 있으나, 기존 방법들은 독립적이고 동일한 분포(i.i.d.)를 전제로 하고 있음.
훈련 환경에서는 깨끗하고 잘 정렬된 데이터가 주어지지만, 실제 depoyment environment에서는 Diversity shift와 Correlation shift가 나타날 수 있음.
- Diversity Shift: 실제 사진 → 만화 스타일 이미지로의 스타일 변화
- Correlation Shift: 레이블-특징 간 연관성 변화
기존의 i.i.d. 가정과는 다른 분포 이동 상황에서 다양한 KD 기법을 평가함. 이를 통해, 기본적인 Vanilla KD 방법도 때로는 충분할 수 있다는 것을 보여주며, 분포 이동 하에서 dark knowledge 및 데이터 증강의 효과 급감 현상을 밝힘.

ShiftKD: Framework to Evaluate Knowledge Distillation to Distribution Shift

Preliminaries

Knowledge Distillation (KD)

KD under distribution shift (non-i.i.d. case)

Non-i.i.d. 상황에서는, 유사하지만 서로 다른 $K$개의 훈련 도메인들 $\mathcal{D}_{\text{tr}} = \left\{ \mathcal{D}_e = (X_e, Y_e) \right\}_{e=1}^{K}$이 주어지며, 각 도메인은 서로 다른 데이터 분포 $P^e_{XY}$를 따름.
분포 이동 상황에서의 KD 목표는 훈련 시 접근할 수 없는 테스트 환경 $\mathcal{D}_\text{te}$에서도 잘 동작할 수 있는 학생모델 $S(X; \theta_s)$를 구축하는 것임.
선생 모델은 분포 이동이 반영된 데이터셋 $\mathcal{D}_\text{tr}$에 대해서 먼저 학습되고, 학생 모델에게 증류함. 이를 통해, 분포가 변화한 테스트셋 $\mathcal{D}_\text{te}$에 대해서 학생의 강건성을 확인함. → 선생모델 자체가 강건하지 않더라도, KD를 통해 강건한 학생을 얻기를 원함.

Framework Setting

Transferable Knowledge algorithms

분포 이동 하에서, 어떤 종류의 지식이 학생이 선생을 잘 따라가도록 도울까?

Distillation Data Manipulation

분포 이동 상황에서 KD의 강건성을 얻기 위해 어떤 데이터 전략을 선택해야 할까?

Optimization option

Types of Distribution shift

Benchmarking Details

Knowledge Transfer Algorithms

Data Manipulation Techniques

Optimization Options

분포 이동 상황에서 KD 성능에 영향을 줄 수 있는 하이퍼파라미터, 사전학습, optimizer, 학생 모델 종류 등을 평가함.

Shifted Datasets

Diversity shift와 Correlation shift의 이동 조건에서 KD 성능을 평가하기 위해 아래의 5가지 데이터셋을 선택함.
- Diversity shift: OOD generalization에서 널리 사용되는 PACS, OfficeHome, DomainNet을 사용함.
- Correlation shift: ColorMNIST(색과 숫자간의 인위적 상관관계)와 CelebA-Blond(성별과 금발 여부간의 상관관계)를 사용함.
이 다섯가지에 국한하지 않고, 대부분의 ODD 벤치마크 데이터셋을 활용할 수 있음.

Evaluation Implementation

Evaluation Metrics

Average Accuracy: 모든 도메인 환경에서 평균적으로 달성한 정확도
Worst-Group Accuracy (WGA): 가장 낮은 성능을 보인 환경에서의 정확도; 분포 이동이 심한 환경에 대한 강건성을 판단하는 기준
Expected Calibration Error (ECE): 모델의 예측 신뢰도와 실제 정확도 간의 차이를 측정하는 calibration 지표; 모델이 얼마나 overconfident 또는 underconfident하는지를 수치화

Benchmarking Details

RQ1: Performance Across Distillation Algorithms

KD를 통해 학습된 학생모델은 일반적인 특징에 초점을 맞추기 때문에 일반화 성능이 향상됨. 이는 학생모델이 선생모델보다 구조적으로 더 단순하게 설계되었기 때문임. 결과적으로, 분포 이동에서도 전반적인 성능향상을 가져옴.
복잡한 KD 기법들이 항상 Vanilla KD보다 큰 이점을 제공하지는 않음.
KD의 성능 개선 효과는 architectural compatibility에 매우 민감하기 때문에, 분포 이동 유형에 따라 KD 기법을 동적으로 조정할 필요가 있음.

Low-level knowledge는 분포 이동 상황에서 학생을 오히려 혼란스럽게 함. High-level semantic feature를 포함하는 마지막 layer를 사용하는 것이 가장 좋은 성능을 보임.
복잡한 KD 기법의 성능 저하 원인은 전달된 특징과 실제 필요한 표현 간의 불일치로 설명됨. 분포 변화 상황에서 선생 모델의 신뢰할 수 없는 저수준 특징에 과도하게 의존하게 되면 학생의 성능 저하로 이어짐. 즉, 모든 계층에서 선생모델을 무작정 따라서는 안되며, 도메인에 독립적이고 의미있는 표현을 선별적으로 정렬해야 함.

선생 모델이 bias를 가지고 있다면, 기존 KD 기법들은 이러한 편향을 학생에게 그대로 전달하게 되어, 학생모델의 성능 향상을 저해함.
i.i.d 환경에서 유용했던 dark knowledge는 분포 이동 환경에서는 오히려 역효과를 낳을 수 있음.

RQ2: The Role of Distillation Data

지식 증류에 사용할 데이터를 신중히 선택하는 것이 중요함. 데이터 조작을 통해, 학습 데이터를 유용하게 변형하여 이 데이터의 분포가 다양한 환경에서 공통적인 분포에 더 가까워지도록 해야함.

RQ3: Possible Causes on Training Options

Connecting KD to Information Theory

KD는 선생모델로부터 오는 유용한 정보만 골라서 사용해야 효과적임. 분포 이동 환경에서는 선생 모델의 잘못된 정보까지 따라하면 오히려 악영향을 미침. 따라서, KD에서도 정보를 선별할 필요가 있음.

Conclusion

KD가 분포 이동 상황에서도 강인한 경량 모델을 만드는데 중요한 역할을 하고 있음.
기존의 복잡한 KD 기법들은 Vanilla KD에 비해 큰 개선을 보여주지 못했음. 따라서, 새로운 알고리즘을 개발할 필요가 있음.
분포 이동 상황에서 학생 모델의 강인성을 향상시킬 새로운 데이터 기반 방법을 만드는 것이 유망한 연구 방향임.

저작자표시 비영리 변경금지 (새창열림)

'Paper Review > Knowledge Distillation' 카테고리의 다른 글

[Paper Review] CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation (0)	2025.02.17
[Paper Review] Scale Decoupled Distillation (2)	2024.12.07
[Paper Review] Logit Standardization in Knowledge Distillation (1)	2024.12.01
[Paper Review] Instance-conditional knowledge distillation for object detection (1)	2024.12.01
[Paper Review] CrossKD: Cross-Head Knowledge Distillation for Object Detection (2)	2024.11.29

현재글[Paper Review] ShiftKD: Benchmarking Knowledge Distillation under Distribution Shift

成學

knowledge_distillation, data_augmentation, computer_vision, discretisation, fluid_dynamics, difference_schemes, overfitting, parameter_matching, object_detection, numerical_methods, upwind_schemes, Generalization, hyperbolic_pdes, neural_operator, radar_system, conservative_methods, computational_fluid_dynamics, euler_equation, dataset_distillation, MIT,

Today :
Yesterday :

일	월	화	수	목	금	토
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28

成學

[Paper Review] ShiftKD: Benchmarking Knowledge Distillation under Distribution Shift

TL;DR

Introduction

ShiftKD: Framework to Evaluate Knowledge Distillation to Distribution Shift

Preliminaries

Knowledge Distillation (KD)

KD under distribution shift (non-i.i.d. case)

Framework Setting

Transferable Knowledge algorithms

Distillation Data Manipulation

Optimization option

Types of Distribution shift

Benchmarking Details

Knowledge Transfer Algorithms

Data Manipulation Techniques

Optimization Options

Shifted Datasets

Evaluation Implementation

Evaluation Metrics

Benchmarking Details

RQ1: Performance Across Distillation Algorithms

RQ2: The Role of Distillation Data

RQ3: Possible Causes on Training Options

Connecting KD to Information Theory

Conclusion

'Paper Review > Knowledge Distillation' 카테고리의 다른 글

'Paper Review/Knowledge Distillation'의 다른글

티스토리툴바

[Paper Review] ShiftKD: Benchmarking Knowledge Distillation under Distribution Shift

TL;DR

Introduction

ShiftKD: Framework to Evaluate Knowledge Distillation to Distribution Shift

Preliminaries

Knowledge Distillation (KD)

KD under distribution shift (non-i.i.d. case)

Framework Setting

Transferable Knowledge algorithms

Distillation Data Manipulation

Optimization option

Types of Distribution shift

Benchmarking Details

Knowledge Transfer Algorithms

Data Manipulation Techniques

Optimization Options

Shifted Datasets

Evaluation Implementation

Evaluation Metrics

Benchmarking Details

RQ1: Performance Across Distillation Algorithms

RQ2: The Role of Distillation Data

RQ3: Possible Causes on Training Options

Connecting KD to Information Theory

Conclusion

'Paper Review > Knowledge Distillation' 카테고리의 다른 글

'Paper Review/Knowledge Distillation'의 다른글

관련글

티스토리툴바