[Paper Review] CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation

Paper Review/Knowledge Distillation

[Paper Review] CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation

hakk35 2025. 2. 17. 14:14

This is a Korean review of

"CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation"
presented at CVPR 2024.

TL;DR

LiDAR-Camera (LC) fusion이 가장 높은 성능을 기록하지만, 높은 비용이 요구되므로 기술 도입이 어려움. 반면, Camera-Radar (CR) fusion은 일반적으로 쉽게 적용할 수 있지만, LC fusion보다 낮은 성능을 기록함.
본 연구는 LC fusion을 teacher model로, CR fusion을 student model로 사용하는 CRKD를 제안함. 이 때, 공유된 특징 공간으로 Bird's-Eye-View (BEV)를 사용함.
4가지 distillation loss를 제안하였고, nuScenes 데이터셋에서 뛰어난 성능을 보여줌.

Introduction

LiDAR, camera, radar는 자율주행 연구에서 가장 흔히 사용되는 sensor이며, 이러한 sensor의 fusion은 detection의 성능과 강건성을 향상시키기 위해 일반적으로 사용됨.
- LiDAR-Camera (LC) fusion은 3D object detection에서 가장 좋은 성능을 보여주지만, 높은 비용으로 인해 적용에 한계가 있음.
- Bird's-Eye-View (BEV) 기반 프레임워크에서 Camera-Only (CO) detection가 좋은 성능을 얻었지만, lighting condtion에 취약하고, 정확한 깊이 측정이 어려움.
- 이를 위해 날씨 변화와 lighting condition에 강건하고 low cost인 radar를 적용할 수 있지만, 데이터가 sparse하고 noisy하기 때문에 Camera-Radar (CR) detection을 설계하는 데에 어려움이 있음.
LiDAR-Only (LO)/LC와 CO/CR detector간의 성능 차이를 줄이기 위해 Knowledge Distillation (KD)를 적용할 수 있음.
- 기존의 cross-modal KD는 open-source에서 널리 이용 가능한 LiDAR 데이터를 활용하기 위해, single-modality detector를 teacher model로 사용하고 LiDAR-based 또는 camera-based student detector로 지식을 전달함.
- LC teacher detector로부터 CR student detector로 distillation 하는 것이 중요함. 그리고 이는 기존의 뛰어난 LC detector의 design을 그대로 사용할 수 있고, LiDAR와 radar 측정간의 shared point cloud representation를 이용할 수 있다는 이점이 있음.
따라서, 그림 1에서 보는 것처럼, 본 논문은 CRKD: Camera-Radar 3D object detector with cross-modality Knowledge Distillation을 제안하여, LC teacher detector로부터 CR student detector로 지식을 전달함.

Related works

Multi-modality 2D Object Detection

Cross-modality Knowledge Distillation

Method

Cross-modality fusion-to-fusion KD를 처리하기 위해, 여러개의 KD module을 설계함.
- Cross-stage radar distillation + Learning-based calibration: radar encoder가 더 정확한 scene-level object distribution을 학습하도록 설계함.
- Mask-scaling feature KD: foreground region의 feature imitation을 위해 설계되었으며, 센서에서 멀리 떨어져있거나 움직이는 객체의 *BEV 변환과정에서 발생하는 부정확성을 고려함.
- Relation KD: scene-level geometry에서의 relation consistency를 유지하기 위해 적용함.
- Response KD: CR model이 움직이는 객체를 잘 포착하도록 class-specific loss weight를 적용함.

* BEV 변환은 카메라 데이터를 위에서 본 것처럼 변환하는 과정을 의미함.
센서에서 멀리 있는 객체나 움직이는 객체는 변환과정에서 오류가 발생할 수 있음.
이는 1) 카메라만으로는 정확한 거리를 추정하기 어렵고,
2) 동적인 객체는 변환 과정에서 오차가 크기 때문임.

Model Architecture Refinement

Gated network를 추가하여 BEVFusion 모델이 single-modality feature maps에서 attention weight(i.e., 어떤 정보가 중요한지)가 생성되도록 학습시킴. 이를 활용해 보완적인 modality가 적응적으로 융합되도록 함 (i.e., Camera와 Radar가 효과적으로 융합).
- Gated network는 다음의 식을 통해 gated features를 학습함.
  $$ \tilde{F}_{M_1} = F_{M_1} \times \sigma \left( \text{Conv}_{M_1} \left( \text{Concat} \left( F_{M_1}, F_{M_2} \right) \right) \right) $$ $$ \tilde{F}_{M_2} = F_{M_2} \times \sigma \left( \text{Conv}_{M_2} \left( \text{Concat} \left( F_{M_1}, F_{M_2} \right) \right) \right) $$
- Adaptive gated network를 통해 teacher와 student model이 input modalities 중에서 상대적으로 중요한 정보를 학습하도록 함.
- Gated feature map은 두개의 input modalities로부터의 유익한 scene geometry를 encode하기 때문에 더욱 효과적인 feature-based distillation이 가능함.

※ 각 notation들의 정의는 원문을 참고

Cross-Stage Radar Distillation (CSRD)

Radar와 LiDAR 모두 포인트 클라우드 형태로 데이터를 표현하지만 물리적 의미가 다르기 때문에, 일반적인 feature imitation이 잘 작동하지 않음.
- Radar point는 더욱 sparse하고, 속도 정보가 포함된 ¶object-level point로 해석됨.
- LiDAR는 dense하고, †geometry-level information을 캡처함.
Radar가 sparse하고 scene-level object distribution을 나타내기 때문에 이를 활용한 Cross-Stage Radar Distillation을 제안함.
- 이는 radar feature map $(F^S_r)$과 LC teacher model이 예측한 scene-level objectness heatmap $(Y^T)$간의 distillation 임.
Radar는 일반적으로 거리 및 방위각 측정에서 noisy가 많기 때문에 이를 해결하기 위해 ‡calibration module을 설계함. $(F^S_r \rightarrow \hat{F}^S_r)$
$$ \mathcal{L}_\text{csrd} = \frac{1}{H\times W}\sum^H_i \sum^W_j || \hat{Y}^T_{i, j} - \hat{F}^S_{r i,j} ||_1 $$

¶ object-level point는 개별 객체의 존재 여부와 속도 정보를 나타냄.
하지만, radar는 해상도가 낮고, 밀도가 희소하기 때문에 물체의 전체적인 형상을 직접 포착하지는 못함.

† LiDAR는 단순한 object-level의 탐지가 아니라, 전체적인 기하학적 정보를 제공함.

‡ Three blocks of conv., batch norm, ReLU activation with kernel size 3 by 3을 통과하고,
calibrated feature map에 projection하기 위해 1 by 1 conv.를 추가함.

Mask-Scaling Feature Distillation (MSFD)

Camera feature maps과 fused feature maps을 align하기 위해서 feature distillation을 제안함.
- 3D object detection에서는 일반적으로 사용하는 direct feature imitation은 잘 작동하지 않는데, 이는 foreground와 background간의 imbalance가 크기 때문임. 따라서 foreground region의 정보만 증류하기 위해 masks를 활용함.
- 추가적으로, foreground의 boundary region이 KD에 효과적이라고 알려져 있음.
- 이를 종합하여, 물체의 거리와 움직임에 따라 적응적으로 마스크를 조정하는 Mask-Scaling Feature Distillation (MSFD)를 제안함. 이를 통해 BEV 변환 과정에서 발생할 수 있는 정렬 오류를 보정함.
  $$ \mathcal{L}_\text{msfd} = \frac{1}{H \times W} \sum^H_i \sum^W_j M_{i, j} || F^T_{i,j} - F^S_{i, j} ||_2 $$
- MSFD loss는 gated camera feature maps $(\tilde{F}^T_c, \tilde{F}^S_c)$과 fused feature maps $(F^T_f, F^S_f)$에 대해서 계산됨.

Relation Distillation (RelD)

Teacher와 student model간의 scene level에서의 geometric relations을 유지하는 것은 중요하기 때문에, RelD loss를 제안함.
- 이를 위해 fused feature map의 cosine similarity를 계산함.
  $$ C_{i, j} = \frac{F^\text{T}_i F^j}{|| F_i ||_2 \cdot || F_j ||_2} $$
- Teacher와 student의 scene-level information gap은 $\mathcal{L}_1$ norm으로 계산됨.
  $$ \mathcal{L}_\text{reld} = \frac{1}{H\times W}\sum^H_{i=1}\sum^W_{j=1}|| C^T_{i,j} - C^S_{i, j} ||_1 $$
- 서로 다른 scales에서의 scene-level relation information을 증류하기 위해서, downsampling operation과 convolutional block을 적용하고 multi-scale RelD loss를 계산함.

Response Distillation (RespD)

Radar는 도플러 효과로 인해 속도를 직접 측정할 수 있다는 장점을 가지고 있음. 따라서, student CR model이 움직이는 물체를 더 효과적으로 탐지하도록 하기 위해 dynamic class에 더 큰 weight를 설정함 (i.e., 더 큰 우선순위).
- Classification loss와 regression loss는 각각 다음과 같음.
  $$ \mathcal{L}_\text{cls} = \sum^K_{i=1}\text{QFL}\left( P^T_{C_i}, P^S_{C_i} \right) \times w_i $$ $$ \mathcal{L}_\text{reg} = \sum^K_{i=1}\text{Smooth}\mathcal{L1}\left( P^T_{B_i}, P^S_{B_i} \right) \times w_i $$

Overall Loss Function

$$ \mathcal{L} = \lambda_1 \cdot \mathcal{L}_\text{csrd} + \lambda_2 \cdot \mathcal{L}_\text{msfd} + \lambda_3 \cdot \mathcal{L}_\text{reld} + \lambda_4 \cdot \mathcal{L}_\text{respd} + \lambda_5 \cdot \mathcal{L}_\text{det} $$

저작자표시 비영리 변경금지

'Paper Review > Knowledge Distillation' 카테고리의 다른 글

[Paper Review] Scale Decoupled Distillation (2)	2024.12.07
[Paper Review] Logit Standardization in Knowledge Distillation (1)	2024.12.01
[Paper Review] Instance-conditional knowledge distillation for object detection (1)	2024.12.01
[Paper Review] CrossKD: Cross-Head Knowledge Distillation for Object Detection (2)	2024.11.29
[Paper Review] ScaleKD: Distilling Scale-Aware Knowledge in Small Object Detector (4)	2024.11.25

현재글[Paper Review] CRKD: Enhanced Camera-Radar Object Detection with Cross-modality Knowledge Distillation

成學

학문을 성취하다; Achievement in Studying; 学問を成し遂げる

Generalization, computer_vision, numerical_methods, computational_fluid_dynamics, object_detection, knowledge_distillation, discretisation, single_image_mixing, hyperbolic_pdes, dataset_distillation, difference_schemes, multi_image_mixing, data_augmentation, euler_equation, upwind_schemes, neural_operator, MIT, overfitting, radar_system, conservative_methods,

Today :
Yesterday :

일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

成學