[Paper] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

Deep Learning

[Paper] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

ju_young 2022. 10. 3. 01:41

Introduction

위 그림의 (b)처럼 ViT는 columnar 구조를 가져서 coarse한 이미지 패치를 가진다. 그래서 pixel-level dense prediction(object detection, segmentation)에서 다음과 같은 한계를 가진다.

output feature map은 single-scale과 low-resolution을 가진다.
연산과 메모리 비용이 비교적 높다.

위 한계를 극복하기위해 본 논문에서는 Pyramid Vision Transformer(PVT)를 제안한다. 위 그림의 (c)가 바로 PVT이고 전통적인 Transformer의 다음과 같은 어려움들을 극복했다.

4x4 pixels per patch와 같이 fine-grained 이미지 패치를 사용하여 high-resolution representation을 학습할 수 있다.
Transformer의 sequence length를 줄이기위해 progressive shirinking pyramid를 소개한다.
high-resolution feature를 학습하는데 필요한 자원을 줄이기위해 spatial-reduction attention(SRA)를 채택했다.

전체적으로 PVT는 다음과 같은 장점을 가지고 있다.

전통적인 CNN backbone에서는 network depth에 따라 local receptive field가 증가하지만 PVT는 global receptive field를 항상 만들어 detection과 segmentation task에 적합하다.
RetinaNet과 Mask R-CNN과 같은 dense prediction pipeline에 쉽게 연결시킬 수 있다.
다른 task-specific한 Transformer decoder와 PVT를 결합하여 convolution-free pipeline을 구축할 수 있다. (PVT+DETR과 같은)

Pyramid Vision Transformer (PVT)

1. Overall Architecture

위 그림과 같이 CNN backbone과 비슷하게 4개의 stage를 가지고 각각 다른 scale의 feature map을 생성한다. 모든 stage는 비슷한 구조를 가지고 patch embedding layer롸 Transformer encoder layer로 구성되어있다.

stage i: H x W x 3 크기의 input image가 들어오면 먼저 $\frac{HW}{4^2}$ 개의 패치로 나눈다. 그럼 각각의 패치는 4 x 4 x 3 크기가 된다. 그리고 패치들을 flatten하고 linear projection에 넣어서 $\frac{HW}{4^2} \times C_1$크기의 embedded patch를 가진다. 그 후 embedded patch는 position embedding을 더하여 Transformer encoder layer를 통과하고, output은 $\frac{H}{4^2}\times \frac{W}{4^2} \times C_1$ 크기의 feature map으로 reshape한다. 같은 방법으로 이전 stage에서 얻은 feature map을 input으로 다음 stage도 진행하여 feature map $F_2, F_3, F_4$를 가진다. 이때 각 stage의 stride는 4, 8, 16, 32이다.

2. Feature Pyramid for Transformer

PVT에서는 feature map scale을 조절하기위해 progressive shrinking strategy를 사용했다. 위의 Overall Architecture에서 설명했듯이 stage i에서 패치를 나눈 후 각 패치를 flatten하고 $C_i$차원의 embedding으로 projection한다. 그 후 embedded patch로 input보다 더 작게 reshape 할 수 있다. 이 과정으로 각 stage에서 feature map의 scale을 더욱 유연하게 적응시킬 수 있다.

3. Transformer Encoder

PVT에서는 전통적인 MHA(multi-head attention) layer를 SRA(spatial-reduction attention)으로 바꾸었다. MHA와 비슷하게 SRA에서도 Query(Q), Key(K), Value(V)를 받고 feature를 뱉는다. 다른 점은 SRA의 경우 attention 연산을 수행하기 전 spatial scale K와 V를 줄여 연산/메모리를 상당히 줄였다.

SRA를 수식으로 표현하면 다음과 같다.

W들은 모두 linear projection parameter이고 $N_i$는 i번째 stage의 attention layer head의 개수이다. 그래서 각 head의 차원은 $\frac{C_i}{N_i}$로 동일하다. (그러니까 concat이 되겠지...)

SR()은 input sequence(K, V)의 spatial 차원을 줄이는 연산이고 다음과 같이 작성할 수 있다.

$R_i$는 i번째 stage에서 attention layer의 reduction ratio를 가리키는 것이다. Reshape(x, $R_i$)은 input sequence x를 다음과 같은 크기로 reshape하는 연산이다. 여기 $W_S$ 또한 linear projection의 parameter이다. Norm은 layer normalization.

그리고나서 원래 Transformer처럼 Attention 연산을 다음과 같이 수행한다.

'Deep Learning' 카테고리의 다른 글

[Paper] Escaping the Big Data Paradigm with Compact Transformers (0)	2022.11.01
[Paper] SepViT: Separable Vision Transformer (1)	2022.10.23
Generate Text Decoding Methods (1)	2022.09.21
[Paper] AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE (1)	2022.09.19
[Paper] LayoutLMv2: Multi-modal Pre-training for Visually-rich Document Understanding (0)	2022.09.15

현재글[Paper] Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction Without Convolutions

JADE's Repository