AUCSeg: AUC-oriented Pixel-level Long-tail Semantic Segmentation

AUCSeg is a generic optimization method that can be directly applied to any SOTA backbone for semantic segmentation. Specifically, AUCSeg includes two crucial components: (1) AUC optimization where a theoretically grounded loss function is explored for pixel-level semantic segmentation, and (2) Tail-class Memory Bank, an effective augmentation scheme to ensure efficient optimization of the proposed AUC loss.

AUC Optimization
Tail-class Memory Bank

AUC Optimization

Semantic segmentation is a multi-class classification task. Therefore, to apply AUC, we follow a popular multi-class AUC manner, i.e., the One vs. One (ovo) strategy, which is an average binary AUC score. Specifically, we denote the height and width of the images as $H$ and $W$, respectively. Let $\mathcal{D}^p = \{(\textbf{X}_{u,v}^i,\textbf{Y}_{u,v}^i)| i \in [1,n],u \in [0, H-1],v \in [0, W-1]\}$ be the set of all pixels, where $n$ is the number of images. The $j$-th element ($j \in [1,n \times (H-1) \times (W-1)]$) in $\mathcal{D}^p$ is abbreviated as $(\textbf{X}_j^p,\textbf{Y}_j^p)$ for convenience. Given the model prediction $f_{\theta}=(f_{\theta}^{(1)}, \dots, f_{\theta}^{(K)})$, $\forall c \in [K]$, $f_{\theta}^{(c)} \in [0,1]$, where $f_{\theta}^{(c)}$ serves as a continuous score function supporting class $c$ and $K$ denotes the total number of classe, our optimization objective is defined as follows: $$ \ell_{auc}:= \sum\limits_{c=1}^{K}\sum\limits_{c' \neq c}\sum\limits_{\textbf{X}_{m}^p \in \mathcal{N}_{c}}\sum\limits_{\textbf{X}_{n}^p \in \mathcal{N}_{c'}}\frac{1}{|\mathcal{N}_{c}||\mathcal{N}_{c'}|}\ell_{sq}^{c,c',m,n}, $$ where we adopt the widely used square loss $\ell_{sq}(x)=(1-x)^2$ as the surrogate loss; $\ell_{sq}^{c,c',m,n}:= \ell_{sq}(f_{\theta}^{(c)}(\textbf{X}_{m}^p)-f_{\theta}^{(c)}(\textbf{X}_{n}^p))$; $\mathcal{N}_{c}=\{\textbf{X}_{k}^p|\textbf{Y}_{k}^p=c \}$ represents the set of pixels with label $c$ in the set $\mathcal{D}^p$, and $|\mathcal{N}_{c}|$ denotes the size of the set.

The proposed loss enjoys a well-guaranteed generalization bound. Specifically, let $\mathbb{E}_{\mathcal{D}}\left[\hat{\mathcal{L}}_{\mathcal{D}}\left(f\right)\right]$ be the population risk of $\hat{\mathcal{L}}_{\mathcal{D}}\left(f\right)$. Assume $\mathcal{F} \subseteq \{ f:\mathcal{X} \to \mathbb{R}^{H \times W \times K} \}$, where $H$ and $W$ represent the height and width of the image, and $K$ represents the number of categories, $\hat{\mathcal{L}}^{(i)}$ is the risk over $i$-th sample, and is $\mu$-Lipschitz with respect to the $l_{\infty}$ norm, (\ie $\Vert \hat{\mathcal{L}}(x)-\hat{\mathcal{L}}(y)\Vert_\infty \le \mu\cdot \Vert x-y \Vert_{\infty}$). There exists three constants $A>0$, $B>0$ and $C>0$, the following generalization bound holds with probability at least $1-\delta$ over a random draw of i.i.d training data (at the image-level): $$ \begin{aligned} \left|\hat{\mathcal{L}}_{\mathcal{D}}\left(f\right)-\mathbb{E}_{\mathcal{D}}\left[\hat{\mathcal{L}}_{\mathcal{D}}\left(f\right)\right]\right| \leq \frac{8}{N}+&\frac{\eta_{\text{inner}}+\eta_{\text{inter}}}{\sqrt{N}}\sqrt{A\log\left(2B\mu\tau Nk+C\right)} \\ &+3\left(\sqrt{\frac{1}{2N}}+K\sqrt{1-\frac{1}{N}}\right)\sqrt{\log \left(\frac{4K(K-1)}{\delta}\right)}, \end{aligned} $$ where $$ \eta_{\text{inner}}=\frac{48\mu\tau\ln N}{N}, \quad \eta_{\text{inter}}=2\sqrt{2}\tau,\quad \tau=\left(\max\limits_{c\in[K]}\frac{n_{\max}^{(c)}}{n_{\text{mean}}^{(c)}}\right)^2, $$ $n_{\max}^{(c)} = \max_{\mathbf{X}} n(\mathbf{X}^{(c)})$, $n_{mean}^{(c)} = \sum_{i=1}^{N}n(\mathbf{X}_i^{(c)})$, $N=\left|\mathcal{D}\right|$, $k=H\times W$ and $\mathbf{X}^{(c)}$ represents the pixel of class $c$ in image $\mathbf{X}$.

Tail-class Memory Bank

The stochastic AUC optimization requires at least one sample from each class in a mini-batch. This leads to the need for a large batch size during training, resulting in an overwhelming computational burden.

Therefore, we introduce a novel Tail-class Memory Bank (T-Memory Bank) to efficiently optimize the loss while managing GPU usage effectively. The high-level ideas behind the T-Memory Bank are as follows: (1) identify any missing tail classes across all images within a mini-batch and (2) randomly replace certain pixels in images missing those classes based on historical class information stored in the T-Memory Bank. Specifically, the T-Memory Bank comprises three main components:

Memory Branch stores a set with $S_{M}$ (the Memory Size) images for each tail class. We define the set as $\mathcal{M}=\{\mathcal{M}_{c_1}, \dots, \mathcal{M}_{c_{n_t}}\}$, where $\mathcal{C}_t=\{c_i\}_{i=1}^{n_t}$ denotes the labels of tail classes, and $n_t$ is the total number of selected tail classes.
Retrieve Branch selects pixels from the Memory Branch to supplement the missing tail classes.
Store Branch updates the Memory Branch whenever a new image arrives.

AUCSeg: AUC-oriented Pixel-level Long-tail Semantic Segmentation

NeurIPS 2024

Abstract

Method

AUC Optimization

Tail-class Memory Bank

Experimental Results

Citation