The Area Under the ROC Curve (AUC) is a well-known metric for evaluating instance-level long-tail learning problems. In the past two decades, many AUC optimization methods have been proposed to improve model performance under long-tail distributions. In this paper, we explore AUC optimization methods in the context of pixel-level long-tail semantic segmentation, a much more complicated scenario. This task introduces two major challenges for AUC optimization techniques. On one hand, AUC optimization in a pixel-level task involves complex coupling across loss terms, with structured inner-image and pairwise inter-image dependencies, complicating theoretical analysis. On the other hand, we find that mini-batch estimation of AUC loss in this case requires a larger batch size, resulting in an unaffordable space complexity. To address these issues, we develop a pixel-level AUC loss function and conduct a dependency-graph-based theoretical analysis of the algorithm's generalization ability. Additionally, we design a Tail-Classes Memory Bank (T-Memory Bank) to manage the significant memory demand. Finally, comprehensive experiments across various benchmarks confirm the effectiveness of our proposed AUCSeg method.
AUCSeg is a generic optimization method that can be directly applied to any SOTA backbone for semantic segmentation. Specifically, AUCSeg includes two crucial components: (1) AUC optimization where a theoretically grounded loss function is explored for pixel-level semantic segmentation, and (2) Tail-class Memory Bank, an effective augmentation scheme to ensure efficient optimization of the proposed AUC loss.
Semantic segmentation is a multi-class classification task. Therefore, to apply AUC, we follow a popular multi-class AUC manner, i.e., the One vs. One (ovo) strategy, which is an average binary AUC score. Specifically, we denote the height and width of the images as $H$ and $W$, respectively. Let $\mathcal{D}^p = \{(\textbf{X}_{u,v}^i,\textbf{Y}_{u,v}^i)| i \in [1,n],u \in [0, H-1],v \in [0, W-1]\}$ be the set of all pixels, where $n$ is the number of images. The $j$-th element ($j \in [1,n \times (H-1) \times (W-1)]$) in $\mathcal{D}^p$ is abbreviated as $(\textbf{X}_j^p,\textbf{Y}_j^p)$ for convenience. Given the model prediction $f_{\theta}=(f_{\theta}^{(1)}, \dots, f_{\theta}^{(K)})$, $\forall c \in [K]$, $f_{\theta}^{(c)} \in [0,1]$, where $f_{\theta}^{(c)}$ serves as a continuous score function supporting class $c$ and $K$ denotes the total number of classe, our optimization objective is defined as follows: $$ \ell_{auc}:= \sum\limits_{c=1}^{K}\sum\limits_{c' \neq c}\sum\limits_{\textbf{X}_{m}^p \in \mathcal{N}_{c}}\sum\limits_{\textbf{X}_{n}^p \in \mathcal{N}_{c'}}\frac{1}{|\mathcal{N}_{c}||\mathcal{N}_{c'}|}\ell_{sq}^{c,c',m,n}, $$ where we adopt the widely used square loss $\ell_{sq}(x)=(1-x)^2$ as the surrogate loss; $\ell_{sq}^{c,c',m,n}:= \ell_{sq}(f_{\theta}^{(c)}(\textbf{X}_{m}^p)-f_{\theta}^{(c)}(\textbf{X}_{n}^p))$; $\mathcal{N}_{c}=\{\textbf{X}_{k}^p|\textbf{Y}_{k}^p=c \}$ represents the set of pixels with label $c$ in the set $\mathcal{D}^p$, and $|\mathcal{N}_{c}|$ denotes the size of the set.
The proposed loss enjoys a well-guaranteed generalization bound. Specifically, let $\mathbb{E}_{\mathcal{D}}\left[\hat{\mathcal{L}}_{\mathcal{D}}\left(f\right)\right]$ be the population risk of $\hat{\mathcal{L}}_{\mathcal{D}}\left(f\right)$. Assume $\mathcal{F} \subseteq \{ f:\mathcal{X} \to \mathbb{R}^{H \times W \times K} \}$, where $H$ and $W$ represent the height and width of the image, and $K$ represents the number of categories, $\hat{\mathcal{L}}^{(i)}$ is the risk over $i$-th sample, and is $\mu$-Lipschitz with respect to the $l_{\infty}$ norm, (\ie $\Vert \hat{\mathcal{L}}(x)-\hat{\mathcal{L}}(y)\Vert_\infty \le \mu\cdot \Vert x-y \Vert_{\infty}$). There exists three constants $A>0$, $B>0$ and $C>0$, the following generalization bound holds with probability at least $1-\delta$ over a random draw of i.i.d training data (at the image-level): $$ \begin{aligned} \left|\hat{\mathcal{L}}_{\mathcal{D}}\left(f\right)-\mathbb{E}_{\mathcal{D}}\left[\hat{\mathcal{L}}_{\mathcal{D}}\left(f\right)\right]\right| \leq \frac{8}{N}+&\frac{\eta_{\text{inner}}+\eta_{\text{inter}}}{\sqrt{N}}\sqrt{A\log\left(2B\mu\tau Nk+C\right)} \\ &+3\left(\sqrt{\frac{1}{2N}}+K\sqrt{1-\frac{1}{N}}\right)\sqrt{\log \left(\frac{4K(K-1)}{\delta}\right)}, \end{aligned} $$ where $$ \eta_{\text{inner}}=\frac{48\mu\tau\ln N}{N}, \quad \eta_{\text{inter}}=2\sqrt{2}\tau,\quad \tau=\left(\max\limits_{c\in[K]}\frac{n_{\max}^{(c)}}{n_{\text{mean}}^{(c)}}\right)^2, $$ $n_{\max}^{(c)} = \max_{\mathbf{X}} n(\mathbf{X}^{(c)})$, $n_{mean}^{(c)} = \sum_{i=1}^{N}n(\mathbf{X}_i^{(c)})$, $N=\left|\mathcal{D}\right|$, $k=H\times W$ and $\mathbf{X}^{(c)}$ represents the pixel of class $c$ in image $\mathbf{X}$.
The stochastic AUC optimization requires at least one sample from each class in a mini-batch. This leads to the need for a large batch size during training, resulting in an overwhelming computational burden.
Therefore, we introduce a novel Tail-class Memory Bank (T-Memory Bank) to efficiently optimize the loss while managing GPU usage effectively. The high-level ideas behind the T-Memory Bank are as follows: (1) identify any missing tail classes across all images within a mini-batch and (2) randomly replace certain pixels in images missing those classes based on historical class information stored in the T-Memory Bank. Specifically, the T-Memory Bank comprises three main components:
Quantitative comparison of AUCSeg with 13 recent advancements and 6 long-tail approaches in semantic segmentation on three typical datasets. The champion and the runner-up are highlighted in bold and underlined. Our AUCSeg outperforms all competitors in most scenarios, especially in overall and tail classes.
AUCSeg is a plug-and-play component that can be integrated into almost any existing backbone. The table below presents experimental results on various backbones, demonstrating the superiority of our AUCSeg for long-tailed semantic segmentation.
Refer to the pdf paper linked above for more details on qualitative, quantitative, and ablation studies.
@inproceedings{han2024aucseg,
title={AUCSeg: AUC-oriented Pixel-level Long-tail Semantic Segmentation},
author={Boyu Han and Qianqian Xu and Zhiyong Yang and Shilong Bao and Peisong Wen and Yangbangyan Jiang and Qingming Huang},
booktitle={Advances in Neural Information Processing Systems},
year={2024}
}
}