FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation

1Institute of Automation, Chinese Academy of Sciences, 2ByteDance Inc 3School of Artificial Intelligence, University of Chinese Academy of Sciences
Interpolate start reference image.

FreeSeg, a generic framework to accomplish Unified, Universal and Open-Vocabulary Image Segmentation.

Abstract

Recently, open-vocabulary learning has emerged to accomplish segmentation for arbitrary categories of text-based descriptions, which popularizes the segmentation system to more general-purpose application scenarios. However, existing methods devotes to designing specialized architectures or parameters for specific segmentation tasks. These customized design paradigms lead to fragmentation between various segmentation tasks, thus hindering the uniformity of segmentation models. Hence in this paper, we propose FreeSeg, a generic framework to accomplish Unified, Universal and Open-Vocabulary Image Segmentation. FreeSeg optimizes an all-in-one network via one-shot training, and employs the same architecture and parameters to handle diverse segmentation tasks seamlessly in the inference procedure. Additionally, adaptive prompt learning facilitates the unified model to capture task-aware and category-sensitive concepts, improving model robustness in multi-task and varied scenarios. Extensive experimental results demonstrate that FreeSeg establishes new state-of-the-art results in performance and generalization on three segmentation tasks, which outperforms the best task-specific architectures by a large margin: 5.5% mIoU on semantic segmentation, 17.6% mAP on instance segmentation, 20.1% PQ on panoptic segmentation for the unseen class on COCO.

Method

The proposed unified open-vocabulary segmentation aims to optimize an all-in-one model to obtain semantic, instance, and panoptic segmentation results on arbitrary categories. To address this novel task, we propose a novel framework to accomplish unified and universal open vocabulary segmentation in this paper, termed as FreeSeg. FreeSeg advocates a two-stage framework, with the first stage extracting universe mask proposals and the second stage leveraging CLIP to perform zero-shot classification on the masks which are generated in the first stage.


Interpolate start reference image.

Qualitative Results

Interpolate start reference image.

BibTeX

@article{qin2023freeseg,
      title={FreeSeg: Unified, Universal and Open-Vocabulary Image Segmentation},
      author={Qin, Jie and Wu, Jie and Yan, Pengxiang and Li, Ming and Yuxi, Ren and Xiao, Xuefeng and Wang, Yitong and Wang, Rui and Wen, Shilei and Pan, Xin and others},
      journal={CVPR},
      year={2023}
}