TediGAN: Text-Guided Diverse Image Generation and Manipulation

Weihao Xia1      Yujiu Yang1      Jing-Hao Xue2      Baoyuan Wu3

Abstract



In this work, we propose TediGAN, a novel framework for multi-modal image generation and manipulation with textual descriptions. The proposed method consists of three components: StyleGAN inversion module, visual-linguistic similarity learning, and instance-level optimization. The inversion module is to train an image encoder to map real images to the latent space of a well-trained StyleGAN. The visual-linguistic similarity is to learn the text-image matching by mapping the image and text into a common embedding space. The instance-level optimization is for identity preservation in manipulation. Our model can provide the lowest effect guarantee, and produce diverse and high-quality images with an unprecedented resolution at 1024. Using a control mechanism based on style-mixing, our TediGAN inherently supports image synthesis with multi-modal inputs, such as sketches or semantic labels with or without instance (text or real image) guidance. To facilitate text-guided multi-modal synthesis, we propose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of real face images and corresponding semantic segmentation map, sketch, and textual descriptions. Extensive experiments on the introduced dataset demonstrate the superior performance of our proposed method.

Materials


  • arxiv
  •                

  • github
  •                

  • video

Framework



In this paper, for the first time, we propose a GAN inversion technique that can map multi-modal information, \eg, texts, sketches, or labels, into a common latent space of a pretrained StyleGAN. Based on that, we propose a very simple yet effective method for Text-guided diverse image generation and manipulation via GAN (abbreviated as TediGAN).



Our proposed method introduces three novel modules. The first StyleGAN inversion module learns the inversion where an image encoder can map a real image to the $\mathcal{W}$ space, while the second visual-linguistic similarity module learns linguistic representations that are consistent with the visual representations by projecting both into a common $\mathcal{W}$ space. The third instance-level optimization module is to preserve the identity after editing, which can precisely manipulate the desired attributes consistent with the texts while faithfully reconstructing the unconcerned ones. Our proposed method can generate diverse and high-quality results with a resolution up to $\text{1024}^2$, and inherently support image synthesis with multi-modal inputs, such as sketches or semantic labels with or without instance (texts or real images) guidance. Due to the utilization of a pretrained StyleGAN model, our method can provide the lowest effect guarantee, i.e., our method can always produce pleasing results no matter how uncommon the given text or image is.

Multi-Modal CelebA-HQ Dataset

To fill the gaps in the text-to-image synthesis dataset for faces, we create the Multi-Modal CelebA-HQ dataset to facilitate the research community. Following the format of the two popular text-to-image synthesis datasets, i.e., CUB for birds and COCO for natural scenes, we create ten unique descriptions for each image in the CelebA-HQ dataset. Besides real faces and textual descriptions, the introduced dataset also contains the label map and sketch for the text-guided generation with multi-modal inputs.


Results

Our method can achieve text-guided diverse image generation and manipulation up to an unprecedented resolution at $\text{1024}^2$.


diverse high-resolution ($\text{1024}^2$) results from text "a smiling young woman with short blonde hair"


diverse high-resolution ($\text{1024}^2$) results from text and label "he is young and wears beard"


diverse high-resolution ($\text{1024}^2$) results from text and sketch "a young woman with long black hair"

Citation

If you find our work, code or pre-trained models helpful for your research, please consider to cite:

@inproceedings{xia2020tedigan,
  author = {Xia, Weihao and Yang, Yujiu and Xue, Jing-Hao and Wu, Baoyuan},
  title = {TediGAN: Text-Guided Diverse Image Generation and Manipulation},
  booktitle={CVPR},
  year={2021}
}