Unsupervised Multi-Domain Multimodal Image-to-Image Translation with Explicit Domain-Constrained Disentanglement

Weihao Xia1      Yujiu Yang1      Jing-Hao Xue2

Abstract


Image-to-image translation has drawn great attention during the past few years. It aims to translate an image in one domain to a target image in another domain. Many applications can be formulated as image-to-image translation problems. However, three big challenges remain in image-to-image translation: 1) the lack of large amounts of aligned training pairs for various tasks; 2) the ambiguity of multiple possible outputs from a single input image; and 3) the lack of simultaneous training for multi-domain translation with a single network. Moreover, we also observed from experiments that the implicit disentanglement of content and style could lead to undesirable results. Therefore in this paper, we propose a unified framework for learning to generate diverse outputs using unpaired training data and allow for simultaneous multi-domain translation via a single model. Furthermore, we investigate how to extract domain supervision information so as to learn domain-constrained disentangled representations and achieve better image-to-image translation. Extensive experiments show that the proposed method outperforms or is comparable with the state-of-the-art methods for various applications.

Materials


  • arxiv
  •                   

  • github

Framework



The pipeline of our method: (a) $n$ domains; (b) two batches of images $x\in \mathcal{D_X}$, $y\in \mathcal{D_Y}$ with corresponding specific discriminative labels randomly selected from two different domains; (c) the first translation; (d) style-swapped images; (e) the second translation; and (f) cycle-reconstructed images. To achieve image translation between domains, we first randomly select two domains, then load two batches of images $x\in \mathcal{D_X}$, $y\in \mathcal{D_Y}$ with corresponding specific discriminative labels. Images from different domains are encoded as domain-invariant content representations $c$ and domain-specific style representations $s$. The two translations are achieved by swapping the style codes and using generator $G$ to produce the translated output images. The first translation constrains the translated images $x^{\prime}$ and $y^{\prime}$ with the proposed disentanglement constrained loss. The second translation constrains the image reconstruction with the cycle consistency loss. Due to the disentangled representations, the style representations are constrained to match the prior Gaussian distribution, so that we can generate several possible outputs by random sampling from this prior. The domain style representations are extracted by the pre-trained feature extractor $E_{\mathcal{Y}}^s$ from the collections of a certain style and constrain the disentanglement of content and style (similarly for $y$, which is omitted for simplicity of the diagram). The multi-domain simultaneous training is implemented by adding specific discriminative labels for each domain.



Illustration of (a) self translation, (b) intra-domain translation and (c) inter-domain translation. For better comparison, we follow the representations as in [1], and to avoid unnecessary confusion, we change the descriptions. Our model consists of two types of auto-encoders (denoted by red and blue arrows, respectively). Similarly to [2], the latent code of each auto-encoder is composed of a content code $c$ and a style code $s$. The model is trained with adversarial objectives (dotted lines) that ensure the translated images to be indistinguishable from real images in the target domain, as well as with bidirectional reconstruction objectives (dashed lines) that reconstruct both images and latent codes.

Results



Citation

@article{xia2020unsupervised,
  title={Unsupervised Multi-Domain Multimodal Image-to-Image Translation 
       with Explicit Domain-Constrained Disentanglement}, author={Xia, Weihao and Yang, Yujiu and Xue, Jing-Hao}, journal={Neural Networks}, year={2020}, publisher={Elsevier} }

References

  1. Xun Huang, Ming-Yu Liu, Serge Belongie, Jan Kautz "MUNIT: Multimodal UNsupervised Image-to-image Translation". ECCV, 2018.
  2. Hsin-Ying Lee*, Hung-Yu Tseng*, Jia-Bin Huang, Maneesh Kumar Singh, Ming-Hsuan Yang, "Diverse Image-to-Image Translation via Disentangled Representations". ECCV, 2018.