[论文翻译]用于卡通人脸生成的微调 StyleGAN2


原文地址:https://arxiv.org/pdf/2106.12445.pdf

代码地址:https://github.com/happy-jihye/Cartoon-StyleGan2

Fine-tuning StyleGAN2 for Cartoon Face Generation

Jihye Back
Seoul National University

ABSTRACT

Recent studies have shown remarkable success in the unsupervised image to image (I2I) translation. However, due to the imbalance in the data, learning joint distribution for various domains is still very challenging. Although existing models can generate realistic target images, it’s difficult to maintain the structure of the source image. In addition, training a generative model on large data in multiple domains requires a lot of time and computer resources. To address these limitations, we propose a novel image-to-image translation method that generates images of the target domain by finetuning a stylegan2 pretrained model. The stylegan2 model is suitable for unsupervised I2I translation on unbalanced datasets; it is highly stable, produces realistic images, and even learns properly from limited data when applied with simple fine-tuning techniques. Thus, in this paper, we propose new methods to preserve the structure of the source images and generate realistic images in the target domain. The code and results are available at https://github.com/happy-jihye/Cartoon-StyleGan2
最近的研究表明,在无监督图像到图像 (I2I) 转换方面取得了显着的成功。然而,由于数据的不平衡,学习各个领域的联合分布仍然非常具有挑战性。虽然现有模型可以生成逼真的目标图像,但很难保持源图像的结构。此外,在多个领域的大数据上训练生成模型需要大量的时间和计算机资源。为了解决这些限制,我们提出了一种新颖的图像到图像转换方法,该方法通过微调 stylegan2 预训练模型来生成目标域的图像。stylegan2模型适用于非平衡数据集上的无监督I2I翻译;它高度稳定,产生逼真的图像,当使用简单的微调技术时,甚至可以从有限的数据中正确学习。因此,在本文中,我们提出了保留源图像结构并在目标域中生成逼真图像的新方法。代码和结果可在https://github.com/happy-jihye/Cartoon-StyleGan2

1INTRODUCTION

The style-based generative model [1, 2] is very effective in image generation, and it generates high-resolution images by learning not only global attributes but also stochastic details. Several researchers [3, 4, 5, 6, 7, 8, 9, 10] have used this model to perform several tasks such as image editing and image translation, noting the powerfulness of style-based architecture. Furthermore, recent research to apply transfer learning to this model is also active. Because style-based architecture trains sequentially from low-resolution to high-resolution, simple fine-tuning techniques make it easier to transfer models of source domains to target domains. Thus, many unsupervised I2I translations have adopted a transfer learning to reduce resources, memory and time.

In this paper, we fine-tune the stylegan2 model for unsupervised I2I translation. We propose two methods to make source images and target images similar, such as paired images.

  1. FreezeSG, which freezes the initial blocks of the style vectors and generator. This is very simple and allows the target image to follow the structure of the source image.
  2. Structure Loss, which make to reduce the distance between the inital blocks of the source generator and the inital blocks of the target generator. The effectiveness of applying Layer Swapping to models trained by this loss is also remarkable.
    基于样式的生成模型[ 1 , 2 ]在图像生成中非常有效,它不仅通过学习全局属性还学习随机细节来生成高分辨率图像。几位研究人员[ 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 ]已使用此模型执行多项任务,例如图像编辑和图像翻译,并注意到基于样式的架构的强大功能。此外,最近将迁移学习应用于该模型的研究也很活跃。由于基于样式的架构从低分辨率到高分辨率按顺序训练,简单的微调技术可以更轻松地将源域模型转移到目标域。因此,许多无监督的 I2I 翻译采用迁移学习来减少资源、内存和时间。

在本文中,我们为无监督 I2I 翻译微调 stylegan2 模型。我们提出了两种方法来使源图像和目标图像相似,例如配对图像。

  1. FreezeSG,冻结样式向量和生成器的初始块。这非常简单,并且允许目标图像遵循源图像的结构。
  2. Structure Loss,使源生成器的初始块与目标生成器的初始块之间的距离减小。将层交换应用于由这种损失训练的模型的有效性也很显着。

2RELATED WORK

Generative Adversarial Networks (GANs)

The GAN framework [11] has shown impressive results in image generation [2, 12, 1], image translation [3, 4, 5, 6], image editing [13, 14, 15], and face image synthesis [2, 1]. GAN is a generative model that trains through the adversarial process of generator G and discriminator D: The generator creates a fake image to confuse the discriminator so it cannot recognize whether the image is real or fake. In this paper, we aim to align the source distribution to the target domain using the GAN network.

生成对抗网络 (GAN)

在GAN框架[ 11 ]已经显示出在图像生成印象深刻的结果[ 2121 ],图象翻译[ 3456 ],图像编辑[ 131415 ],以及人脸图像合成[ 21 ]. GAN 是一个生成模型,它通过生成器 G 和判别器 D 的对抗过程进行训练:生成器创建假图像来混淆判别器,使其无法识别图像是真还是假。在本文中,我们的目标是使用 GAN 网络将源分布与目标域对齐。

Image-to-Image Translation

Recent studies have shown remarkable success in the image to image (I2I) translation [3, 4, 5, 6] that aims to learn the mapping between an input image and an output image. Pix2Pix [5] is the first framework for I2I translation, which uses cGANs [16] to enable successful I2I transformations for a variety of tasks when there is paired dataset. Because of the limited paired training data, several methods have been proposed to address unsupervised image-to-image translation tasks. UNIT [6] assumes a shared latent space, so that image pairs from different domains can be mapped into one shared latent space. Other unsupervised I2I methods [7, 8], including CycleGAN [9], introduce the cycle consistency loss to make a constraint between the source image and the target image. However, these frameworks must be trained on data from both the source domain and the target domain simultaneously. Training consists of the mapping process in multiple domains, which requires a lot of computational resources and time. If there is a severe imbalance between data of each domain, mode collapse may also happen. To overcome these limitations, we proposed a new unsupervised Image-to-Image translation method that generates images of target domains using the stylegan2 [2] pretrained model.

图像到图像的翻译

最近的研究表明,图像到图像 (I2I) 转换[ 3 , 4 , 5 , 6 ]取得了显着的成功,该转换旨在学习输入图像和输出图像之间的映射。Pix2Pix [ 5 ]是第一个 I2I 转换框架,它使用 cGAN [ 16 ]在有配对数据集时为各种任务启用成功的 I2I 转换。由于有限的配对训练数据,已经提出了几种方法来解决无监督的图像到图像转换任务。单位[ 6 ]假设一个共享的潜在空间,因此来自不同域的图像对可以映射到一个共享的潜在空间。其他无监督 I2I 方法[ 7 , 8 ],包括 CycleGAN [ 9 ],引入循环一致性损失,在源图像和目标图像之间做一个约束。但是,这些框架必须同时针对来自源域和目标域的数据进行训练。训练由多个领域的映射过程组成,需要大量的计算资源和时间。如果每个域的数据之间存在严重的不平衡,也可能发生模式崩溃。为了克服这些限制,我们提出了一种新的无监督图像到图像转换方法,该方法使用 stylegan2 [ 2 ]预训练模型生成目标域的图像。

We compared the results of FreezeSG with existing models (FreezeD, FreezeG). At this time, we swapped the output of the 4-th style block.
Figure 1: We compared the results of FreezeSG with existing models (FreezeD, FreezeG). At this time, we swapped the output of the 4-th style block.
图 1:我们将 FreezeSG 的结果与现有模型(FreezeD、FreezeG)进行了比较。这时,我们交换了第 4 个样式块的输出。

Transfer Learning

Transfer learning is a very powerful and effective method in computer vision. In transfer learning, a new model is trained using the weights of the pre-trained model. This method can reduce time and computer resources because it trains by fine-tuning a pre-trained model without starting from random initialization. Efforts to apply transfer learning to GANs are also active. Mo et al [17] proposed a FreezeD model for the generator to learn the target distribution by freezing the highest-resolution layer of the discriminator during the fine-tuning process. Karras et al [18] proposed an adaptive discriminator augmentation (ADA) mechanism that stabilizes training in limited data. They have also proven that ADA performs better with FreezeD. These studies have shown that applying transfer learning to GANs allows image translation from the source domain to the target domain more efficiently.

迁移学习

迁移学习是计算机视觉中非常强大和有效的方法。在迁移学习中,使用预训练模型的权重训练新模型。这种方法可以减少时间和计算机资源,因为它通过微调预训练模型进行训练,而无需从随机初始化开始。将迁移学习应用于 GAN 的努力也很活跃。Mo 等人[ 17 ]为生成器提出了一种 FreezeD 模型,通过在微调过程中冻结鉴别器的最高分辨率层来学习目标分布。卡拉斯等人[ 18 ]提出了一种自适应鉴别器增强(ADA)机制,可以在有限的数据中稳定训练。他们还证明了 ADA 在 FreezeD 中的表现更好。这些研究表明,将迁移学习应用于 GAN 可以更有效地将图像从源域转换到目标域。

With the effectiveness and powerfulness of stylegan2 [2] and fine-tuning techniques [17, 18], there has been a recent move to apply them to unsupervised I2I translation. FreezeG [19] argued that freezing the generator’s low-level resolution layer helps maintain the structure of the source domain. In addition, Pinkney et al [20] proposed a layer swapping method that combines the high-resolution layer of the FFHQ model with the low-resolution layer of the animation model to generate a photo-realistic face. Huang et al [21] suggested a FreezeFC mechanism that freezes 8-layer w=f(z),w∈W of the StyleGAN2 network during the fine-tuning process to increase the similarity between the source and target domains.

凭借 stylegan2 [ 2 ]和微调技术[ 17 , 18 ]的有效性和强大功能,最近出现了将它们应用于无监督 I2I 翻译的举措。FreezeG [ 19 ]认为冻结生成器的低级分辨率层有助于保持源域的结构。此外,Pinkney 等人[ 20 ]提出了一种层交换方法,将 FFHQ 模型的高分辨率层与动画模型的低分辨率层相结合,生成逼真的人脸。黄等[ 21 ] 建议冻结 8 层的 FreezeFC 机制 w=f(z),w∈W StyleGAN2 网络在微调过程中的变化,以增加源域和目标域之间的相似性。

In this paper, we proposed a method for unsupervised image to image(I2I) translation by applying transfer learning to the stylegan2 pretrained model. Inspired by these works [19, 20, 21], we also proposed new methods to sustain the structure of the source images and generate realistic images in the target domain.
在本文中,我们通过将迁移学习应用于 stylegan2 预训练模型,提出了一种无监督图像到图像(I2I)转换的方法。通过这些作品的启发[ 192021 ],我们还提出了新的方法,以维持源图像的结构和生成逼真的图像在目标域。

3METHOD 方法

3.1FREEZESG 冻结SG

Noting the effectiveness of FreezeG [17], we analyze the factors that determine the structure of the generated image and find that not only the early layers of the generator but also the style vectors injected into the generator are involved. Inspired by this, we freeze the initial blocks of the generator and the initial style vectors injected in the fine-tuning process of stylegan2. We call this simple yet effective method a Freeze Style vector and Generator (FreezeSG).

注意到 FreezeG [ 17 ]的有效性,我们分析了决定生成图像结构的因素,发现不仅涉及生成器的早期层,还涉及注入生成器的样式向量。受此启发,我们冻结了生成器的初始块和 stylegan2 微调过程中注入的初始样式向量。我们称这种简单而有效的方法为 Freeze Style vector and Generator (FreezeSG)

Figure 1 shows that FreezeSG reflects the source image better than freezing the generator alone. In addition, it shows that the structure of the source image is better preserved if the low-resolution layer of the generator in the source domain and the high-resolution layer of the generator in the target domain are integrated by applying the Layer Swapping (LS) technique. When LS is applied, the generated images by FreezeSG have a higher similarity to the source image than when FreezeG or the baseline (FreezeD + ADA) were used.
1显示 FreezeSG 比单独冻结生成器更能反映源图像。此外,它表明,如果通过应用层交换(LS ) 技术。当应用 LS 时,与使用 FreezeG 或基线 (FreezeD + ADA) 时相比,FreezeSG 生成的图像与源图像具有更高的相似性。

3.2STRUCTURE LOSS

The architecture of our generator (using structure loss). If we want to apply a loss for the RGB output of three style blocks(4x4 - 16x16), you calculate the mse-loss between the output of (b) a fixed source generator and the output of (c) the target generator being trained. After calculating, and add them all.(d) shows the integration of the low-resolutoin of the source generator and the high-resolution of the target generator by applying the Layer Swapping method.

Figure 2: The architecture of our generator (using structure loss). If we want to apply a loss for the RGB output of three style blocks(4x4 - 16x16), you calculate the mse-loss between the output of (b) a fixed source generator and the output of (c) the target generator being trained. After calculating, and add them all.(d) shows the integration of the low-resolutoin of the source generator and the high-resolution of the target generator by applying the Layer Swapping method.
图 2:我们生成器的架构(使用结构损失)。如果我们想对三个样式块 (4x4 - 16x16) 的 RGB 输出应用损失,您可以计算 (b) 固定源生成器的输出与 (c) 正在训练的目标生成器的输出之间的 mse-loss . 计算后,将它们全部相加。(d)显示了应用Layer Swapping方法将源生成器的低分辨率和目标生成器的高分辨率集成在一起。

Comparison of training results using structure loss and baseline(freezeD). We used the Layer Swapping technique for
Figure 3: Comparison of training results using structure loss and baseline(freezeD). We used the Layer Swapping technique for l=2.
图 3:使用结构损失和基线(freezeD)比较训练结果。我们使用层交换技术l=2.

In Section 3.1, we showed the effectiveness of FreezeSG. However, since this fixes the weights of the low-resolution layer of the generator, it is difficult to obtain meaningful results when layer swapping on the low-resolution layer. We introduce a simple and effective loss function that we called structure loss to obtain meaningful results even when layer swapping is performed for low-resolution layers.
在第3.1节中,我们展示了 FreezeSG 的有效性。但是,由于这固定了生成器低分辨率层的权重,因此在低分辨率层上进行层交换时很难获得有意义的结果。我们引入了一个简单而有效的损失函数,我们称之为结构损失,即使在对低分辨率层执行层交换时也能获得有意义的结果。

Adversarial Loss.

Original GAN [11] is a model that trains through the adversarial process of generator G and discriminator D. The objective of a GAN can be expressed as
原始 GAN [ 11 ]是通过生成器的对抗过程进行训练的模型G 和鉴别器 D. GAN 的目标可以表示为

Ladv=Ex∼pdata[logD(x)]+Ez∼p(z)[log(1−D(G(z)))], (1)

where G tries to minimize this objective to learn the distribution of the target domain, and D tries to maximize it to discriminate between the real image x and the fake image G(z).

G 试图最小化这个目标来学习目标域的分布,和 D 尝试最大化它以区分真实图像 X 和假图像 G(z).

Structure Loss.

We adopted the structure of input/output skips among the three architectures (MSG-GAN, In-put/output skips, and Residual nets) of the stylegan2 model [2]. Figure 2(a) shows this architecture, which is simplified by upsampling and summing the contributions of RGB outputs corresponding to different resolutions. Based on the fact that the structure of the image is determined at low resolution, we apply structure loss to the values of the low-resolution layer so that the generated image is similar to the image in the source domain. The structure loss makes the RGB output of the source generator to be fine-tuned to have a similar value with the RGB output of the target generator during training.

结构损失。

我们在 stylegan2 模型的三种架构(MSG-GAN、输入/输出跳跃和残差网络)中采用了输入/输出跳跃的结构[ 2 ]。图2 (a) 显示了这种架构,通过对不同分辨率对应的 RGB 输出的贡献进行上采样和求和来简化该架构。基于图像的结构是在低分辨率下确定的,我们将结构损失应用于低分辨率层的值,以便生成的图像与源域中的图像相似。结构损失使得源生成器的 RGB 输出在训练过程中被微调到与目标生成器的 RGB 输出具有相似的值。

Structure loss is available as follows.
结构损失可用如下。

  1. If we want to apply structure loss for n style blocks, we must first extract the RGB outputs of both the source generator and the target generator for each resolution.
    如果我们想申请的结构损失为n 样式块,我们必须首先为每个分辨率提取源生成器和目标生成器的 RGB 输出。
  2. Calculate the mse loss between Gsl=k(ws) and Gtl=k(ws) of each resolution, and add it up to the n-th layer.
    计算 Gsl=k(ws) 和 Gtl=k(ws) 两者之间的mse损失 并将其添加到第 n 层
Lstructure=n∑k=1E[Gsl=k(ws)−Gtl=k(wt)] (2)

where Gsl=k(ws) is the RGB outputs of the source generator, and Gtl=k(wt) is the RGB outputs of the target generator. k is the index of the style block.

Full Objective.

Finally, the objective functions are expressed as

LD=−Ladv (3)
LG=Ladv+λstructureLstructure (4)

where λstructure is a hyper-parameter that control the relative importance of structure of source domain. We use λstructure=1 in all of our experiments.

Figure 3 shows that our model generates images much more naturally than baseline [17, 2]. Since we processed the initial layer of G to learn the structure, images of areas such as jaws and heads are well generated.

3显示我们的模型比基线[ 17 , 2 ]更自然地生成图像。由于我们处理了初始层G 为了学习结构,可以很好地生成下巴和头部等区域的图像。

Comparison between freezeG and our models.Figure 4: Comparison between freezeG and our models.

4EXPERIMENTS

4.1DATASETS

Source domain dataset.

We used Flickr-Faces-HQ (FFHQ) [1] as a dataset for the source domain. This has high-quality images of human faces and there are 70,000 images. We used FFHQ to train the stylegan2 model [2] of 256 resolution.

源域数据集。

我们使用 Flickr-Faces-HQ (FFHQ) [ 1 ]作为源域的数据集。这有高质量的人脸图像,有 70,000 张图像。我们使用 FFHQ 来训练256 分辨率的 stylegan2 模型[ 2 ]。

Target domain dataset.

We experimented with a variety of datasets, including Naver Webtoon [22], Metfaces [18], and Disney [20]. Naver Webtoon Dataset [22] contains facial images of webtoon characters serialized on Naver. We made this dataset by crawling webtoons from Naver’s webtoons site and cropping the faces to 256 x 256 sizes. There are about 15 kinds of webtoons and 8,000 images. We trained the entire Naver Webtoon dataset, and we also trained each webtoon in this experiment. We also experimented with other dataset [18, 20], and learned using images from 256 resolution.

目标域数据集。

我们对各种数据集进行了试验,包括 Naver Webtoon [ 22 ]、Metfaces [ 18 ]和 Disney [ 20 ]。Naver Webtoon Dataset [ 22 ]包含在 Naver 上序列化的 webtoon 字符的面部图像。我们通过从 Naver 的 webtoons 站点抓取 webtoons 并将人脸裁剪为 256 x 256 大小来制作这个数据集。大约有15种网络漫画和8000张图片。我们训练了整个 Naver Webtoon 数据集,我们也在这个实验中训练了每个 webtoon。我们还对其他数据集[ 18 , 20 ] 进行了实验,并使用 256 分辨率的图像进行了学习。

4.2TRAINING DETAILS

In Section 3.1, we freeze the initial blocks of the generator and style vector and then train the model by fine-tuning the stylegan2 pre-trained model. In this process, the same objective function as stylegan2 was used. We find it effective to freeze two style blocks (4x4 - 8x8) when generating 256 x 256 resolution images. When applying Layer Swapping [20], it is most effective to integrate the low-resolution layer of the source generator (4x4 - 64x64) and the high-resolution layer (64x64 - 256x256) of the target generator.

In Section 3.2, We train by applying structure loss to the input/output skip architecture of the stylegan2 model [2]. We used this loss for the three low-resolution layers of the source generator and target generator, and we adopted mse-loss as a loss function. Furthermore, the structural loss enables effective layer swapping even for low-resolution layers, unlike freezeSG. So we swapped different low-resolution layers through various experiments. (Figure 5)
在第3.1节中,我们冻结生成器和样式向量的初始块,然后通过对 stylegan2 预训练模型进行微调来训练模型。在这个过程中,使用了与stylegan2相同的目标函数。我们发现在生成 256 x 256 分辨率图像时冻结两个样式块 (4x4 - 8x8) 是有效的。在应用层交换[ 20 ] 时,将源生成器的低分辨率层 (4x4 - 64x64) 和目标生成器的高分辨率层 (64x64 - 256x256) 集成是最有效的。

在第3.2节中,我们通过将结构损失应用于stylegan2 模型[ 2 ]的输入/输出跳过架构来进行训练。我们将此损失用于源生成器和目标生成器的三个低分辨率层,并采用 mse-loss 作为损失函数。此外,与 freezeSG 不同,结构损失即使对于低分辨率层也能实现有效的层交换。所以我们通过各种实验交换了不同的低分辨率层。(图5

4.3EXPERIMENT RESULTS

We showed that our model was more effective than the baseline (FreezeD + ADA) [17, 18] and FreezeG [19]. In addition, when the layer swapping technique was applied, the structure of the source domain was more successfully maintained. We also conducted additional experiments on the metface [18] and disney [20] dataset, and the highly quality images were generated when structural loss and layer swapping were used together. FreezeSG also maintained the structure of the source image well but it produced an unnatural image than the image of applying structure loss.
我们表明我们的模型比基线(FreezeD + ADA)[ 17 , 18 ]和 FreezeG [ 19 ]更有效。此外,当应用层交换技术时,源域的结构得到了更成功的维护。我们还在metface [ 18 ]和disney [ 20 ]数据集上进行了额外的实验,当结构损失和层交换一起使用时,生成了高质量的图像。FreezeSG 也很好地保持了源图像的结构,但与应用结构损失的图像相比,它产生了不自然的图像。

5CONCLUSIONS AND FUTURE WORKS

In this paper, we proposed Cartoon-StyleGAN: Fine-tuning StyleGAN2 for Cartoon Face Generation, a novel unsupervised Image-to-Image translation methods. Experimental results show that these methods are effective in making source image and target image similar. Compared to previous studies, it generates a more realistic image.

However, our studies need to adjust layers to optimize for each dataset. In future works, we expect to be more stable and performance-enhancing.
在本文中,我们提出了 Cartoon-StyleGAN:Fine-tuning StyleGAN2 for Cartoon Face Generation,一种新颖的无监督图像到图像转换方法。实验结果表明,这些方法在使源图像和目标图像相似方面是有效的。与之前的研究相比,它生成了更逼真的图像。

但是,我们的研究需要调整层以针对每个数据集进行优化。在未来的工作中,我们期望更加稳定和性能提升。

The results of applying structure loss with Layer Swapping.
Figure 5: The results of applying structure loss with Layer Swapping.
图 5:通过层交换应用结构损失的结果。

The results of FreezeSG with Layer Swapping.
Figure 6: The results of FreezeSG with Layer Swapping.
图 6:具有层交换的 FreezeSG 的结果。

REFERENCES

 Comparison between baseline(FreezeD + ADA) and our model applying the structure loss.Figure 7: Comparison between baseline(FreezeD + ADA) and our model applying the structure loss.## APPENDIX A APPENDIX

A.1CHANGE FACIAL EXPRESSION AND POSE.

The architecture of StyleCLIP.Figure 8: The architecture of StyleCLIP.The result images of text guided manipulation. We additionally introduce Figure 9: The result images of text guided manipulation. We additionally introduce styleclip strength to increase image variation. The third image is the one generated by StyleCLIP, and the 6-th image is the one applied with styleclip strength, α=2.#### StyleCLIP

Inspired by StyleCLIP [23] that manipulates generated images with text, we change the faces of generated cartoon characters by text. We used the latent optimization method among the three methods of StyleCLIP and additionally introduced styleclip strength. It allows the latent vector to linearly move in the direction of the optimized latent vector, making the image change better with text.

w′=α(w−ws)+w) (5)

where α is a styleclip strength, w is a trainable latent vector, and ws is a fixed latent vector.

We first found a latent vector optimized according to text using a source generator and styleclip. Then we entered this latent vector into a fine-tuned target generator to change the facial expression of the cartoon face. Figure 9 shows that StyleCLIP is effective in changing the expression of a webtoon character.

Want to hear about new tools we're making? Sign up to our mailin