# STYLEGAN2: Analyzing and Improving the Image Quality of StyleGAN 分析和改善 StyleGAN 的图像质量

Tero Karras
Samuli Laine
Miika Aittala
Janne Hellsten
Jaakko Lehtinen
Timo Aila

＞　译者语：stylegan 改进版

## 摘要

StyleGAN 在数据驱动的无条件生成图像建模中达到了最先进的结果。我们将揭露和分析其出现一些特征伪影的原因，并提出模型架构和训练方法方面的改进以解决这些问题。特别需要注意的是，我们重新设计了生成器归一化方法，重新审视了渐进式增长架构，并对生成器施加了正则化，使得从潜在矢量到图像的映射中得到良好质量的图像。除了改善图像质量外，使用路径长度调节器还带来了额外的好处，即生成器变得非常容易反转。这使得可以可靠地检测图像是否由特定网络生成。我们进一步对生成器是如何充分应用输出分辨率，并如何确定网络容量问题进行了可视化，从而激励我们训练更大的模型，以进一步提高质量。总体而言，我们改进的模型在现有的分布式指标质量和感知的图像质量方面都刷新了无条件图像建模的最先进技术指标。

## ABSTRACT

The style-based GAN architecture (StyleGAN) yields state-of-the-art results in data-driven unconditional generative image modeling. We expose and analyze several of its characteristic artifacts, and propose changes in both model architecture and training methods to address them. In particular, we redesign generator normalization, revisit progressive growing, and regularize the generator to encourage good conditioning in the mapping from latent vectors to images. In addition to improving image quality, this path length regularizer yields the additional benefit that the generator becomes significantly easier to invert. This makes it possible to reliably detect if an image is generated by a particular network. We furthermore visualize how well the generator utilizes its output resolution, and identify a capacity problem, motivating us to train larger models for additional quality improvements. Overall, our improved model redefines the state of the art in unconditional image modeling, both in terms of existing distribution quality metrics as well as perceived image quality.

## 1 介绍 INTRODUCTION

The resolution and quality of images produced by generative methods, especially generative adversarial networks (GAN) [15], are improving rapidly [23, 31, 5]. The current state-of-the-art method for high-resolution image synthesis is StyleGAN [24], which has been shown to work reliably on a variety of datasets. Our work focuses on fixing its characteristic artifacts and improving the result quality further.

The distinguishing feature of StyleGAN [24] is its unconventional generator architecture. Instead of feeding the input latent code z∈Z only to the beginning of a the network, the mapping network f first transforms it to an intermediate latent code w∈W. Affine transforms then produce styles that control the layers of the synthesis network g via adaptive instance normalization (AdaIN) [20, 9, 12, 8]. Additionally, stochastic variation is facilitated by providing additional random noise maps to the synthesis network. It has been demonstrated [24, 38] that this design allows the intermediate latent space W to be much less entangled than the input latent space Z. In this paper, we focus all analysis solely on W, as it is the relevant latent space from the synthesis network’s point of view.

${\bf w}\in\mathcal{W}$。然后仿射变换生成控制图层的样式，并通过自适应实例归一化（AdaIN）[20、9、12、8]参与合成网络 g 进行合成。 另外，通过向合成网络提供额外的随机噪声图来促进随机变化。[24，38]研究表明，这种设计能让中间隐空间 W 比输入隐空间 Z 的纠缠少得多。。在本文中，我们所有的分析都只集中在 W 上，因为从合成网络的视角看，它是相关的隐空间。。

Many observers have noticed characteristic artifacts in images generated by StyleGAN [3]. We identify two causes for these artifacts, and describe changes in architecture and training methods that eliminate them. First, we investigate the origin of common blob-like artifacts, and find that the generator creates them to circumvent a design flaw in its architecture. In Section 2, we redesign the normalization used in the generator, which removes the artifacts. Second, we analyze artifacts related to progressive growing [23] that has been highly successful in stabilizing high-resolution GAN training. We propose an alternative design that achieves the same goal — training starts by focusing on low-resolution images and then progressively shifts focus to higher and higher resolutions — without changing the network topology during training. This new design also allows us to reason about the effective resolution of the generated images, which turns out to be lower than expected, motivating a capacity increase (Section 4).

Quantitative analysis of the quality of images produced using generative methods continues to be a challenging topic. Fréchet inception distance (FID) [19] measures differences in the density of two distributions in the high-dimensional feature space of a InceptionV3 classifier [39]. Precision and Recall (P&R) [36, 27] provide additional visibility by explicitly quantifying the percentage of generated images that are similar to training data and the percentage of training data that can be generated, respectively. We use these metrics to quantify the improvements.

Both FID and P&R are based on classifier networks that have recently been shown to focus on textures rather than shapes [11], and consequently, the metrics do not accurately capture all aspects of image quality. We observe that the perceptual path length (PPL) metric [24], introduced as a method for estimating the quality of latent space interpolations, correlates with consistency and stability of shapes. Based on this, we regularize the synthesis network to favor smooth mappings (Section 3) and achieve a clear improvement in quality. To counter its computational expense, we also propose executing all regularizations less frequently, observing that this can be done without compromising effectiveness.

Finally, we find that projection of images to the latent space W works significantly better with the new, path-length regularized generator than with the original StyleGAN. This has practical importance since it allows us to tell reliably whether a given image was generated using a particular generator (Section 5).

Our implementation and trained models are available at https://github.com/NVlabs/stylegan2

## 2 消除因归一化导致的伪影 REMOVING NORMALIZATION ARTIFACTS

We begin by observing that most images generated by StyleGAN exhibit characteristic blob-shaped artifacts that resemble water droplets. As shown in Figure 1, even when the droplet may not be obvious in the final image, it is present in the intermediate feature maps of the generator The anomaly starts to appear around 64×64 resolution, is present in all feature maps, and becomes progressively stronger at higher resolutions. The existence of such a consistent artifact is puzzling, as the discriminator should be able to detect it.

We pinpoint the problem to the AdaIN operation that normalizes the mean and variance of each feature map separately, thereby potentially destroying any information found in the magnitudes of the features relative to each other. We hypothesize that the droplet artifact is a result of the generator intentionally sneaking signal strength information past instance normalization: by creating a strong, localized spike that dominates the statistics, the generator can effectively scale the signal as it likes elsewhere. Our hypothesis is supported by the finding that when the normalization step is removed from the generator, as detailed below, the droplet artifacts disappear completely.

### 2.1. 生成器架构修正 GENERATOR ARCHITECTURE REVISITED

We will first revise several details of the StyleGAN generator to better facilitate our redesigned normalization. These changes have either a neutral or small positive effect on their own in terms of quality metrics.

Figure 2a shows the original StyleGAN synthesis network g [24], and in Figure 2b we expand the diagram to full detail by showing the weights and biases and breaking the AdaIN operation to its two constituent parts: normalization and modulation. This allows us to re-draw the conceptual gray boxes so that each box indicates the part of the network where one style is active (i.e., “style block”). Interestingly, the original StyleGAN applies bias and noise within the style block, causing their relative impact to be inversely proportional to the current style’s magnitudes. We observe that more predictable results are obtained by moving these operations outside the style block, where they operate on normalized data. Furthermore, we notice that after this change it is sufficient for the normalization and modulation to operate on the standard deviation alone (i.e., the mean is not needed). The application of bias, noise, and normalization to the constant input can also be safely removed without observable drawbacks. This variant is shown in Figure 2c, and serves as a starting point for our redesigned normalization.

### 2.2. 实例归一化修正 ### INSTANCE NORMALIZATION REVISITED

Given that instance normalization appears to be too strong, how can we relax it while retaining the scale-specific effects of the styles? We rule out batch normalization [21] as it is incompatible with the small minibatches required for high-resolution synthesis. Alternatively, we could simply remove the normalization. While actually improving FID slightly [27], this makes the effects of the styles cumulative rather than scale-specific, essentially losing the controllability offered by StyleGAN (see video). We will now propose an alternative that removes the artifacts while retaining controllability. The main idea is to base normalization on the expected statistics of the incoming feature maps, but without explicit forcing.

Recall that a style block in Figure 2c consists of modulation, convolution, and normalization. Let us start by considering the effect of a modulation followed by a convolution. The modulation scales each input feature map of the convolution based on the incoming style, which can alternatively be implemented by scaling the convolution weights:

$$w'{ijk} = s_i \cdot w{ijk}$$

where w and w′ are the original and modulated weights, respectively, si is the scale corresponding to the ith input feature map, and j and k enumerate the output feature maps and spatial footprint of the convolution, respectively.

Now, the purpose of instance normalization is to essentially remove the effect of s from the statistics of the convolution’s output feature maps. We observe that this goal can be achieved more directly. Let us assume that the input activations are i.i.d. random variables with unit standard deviation. After modulation and convolution, the output activations have standard deviation of

i.e., the outputs are scaled by the L2 norm of the corresponding weights. The subsequent normalization aims to restore the outputs back to unit standard deviation. Based on Equation 2, this is achieved if we scale (“demodulate”) each output feature map j by 1/σj. Alternatively, we can again bake this into the convolution weights:

$$w''{ijk} = w'{ijk} \bigg/ \sqrt{\raisebox{0mm}[4.0mm][2.5mm]{\underset{i,k}{{}\displaystyle\sum{}}} {w'_{ijk}}^2 + \epsilon}, \label{eq:demodulation}$$

is a small constant to avoid numerical issues.

We have now baked the entire style block to a single convolution layer whose weights are adjusted based on s using Equations 1 and 3 (Figure 2d). Compared to instance normalization, our demodulation technique is weaker because it is based on statistical assumptions about the signal instead of actual contents of the feature maps. Similar statistical analysis has been extensively used in modern network initializers [13, 18], but we are not aware of it being previously used as a replacement for data-dependent normalization. Our demodulation is also related to weight normalization [37] that performs the same calculation as a part of reparameterizing the weight tensor. Prior work has identified weight normalization as beneficial in the context of GAN training [42].

Our new design removes the characteristic artifacts (Figure 3) while retaining full controllability, as demonstrated in the accompanying video. FID remains largely unaffected (Table 1, rows a, b), but there is a notable shift from precision to recall. We argue that this is generally desirable, since recall can be traded into precision via truncation, whereas the opposite is not true [27]. In practice our design can be implemented efficiently using grouped convolutions, as detailed in Appendix B. To avoid having to account for the activation function in Equation 3, we scale our activation functions so that they retain the expected signal variance.

## 3. 图像质量和生成器平滑度 IMAGE QUALITY AND GENERATOR SMOOTHNESS

While GAN metrics such as FID or Precision and Recall (P&R) successfully capture many aspects of the generator, they continue to have somewhat of a blind spot for image quality. For an example, refer to Figures 13 and 14 that contrast generators with identical FID and P&R scores but markedly different overall quality.

We observe an interesting correlation between perceived image quality and perceptual path length (PPL) [24], a metric that was originally introduced for quantifying the smoothness of the mapping from a latent space to the output image by measuring average LPIPS distances [49] between generated images under small perturbations in latent space. Again consulting Figures 13 and 14, a smaller PPL (smoother generator mapping) appears to correlate with higher overall image quality, whereas other metrics are blind to the change. Figure 4 examines this correlation more closely through per-image PPL scores computed by sampling latent space around individual points in W on StyleGAN trained on LSUN Cat: low PPL scores are indeed indicative of high-quality images, and vice versa. Figure 5a shows the corresponding histogram of per-image PPL scores and reveals the long tail of the distribution. The overall PPL for the model is simply the expected value of the per-image PPL scores.

It is not immediately obvious why a low PPL should correlate with image quality. We hypothesize that during training, as the discriminator penalizes broken images, the most direct way for the generator to improve is to effectively stretch the region of latent space that yields good images. This would lead to the low-quality images being squeezed into small latent space regions of rapid change. While this improves the average output quality in the short term, the accumulating distortions impair the training dynamics and consequently the final image quality.

This empirical correlation suggests that favoring a smooth generator mapping by encouraging low PPL during training may improve image quality, which we show to be the case below. As the resulting regularization term is somewhat expensive to compute, we first describe a general optimization that applies to all regularization techniques.

### 3.1. 延迟正则化（Lazy regularization）

Typically the main loss function (e.g., logistic loss [15]) and regularization terms (e.g., R1 [30]) are written as a single expression and are thus optimized simultaneously. We observe that typically the regularization terms can be computed much less frequently than the main loss function, thus greatly diminishing their computational cost and the overall memory usage. Table 1, row c shows that no harm is caused when R1 regularization is performed only once every 16 minibatches, and we adopt the same strategy for our new regularizer as well. Appendix B gives implementation details.

### 3.2. 路径长度正则化 PATH LENGTH REGULARIZATION

Excess path distortion in the generator is evident as poor local conditioning: any small region in W becomes arbitrarily squeezed and stretched as it is mapped by g. In line with earlier work [33], we consider a generator mapping from the latent space to image space to be well-conditioned if, at each point in latent space, small displacements yield changes of equal magnitude in image space regardless of the direction of perturbation.

At a single w∈W, the local metric scaling properties of the generator mapping g(w):W↦Y are captured by the Jacobian matrix Jw=∂g(w)/∂w. Motivated by the desire to preserve the expected lengths of vectors regardless of the direction, we formulate our regularizer as

$$\mathbb{E}{{\bf w},{\bf y}\sim\mathcal{N}(0,\mathbf{I})}\left( \left\lVert\mathbf{J}{\bf w}^{T}{\bf y}\right\rVert_{2}-a \right)^{2} \tag 4$$

where y are random images with normally distributed pixel intensities, and w∼f(z), where z are normally distributed. We show in Appendix C that, in high dimensions, this prior is minimized when Jw is orthogonal (up to a global scale) at any w. An orthogonal matrix preserves lengths and introduces no squeezing along any dimension.

To avoid explicit computation of the Jacobian matrix, we use the identity ${\bf J}^{T}{\bf w}{\bf y}=\nabla{\bf w}(g({\bf w})\cdot{\bf y})$, which is efficiently computable using standard backpropagation [6]. The constant a is set dynamically during optimization as the long-running exponential moving average of the lengths $\lVert{\bf J}^{T}{\bf w}{\bf y}\rVert{2}$, allowing the optimization to find a suitable global scale by itself.

Our regularizer is closely related to the Jacobian clamping regularizer presented by Odena et al. [33]. Practical differences include that we compute the products$\mathbf{J}_{\bf w}\boldsymbol{\delta}$analytically whereas they use finite differences for estimating Jwδ with Z∋δ∼N(0,I). It should be noted that spectral normalization [31] of the generator [45] only constrains the largest singular value, posing no constraints on the others and hence not necessarily leading to better conditioning.

In practice, we notice that path length regularization leads to more reliable and consistently behaving models, making architecture exploration easier. Figure 5b shows that path length regularization clearly improves the distribution of per-image PPL scores. Table 1, row d shows that regularization reduces PPL, as expected, but there is a tradeoff between FID and PPL in LSUN Car and other datasets that are less structured than FFHQ. Furthermore, we observe that a smoother generator is easier to invert (Section 5).

## 4. 渐进式增长修正 PROGRESSIVE GROWING REVISITED

Progressive growing [23] has been very successful in stabilizing high-resolution image synthesis, but it causes its own characteristic artifacts. The key issue is that the progressively grown generator appears to have a strong location preference for details; the accompanying video shows that when features like teeth or eyes should move smoothly over the image, they may instead remain stuck in place before jumping to the next preferred location. Figure 6 shows a related artifact. We believe the problem is that in progressive growing each resolution serves momentarily as the output resolution, forcing it to generate maximal frequency details, which then leads to the trained network to have excessively high frequencies in the intermediate layers, compromising shift invariance [48]. Appendix A shows an example. These issues prompt us to search for an alternative formulation that would retain the benefits of progressive growing without the drawbacks.

### 4.1. 可选的网络架构 ALTERNATIVE NETWORK ARCHITECTURES

While StyleGAN uses simple feedforward designs in the generator (synthesis network) and discriminator, there is a vast body of work dedicated to the study of better network architectures. In particular, skip connections [34, 22], residual networks [17, 16, 31], and hierarchical methods [7, 46, 47] have proven highly successful also in the context of generative methods. As such, we decided to re-evaluate the network design of StyleGAN and search for an architecture that produces high-quality images without progressive growing.

Figure 7a shows MSG-GAN [22], which connects the matching resolutions of the generator and discriminator using multiple skip connections. The MSG-GAN generator is modified to output a mipmap [41] instead of an image, and a similar representation is computed for each real image as well. In Figure 7b we simplify this design by upsampling and summing the contributions of RGB outputs corresponding to different resolutions. In the discriminator, we similarly provide the downsampled image to each resolution block of the discriminator. We use bilinear filtering in all up and downsampling operations. In Figure 7c we further modify the design to use residual connections.3 This design is similar to LAPGAN [7] without the per-resolution discriminators employed by Denton et al.

Table 2 compares three generator and three discriminator architectures: original feedforward networks as used in StyleGAN, skip connections, and residual networks, all trained without progressive growing. FID and PPL are provided for each of the 9 combinations. We can see two broad trends: skip connections in the generator drastically improve PPL in all configurations, and a residual discriminator network is clearly beneficial for FID. The latter is perhaps not surprising since the structure of discriminator resembles classifiers where residual architectures are known to be helpful. However, a residual architecture was harmful in the generator — the lone exception was FID in LSUN Car when both networks were residual.

For the rest of the paper we use a skip generator and a residual discriminator, and do not use progressive growing. This corresponds to configuration e in Table 1, and as can be seen table, switching to this setup significantly improves FID and PPL.

StyleGAN 中使用的原始前馈网络，跳越连接和残差网络，所有这些网络的训练都没有用渐进式增长(progressive growing)。表中为 9 种组合都计算了 FID 和 PPL。我们可以看到两个趋势：在所有配置中，生成器中的跳越连接会大大改善 PPL，而残差的判别器网络显然对 FID 有利。后者也许不足为奇，因为判别器的结构就是分类器（残差结构会对分类器有利）。但是，残差的体系结构对生成器有害。当两个网络都是残差的时，唯一的例外是 LSUN CAR 中的 FID。在本文的其余部分，我们使用跳越生成器 skip generator 和残差判别器，而不使用渐进式增长。这对应于表 1 中的配置 E，从表中可以看出，切换到该设置可以显著改善 FID 和 PPL。

### 4.2. 分辨率的使用 RESOLUTION USAGE

The key aspect of progressive growing, which we would like to preserve, is that the generator will initially focus on low-resolution features and then slowly shift its attention to finer details. The architectures in Figure 7 make it possible for the generator to first output low resolution images that are not affected by the higher-resolution layers in a significant way, and later shift the focus to the higher-resolution layers as the training proceeds. Since this is not enforced in any way, the generator will do it only if it is beneficial. To analyze the behavior in practice, we need to quantify how strongly the generator relies on particular resolutions over the course of training.

Since the skip generator (Figure 7b) forms the image by explicitly summing RGB values from multiple resolutions, we can estimate the relative importance of the corresponding layers by measuring how much they contribute to the final image. In Figure 8a, we plot the standard deviation of the pixel values produced by each tRGB layer as a function of training time. We calculate the standard deviations over 1024 random samples of w and normalize the values so that they sum to 100%.

At the start of training, we can see that the new skip generator behaves similar to progressive growing — now achieved without changing the network topology. It would thus be reasonable to expect the highest resolution to dominate towards the end of the training. The plot, however, shows that this fails to happen in practice, which indicates that the generator may not be able to “fully utilize” the target resolution. To verify this, we inspected the generated images manually and noticed that they generally lack some of the pixel-level detail that is present in the training data — the images could be described as being sharpened versions of 512^2 images instead of true 1024^2 images.

This leads us to hypothesize that there is a capacity problem in our networks, which we test by doubling the number of feature maps in the highest-resolution layers of both networks.4(4We double the number of feature maps in resolutions 64^2–1024^2 while keeping other parts of the networks unchanged. This increases the total number of trainable parameters in the generator by 22% (25M → 30M) and in the discriminator by 21% (24M → 29M).） This brings the behavior more in line with expectations: Figure 8b shows a significant increase in the contribution of the highest-resolution layers, and Table 1, row f shows that FID and Recall improve markedly.

Table 3 compares StyleGAN and our improved variant in several LSUN categories, again showing clear improvements in FID and significant advances in PPL. It is possible that further increases in the size could provide additional benefits.

## 5. 将图像投影到潜在空间 PROJECTION OF IMAGES TO LATENT SPACE

Inverting the synthesis network g is an interesting problem that has many applications. Manipulating a given image in the latent feature space requires finding a matching latent vector w for it first. Also, as the image quality of GANs im-proves, it becomes more important to be able to attribute a potentially synthetic image to the network that generated it.

Previous research [1, 10] suggests that instead of finding a common latent vector w, the results improve if a separate w is chosen for each layer of the generator. The same approach was used in an early encoder implementation [32]. While extending the latent space in this fashion finds a closer match to a given image, it also enables projecting arbitrary images that should have no latent representation. Focusing on the forensic detection of generated images, we concentrate on finding latent codes in the original, unextended latent space, as these correspond to images that the generator could have produced.

Our projection method differs from previous methods in two ways. First, we add ramped-down noise to the latent code during optimization in order to explore the latent space more comprehensively. Second, we also optimize the stochastic noise inputs of the StyleGAN generator, regularizing them to ensure they do not end up carrying coherent signal. The regularization is based on enforcing the autocorrelation coefficients of the noise maps to match those of unit Gaussian noise over multiple scales. Details of our projection method can be found in Appendix D.

### 5.1. 检测生成的图像 DETECTION OF GENERATED IMAGES

It is possible to train classifiers to detect GAN-generated images with reasonably high confidence [29, 44, 40, 50]. However, given the rapid pace of progress, this may not be a lasting situation. Projection-based methods are unique in that they can provide evidence, in the form of a matching latent vector, that an image was synthesized by a specific network [2]. There is also no reason why their effectiveness would diminish as the quality of synthetic images improves, unlike classifier-based methods that may have fewer clues to work with in the future.

It turns out that our improvements to StyleGAN make it easier to detect generated images using projection-based methods, even though the quality of generated images is higher. We measure how well the projection succeeds by computing the LPIPS [49] distance between original and re-synthesized image as

$$D_{\mathrm{LPIPS}}[\boldsymbol{x},g(\tilde{g}^{-1}(\boldsymbol{x}))]$$

where x is the image being analyzed and ~g−1 denotes the approximate projection operation. Figure 9 shows histograms of these distances for LSUN Car and FFHQ datasets using the original StyleGAN and our best architecture, and Figure 10 shows example projections. As the latter illustrates, the images generated using our improved architecture can be projected into generator inputs so well that they can be unambiguously attributed to the generating network. With original StyleGAN, even though it should technically be possible to find a matching latent vector, it appears that the latent space is in practice too complex for this to succeed reliably. Our improved model with smoother latent space W suffers considerably less from this problem.

## 6. 结论与未来工作 CONCLUSIONS AND FUTURE WORK

We have identified and fixed several image quality issues in StyleGAN, improving the quality further and considerably advancing the state of the art in several datasets. In some cases the improvements are more clearly seen in motion, as demonstrated in the accompanying video. Appendix A includes further examples of results obtainable using our method. Despite the improved quality, it is easier to detect images generated by our method using projection-based methods, compared to the original StyleGAN.

Training performance has also improved. At 1024^2 resolution, the original StyleGAN (config a in Table 1) trains at 37 images per second on NVIDIA DGX-1 with 8 Tesla V100 GPUs, while our config e trains 40% faster at 61 img/s. Most of the speedup comes from simplified dataflow due to weight demodulation, lazy regularization, and code optimizations. Config f (larger networks) trains at 31 img/s, and is thus only slightly more expensive to train than original StyleGAN. With config f, the total training time was 9 days for FFHQ and 13 days for LSUN Car.

As future work, it could be fruitful to study further improvements to the path length regularization, e.g., by replacing the pixel-space L2 distance with a data-driven feature-sp

## 7. 致谢 ACKNOWLEDGEMENTS

We thank Ming-Yu Liu for an early review, Timo Viitanen for his help with code release, and Tero Kuosmanen for compute infrastructure.

## B. 实现细节

### 路径长度正则化

$$\gamma_{\textrm{pl}}=\frac{\ln 2}{r^{2}(\ln r-\ln 2)}\textrm{,}$$

####特定于数据集的调整

## C. 路径长度正则化的作用

$$\mathcal{L}{\bf p1}=\mathbb{E}{\bf w}\mathbb{E}{\bf y}\left(\left\lVert{\bf J}^{T}{\bf w}{%\bf y}\right\rVert_{2}-a\right)^{2}$$

### C.1. EFFECT ON POINTWISE JACOBIANS 对逐像素雅可比矩阵的影响

$$\mathcal{L}{w}:=\mathbb{E}{\bf y}\left(\left\lVert{\bf J}^{T}{\bf w}{%\bf y}\right\rVert{2}-a\right)^{2}$$

${\bf J}^{T}_{\bf w}=\mathbf{U}\tilde{\bf\Sigma}\mathbf{V}^{T}$

$$\mathcal{L}{{\bf w}}=\mathbb{E}{\bf y}\left(\left\lVert\mathbf{U}\tilde{\bf\Sigma}%\mathbf{V}^{T}{\bf y}\right\rVert_{2}-a\right)^{2}$$

$$\mathcal{L}{{\bf w}}=\mathbb{E}{\tilde{\bf y}}\left(\left\lVert{\bf\Sigma}%\tilde{\bf y}\right\rVert_{2}-a\right)^{2}$$

$$\mathcal{L}{{\bf w}}=\int\left(\left\lVert{\bf\Sigma}\tilde{\bf y}\right\rVert{2}-a%\right)^{2}p_{\tilde{\bf y}}(\tilde{\bf y}){}\mathrm{d}\tilde{\bf y} =\displaystyle(2\pi)^{-\frac{L}{2}}\int\left(\left\lVert{\bf\Sigma}\tilde{\bf y%}\right\rVert_{2}-a\right)^{2}\mathrm{exp}\left(-\frac{\tilde{\bf y}^{T}\tilde%{\bf y}}{2}\right){}\mathrm{d}\tilde{\bf y}$$

$$\mathcal{L}{{\bf w}}\approx(2\pi e/L)^{-L/2}\int{\mathbb{S}}%\int_{0}^{\infty}\left(r\left\lVert{\bf\Sigma}\phi\right\rVert_{2}-a\right)^{2%}\\displaystyle\mathrm{exp}\left(-\frac{\left(r-\sqrt{L}\right)^{2}}{2\sigma^{2}%}\right){}\mathrm{d}r{}\mathrm{d}\phi$$

$$\mathcal{L}{{\bf w}}\approx(2\pi e/L)^{-L/2}\mathcal{A}(\mathbb{%S})a^{2}L^{-1}\int{0}^{\infty}\left(r-\sqrt{L}\right)^{2}\\displaystyle\mathrm{exp}\left(-\frac{\left(r-\sqrt{L}\right)^{2}}{2\sigma^{2}%}\right)~{}\mathrm{d}r$$

$$a\approx\mathbb{E}{{\bf w},{\bf y}}\left(\left\lVert{\bf J}^{T}{w}{\bf y}%\right\rVert_{2}\right)$$

### C.2. 对生成器映射的全局属性的影响

$\tilde{u}=g\circ u$

$$L=\int_{t_{0}}^{t_{1}}\lvert\tilde{u}^{\prime}(t)\rvert~{}\mathrm{d}t$$

$$L=\int_{t_{0}}^{t_{1}}\lvert J_{g}(u(t))u^{\prime}(t)\rvert~{}\mathrm{d}t$$

$$L=\int_{t_{0}}^{t_{1}}\lvert u^{\prime}(t)\rvert~{}\mathrm{d}t$$

## D. 投影方式细节 PROJECTION METHOD DETAILS

$$L_{\mathit{image}}=D_{\mathrm{LPIPS}}[\boldsymbol{x},g({\bf\tilde{w}},%\boldsymbol{n}{0},\boldsymbol{n}{1},\ldots)]$$

$$\displaystyle L_{i,j}=\left(\frac{1}{r_{i,j}^{2}}\cdot\sum_{x,y}\boldsymbol{n}{i,j}(x,%y)\cdot\boldsymbol{n}{i,j}(x-1,y)\right)^{2}$$

$$\displaystyle\left(\frac{1}{r_{i,j}^{2}}\cdot\sum_{x,y}\boldsymbol{n}{i,j}(x,%y)\cdot\boldsymbol{n}{i,j}(x-1,y)\right)^{2}$$