# StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery 文本驱动的 styleggan 图像合成

## Abstract 摘要

Inspired by the ability of StyleGAN to generate highly realistic images in a variety of domains, much recent work has focused on understanding how to use the latent spaces of StyleGAN to manipulate generated and real images. However, discovering semantically meaningful latent manipulations typically involves painstaking human examination of the many degrees of freedom, or an annotated collection of images for each desired manipulation. In this work, we explore leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to develop a text-based interface for StyleGAN image manipulation that does not require such manual effort. We first introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to a user-provided text prompt. Next, we describe a latent mapper that infers a text-guided latent manipulation step for a given input image, allowing faster and more stable text-based manipulation. Finally, we present a method for mapping a text prompts to input-agnostic directions in StyleGAN's style space, enabling interactive text-driven image manipulation. Extensive results and comparisons demonstrate the effectiveness of our approaches. \vfill

Examples of text-driven manipulations using StyleCLIP. Top row: input images; Bottom row: our manipulated results. The text prompt used to drive each manipulation appears under each column.

## Introduction 简介

Generative Adversarial Networks (GANs) have revolutionized image synthesis, with recent style-based generative models boasting some of the most realistic synthetic imagery to date. Furthermore, the learnt intermediate latent spaces of StyleGAN have been shown to possess disentanglement properties, which enable utilizing pretrained models to perform a wide variety of image manipulations on synthetic, as well as real, images.
Harnessing StyleGAN's expressive power requires developing simple and intuitive interfaces for users to easily carry out their intent. Existing methods for semantic control discovery either involve manual examination (e.g.,), a large amount of annotated data, or pretrained classifiers . Furthermore, subsequent manipulations are typically carried out by moving along a direction in one of the latent spaces, using a parametric model, such as a 3DMM in StyleRig , or a trained normalized flow in StyleFlow . Specific edits, such as virtual try-on and aging have also been explored. Thus, existing controls enable image manipulations only along preset semantic directions, severely limiting the user's creativity and imagination. Whenever an additional, unmapped, direction is desired, further manual effort and/or large quantities of annotated data are necessary.
In this work, we explore leveraging the power of recently introduced Contrastive Language-Image Pre-training (CLIP) models in order to enable intuitive text-based semantic image manipulation that is neither limited to preset manipulation directions, nor requires additional manual effort to discover new controls. The CLIP model is pretrained on 400 million image-text pairs harvested from the Web, and since natural language is able to express a much wider set of visual concepts, combining CLIP with the generative power of StyleGAN opens fascinating avenues for image manipulation. Figuresteaser shows several examples of unique manipulations produced using our approach. Specifically, in this paper we investigate three techniques that combine CLIP with StyleGAN: The results in this paper and the supplementary material demonstrate a wide range of semantic manipulations on images of human faces, animals, cars, and churches. These manipulations range from abstract to specific, and from extensive to fine-grained. Many of them have not been demonstrated by any of the previous StyleGAN manipulation works, and all of them were easily obtained using a combination of pretrained StyleGAN and CLIP models.

1. 文本指导的潜在优化，其中 CLIP 模型用作损失网络[ johnson2016perceptual ]。这是最通用的方法，但是需要几分钟的优化才能对图像进行操作。
2. 潜在残差映射器，针对特定的文本提示进行了培训。给定潜在空间中的起点（要处理的输入图像），映射器会在潜在空间中产生局部步长。
3. 一种将文本提示映射到 StyleGAN 样式空间中与输入无关的（全局）方向的方法，可控制操纵强度和解开程度。

### Vision and Language 视觉与语言方向

Multiple works learn cross-modal Vision and language (VL) representations for a variety of tasks, such as language-based image retrieval, image captioning, and visual question answering. Following the success of BERT in various language tasks, recent VL methods typically use Transformers to learn the joint representations. A recent model, based on Contrastive Language-Image Pre-training (CLIP), learns a multi-modal embedding space, which may be used to estimate the semantic similarity between a given text and an image. CLIP was trained on 400 million text-image pairs, collected from a variety of publicly available sources on the Internet. The representations learned by CLIP have been shown to be extremely powerful, enabling state-of-the-art zero-shot image classification on a variety of datasets. We refer the reader to OpenAI's Distill article for an extensive exposition and discussion of the visual concepts learned by CLIP. The pioneering work of Reed approached text-guided image generation by training a conditional GAN, conditioned by text embeddings obtained from a pretrained encoder. Zhang improved image quality by using multi-scale GANs. AttnGAN incorporated an attention mechanism between the text and image features. Additional supervision was used in other works to further improve the image quality. A few studies focus on text-guided image manipulation. Some methods use a GAN-based encoder-decoder architecture, to disentangle the semantics of both input images and text descriptions.
ManiGAN introduces a novel text-image combination module, which produces high-quality images. Differently from the aforementioned works, we propose a single framework that combines the high-quality images generated by StyleGAN, with the rich multi-domain semantics learned by CLIP. Recently, DALL·E, a 12-billion parameter version of GPT-3, which at 16-bit precision requires over 24GB of GPU memory, has shown a diverse set of capabilities in generating and applying transformations to images guided by text. In contrast, our approach is deployable even on a single commodity GPU. A concurrent work to ours, TediGAN, also uses StyleGAN for text-guided image generation and manipulation. By training an encoder to map text into the StyleGAN latent space, one can generate an image corresponding to a given text. To perform text-guided image manipulation, TediGAN encodes both the image and the text into the latent space, and then performs style-mixing to generate a corresponding image. In Sectionexperiments we demonstrate that the manipulations achieved using our approach reflect better the semantics of the driving text. In a recent online post, Perez describes a text-to-image approach that combines StyleGAN and CLIP in a manner similar to our latent optimizer in Sectionopt. Rather than synthesizing an image from scratch, our optimization scheme, as well as the other two approaches described in this work, focus on image manipulation. While text-to-image generation is an intriguing and challenging problem, we believe that the image manipulation abilities we provide constitute a more useful tool for the typical workflow of creative artists.

### Latent Space Image Manipulation 潜在空间图像处理

Many works explore how to utilize the latent space of a pretrained generator for image manipulation. Specifically, the intermediate latent spaces in StyleGAN have been shown to enable many disentangled and meaningful image manipulations. Some methods learn to perform image manipulation in an end-to-end fashion, by training a network that encodes a given image into a latent representation of the manipulated image. Other methods aim to find latent paths, such that traversing along them result in the desired manipulation. Such methods can be categorized into: (i) methods that use image annotations to find meaningful latent paths, and (ii) methods that find meaningful directions without supervision, and require manual annotation for each direction. While most works perform image manipulations in the $\mathcal{W}$ or $\mathcal{W+}$ spaces, Wu proposed to use the StyleSpace $\mathcal{S}$ , and showed that it is better disentangled than $\mathcal{W}$ and $\mathcal{W+}$ . Our latent optimizer and mapper work in the $\mathcal{W+}$ space, while the input-agnostic directions that we detect are in $\mathcal{S}$ . In all three, the manipulations are derived directly from text input, and our only source of supervision is a pretrained CLIP model. As CLIP was trained on hundreds of millions of text-image pairs, our approach is generic and can be used in a multitude of domains without the need for domain- or manipulation-specific data annotation.

## StyleCLIP Text-Driven Manipulation STYLECLIP 文本驱动的操作

In this work we explore three ways for text-driven image manipulation, all of which combine the generative power of StyleGAN with the rich joint vision-language representation learned by CLIP. We begin in Section opt with a simple latent optimization scheme, where a given latent code of an image in StyleGAN's $\mathcal{W}+$ space is optimized by minimizing a loss computed in CLIP space. The optimization is performed for each (source image, text prompt) pair. Thus, despite it's versatility, several minutes are required to perform a single manipulation, and the method can be difficult to control. A more stable approach is described in Section mapper, where a mapping network is trained to infer a manipulation step in latent space, in a single forward pass. The training takes a few hours, but it must only be done once per text prompt. The direction of the manipulation step may vary depending on the starting position in $\mathcal{W}+$ , which corresponds to the input image, and thus we refer to this mapper as local. Our experiments with the local mapper reveal that, for a wide variety of manipulations, the directions of the manipulation step are often similar to each other, despite different starting points. Also, since the manipulation step is performed in $\mathcal{W}+$ , it is difficult to achieve fine-grained visual effects in a disentangled manner. Thus, in Sectionglobal we explore a third text-driven manipulation scheme, which transforms a given text prompt into an input agnostic (i.e., global in latent space) mapping direction. The global direction is computed in StyleGAN's style space $\mathcal{S}$ , which is better suited for fine-grained and disentangled visual manipulation, compared to $\mathcal{W}+$ .

pre-proc train -time infer.- time input image - dependent latent- space
optimizer 98 sec yes W+
mapper 10 – 12h 75 ms yes W+
global dir. 4h 72 ms no S

Table 1: Our three methods for combining StyleGAN and CLIP. The latent step inferred by the optimizer and the mapper depends on the input image, but the training is only done once per text prompt. The global direction method requires a one-time pre-processing, after which it may be applied to different (image, text prompt) pairs. Times are for a single NVIDIA GTX 1080Ti GPU.

Table methods summarizes the differences between the three methods outlined above, while visual results and comparisons are presented in the following sections.

## Latent Optimization 潜在优化

A simple approach for leveraging CLIP to guide image manipulation is through direct latent code optimization. Specifically, given a source latent code $w_s \in\mathcal{W}+$ , and a directive in natural language, or a text prompt $t$ , we solve the following optimization problem:

$$\argmin_{w \in W+}{D_{CLIP}(G(w), t) + \lambda_{L2} \norm{w - w_s}2 + \lambda_{ID} L_ID(w)},$$

where $G$ is a pretrained StyleGAN We use StyleGAN2 in all our experiments. generator and $D_{CLIP}$ is the cosine distance between the CLIP embeddings of its two arguments. Similarity to the input image is controlled by the $L_2$ distance in latent space, and by the identity loss:

$G$是我们所有实验中 Stylegan 的预训练模型的生成器，$D_{CLIP}$是其两个参数的剪辑嵌入之间的余弦距离。与输入图像的相似性通过潜伏空间中的$L_2$距离和 identity loss 来判断：

$$L_{ID}(w) = 1- <R(G(w_s)),R(G(w))> ,$$

where $R$ is a pretrained ArcFace network for face recognition, and $\langle\cdot, \cdot\rangle$ computes the cosine similarity between it's arguments. We solve this optimization problem through gradient descent, by back-propagating the gradient of the objective in through the pretrained and fixed StyleGAN generator $G$ and the CLIP image encoder. In Figure opt_results we provide several edits that were obtained using this optimization approach after 200-300 iterations. The input images were inverted by e4e . Note that visual characteristics may be controlled explicitly (beard, blonde) or implicitly, by indicating a real or a fictional person (Beyonce, Trump, Elsa). The values of $\lambda_{L2}$ and $\lambda_{ID}$ depend on the nature of the desired edit. For changes that shift towards another identity, $\lambda_{ID}$ is set to a lower value.

，其中$R$是用于面部识别的预训练 arcface 网络，而⟨⋅,⋅⟩计算它的参数之间的余弦相似性。通过梯度下降，通过梯度下降来通过返回传播通过预先磨则和固定式样式的样式生成器$G$和剪辑图像编码器的目标的梯度来解决该优化问题。在图中，OPT_RESULTS 我们提供了多种编辑，在 200-300 次迭代之后使用此优化方法获得。输入图像由 E4E 反转。注意，可以通过表示真实或虚构的人（Beyonce，Trump，Elsa）明确地（胡子，金发）或隐含地控制视觉特征。 $\lambda_{L2}$和$\lambda_{ID}$的值取决于所需编辑的性质。对于向其他身份转移的变化，$\lambda_{ID}$设置为较低的值。

Figure 2: The architecture of our text-guided mapper (using the text prompt “surprised”, in this example). The source image (left) is inverted into a latent code w. Three separate mapping functions are trained to generate residuals (in blue) that are added to w to yield the target code, from which a pretrained StyleGAN (in green) generates an image (right), assessed by the CLIP and identity losses.

Figure 3: Edits of real celebrity portraits obtained by latent optimization. The driving text prompt and the (λL2,λID) parameters for each edit are indicated under the corresponding result.

## Latent Mapper 潜在映射器

Mohawk Afro Bob-cut Curly Beyonce Taylor Swift Surprised Purple hair
Mean 0.82 0.84 0.82 0.84 0.83 0.77 0.79 0.73
Std 0.096 0.085 0.095 0.088 0.081 0.107 0.893 0.145

Table 2: Average cosine similarity between manipulation directions obtained from mappers trained using differnt text prompts.

The latent optimization described above is versatile, as it performs a dedicated optimization for each (source image, text prompt) pair. On the downside, several minutes of optimization are required to edit a single image, and the method is somewhat sensitive to the values of its parameters. Below, we describe a more efficient process, where a mapping network is trained, for a specific text prompt $t$ , to infer a manipulation step $M_t(w)$ in the $\mathcal{W+}$ space, for any given latent image embedding $w \in\mathcal{W}+$ .

#### Architecture

The architecture of our text-guided mapper is depicted in Figuremapper_arch. It has been shown that different StyleGAN layers are responsible for different levels of detail in the generated image. Consequently, it is common to split the layers into three groups (coarse, medium, and fine), and feed each group with a different part of the (extended) latent vector. We design our mapper accordingly, with three fully-connected networks, one for each group/part. The architecture of each of these networks is the same as that of the StyleGAN mapping network, but with fewer layers (4 rather than 8, in our implementation). Denoting the latent code of the input image as $w = (w_{c}, w_{m}, w_{f})$ , the mapper is defined by 我们的文本引导映射器的体系结构在 Figuremapper_arch 中描绘。已经表明，不同的样式中止层对所生成的图像中的不同细节级别负责。因此，常常将层分成三组（粗，培养基和细），并用不同部分的（延伸的）潜航载体喂食每组。我们根据具有三个完全连接的网络设计我们的地图 Per，每个组/部分都是一个。这些网络中的每一个的架构与样式映射网络的架构相同，但层数较少（在我们的实施中 4 而不是 8 个）。表示输入图像的潜在代码为$w = (w_{c}, w_{m}, w_{f})$，映射器由

$$M_t(w) = (M^c_t(w_c), M^m_t(w_m), M^f_t(w_f)).$$
Note that one can choose to train only a subset of the three mappers. There are cases where it is useful to preserve some attribute level and keep the style codes in the corresponding entries fixed.

#### Losses

Our mapper is trained to manipulate the desired attributes of the image as indicated by the text prompt $t$ , while preserving the other visual attributes of the input image. The CLIP loss, $L_CLIP(w)$ guides the mapper to minimize the cosine distance in the CLIP latent space:。我们的映射器接受培训以操纵图像的所需属性，如文本提示$t$所示，同时保留输入图像的其他可视化属性。剪辑丢失，$L_CLIP(w)$指导映射器以最小化剪辑潜像中的余弦距离：
$$L_CLIP(w) = D_{CLIP}(G(w + M_t(w)), t),$$

where $G$ denotes again the pretrained StyleGAN generator. To preserve the visual attributes of the original input image, we minimize the $L_2$ norm of the manipulation step in the latent space. Finally, for edits that require identity preservation, we use the identity loss defined in eq.(id-loss).
，其中$G$再次表示佩带的样式生成器。为了保留原始输入图像的视觉属性，我们最小化了潜在空间中操作步骤的$L_2$标准。最后，对于需要身份保存的编辑，我们使用 EQ 中定义的身份损失。（ID-loss）。

$$\ mathcal {l} clip（w）= d {clip}（g（w + m_t（w）），t ），$$

Our total loss function is a weighted combination of these losses:

$$L(w) =L_CLIP(w) + \lambda_{L2}\norm{M_t(w)}_2 + \lambda_ID L_ID(w).$$

As before, when the edit is expected to change the identity, we do not use the identity loss. The parameter values we use for the examples in this paper are $\lambda_L2 = 0.8, \lambda_ID = 0.1$ , except for the Trump manipulation in Figureglobal-vs-mapper, where the parameter values we use are $\lambda_L2 = 2, \lambda_ID = 0$ . In Figuremapper-hair we provide several examples for hair style edits, where a different mapper used in each column. In all of these examples, the mapper succeeds in preserving the identity and most of the other visual attributes that are not related to hair. Note, that the resulting hair appearance is adapted to the individual; this is particularly apparent in the Curly hair and Bob-cut hairstyle edits. It should be noted that the text prompts are not limited to a single attribute at a time. Figuremapper-multi shows four different combinations of hair attributes, straight/curly and short/long, each yielding the expected outcome. This degree of control has not been demonstrated by any previous method we're aware of. Since the latent mapper infers a custom-tailored manipulation step for each input image, it is interesting to examine the extent to which the direction of the step in latent space varies over different inputs. To test this, we first invert the test set of CelebA-HQ using e4e . Next, we feed the inverted latent codes into several trained mappers and compute the cosine similarity between all pairs of the resulting manipulation directions. The mean and the standard deviation of the cosine similarity for each mapper is reported in Tablemapper-directions. The table shows that even though the mapper infers manipulation steps that are adapted to the input image, in practice, the cosine similarity of these steps for a given text prompt is high, implying that their directions are not as different as one might expect.

Figure 4: Hair style edits using our mapper. The driving text prompts are indicated below each column. All input images are inversions of real images.

## Global Directions 全局方向

While the latent mapper allows fast inference time, we find that it sometimes falls short when a fine-grained disentangled manipulation is desired. Furthermore, as we have seen, the directions of different manipulation steps for a given text prompt tend to be similar. Motivated by these observations, in this section we propose a method for mapping a text prompt into a single, global direction in StyleGAN's style space $\Delta$ , which has been shown to be more disentangled than other latent spaces. Let $s \in S$ denote a style code, and $G(s)$ the corresponding generated image. Given a text prompt indicating a desired attribute, we seek a manipulation direction $\Delta s$ , such that $G(s + \alpha\Delta s)$ yields an image where that attribute is introduced or amplified, without significantly affecting other attributes. The manipulation strength is controlled by $\alpha$ . Our high-level idea is to first use the CLIP text encoder to obtain a vector $\Delta t$ in CLIP's joint language-image embedding and then map this vector into a manipulation direction $\Delta s$ in $S$ . A stable $\Delta t$ is obtained from natural language, using prompt engineering, as described below. The corresponding direction $\Delta s$ is then determined by assessing the relevance of each style channel to the target attribute.

More formally, denote by $\clipI$ the manifold of image embeddings in CLIP's joint embedding space, and by $\clipT$ the manifold of its text embeddings. We distinguish between these two manifolds, because there is no one-to-one mapping between them: an image may contain a large number of visual attributes, which can hardly be comprehensively described by a single text sentence; conversely, a given sentence may describe many different images. During CLIP training, all embeddings are normalized to a unit norm, and therefore only the direction of embedding contains semantic information, while the norm may be ignored. Thus, in well trained areas of the CLIP space, we expect directions on the $\clipT$ and $\clipI$ manifolds that correspond to the same semantic changes to be roughly collinear (i.e., have large cosine similarity), and nearly identical after normalization.

Given a pair of images, $G(s)$ and $G(s+\alpha\Delta s)$ , we denote their $\clipI$ embeddings by $i$ and $i + \Delta i$ , respectively. Thus, the difference between the two images in CLIP space is given by $\Delta i$ . Given a natural language instruction encoded as $\Delta t$ , and assuming collinearity between $\Delta t$ and $\Delta i$ , we can determine a manipulation direction $\Delta s$ by assessing the relevance of each channel in $S$ to the direction $\Delta i$ .

#### From natural language to Δt

In order to reduce text embedding noise, Radford utilize a technique called prompt engineering that feeds several sentences with the same meaning to the text encoder, and averages their embeddings. For example, for ImageNet zero-shot classification, a bank of 80 different sentence templates is used, such as a bad photo of a {}, a cropped photo of the {}, a black and white photo of a {}, and a painting of a {}. At inference time, the target class is automatically substituted into these templates to build a bank of sentences with similar semantics, whose embeddings are then averaged. This process improves zero-shot classification accuracy by an additional $3.5%$ over using a single text prompt. Similarly, we also employ prompt engineering (using the same ImageNet prompt bank) in order to compute stable directions in $\clipT$ . Specifically, our method should be provided with text description of a target attribute and a corresponding neutral class. For example, when manipulating images of cars, the target attribute might be specified as a sports car, in which case the corresponding neutral class might be a car. Prompt engineering is then applied to produce the average embeddings for the target and the neutral class, and the normalized difference between the two embeddings is used as the target direction $\Delta t$ .

#### Channelwise relevance 通道相关性

Next, our goal is to construct a style space manipulation direction $\Delta s$ that would yield a change $\Delta i$ , collinear with the target direction $\Delta t$ . For this purpose, we need to assess the relevance of each channel $c$ of $S$ to a given direction $\Delta i$ in CLIP's joint embedding space. We generate a collection of style codes $s \in S$ , and perturb only the $c$ channel of each style code by adding a negative and a positive value.

Denoting by $\Delta i_c$ the CLIP space direction between the resulting pair of images, the relevance of channel $c$ to the target manipulation is estimated as the mean projection of $\Delta i_c$ onto $\Delta i$ :

$$R_c(\Delta i) = \mathbb{E}_{s \in S }{\Delta i_c \cdot\Delta i}$$

In practice, we use 100 image pairs to estimate the mean. The pairs of images that we generate are given by $G(s \pm\alpha\Delta s_c)$ , where $\Delta s_c$ is a zero vector, except its $c$ coordinate, which is set to the standard deviation of the channel. The magnitude of the perturbation is set to $\alpha=5$ .

Figure 6 : Image manipulation driven by the prompt “grey hair” for different manipulation strengths and disentanglement thresholds. Moving along the Δs direction, causes the hair color to become more grey, while steps in the −Δs direction yields darker hair. The effect becomes stronger as the strength α increases. When the disentanglement threshold β is high, only the hair color is affected, and as β is lowered, additional correlated attributes, such as wrinkles and the shape of the face are affected as well.

Having estimated the relevance $R_c$ of each channel, we ignore channels whose $R_c$ falls below a threshold $\beta$ . This parameter may be used to control the degree of disentanglement in the manipulation: using higher threshold values results in more disentangled manipulations, but at the same time the visual effect of the manipulation is reduced. Since various high-level attributes, such as age, involve a combination of several lower level attributes (for example, grey hair, wrinkles, and skin color), multiple channels are relevant, and in such cases lowering the threshold value may be preferable, as demonstrated in Figuredisentanglement_strength. To our knowledge, the ability to control the degree of disentanglement in this manner is unique to our approach. In summary, given a target direction $\Delta i$ in CLIP space, we set

$$\Delta s = \Delta i_c \cdot \Delta i if |\Delta i_c \cdot \Delta i| \geq \beta otherwise 0$$

Figuresalex-faces andalex-nonfaces show a variety of edits along text-driven manipulation directions determined as described above on images of faces, cars, and dogs. The manipulations in Figurealex-faces are performed using StyleGAN2 pretrained on FFHQ . The inputs are real images, embedded in $\mathcal{W+}$ space using the e4e encoder. The figure demonstrates text-driven manipulations of 18 attributes, including complex concepts, such as facial expressions and hair styles. The manipulations in Figurealex-nonfaces use StyleGAN2 pretrained on LSUN cars (on real images) and on generated images from StyleGAN2-ada pretrained on AFHQ dogs.

Figuresalex-面对 andalex-nonfaces 展示各种沿确定文本驱动操控方向的编辑如上所述，在面孔，汽车和狗的图像上。 FigureaEx-Faces 中的操纵是使用 FFHQ 上磨平的 Stylegan2 进行的。输入是使用 E4E 编码器嵌入$\mathcal{W+}$空间中的真实图像。该图展示了 18 个属性的文本驱动的操作，包括复杂概念，例如面部表情和发型。 FigureaEx-nonfaces 的操纵使用 Lsun Cars（实际图像）上的 Stylegan2，以及来自 Afhq 狗的 Stylegan2-Ada 的生成图像。

Figure 7: A variety of edits along global text-driven manipulation directions, demonstrated on portraits of celebrities. Edits are performed using StyleGAN2 pretrained on FFHQ [karras2019style]. The inputs are real images, embedded in W+ space using the e4e encoder [tov2021designing]. The target attribute used in the text prompt is indicated above each column.

Figure 8 A variety of edits along global text-driven manipulation directions. Left: using StyleGAN2 pretrained on LSUN cars [yu2015lsun]. Right: using StyleGAN2-ada [karras2020training] pretrained on AFHQ dogs [choi2020stargan]. The target attribute used in the text prompt is indicated above each column.

## Comparisons and Evaluation 比较和评估

We now turn to compare the three methods presented and analyzed in the previous sections among themselves and to other methods. All the real images that we manipulate are inverted using the e4e encoder [tov2021designing].

Text-driven image manipulation methods: We begin by comparing several text-driven facial image manipulation methods in Figure 8. We compare between our latent mapper method (Section 5), our global direction method (Section 6), and TediGAN [xia2020tedigan]. For TediGAN, we use the authors’ official implementation, which has been recently updated to utilize CLIP for image manipulation, and thus is somewhat different from the method presented in their paper. We do not include results of the optimization method presented in Section 4, since its sensitivity to hyperparameters makes it time-consuming, and therefore not scalable.

We perform the comparison using three kinds of attributes ranging from complex, yet specific (e.g., “Trump”), less complex and less specific (e.g., “Mohawk”), to simpler and more common (e.g., “without wrinkles”). The complex “Trump” manipulation, involves several attributes such as blonde hair, squinting eyes, open mouth, somewhat swollen face and Trump’s identity. While a global latent direction is able to capture the main visual attributes, which are not specific to Trump, it fails to capture the specific identity. In contrast, the latent mapper is more successful. The “Mohawk hairstyle” is a less complex attribute, as it involves only hair, and it isn’t as specific. Thus, both our methods are able to generate satisfactory manipulations. The manipulation generated by the global direction is slightly less pronounced, since the direction in CLIP space is an average one. Finally, for the “without wrinkles” prompt, the global direction succeeds in removing the wrinkles, while keeping other attributes mostly unaffected, while the mapper fails. We attribute this to W+ being less disentangled. We observed similar behavior on another set of attributes (“Obama”,“Angry”,“beard”). We conclude that for complex and specific attributes (especially those that involve identity), the mapper is able to produce better manipulations. For simpler and/or more common attributes, a global direction suffices, while offering more disentangled manipulations. We note that the results produced by TediGAN fail in all three manipulations shown in Figure 8.

Other StyleGAN manipulation methods: In Figure 9, we show a comparison between our global direction method and several state-of-the-art StyleGAN image manipulation methods: GANSpace [harkonen2020ganspace], InterFaceGAN [shen2020interfacegan], and StyleSpace [wu2020stylespace]. The comparison only examines the attributes which all of the compared methods are able to manipulate (Gender, Grey hair, and Lipstick), and thus it does not include the many novel manipulations enabled by our approach. Since all of these are common attributes, we do not include our mapper in this comparison. Following Wu \etal [wu2020stylespace], the manipulation step strength is chosen such that it induces the same amount of change in the logit value of the corresponding classifiers (pretrained on CelebA).

It may be seen that in GANSpace [harkonen2020ganspace] manipulation is entangled with skin color and lighting, while in InterFaceGAN [shen2020interfacegan] the identity may change significantly (when manipulating Lipstick). Our manipulation is very similar to StyleSpace [wu2020stylespace], which only changes the target attribute, while all other attributes remain the same.

In the supplementary material, we also show a comparison with StyleFLow [abdal2020styleflow], a state-of-the-art non-linear method. Our method produces results of similar quality, despite the fact that StyleFlow simultaneously uses several attribute classifiers and regressors (from the Microsoft face API), and is thus can manipulate a limited set of attributes. In contrast, our method requires no extra supervision.

Figure 9: We compare three methods that utilize StyleGAN and CLIP using three different kinds of attributes.

Figure 10: Comparison with state-of-the-art methods using the same amount of manipulation according to a pretrained attribute classifier.

#### Limitations. 局限性

Our method relies on a pretrained StyleGAN generator and CLIP model for a joint language-vision embedding. Thus, it cannot be expected to manipulate images to a point where they lie outside the domain of the pretrained generator (or remain inside the domain, but in regions that are less well covered by the generator). Similarly, text prompts which map into areas of CLIP space that are not well populated by images, cannot be expected to yield a visual manipulation that faithfully reflects the semantics of the prompt. We have also observed that drastic manipulations in visually diverse datasets are difficult to achieve. For example, while tigers are easily transformed into lions (see Figure StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery), we were less successful when transforming tigers to wolves, as demonstrated in the supplementary material.

## Conclusions

We introduced three novel image manipulation methods, which combine the strong generative powers of StyleGAN with the extraordinary visual concept encoding abilities of CLIP. We have shown that these techniques enable a wide variety of unique image manipulations, some of which are impossible to achieve with existing methods that rely on annotated data. We have also demonstrated that CLIP provides fine-grained edit controls, such as specifying a desired hair style, while our method is able to control the manipulation strength and the degree of disentanglement. In summary, we believe that text-driven manipulation is a powerful image editing tool, whose abilities and importance will only continue to grow.

## APPENDIX A LATENT MAPPER – ABLATION STUDY 潜在映射 - 消融研究

In this section, we study the importance of various choices in the design of our latent mapper (Section 5).

### Architecture

The architecture of the mapper is rather simple and with relatively small number of parameters. Moreover, it has negligible effect on the inference time. Yet, it is natural to compare the presented architecture, which consists of three different mapping networks, to an architecture with a single mapping network. Intuitively, using a separate network for each group of style vector entries should better enable changes at several different levels of detail in the image. Indeed, we find that with driving text that requires such changes, e.g. Donald Trump, a single mapping network does not yield results that are as effective as those produced with three. An example is shown in Figureablation-mapper. Although the full, three network mapper, gives better results for some driving texts, as mentioned in Section 5, we note that not all the three are needed when the manipulation should not affect some attributes. For example, for the hairstyle edits shown in Figure 5, the manipulation should not affect the color scheme of the image. Therefore, we perform these edits when training $M^c$ and $M^m$ only, that is, $M_t(w) = (M^c_t(w_c), M^m_t(w_m), 0)$ . We show a comparison in Figureablation-hair. As can be seen, by removing $M_f$ from the architecture, we get slightly better results. Therefore, for the sake of simplicity and generalization of the method, we chose to describe the method with all three networks. In the main paper, the results shown were obtained with all three networks, while here we also show results with only two (without $M_f$ ).

### Losses

#### CLIP Loss

To show the uniqueness of using a “celeb edit” with CLIP, we perform the following experiment. Instead of using the CLIP loss, we use the identity loss with respect to a single image of the desired celeb. Specifically, we perform this experiment by using an image of Beyonce. The results are shown in Figure 12. As can be seen, CLIP guides the mapper to perform a unique edit which cannot be achieved by simply using a facial recognition network.

#### ID Loss

Here we show that the identity loss is significant for preserving the identity of the person in the input image. When using the default parameter setting of $\lambda_{L2} = 0.8$ with $\lambda_{ID} = 0$ (i.e., no identity loss), we observe that the mapper fails to preserve the identity, and introduces large changes. Therefore, we also experiment with $\lambda_{L2} = 1.6$ , however, this still does not preserve the original identity well enough. The results are shown in Figureablation-id.

In this section we provide additional results to those presented in the paper. Specifically, we begin with a variety of image manipulations obtained using our latent mapper. All manipulated images are taken from the CelebA-HQ and were inverted by e4e. - In Figuresupp-hair we show a large gallery of hair style manipulations. In Figuressupp-women and supp-men we show celeb edits, where the input image is manipulated to resemble a certain target celebrity. In Figuresupp-expressions we show a variety of expression edits. Next, Figurealex-nonfaces-supp shows a variety of edits on non-face datasets, performed along text-driven global latent manipulation directions (Section 6). Figuredisentanglement_strength2 shows image manipulations driven by the prompt a photo of a male face for different manipulation strengths and disentanglement thresholds. Moving along the global direction, causes the facial features to become more masculine, while steps in the opposite direction yields more feminine features. The effect becomes stronger as the strength $\alpha$ increases. When the disentanglement threshold $\beta$ is high, only the facial features are affected, and as $\beta$ is lowered, additional correlated attributes, such as hair length and facial hair are affected as well. In Figurecompare_linear2, we show another comparison between our global direction method and several state-of-the-art StyleGAN image manipulation methods: GANSpace, InterFaceGAN , and StyleSpace . The comparison only examines the attributes which all of the compared methods are able to manipulate (Gender, Grey hair, and Lipstick), and thus it does not include the many novel manipulations enabled by our approach. Following Wu , the manipulation step strength is chosen such that it induces the same amount of change in the logit value of the corresponding classifiers (pretrained on CelebA).
It may be seen that in GANSpace manipulation is entangled with skin color and lighting, while in InterFaceGAN the identity may change significantly (when manipulating Lipstick). Our manipulation is very similar to StyleSpace , which only changes the target attribute, while all other attributes remain the same. Figurestyleflow shows a comparison between StyleFlow and our global directions method. It may be seen that our method is able to produce results of comparable visual quality, despite the fact that StyleFlow requires the simultaneous use of several attribute classifiers and regressors (from the Microsoft face API), and is thus able to manipulate a limited set of attributes. In contrast, our method required no extra supervision to produce these and all of the other manipulations demonstrated in this work. Figureglobal-vs-mapper-supp shows an additional comparison between text-driven manipulation using our global directions method and our latent mapper. Our observations are similar to the ones we made regarding Figure 10 in the main paper. Finally, Figuretiger demonstrates that drastic manipulations in visually diverse datasets are sometimes difficult to achieve using our global directions. Here we use StyleGAN-ada pretrained on AFHQ wild , which contains wolves, lions, tigers and foxes. There is a smaller domain gap between tigers and lions, which mainly involves color and texture transformations. However, there is a larger domain gap between tigers and wolves, which, in addition to color and texture transformations, also involves more drastic shape deformations. This figure demonstrates that our global directions method is more successful in transforming tigers into lions, while failing in some cases to transform tigers to wolves.

19 显示了提示“男性面部照片”对不同的操作强度和解开阈值进行的图像操作。沿着整体方向移动会导致面部特征变得更加男性化，而朝相反方向移动会产生更多的女性特征。强度越强，效果越强 α 增加。当解开阈值 β 很高，只有面部特征受到影响，并且 β 降低时，其他相关属性（例如头发长度和胡子）也会受到影响。

21 显示了 StyleFlow [ abdal2020styleflow ]与我们的全局方向方法之间的比较 。可以看出，尽管 StyleFlow 需要同时使用多个属性分类器和回归器（来自 Microsoft Face API），但我们的方法仍能够产生可比的视觉质量的结果，并且因此能够操纵有限的一组属性。相反，我们的方法不需要额外的监督即可生成这些以及本工作中演示的所有其他操作。

22 显示了使用全局方向方法和潜在映射器的文本驱动的操纵之间的其他比较。我们的观察结果与我们在主论文中对图 10 所做的观察相似。

## Video

We show examples of interactive text-driven image manipulation in our supplementary video. We use a simple heuristic method to determine the initial disentanglement threshold ( $\beta$ ). The threshold is chosen such that $k$ channels will be active. For real face manipulation, we set the initial strength to $\alpha=3$ and the disentanglement threshold so that $k=20$ . For real car manipulation, we set the initial values to $\alpha=3$ and $k=100$ . For generated cat manipulation, we set the initial values to $\alpha=7$ and $k=100$ .