Hebrew university of Jerusalem Jerusalem, Yerushalayim, Israel
Vision + language
Image and video synthesis; Neural generative models; Vision applications and systems
Inspired by the ability of StyleGAN to generate highly re-alistic images in a variety of domains, much recent work hasfocused on understanding how to use the latent spaces ofStyleGAN to manipulate generated and real images. How-ever, discovering semantically meaningful latent manipula-tions typically involves painstaking human examination ofthe many degrees of freedom, or an annotated collectionof images for each desired manipulation. In this work, weexplore leveraging the power of recently introduced Con-trastive Language-Image Pre-training (CLIP) models in or-der to develop a text-based interface for StyleGAN imagemanipulation that does not require such manual effort. Wefirst introduce an optimization scheme that utilizes a CLIP-based loss to modify an input latent vector in response to auser-provided text prompt. Next, we describe a latent map-per that infers a text-guided latent manipulation step fora given input image, allowing faster and more stable text-based manipulation. Finally, we present a method for map-ping a text prompts to input-agnostic directions in Style-GAN’s style space, enabling interactive text-driven imagemanipulation. Extensive results and comparisons demon-strate the effectiveness of our approaches.