DALL-E
A ground-breaking machine learning news
Introduction:
It was a tuesday morning. Woke up, sipping in my morning cup of coffee, found out that openai has dropped a bomb again. This time they didn't stop with language, but they created a neural network architecture which takes a text-prompt and creates an image for that text. It created another ripple within the data science and deep learning communities within few days; and within 5 days, there are 1000s of news to technical articles written about DALL-E now. So here is my take on DALL-E. sit back and enjoy!
Summary:
In this article, we are going to go through the basic and medium technical review of DALL-E and clip AI neural networks. This is not a purely technical article,as it will cater to both ml and non-ml people similarly about this awesome new thing.
What is DALL-E?
DALL-E is a version of GPT-3 with 12 billion parameters, which is trained to generate image from a text description. While OpenAI may have started with concepts from previous researches which took detailed text captions and generated somewhat meaningful images, the DALL-E reached a very different level. DALL-E can
(1) create anthropomorphic versions of non-living things,
(2) render different unrelated concepts and impossible things to create most plausible images
(3) use existing images to change, transform and then compile new images.
So DALL-E, while is not a totally novel concept as some people may end up claiming, is a very novel result in a old question.
The obvious thing: Some examples of DALL-E images:
Everyone reading about DALL-E wants to get only one thing, yes! you got that right. Some images created by DALL-E:
(1) Anthropomorphic: an illustration of a baby daikon radish in a tutu walking a dog
(2) Joining unrelated topics to create most plausible image:
(3) Transformation: a cat with and without hat on top
cat image:
cat with top on hat!
What are the types of images we have seen DALL.E generate?
Now, although we don't know all about the DALL.E openly, as the model is yet to be launched in open, according to OpenAI, these are the following types of images DALL.E creates:
(1) Combining unrelated concepts: example is armchair like avocado
(2) Animal illustrations: creation of anthropomorphic images
(3) zero-shot visual reasoning: example is creation of cat with a hat on top
(4) geographical knowledge: dall.e supposedly knows even to create images specific to geographical locations. For example, look at the following images when we try to generate food coming from india:
(5) temporal knowledge: it is claimed by openai that dall.e even knows how to create images concepts varying over time also. Look at its attempt to create time-variation of images.
Text prompt: a photo of a action movie poster from the...
Finally, these are all claims which people will thoroughly verify once DALL.E's APIs come out in open.
Technical concepts behind DALL.E:
DALL.E was not the first one in the long run research of text to image creation. The revolution started back in 2016 with Scott Reed's revolutionary work. In reed's paper, for first time, using a detailed caption, GAN models created specific images showing text descriptions. The normal GAN models are supposed to generate fake images till they get really good in generating images which are like the original images itself. But for generating text specific images, researchers have figured ways to actually condition the image based on text input.
The idea first came from karpathy et al.'s paper on 2015 to generate text captions of images based on images, by conditioning LSTMs on the output of a Deep convolutional network created output from the image. So embarking upon the idea from that, and those of training recurrent autoencoder models in 2015 to create images, reed et al. took the following approach:
"Our approach is to train a deep convolutional generativeadversarial network (DC-GAN) conditioned on text fea-tures encoded by a hybrid character-level convolutional-recurrent neural network. Both the generator networkGand the discriminator networkDperform feed-forward in-ference conditioned on the text feature."
Now, this was still GAN model in the core. In 2017, han zhang et al. who created the first STACKGAN ( stacking of multiple GAN models) created a STACKGAN++ architecture. This architecture can be described best from quoting the abstract of the paper:
"Abstract—Although Generative Adversarial Networks (GANs) have shown remarkable success in various tasks, they still facechallenges in generating high quality images. In this paper, we propose Stacked Generative Adversarial Networks (StackGANs)aimed at generating high-resolution photo-realistic images. First, we propose a two-stage generative adversarial network architecture,StackGAN-v1, for text-to-image synthesis. The Stage-I GAN sketches the primitive shape and colors of a scene based on a given textdescription, yielding low-resolution images. The Stage-II GAN takes Stage-I results and the text description as inputs, and generateshigh-resolution images with photo-realistic details. Second, an advanced multi-stage generative adversarial network architecture,StackGAN-v2, is proposed for both conditional and unconditional generative tasks. Our StackGAN-v2 consists of multiple generatorsand multiple discriminators arranged in a tree-like structure; images at multiple scales corresponding to the same scene are generatedfrom different branches of the tree. StackGAN-v2 shows more stable training behavior than StackGAN-v1 by jointly approximatingmultiple distributions. Extensive experiments demonstrate that the proposed stacked generative adversarial networks significantlyoutperform other state-of-the-art methods in generating photo-realistic images."
Finally, DALL-E used all of these. According to the blog of original openAI article:
The embeddings are produced by an encoder pretrained using a contrastive loss, not unlike CLIP. StackGAN and StackGAN++ are used for creating the images, and increasing visual details and resolution of the created images.
On top of this, AttnGAN incorporates attention between the text and image features, and proposes a contrastive text-image feature matching loss as an auxiliary objective.
Similar to the rejection sampling used in VQVAE-2, here also CLIP is used to rerank the top 32 of 512 samples for each caption in all of the interactive visuals.
So basically, in DALL-E, gpt-3 text embedding, AttnGAN, CLIP, STackGAN++ came together and was applied together. While we described some parts of it, AttnGAN and CLIP are more complex, hence avoiding for this article.
Conclusion:
In this article, we discussed about the wonderful new research for text to image creation, saw the cute little images, and finally discussed the technical concepts behind this research on a high level. As we move forward, let's hope we get to play with the API for this awesome new tool, and then can check the different usecases of this tool.
Thanks for reading! see you again with exciting new posts about AI and machine learning. If you liked this writing, consider subscribing to get all and more like this straight into your email!
References:
(1) text caption detected GAN network generated image
(2) text to image synthesis using STACKGAN: stacked Generative adversarial network
(3) StackGAN++: realistic image synthesis with stacked generative adversarial networks
Comments
Post a Comment