Skip to main content

What is DALL.E and how does it create image out of text?

                                              DALL-E

                                      A ground-breaking machine learning news

DALL-E image of text to generated images

Introduction:

It was a tuesday morning. Woke up, sipping in my morning cup of coffee, found out that openai has dropped a bomb again. This time they didn't stop with language, but they created a neural network architecture which takes a text-prompt and creates an image for that text. It created another ripple within the data science and deep learning communities within few days; and within 5 days, there are 1000s of news to technical articles written about DALL-E now. So here is my take on DALL-E. sit back and enjoy!

Summary:

In this article, we are going to go through the basic and medium technical review of DALL-E and clip AI neural networks. This is not a purely technical article,as it will cater to both ml and non-ml people similarly about this awesome new thing.

What is DALL-E?

DALL-E is a version of GPT-3 with 12 billion parameters, which is trained to generate image from a text description. While OpenAI may have started with concepts from previous researches which took detailed text captions and generated somewhat meaningful images, the DALL-E reached a very different level. DALL-E can

(1) create anthropomorphic versions of non-living things, 

(2) render different unrelated concepts and impossible things to create most plausible images

(3) use existing images to change, transform and then compile new images.

So DALL-E, while is not a totally novel concept as some people may end up claiming, is a very novel result in a old question. 

The obvious thing: Some examples of DALL-E images:

Everyone reading about DALL-E wants to get only one thing, yes! you got that right. Some images created by DALL-E:

(1) Anthropomorphic: an illustration of a baby daikon radish in a tutu walking a dog

 

dall-e generated baby daikon walking a radish

(2)  Joining unrelated topics to create most plausible image:

dall-e created image an arm chair looking like avocado

(3) Transformation: a cat with and without hat on top

                                      cat image: 

                                        

                                                  cat with top on hat!

What are the types of images we have seen DALL.E generate?

Now, although we don't know all about the DALL.E openly, as the model is yet to be launched in open, according to OpenAI, these are the following types of images DALL.E creates:

(1) Combining unrelated concepts: example is armchair like avocado

(2) Animal illustrations: creation of anthropomorphic images

(3) zero-shot visual reasoning: example is creation of cat with a hat on top

(4) geographical knowledge: dall.e supposedly knows even to create images specific to geographical locations. For example, look at the following images when we try to generate food coming from india:

 

(5) temporal knowledge: it is claimed by openai that dall.e even knows how to create images concepts varying over time also. Look at its attempt to create time-variation of images. 

Text prompt: a photo of a action movie poster from the...

DALL-E generated time antique action movie posters

Finally, these are all claims which people will thoroughly verify once DALL.E's APIs come out in open.

Technical concepts behind DALL.E:

DALL.E was not the first one in the long run research of text to image creation. The revolution started back in 2016 with Scott Reed's revolutionary work. In reed's paper, for first time, using a detailed caption, GAN models created specific images showing text descriptions. The normal GAN models are supposed to generate fake images till they get really good in generating images which are like the original images itself. But for generating text specific images, researchers have figured ways to actually condition the image based on text input. 

The idea first came from karpathy et al.'s paper on 2015 to generate text captions of images based on images, by conditioning LSTMs on the output of a Deep convolutional network created output from the image. So embarking upon the idea from that, and those of training recurrent autoencoder models in 2015 to create images, reed et al. took the following approach:

"Our approach is to train a deep convolutional generativeadversarial network (DC-GAN) conditioned on text fea-tures encoded by a hybrid character-level convolutional-recurrent neural network. Both the generator networkGand the discriminator networkDperform feed-forward in-ference conditioned on the text feature."

Now, this was still GAN model in the core. In 2017, han zhang et al. who created the first STACKGAN ( stacking of multiple GAN models) created a STACKGAN++ architecture. This architecture can be described best from quoting the abstract of the paper:

"Abstract—Although Generative Adversarial Networks (GANs) have shown remarkable success in various tasks, they still facechallenges in generating high quality images. In this paper, we propose Stacked Generative Adversarial Networks (StackGANs)aimed at generating high-resolution photo-realistic images. First, we propose a two-stage generative adversarial network architecture,StackGAN-v1, for text-to-image synthesis. The Stage-I GAN sketches the primitive shape and colors of a scene based on a given textdescription, yielding low-resolution images. The Stage-II GAN takes Stage-I results and the text description as inputs, and generateshigh-resolution images with photo-realistic details. Second, an advanced multi-stage generative adversarial network architecture,StackGAN-v2, is proposed for both conditional and unconditional generative tasks. Our StackGAN-v2 consists of multiple generatorsand multiple discriminators arranged in a tree-like structure; images at multiple scales corresponding to the same scene are generatedfrom different branches of the tree. StackGAN-v2 shows more stable training behavior than StackGAN-v1 by jointly approximatingmultiple distributions. Extensive experiments demonstrate that the proposed stacked generative adversarial networks significantlyoutperform other state-of-the-art methods in generating photo-realistic images."

Finally, DALL-E used all of these. According to the blog of original openAI article:

The embeddings are produced by an encoder pretrained using a contrastive loss, not unlike CLIP. StackGAN and StackGAN++ are used for creating the images, and increasing visual details and resolution of the created images. 

On top of this, AttnGAN incorporates attention between the text and image features, and proposes a contrastive text-image feature matching loss as an auxiliary objective. 

Similar to the rejection sampling used in VQVAE-2, here also CLIP is used to rerank the top 32 of 512 samples for each caption in all of the interactive visuals.

So basically, in DALL-E, gpt-3 text embedding, AttnGAN, CLIP, STackGAN++ came together and was applied together. While we described some parts of it, AttnGAN and CLIP are more complex, hence avoiding for this article. 

Conclusion:

In this article, we discussed about the wonderful new research for text to image creation, saw the cute little images, and finally discussed the technical concepts behind this research on a high level. As we move forward, let's hope we get to play with the API for this awesome new tool, and then can check the different usecases of this tool.

Thanks for reading! see you again with exciting new posts about AI and machine learning. If you liked this writing, consider subscribing to get all and more like this straight into your email!

References:

(1) text caption detected GAN network generated image

(2) text to image synthesis using STACKGAN: stacked Generative adversarial network

(3) StackGAN++: realistic image synthesis with stacked generative adversarial networks

(4) Colab notebook to play with CLIP-AI model

(5) Actual blog by OpenAI 

Comments

Popular posts from this blog

Tinder bio generation with OpenAI GPT-3 API

Introduction: Recently I got access to OpenAI API beta. After a few simple experiments, I set on creating a simple test project. In this project, I will try to create good tinder bio for a specific person.  The abc of openai API playground: In the OpenAI API playground, you get a prompt, and then you can write instructions or specific text to trigger a response from the gpt-3 models. There are also a number of preset templates which loads a specific kind of prompt and let's you generate pre-prepared results. What are the models available? There are 4 models which are stable. These are: (1) curie (2) babbage (3) ada (4) da-vinci da-vinci is the strongest of them all and can perform all downstream tasks which other models can do. There are 2 other new models which openai introduced this year (2021) named da-vinci-instruct-beta and curie-instruct-beta. These instruction models are specifically built for taking in instructions. As OpenAI blog explains and also you will see in our

Can we write codes automatically with GPT-3?

 Introduction: OpenAI created and released the first versions of GPT-3 back in 2021 beginning. We wrote a few text generation articles that time and tested how to create tinder bio using GPT-3 . If you are interested to know more on what is GPT-3 or what is openai, how the server look, then read the tinder bio article. In this article, we will explore Code generation with OpenAI models.  It has been noted already in multiple blogs and exploration work, that GPT-3 can even solve leetcode problems. We will try to explore how good the OpenAI model can "code" and whether prompt tuning will improve or change those performances. Basic coding: We will try to see a few data structure coding performance by GPT-3. (a) Merge sort with python:  First with 200 words limit, it couldn't complete the Write sample code for merge sort in python.   def merge(arr, l, m, r):     n1 = m - l + 1     n2 = r- m       # create temp arrays     L = [0] * (n1)     R = [0] * (n

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like knowle