Skip to main content

What is DALL.E and how does it create image out of text?

                                              DALL-E

                                      A ground-breaking machine learning news

DALL-E image of text to generated images

Introduction:

It was a tuesday morning. Woke up, sipping in my morning cup of coffee, found out that openai has dropped a bomb again. This time they didn't stop with language, but they created a neural network architecture which takes a text-prompt and creates an image for that text. It created another ripple within the data science and deep learning communities within few days; and within 5 days, there are 1000s of news to technical articles written about DALL-E now. So here is my take on DALL-E. sit back and enjoy!

Summary:

In this article, we are going to go through the basic and medium technical review of DALL-E and clip AI neural networks. This is not a purely technical article,as it will cater to both ml and non-ml people similarly about this awesome new thing.

What is DALL-E?

DALL-E is a version of GPT-3 with 12 billion parameters, which is trained to generate image from a text description. While OpenAI may have started with concepts from previous researches which took detailed text captions and generated somewhat meaningful images, the DALL-E reached a very different level. DALL-E can

(1) create anthropomorphic versions of non-living things, 

(2) render different unrelated concepts and impossible things to create most plausible images

(3) use existing images to change, transform and then compile new images.

So DALL-E, while is not a totally novel concept as some people may end up claiming, is a very novel result in a old question. 

The obvious thing: Some examples of DALL-E images:

Everyone reading about DALL-E wants to get only one thing, yes! you got that right. Some images created by DALL-E:

(1) Anthropomorphic: an illustration of a baby daikon radish in a tutu walking a dog

 

dall-e generated baby daikon walking a radish

(2)  Joining unrelated topics to create most plausible image:

dall-e created image an arm chair looking like avocado

(3) Transformation: a cat with and without hat on top

                                      cat image: 

                                        

                                                  cat with top on hat!

What are the types of images we have seen DALL.E generate?

Now, although we don't know all about the DALL.E openly, as the model is yet to be launched in open, according to OpenAI, these are the following types of images DALL.E creates:

(1) Combining unrelated concepts: example is armchair like avocado

(2) Animal illustrations: creation of anthropomorphic images

(3) zero-shot visual reasoning: example is creation of cat with a hat on top

(4) geographical knowledge: dall.e supposedly knows even to create images specific to geographical locations. For example, look at the following images when we try to generate food coming from india:

 

(5) temporal knowledge: it is claimed by openai that dall.e even knows how to create images concepts varying over time also. Look at its attempt to create time-variation of images. 

Text prompt: a photo of a action movie poster from the...

DALL-E generated time antique action movie posters

Finally, these are all claims which people will thoroughly verify once DALL.E's APIs come out in open.

Technical concepts behind DALL.E:

DALL.E was not the first one in the long run research of text to image creation. The revolution started back in 2016 with Scott Reed's revolutionary work. In reed's paper, for first time, using a detailed caption, GAN models created specific images showing text descriptions. The normal GAN models are supposed to generate fake images till they get really good in generating images which are like the original images itself. But for generating text specific images, researchers have figured ways to actually condition the image based on text input. 

The idea first came from karpathy et al.'s paper on 2015 to generate text captions of images based on images, by conditioning LSTMs on the output of a Deep convolutional network created output from the image. So embarking upon the idea from that, and those of training recurrent autoencoder models in 2015 to create images, reed et al. took the following approach:

"Our approach is to train a deep convolutional generativeadversarial network (DC-GAN) conditioned on text fea-tures encoded by a hybrid character-level convolutional-recurrent neural network. Both the generator networkGand the discriminator networkDperform feed-forward in-ference conditioned on the text feature."

Now, this was still GAN model in the core. In 2017, han zhang et al. who created the first STACKGAN ( stacking of multiple GAN models) created a STACKGAN++ architecture. This architecture can be described best from quoting the abstract of the paper:

"Abstract—Although Generative Adversarial Networks (GANs) have shown remarkable success in various tasks, they still facechallenges in generating high quality images. In this paper, we propose Stacked Generative Adversarial Networks (StackGANs)aimed at generating high-resolution photo-realistic images. First, we propose a two-stage generative adversarial network architecture,StackGAN-v1, for text-to-image synthesis. The Stage-I GAN sketches the primitive shape and colors of a scene based on a given textdescription, yielding low-resolution images. The Stage-II GAN takes Stage-I results and the text description as inputs, and generateshigh-resolution images with photo-realistic details. Second, an advanced multi-stage generative adversarial network architecture,StackGAN-v2, is proposed for both conditional and unconditional generative tasks. Our StackGAN-v2 consists of multiple generatorsand multiple discriminators arranged in a tree-like structure; images at multiple scales corresponding to the same scene are generatedfrom different branches of the tree. StackGAN-v2 shows more stable training behavior than StackGAN-v1 by jointly approximatingmultiple distributions. Extensive experiments demonstrate that the proposed stacked generative adversarial networks significantlyoutperform other state-of-the-art methods in generating photo-realistic images."

Finally, DALL-E used all of these. According to the blog of original openAI article:

The embeddings are produced by an encoder pretrained using a contrastive loss, not unlike CLIP. StackGAN and StackGAN++ are used for creating the images, and increasing visual details and resolution of the created images. 

On top of this, AttnGAN incorporates attention between the text and image features, and proposes a contrastive text-image feature matching loss as an auxiliary objective. 

Similar to the rejection sampling used in VQVAE-2, here also CLIP is used to rerank the top 32 of 512 samples for each caption in all of the interactive visuals.

So basically, in DALL-E, gpt-3 text embedding, AttnGAN, CLIP, STackGAN++ came together and was applied together. While we described some parts of it, AttnGAN and CLIP are more complex, hence avoiding for this article. 

Conclusion:

In this article, we discussed about the wonderful new research for text to image creation, saw the cute little images, and finally discussed the technical concepts behind this research on a high level. As we move forward, let's hope we get to play with the API for this awesome new tool, and then can check the different usecases of this tool.

Thanks for reading! see you again with exciting new posts about AI and machine learning. If you liked this writing, consider subscribing to get all and more like this straight into your email!

References:

(1) text caption detected GAN network generated image

(2) text to image synthesis using STACKGAN: stacked Generative adversarial network

(3) StackGAN++: realistic image synthesis with stacked generative adversarial networks

(4) Colab notebook to play with CLIP-AI model

(5) Actual blog by OpenAI 

Comments

Popular posts from this blog

Mastering SQL for Data Science: Top SQL Interview Questions by Experience Level

Introduction: SQL (Structured Query Language) is a cornerstone of data manipulation and querying in data science. SQL technical rounds are designed to assess a candidate’s ability to work with databases, retrieve, and manipulate data efficiently. This guide provides a comprehensive list of SQL interview questions segmented by experience level—beginner, intermediate, and experienced. For each level, you'll find key questions designed to evaluate the candidate’s proficiency in SQL and their ability to solve data-related problems. The difficulty increases as the experience level rises, and the final section will guide you on how to prepare effectively for these rounds. Beginner (0-2 Years of Experience) At this stage, candidates are expected to know the basics of SQL, common commands, and elementary data manipulation. What is SQL? Explain its importance in data science. Hint: Think about querying, relational databases, and data manipulation. What is the difference between WHERE ...

Spacy errors and their solutions

 Introduction: There are a bunch of errors in spacy, which never makes sense until you get to the depth of it. In this post, we will analyze the attribute error E046 and why it occurs. (1) AttributeError: [E046] Can't retrieve unregistered extension attribute 'tag_name'. Did you forget to call the set_extension method? Let's first understand what the error means on superficial level. There is a tag_name extension in your code. i.e. from a doc object, probably you are calling doc._.tag_name. But spacy suggests to you that probably you forgot to call the set_extension method. So what to do from here? The problem in hand is that your extension is not created where it should have been created. Now in general this means that your pipeline is incorrect at some level.  So how should you solve it? Look into the pipeline of your spacy language object. Chances are that the pipeline component which creates the extension is not included in the pipeline. To check the pipe eleme...

What is Bort?

 Introduction: Bort, is the new and more optimized version of BERT; which came out this october from amazon science. I came to know about it today while parsing amazon science's news on facebook about bort. So Bort is the newest addition to the long list of great LM models with extra-ordinary achievements.  Why is Bort important? Bort, is a model of 5.5% effective and 16% total size of the original BERT model; and is 20x faster than BERT, while being able to surpass the BERT model in 20 out of 23 tasks; to quote the abstract of the paper,  ' it obtains performance improvements of between 0 . 3% and 31%, absolute, with respect to BERT-large, on multiple public natural language understanding (NLU) benchmarks. ' So what made this achievement possible? The main idea behind creation of Bort is to go beyond the shallow depth of weight pruning, connection deletion or merely factoring the NN into different matrix factorizations and thus distilling it. While methods like know...