CS 180 Project 5

By Jathin Korrapati

Part A: Diffusion Model Setup

Part 0: Setup

We begin this project by loading DeepFloyd, which is a text-to-image model we retrieve from HuggingFace. Here are some of the results:

Part 1: Sampling

Now, we will generate many different ways to sample our model in order to compare and see the capabilities of DeepFloyd.

The quality of the descriptions seem to be directly related to the quality of the image outputs, where the simplicity of them seemed to correlate more with the variable output from the images (snowy mountain village is cartoon-looking for example). I tried two different values for the num_inference_steps variable of 100 and 200, where the quality definitely seemed to be better for the one with a 200, outputs being more clear and specific to the prompt. I used a random seed of 180 for my image generations here.

Part 1.1: Noising

We implement this by taking a clean image, and then adding some noise to it, which we do by implementing the following formula: $x_t = \sqrt{\bar{a_t}}x_0 + \sqrt{1 - \bar{a_t}}\epsilon$ . Epsilon is defined from a normal distribution: $\epsilon = N(0, 1).$ We don’t just add noise, but we also scale the image accordingly. The results of the forward process are displayed here:

Part 1.2: Classical Denoising

Here, we try to denoise the images we just added noise to by using Gaussian blur filtering to try to remove the noise. In order to this, we use a $kernel = 5$ and $\sigma = 2$ . Denoising works by finding the averages of the kernel regions and then converging on that mean to denoise the image. Results are displayed below:

Part 1.3: One-Step Denoising

Here, we use the pretrained diffusion model, defined as $stage_{1}.unet$ in the code that has been trained already on a large dataset of images which we use to recover the Gaussians of the image from and then use to remove the noise. Re-arranging the equations from before, given our epsilon we can find $x_0 = \frac{x_t - \sqrt{1 - \bar{\alpha_t}}\epsilon}{\sqrt{\bar{\alpha_t}}}$ . Results are displayed below:

Part 1.4: Iterative Denoising

Now, with iterative denoising, the idea is if we use the previous image in convex combination of our current image at the timestep we are iterating on and predicted image, we can denoise our formula by using the following equation given to us: $x_{t'} = \frac{\sqrt{\bar{\alpha_{t'}}}\beta_t}{1 - \bar{\alpha_t}}x_0 + \frac{\sqrt{\alpha_t}(1 - \bar{\alpha_{t'}})}{1 - \bar{\alpha_t}}x_t + v_{\sigma}$ . In implementing this, our results are shown below:

Part 1.5: Diffusion Model Sampling

In 1.4, we used the diffusion model in order to denoise an image, but here, we are trying to generate images from scratch using iterative denoising, which we can do by setting our $i_{start} = 0$ and passing random noise in to see what random images it displays for the prompt embedding: a high quality photo. The results are shown below:

Part 1.6: Classifier-Free Guidance (CFG)

Here, we introduce the notion of classifier-free guidance which takes both conditional and unconditional noise into account in order to generate a new noise estimate, which is defined as:

$\epsilon = \epsilon_u + \gamma(\epsilon_c - \epsilon_u)$ . We obtain each of these estimates with a forward pass into our U-net, one conditional and one unconditional and then calculate the equation above. Here, $\gamma$ , controls the strength of the CFG estimator. For our results, we use $\gamma = 7$ to generate our results for the prompt embedding again of “a high quality photo.”

Part 1.7: Image-to-image Translation (SDE Edit)

In this part, we use the SDE edit algorithm in order to force our U-net model architecture to hallucinate to the manifold or reference image that we ant it to. Doing this iteratively, we will see the final image in our loop resemble our reference image most similarly. Once again, the description of “a high quality photo” was used. Results are below:

Part 1.7.1: Editing Hand-Drawn and Web Images

Here, we display the results from above with the SDE Edit algorithm on two hand drawn images and one image from the web to show how the edits look with our algorithm. Results are displayed below:

Part 1.7.2: Inpainting

Here, we want to implement something called inpainting. Given an image $x_o$ and a binary mask m, we want to generate a new image where it has the same content when the mask is 0, but new content when the mask is 1. We do this by running the diffusion denoising loop again, but we force $x_t$ to keep the pattern above by implementing: $x_t \leftarrow \bold{m}x_t + (1 - \bold{m})forward(x_{o}, t)$ . What this does is force everything inside the mask to be the same, but edits everything outside the mask. Results are shown below:

Part 1.7.3 Text-Conditional Image-to-image Translation

Now, we do the exact same thing as before with SDEdit, but we add the projection of a text prompt to guide the diffusion model in its generation of images. We change the prompt from “a high quality photo” to any of the precomputed embeddings from before. Results are shown below:

“a rocket ship” for original Campanile image

Part 1.8: Visual Anagrams

The goal here is to generate an image that looks like one thing when upright (e.g., “an oil painting of people around a campfire”) but transforms into something completely different when flipped upside down (e.g., “an oil painting of an old man”). The process involves denoising the image twice, first with the normal, unflipped prompt to estimate noise, and then flipping the image again upside down with the flipped prompt to get another noise estimate. Then we average the noises for the final image and blend them together to output our final result. My results are displayed below and the prompts that I used.

The three prompts I use in order are:

“an oil painting of an old man”, “an oil painting of people around a campfire”

“a lithograph of a skull”, “a rocket ship”

“a photo of the amalfi cost”, “a lithograph of waterfalls”

Part 1.9: Hybrid Images

Now, here what we are trying to do is implement hybrid images with a diffusion model, where given two images we create a composite noise estimate by passing one image’s noise through a low-pass filter, and passing the other image’s noise through the high-pass filter, each with a forward pass of our model on different embedding prompts. We combine the noises for each in the end by just adding them and then estimate our hybrid image. My results and prompts are below:

The three prompts I use in order are:

“a lithograph of a skull”, “a lithograph of waterfalls”

“an oil painting of a snowy mountain village”, “an oil painting of people around a campfire”

“a photo of a man”, “a photo of a dog”

Part B: Diffusion Models

Part 1: Training a Single-Step Denoising UNet

Part 1.1: UNet Implementation

We start by implementing the Unet architecture from the UNet paper for our purposes here:

Part 1.2: Using the UNet to Train a Denoiser

After implementing the basic structure for the UNet implementation we now need to train our denoiser. In order to do this, we need to generate noisy images for our model with ( $x_{d}, x_{c})$ where d means noisy and c means clean. We generate noisy digits by using the following noising equation: $x_d = x_c + \sigma\epsilon$ , where $\epsilon \sim N(0, I)$ . We also normalize our data, and we generate the following plot for different sigma values:

Part 1.2.1: Training

Here, we set our $\sigma = 0.5$ and train our model to perform denoising. We also use a learning rate value of $1e-4$ . Here are the visual results:

Part 1.2.2: Out-of-Distribution Testing

Before, we only experimented with $\sigma =0.5$ , but now lets see how the model performs on different sigma values that it was not trained on here, we generate results for $\sigma = [0.0, 0.2, 0.4, 0.6, 0.8, 1]$ :

Part 2: Training a Diffusion Model

Part 2.1: Adding Time Conditioning to UNet

In this part, we have our base UNet model implementation, but now we add time conditioning to our model to iteratively denoise our image for better results. This works by instead estimating a clean version of the noisy image, it estimates the total error of the noisy image.

Part 2.2: UNet Training

We train our UNet here to get our in a similar way from before, but now using the Time Conditioned UNet, here is the training curve:

Part 2.3: Sampling from UNet

We sample during and after training of our UNet to iteratively remove noise from the pure-noise images we give it. Here are my results from sampling from the model while its training at different epochs:

Part 2.4: Adding Class-Conditioning to UNet

The problem from before is our model does not resemble digits all the time, and just random generations of images close to them. To force it to get more control, we use a one-hot-encoding vector, we call $c$ , that is a one dimensional tensor representing a digit from 0 to 9. We pas in this label in the training phase with the clean image, and to support random generation with class,, we zero out a percentage of the encodings. We call this $p_{cond}$ , which is $= 0.1$ . This is the following training loss curve:

Part 2.5: Sampling from Class-Conditioned UNet

We sample very similarly as to how we did in 2.3, but now adding specific class values in our one hot encoding $c$ to suppoprt class-conditioned Unet training. Results are shown below: