The current state of AI image generation (early 2025)

Hem / Blogg / The current state of AI …
Sven-Erik Jonsson 2025-02-25 12:57:12
Ok, it is time for an update. I know it has been a while since the last time I wrote about experimenting with generative AI, hope this post makes up for it.
Ever since the first diffusion model came around in 2015 they all have had some serious constraints on what was possible to achieve due to hardware limits. They still have issues, but progress is made in interesting ways.
You've probably already read about one of these big players:
Midjourney
DALL-E 3
Ideogram
Adobe Firefly
Allt these have some impressive models, that can generate really nice 1-2 megapixel pictures for a fee. This can be fun for a while, until you try using them for something professional. You then realize that it will become tricky without addressing a list of issues.
Issues with AI imagesRelatively low output resolution (2048 x 2048 max, often 1024 x 1024)
Incorrect anatomy (wrong number of fingers, missing teeth, weird skin texture)
Bad at generating text (nonsensical glyphs)
Faulty scale (subjects too large/small)
Shadows not matching subject
Imperfect patterns (sofas with uneven seats, organic; wavy lines in things that should be straight)
Small generative artifacts (small, weird people in the background, random compression artifacts)
Poor prompt adherence (two subjects show up when prompting for one)
Copyright concerns about the underlying data used to train the model
Human hand, placed on table (Stable Diffusion 3.5)

A luxurious corner sofa (Stable Diffusion 3.5)
Fixing it yourselfAlongside the proprietary models there are some worthwhile open-source alternatives worth mentioning, that you can run on your home computer (given that you have a graphics card with 8 GB VRAM and at least 16 GB RAM).
Stable Diffusion 3.5, 1.0 in August 22, 2022
SDXL July 18, 2023
FLUX.1 August 1, 2024
Different checkpoints (a captured state of the model) are downloadable from Hugging face. They are all free for non-commercial use (some include commercial licenses).
All suffer from the same issues listed previously. The later models are better in many ways, FLUX can generate readable text in a limited way, it also has a better slightly understanding of human anatomy (but still not quite right as shown below with incorrect proportions and merged fingers).
A Human hand, placed on table (Flux.1 dev)

A luxurious corner sofa (Flux.1 dev)
A short note on NSFWSome of these free checkpoints are trained on imagery that is NSFW (Not Safe For Work, meaning that they contain things like nudity, gore and monsters). This sometimes has to be mitigated by providing negative prompts or selectively excluding results, a thing to consider if building an automated public facing service.
The anatomy of a diffusion modelA diffusion model is stored in a safetensors file, there are others like ONNX or bin, but they usually contain the following.
Image model (latent space)
CLIP (Contrastive Language-Image Pretraining)
VAE (Variational AutoEncoder)
A CLIP encoder encodes text to instructions on how a sampler should generate the image, also called the prompt. The VAE converts images to and from latent space (the internal format).
Using a backendTo escape having to write a large amount of Python code (some might actually be required though), to test these models, a backend software can be used. I personally use ComfyUI a graph based editor that enables combining or hot swapping different models into a workflow. Others of note are automatic1111 and InvokeAI.
Controlling image generationBackends like ComfyUI offer granular control that standardized web interfaces can't accommodate by providing workflows. These workflows can do a lot more than the proprietary web services due to sheer flexibility.
Image to imageA frowning person (Flux.1 dev)
A base image can be sent to the sampler along with the prompt, how much of the original image is used is determined by parameters.
Fixed seedsRandomization can be controlled by using specific seeds to get the same output. In combination with varying other parameters to get subtle variations that can be selectively edited afterwards to exclude unwanted details (like hands with weird anatomy).
LoRAIf using text based prompts to generate a series of images, it is nearly impossible to get the subject to look alike between images due to randomization. A LoRA (Low Rank Adaptation) is like a patch for the model where a pre-trained custom subject can be inserted by using keywords in the prompt. LoRA construction is complex and requires a moderate to large amount of images of the subject.
ControlNetA ControlNet is an external neural network that controls image generation via supplying additional conditioning on top of CLIP encoded text. It was concieved by 3 researchers (Lvmin Zhang, Anyi Rao, Maneesh Agrawala) in 2023. This can be used for specifying human poses or copying composition from another image or altering the style of an image.
DepthA depth map can be used to control the subject/content of the image.
A painting of a woman (SDXL, Proteus v0.4 checkpoint)
Canny/ScribbleA scribble or edge detected version of a source image can be used to control the image.
A painting of a woman (Flux.1 dev)
PoseA pose definition can be used to control the image.

A smiling person (SDXL, Proteus v0.4)
Inpainting/OutpaintingParts of the image can be replaced partly or in full based on a mask, this can help remove/reduce unwanted details in the source image.

Example Inpainting from a ComfyUI tutorial.
IP AdapterIP Adapter is a computer vision based way to supply a way to prompt using images, enabling style transfer.

Retro robot in forest (SDXL, base)
UpscalingTo fix the low resolution issues another AI model can be used to upscale the image through different methods, by either inferring some data or by just hallucinating more pixels based on the input image and a text prompt. One notable paid product that does this is Topaz Labs Gigapixel.
There are also plugins for ComfyUI that can split the image into tiles and stich them together to increase resolution. These upscalers work with the models latent space and use prompts to generate high quality results.
Ultimate SD upscale
SUPIR

Close-up of image upscaled by 4x by Ultimate SD upscale
SummaryGiven the above stated list of issues with generative AI and the amount of work that has to be put in to produce high quality imagery, the need for digital artists and engineers still exists and the difference in quality is the amount of work you put into generating/improving your images.