Topics Building AI Model from Scratch
Series · 6 posts Contact

Image Generation
Pipeline

Stable Diffusion is the most powerful open-source image generation model available. In this post you'll understand its architecture, use ControlNet to guide exactly what gets generated, fine-tune it with LoRA for a custom style, and build a complete image generation API.

1. Stable Diffusion Architecture

Stable Diffusion combines three components that work together: a VAE that compresses images, a U-Net that denoises in latent space, and a CLIP text encoder that converts prompts into embeddings.

Text Encoder
(CLIP)
Prompt → embeddings
+
U-Net
(Denoiser)
Noisy latent → clean latent
VAE Decoder
(Upscaler)
Latent → pixel image
💡

Why the latent space? Operating in the compressed latent space (64×64 instead of 512×512) makes diffusion ~64x fewer spatial computations per step (8x per dimension), leading to dramatically faster training and inference with nearly identical output quality. This is the “latent” in Latent Diffusion Models (LDM).

2. Text-to-Image Pipeline

The diffusers library from Hugging Face makes this remarkably simple. Under the hood it runs the full encode → denoise → decode pipeline.

text_to_image.py
from diffusers import StableDiffusionPipeline import torch pipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda") image = pipe( prompt="A futuristic city at sunset, cinematic, 8k, detailed", negative_prompt="blurry, low quality, watermark", num_inference_steps=30, # more steps = higher quality, slower guidance_scale=7.5, # how closely to follow the prompt height=512, width=512, ).images[0] image.save("output.png")

Key parameters

num_inference_steps
Denoising iterations. 20–30 is usually enough; diminishing returns beyond 50.
guidance_scale
CFG scale. 7–12 is typical. Higher = more prompt-adherent but less creative.
negative_prompt
What to avoid. Essential for removing artifacts: "blurry, ugly, deformed, watermark".
seed
Fix the random seed for reproducible outputs. Vary it to explore different results.

3. ControlNet

ControlNet adds structural control to the generation process — you can pass a pose skeleton, edge map, depth map, or segmentation mask to constrain exactly how the image is composed, while the prompt still controls style and content.

🦴
OpenPose
Human body pose skeleton controls character positioning
✏️
Canny Edges
Edge detection map preserves object structure from reference
🏔️
Depth
Depth map preserves 3D composition and perspective
🎨
Seg / Normal
Segmentation or normal maps for fine-grained scene control
controlnet.py
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel import torch from PIL import Image controlnet = ControlNetModel.from_pretrained( "lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16 ) pipe = StableDiffusionControlNetPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16 ).to("cuda") edge_map = Image.open("reference_edges.png") image = pipe( prompt="A robot in a cyberpunk city, neon lights, cinematic", image=edge_map, num_inference_steps=30, ).images[0]

4. LoRA Fine-tuning

LoRA (Low-Rank Adaptation) lets you fine-tune Stable Diffusion on your own images — 10–30 photos — to teach it a specific person, art style, or product. The result is a small adapter file (~10 MB) that you load on top of the base model.

DreamBooth + LoRA workflow

1
Gather 10–30 images of your subject. Varied angles, lighting, backgrounds. Crop to 512×512.
2
Choose a trigger word — a unique token like sks person or myproduct that you'll use in prompts.
3
Fine-tune with LoRA — ~500–1000 steps on a single GPU. Use kohya_ss or the diffusers training script.
4
Load the adapter and use the trigger word in your prompts to activate the fine-tuned concept.
load_lora.py
pipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda") # Load your LoRA weights pipe.load_lora_weights("./my_lora_weights", adapter_name="my_style") image = pipe( # Use trigger word to activate the fine-tuned concept prompt="sks person at the beach, golden hour, portrait", num_inference_steps=30, ).images[0]

5. Image-to-Image

Instead of starting from pure noise, image-to-image starts from an existing image with some noise added. This lets you restyle, vary, or repair existing images while preserving their structure.

img2img.py
from diffusers import StableDiffusionImg2ImgPipeline pipe = StableDiffusionImg2ImgPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda") init_image = Image.open("sketch.jpg").resize((512, 512)) image = pipe( prompt="A detailed oil painting of a medieval castle", image=init_image, strength=0.75, # 0=keep original, 1=ignore original guidance_scale=7.5, ).images[0]

6. Building the API

A minimal FastAPI wrapper around the pipeline, with base64 image response:

image_api.py
from fastapi import FastAPI from pydantic import BaseModel import base64, io app = FastAPI() class GenRequest(BaseModel): prompt: str negative_prompt: str = "blurry, low quality" steps: int = 30 guidance: float = 7.5 @app.post("/generate") async def generate(req: GenRequest): image = pipe( prompt=req.prompt, negative_prompt=req.negative_prompt, num_inference_steps=req.steps, guidance_scale=req.guidance, ).images[0] buf = io.BytesIO() image.save(buf, format="PNG") b64 = base64.b64encode(buf.getvalue()).decode() return {"image": b64, "format": "png"}

Use a task queue for production. Image generation takes 2–10 seconds per request. Use Celery + Redis or a job queue to handle multiple concurrent requests without blocking the API server.

7. Knowledge Check

5 questions · pick the best answer

Q1 of 5
Why does Stable Diffusion operate in latent space instead of pixel space?
Q2 of 5
What does a higher guidance_scale (CFG) value do?
Q3 of 5
What does ControlNet add to Stable Diffusion?
Q4 of 5
In LoRA fine-tuning, what does the trigger word do?
Q5 of 5
In img2img, a strength of 0.0 means...

8. What's Next

Your image pipeline is ready. The final post covers deploying everything to production at scale.

← Previous Building a Chatbot
Next → Deploying AI to Production