Image Generation
Pipeline

05 / 06 · 2026-06-30 · 24 min read · Stable Diffusion LoRA ControlNet Advanced

Stable Diffusion is the most powerful open-source image generation model available. In this post you'll understand its architecture, use ControlNet to guide exactly what gets generated, fine-tune it with LoRA for a custom style, and build a complete image generation API.

In this post

1. Stable Diffusion Architecture 2. Text-to-Image Pipeline 3. ControlNet 4. LoRA Fine-tuning 5. Image-to-Image 6. Building the API 7. Knowledge Check 8. What's Next

1. Stable Diffusion Architecture

Stable Diffusion combines three components that work together: a VAE that compresses images, a U-Net that denoises in latent space, and a CLIP text encoder that converts prompts into embeddings.

Text Encoder

(CLIP)

Prompt → embeddings

U-Net

(Denoiser)

Noisy latent → clean latent

⟶

VAE Decoder

(Upscaler)

Latent → pixel image

💡

Why the latent space? Operating in the compressed latent space (64×64 instead of 512×512) makes diffusion ~64x fewer spatial computations per step (8x per dimension), leading to dramatically faster training and inference with nearly identical output quality. This is the “latent” in Latent Diffusion Models (LDM).

2. Text-to-Image Pipeline

The diffusers library from Hugging Face makes this remarkably simple. Under the hood it runs the full encode → denoise → decode pipeline.

text_to_image.py
from diffusers import StableDiffusionPipeline
import torch

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

image = pipe(
    prompt="A futuristic city at sunset, cinematic, 8k, detailed",
    negative_prompt="blurry, low quality, watermark",
    num_inference_steps=30,   # more steps = higher quality, slower
    guidance_scale=7.5,       # how closely to follow the prompt
    height=512, width=512,
).images[0]

image.save("output.png")

Key parameters

num_inference_steps

Denoising iterations. 20–30 is usually enough; diminishing returns beyond 50.

guidance_scale

CFG scale. 7–12 is typical. Higher = more prompt-adherent but less creative.

negative_prompt

What to avoid. Essential for removing artifacts: "blurry, ugly, deformed, watermark".

seed

Fix the random seed for reproducible outputs. Vary it to explore different results.

3. ControlNet

ControlNet adds structural control to the generation process — you can pass a pose skeleton, edge map, depth map, or segmentation mask to constrain exactly how the image is composed, while the prompt still controls style and content.

🦴

OpenPose

Human body pose skeleton controls character positioning

✏️

Canny Edges

Edge detection map preserves object structure from reference

🏔️

Depth

Depth map preserves 3D composition and perspective

🎨

Seg / Normal

Segmentation or normal maps for fine-grained scene control

controlnet.py
from diffusers import StableDiffusionControlNetPipeline, ControlNetModel
import torch
from PIL import Image

controlnet = ControlNetModel.from_pretrained(
    "lllyasviel/sd-controlnet-canny",
    torch_dtype=torch.float16
)
pipe = StableDiffusionControlNetPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    controlnet=controlnet,
    torch_dtype=torch.float16
).to("cuda")

edge_map = Image.open("reference_edges.png")

image = pipe(
    prompt="A robot in a cyberpunk city, neon lights, cinematic",
    image=edge_map,
    num_inference_steps=30,
).images[0]

4. LoRA Fine-tuning

LoRA (Low-Rank Adaptation) lets you fine-tune Stable Diffusion on your own images — 10–30 photos — to teach it a specific person, art style, or product. The result is a small adapter file (~10 MB) that you load on top of the base model.

DreamBooth + LoRA workflow

Gather 10–30 images of your subject. Varied angles, lighting, backgrounds. Crop to 512×512.

Choose a trigger word — a unique token like sks person or myproduct that you'll use in prompts.

Fine-tune with LoRA — ~500–1000 steps on a single GPU. Use kohya_ss or the diffusers training script.

Load the adapter and use the trigger word in your prompts to activate the fine-tuned concept.

load_lora.py
pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

# Load your LoRA weights
pipe.load_lora_weights("./my_lora_weights", adapter_name="my_style")

image = pipe(
    # Use trigger word to activate the fine-tuned concept
    prompt="sks person at the beach, golden hour, portrait",
    num_inference_steps=30,
).images[0]

5. Image-to-Image

Instead of starting from pure noise, image-to-image starts from an existing image with some noise added. This lets you restyle, vary, or repair existing images while preserving their structure.

img2img.py
from diffusers import StableDiffusionImg2ImgPipeline

pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

init_image = Image.open("sketch.jpg").resize((512, 512))

image = pipe(
    prompt="A detailed oil painting of a medieval castle",
    image=init_image,
    strength=0.75,   # 0=keep original, 1=ignore original
    guidance_scale=7.5,
).images[0]

6. Building the API

A minimal FastAPI wrapper around the pipeline, with base64 image response:

image_api.py
from fastapi import FastAPI
from pydantic import BaseModel
import base64, io

app = FastAPI()

class GenRequest(BaseModel):
    prompt: str
    negative_prompt: str = "blurry, low quality"
    steps: int = 30
    guidance: float = 7.5

@app.post("/generate")
async def generate(req: GenRequest):
    image = pipe(
        prompt=req.prompt,
        negative_prompt=req.negative_prompt,
        num_inference_steps=req.steps,
        guidance_scale=req.guidance,
    ).images[0]

    buf = io.BytesIO()
    image.save(buf, format="PNG")
    b64 = base64.b64encode(buf.getvalue()).decode()

    return {"image": b64, "format": "png"}

⚡

Use a task queue for production. Image generation takes 2–10 seconds per request. Use Celery + Redis or a job queue to handle multiple concurrent requests without blocking the API server.

7. Knowledge Check

5 questions · pick the best answer

Q1 of 5

Why does Stable Diffusion operate in latent space instead of pixel space?

Q2 of 5

What does a higher guidance_scale (CFG) value do?

Q3 of 5

What does ControlNet add to Stable Diffusion?

Q4 of 5

In LoRA fine-tuning, what does the trigger word do?

Q5 of 5

In img2img, a strength of 0.0 means...

8. What's Next

Your image pipeline is ready. The final post covers deploying everything to production at scale.

Image GenerationPipeline

1. Stable Diffusion Architecture

2. Text-to-Image Pipeline

Key parameters

3. ControlNet

4. LoRA Fine-tuning

DreamBooth + LoRA workflow

5. Image-to-Image

6. Building the API

7. Knowledge Check

8. What's Next

Image Generation
Pipeline