Image Generation
Pipeline
1. Stable Diffusion Architecture
Stable Diffusion combines three components that work together: a VAE that compresses images, a U-Net that denoises in latent space, and a CLIP text encoder that converts prompts into embeddings.
Why the latent space? Operating in the compressed latent space (64×64 instead of 512×512) makes diffusion ~64x fewer spatial computations per step (8x per dimension), leading to dramatically faster training and inference with nearly identical output quality. This is the “latent” in Latent Diffusion Models (LDM).
2. Text-to-Image Pipeline
The diffusers library from Hugging Face makes this remarkably simple. Under the hood it runs the full encode → denoise → decode pipeline.
text_to_image.pyfrom diffusers import StableDiffusionPipeline import torch pipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda") image = pipe( prompt="A futuristic city at sunset, cinematic, 8k, detailed", negative_prompt="blurry, low quality, watermark", num_inference_steps=30, # more steps = higher quality, slower guidance_scale=7.5, # how closely to follow the prompt height=512, width=512, ).images[0] image.save("output.png")
Key parameters
3. ControlNet
ControlNet adds structural control to the generation process — you can pass a pose skeleton, edge map, depth map, or segmentation mask to constrain exactly how the image is composed, while the prompt still controls style and content.
controlnet.pyfrom diffusers import StableDiffusionControlNetPipeline, ControlNetModel import torch from PIL import Image controlnet = ControlNetModel.from_pretrained( "lllyasviel/sd-controlnet-canny", torch_dtype=torch.float16 ) pipe = StableDiffusionControlNetPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", controlnet=controlnet, torch_dtype=torch.float16 ).to("cuda") edge_map = Image.open("reference_edges.png") image = pipe( prompt="A robot in a cyberpunk city, neon lights, cinematic", image=edge_map, num_inference_steps=30, ).images[0]
4. LoRA Fine-tuning
LoRA (Low-Rank Adaptation) lets you fine-tune Stable Diffusion on your own images — 10–30 photos — to teach it a specific person, art style, or product. The result is a small adapter file (~10 MB) that you load on top of the base model.
DreamBooth + LoRA workflow
sks person or myproduct that you'll use in prompts.load_lora.pypipe = StableDiffusionPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda") # Load your LoRA weights pipe.load_lora_weights("./my_lora_weights", adapter_name="my_style") image = pipe( # Use trigger word to activate the fine-tuned concept prompt="sks person at the beach, golden hour, portrait", num_inference_steps=30, ).images[0]
5. Image-to-Image
Instead of starting from pure noise, image-to-image starts from an existing image with some noise added. This lets you restyle, vary, or repair existing images while preserving their structure.
img2img.pyfrom diffusers import StableDiffusionImg2ImgPipeline pipe = StableDiffusionImg2ImgPipeline.from_pretrained( "runwayml/stable-diffusion-v1-5", torch_dtype=torch.float16 ).to("cuda") init_image = Image.open("sketch.jpg").resize((512, 512)) image = pipe( prompt="A detailed oil painting of a medieval castle", image=init_image, strength=0.75, # 0=keep original, 1=ignore original guidance_scale=7.5, ).images[0]
6. Building the API
A minimal FastAPI wrapper around the pipeline, with base64 image response:
image_api.pyfrom fastapi import FastAPI from pydantic import BaseModel import base64, io app = FastAPI() class GenRequest(BaseModel): prompt: str negative_prompt: str = "blurry, low quality" steps: int = 30 guidance: float = 7.5 @app.post("/generate") async def generate(req: GenRequest): image = pipe( prompt=req.prompt, negative_prompt=req.negative_prompt, num_inference_steps=req.steps, guidance_scale=req.guidance, ).images[0] buf = io.BytesIO() image.save(buf, format="PNG") b64 = base64.b64encode(buf.getvalue()).decode() return {"image": b64, "format": "png"}
Use a task queue for production. Image generation takes 2–10 seconds per request. Use Celery + Redis or a job queue to handle multiple concurrent requests without blocking the API server.
7. Knowledge Check
5 questions · pick the best answer
8. What's Next
Your image pipeline is ready. The final post covers deploying everything to production at scale.