Working with
LLMs
1. How LLMs Work (Briefly)
A Large Language Model is a transformer trained on massive text corpora to predict the next token. Every response you get is the model sampling one token at a time, conditioned on everything before it.
The key insight: because the training data contains virtually every kind of text — code, instructions, reasoning, dialogue — the model learns to do all of those things. Your job as a builder is to steer that capability toward your use case.
Temperature controls randomness. At temperature 0 the model picks the most probable next token every time (deterministic). At temperature 1+ it samples more freely, producing more varied but potentially less accurate outputs.
2. Prompt Engineering
The cheapest and fastest way to get more from an LLM. No training required — you just change what you put in the context window.
System prompt
The system prompt sets the model's role, tone, and constraints before the user says anything. It's the most powerful lever you have.
system_prompt.py# Bad — too vague system = "You are a helpful assistant." # Good — specific persona, constraints, format system = """You are a senior Python engineer. - Answer only about Python and its ecosystem. - If unsure, say so — never hallucinate package names. - Always include a runnable code example. - Keep answers under 300 words unless asked for more."""
Few-shot prompting
Show the model examples of the input-output format you want before giving the real input. This dramatically improves consistency on structured tasks.
few_shot.pymessages = [ {"role": "user", "content": "Classify: I love this product!"}, {"role": "assistant", "content": "Sentiment: POSITIVE"}, {"role": "user", "content": "Classify: The shipping took forever."}, {"role": "assistant", "content": "Sentiment: NEGATIVE"}, {"role": "user", "content": "Classify: It does what it says."}, # Model will output: "Sentiment: NEUTRAL" ]
Chain-of-thought
Adding "Think step by step" to a prompt causes the model to reason through the problem before giving an answer. This alone can boost accuracy on math and logic tasks by 20–40%.
3. Fine-tuning
Fine-tuning continues training an existing LLM on your own dataset. Use it when you need the model to consistently follow a specific format, tone, or domain vocabulary that prompt engineering alone can't reliably achieve.
- You need a specific output format (JSON, YAML, code)
- Domain jargon matters (legal, medical, internal terminology)
- You have 100+ high-quality labeled examples
- Latency matters — you want a smaller, faster model
- Prompt engineering already works well enough
- You need up-to-date knowledge (use RAG instead)
- You have fewer than 50 training examples
- Your use case changes frequently
LoRA: Efficient Fine-tuning
Fine-tuning all billions of parameters is expensive. LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices — reducing trainable parameters by 100–1,000x while keeping most of the benefit.
lora_finetune.pyfrom peft import LoraConfig, get_peft_model from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B") lora_config = LoraConfig( r=16, # rank — higher = more capacity lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, task_type="CAUSAL_LM" ) model = get_peft_model(base_model, lora_config) model.print_trainable_parameters() # trainable params: 4,194,304 || all params: 8,030,261,248 (0.05%)
4. Embeddings & Semantic Search
An embedding converts text into a dense vector of floats. Similar texts produce similar vectors — so you can find related content by comparing vectors in high-dimensional space.
This is fundamentally different from keyword search: embeddings capture meaning, not just matching words.
Cosine similarity is the standard distance metric. It measures the angle between two vectors — 1.0 means identical meaning, 0.0 means unrelated, -1.0 means opposite.
embeddings.pyfrom anthropic import Anthropic import numpy as np client = Anthropic() def embed(text): resp = client.messages.create( model="claude-3-5-haiku-latest", max_tokens=1, messages=[{"role": "user", "content": text}] ) # In practice, use a dedicated embedding model like text-embedding-3 return np.array(resp.content[0].text) def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
5. Retrieval-Augmented Generation (RAG)
RAG solves LLMs' biggest weakness: they're frozen at training time. Instead of baking all knowledge into model weights, RAG retrieves relevant documents at query time and puts them in the context window.
rag.pydef rag_query(question, vector_db, llm_client, top_k=3): # Step 1: embed the question q_embedding = embed(question) # Step 2: retrieve relevant chunks results = vector_db.query(q_embedding, top_k=top_k) context = "\n\n".join(r.text for r in results) # Step 3: generate grounded answer prompt = f"""Answer the question using ONLY the context below. If the answer isn't in the context, say "I don't know." Context: {context} Question: {question}""" response = llm_client.messages.create( model="claude-sonnet-4-6", max_tokens=512, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text
Chunk size matters. 200–400 tokens per chunk is a common sweet spot. Too small and you lose context; too large and irrelevant content dilutes the signal.
6. Which Technique to Use
| Technique | Best for | Cost | Freshness |
|---|---|---|---|
| Prompt engineering | Behavior, format, tone | Free | Training cutoff |
| Fine-tuning | Style, domain vocabulary | Medium | Training cutoff |
| RAG | Private or fresh knowledge | Low (infra) | Real-time |
| Embeddings | Search, deduplication | Low | As fresh as index |
In practice, most production apps combine all four. Prompt engineering is always layer one; RAG for knowledge; fine-tuning when needed for style; embeddings underneath the search.
7. Knowledge Check
5 questions · pick the best answer
8. What's Next
You now have the core LLM toolkit. Next we apply it — building a full chatbot with memory and tools.