Topics Building AI Model from Scratch
Series · 6 posts Contact

Working with
LLMs

LLMs are the most powerful general-purpose AI building blocks available today. This post covers the four main ways you use them in real applications: prompt engineering, fine-tuning, RAG, and embedding search — with code for each.

1. How LLMs Work (Briefly)

A Large Language Model is a transformer trained on massive text corpora to predict the next token. Every response you get is the model sampling one token at a time, conditioned on everything before it.

The key insight: because the training data contains virtually every kind of text — code, instructions, reasoning, dialogue — the model learns to do all of those things. Your job as a builder is to steer that capability toward your use case.

💡

Temperature controls randomness. At temperature 0 the model picks the most probable next token every time (deterministic). At temperature 1+ it samples more freely, producing more varied but potentially less accurate outputs.

2. Prompt Engineering

The cheapest and fastest way to get more from an LLM. No training required — you just change what you put in the context window.

System prompt

The system prompt sets the model's role, tone, and constraints before the user says anything. It's the most powerful lever you have.

system_prompt.py
# Bad — too vague system = "You are a helpful assistant." # Good — specific persona, constraints, format system = """You are a senior Python engineer. - Answer only about Python and its ecosystem. - If unsure, say so — never hallucinate package names. - Always include a runnable code example. - Keep answers under 300 words unless asked for more."""

Few-shot prompting

Show the model examples of the input-output format you want before giving the real input. This dramatically improves consistency on structured tasks.

few_shot.py
messages = [ {"role": "user", "content": "Classify: I love this product!"}, {"role": "assistant", "content": "Sentiment: POSITIVE"}, {"role": "user", "content": "Classify: The shipping took forever."}, {"role": "assistant", "content": "Sentiment: NEGATIVE"}, {"role": "user", "content": "Classify: It does what it says."}, # Model will output: "Sentiment: NEUTRAL" ]

Chain-of-thought

Adding "Think step by step" to a prompt causes the model to reason through the problem before giving an answer. This alone can boost accuracy on math and logic tasks by 20–40%.

Be specific about format, length, and tone
Use delimiters (```, ---) to separate data from instructions
One task per prompt — break complex tasks into steps
Ask for reasoning before the final answer

3. Fine-tuning

Fine-tuning continues training an existing LLM on your own dataset. Use it when you need the model to consistently follow a specific format, tone, or domain vocabulary that prompt engineering alone can't reliably achieve.

Use fine-tuning when...
  • You need a specific output format (JSON, YAML, code)
  • Domain jargon matters (legal, medical, internal terminology)
  • You have 100+ high-quality labeled examples
  • Latency matters — you want a smaller, faster model
Don't fine-tune when...
  • Prompt engineering already works well enough
  • You need up-to-date knowledge (use RAG instead)
  • You have fewer than 50 training examples
  • Your use case changes frequently

LoRA: Efficient Fine-tuning

Fine-tuning all billions of parameters is expensive. LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices — reducing trainable parameters by 100–1,000x while keeping most of the benefit.

lora_finetune.py
from peft import LoraConfig, get_peft_model from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B") lora_config = LoraConfig( r=16, # rank — higher = more capacity lora_alpha=32, target_modules=["q_proj", "v_proj"], lora_dropout=0.05, task_type="CAUSAL_LM" ) model = get_peft_model(base_model, lora_config) model.print_trainable_parameters() # trainable params: 4,194,304 || all params: 8,030,261,248 (0.05%)

4. Embeddings & Semantic Search

An embedding converts text into a dense vector of floats. Similar texts produce similar vectors — so you can find related content by comparing vectors in high-dimensional space.

This is fundamentally different from keyword search: embeddings capture meaning, not just matching words.

Cosine similarity is the standard distance metric. It measures the angle between two vectors — 1.0 means identical meaning, 0.0 means unrelated, -1.0 means opposite.

embeddings.py
from anthropic import Anthropic import numpy as np client = Anthropic() def embed(text): resp = client.messages.create( model="claude-3-5-haiku-latest", max_tokens=1, messages=[{"role": "user", "content": text}] ) # In practice, use a dedicated embedding model like text-embedding-3 return np.array(resp.content[0].text) def cosine_similarity(a, b): return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

5. Retrieval-Augmented Generation (RAG)

RAG solves LLMs' biggest weakness: they're frozen at training time. Instead of baking all knowledge into model weights, RAG retrieves relevant documents at query time and puts them in the context window.

1
Index: chunk your documents, embed each chunk, store in a vector DB (Pinecone, Chroma, pgvector)
2
Retrieve: embed the user's question, find the top-k most similar chunks via cosine search
3
Generate: pass retrieved chunks + user question to the LLM as context, get a grounded answer
rag.py
def rag_query(question, vector_db, llm_client, top_k=3): # Step 1: embed the question q_embedding = embed(question) # Step 2: retrieve relevant chunks results = vector_db.query(q_embedding, top_k=top_k) context = "\n\n".join(r.text for r in results) # Step 3: generate grounded answer prompt = f"""Answer the question using ONLY the context below. If the answer isn't in the context, say "I don't know." Context: {context} Question: {question}""" response = llm_client.messages.create( model="claude-sonnet-4-6", max_tokens=512, messages=[{"role": "user", "content": prompt}] ) return response.content[0].text

Chunk size matters. 200–400 tokens per chunk is a common sweet spot. Too small and you lose context; too large and irrelevant content dilutes the signal.

6. Which Technique to Use

Technique Best for Cost Freshness
Prompt engineering Behavior, format, tone Free Training cutoff
Fine-tuning Style, domain vocabulary Medium Training cutoff
RAG Private or fresh knowledge Low (infra) Real-time
Embeddings Search, deduplication Low As fresh as index

In practice, most production apps combine all four. Prompt engineering is always layer one; RAG for knowledge; fine-tuning when needed for style; embeddings underneath the search.

7. Knowledge Check

5 questions · pick the best answer

Q1 of 5
What does temperature control in an LLM?
Q2 of 5
What is the main advantage of LoRA over full fine-tuning?
Q3 of 5
What does an embedding capture that keyword search does not?
Q4 of 5
What problem does RAG primarily solve?
Q5 of 5
Few-shot prompting means...

8. What's Next

You now have the core LLM toolkit. Next we apply it — building a full chatbot with memory and tools.

← Previous Generative AI Fundamentals
Next → Building a Chatbot