Working with
LLMs

03 / 06 · 2026-06-30 · 20 min read · LLMs RAG Fine-tuning Intermediate

LLMs are the most powerful general-purpose AI building blocks available today. This post covers the four main ways you use them in real applications: prompt engineering, fine-tuning, RAG, and embedding search — with code for each.

In this post

1. How LLMs Work (Briefly) 2. Prompt Engineering 3. Fine-tuning 4. Embeddings & Semantic Search 5. Retrieval-Augmented Generation (RAG) 6. Which Technique to Use 7. Knowledge Check 8. What's Next

1. How LLMs Work (Briefly)

A Large Language Model is a transformer trained on massive text corpora to predict the next token. Every response you get is the model sampling one token at a time, conditioned on everything before it.

The key insight: because the training data contains virtually every kind of text — code, instructions, reasoning, dialogue — the model learns to do all of those things. Your job as a builder is to steer that capability toward your use case.

💡

Temperature controls randomness. At temperature 0 the model picks the most probable next token every time (deterministic). At temperature 1+ it samples more freely, producing more varied but potentially less accurate outputs.

2. Prompt Engineering

The cheapest and fastest way to get more from an LLM. No training required — you just change what you put in the context window.

System prompt

The system prompt sets the model's role, tone, and constraints before the user says anything. It's the most powerful lever you have.

system_prompt.py
# Bad — too vague
system = "You are a helpful assistant."

# Good — specific persona, constraints, format
system = """You are a senior Python engineer.
- Answer only about Python and its ecosystem.
- If unsure, say so — never hallucinate package names.
- Always include a runnable code example.
- Keep answers under 300 words unless asked for more."""

Few-shot prompting

Show the model examples of the input-output format you want before giving the real input. This dramatically improves consistency on structured tasks.

few_shot.py
messages = [
    {"role": "user",      "content": "Classify: I love this product!"},
    {"role": "assistant", "content": "Sentiment: POSITIVE"},
    {"role": "user",      "content": "Classify: The shipping took forever."},
    {"role": "assistant", "content": "Sentiment: NEGATIVE"},
    {"role": "user",      "content": "Classify: It does what it says."},
    # Model will output: "Sentiment: NEUTRAL"
]

Chain-of-thought

Adding "Think step by step" to a prompt causes the model to reason through the problem before giving an answer. This alone can boost accuracy on math and logic tasks by 20–40%.

Be specific about format, length, and tone

Use delimiters (```, ---) to separate data from instructions

One task per prompt — break complex tasks into steps

Ask for reasoning before the final answer

3. Fine-tuning

Fine-tuning continues training an existing LLM on your own dataset. Use it when you need the model to consistently follow a specific format, tone, or domain vocabulary that prompt engineering alone can't reliably achieve.

Use fine-tuning when...

You need a specific output format (JSON, YAML, code)
Domain jargon matters (legal, medical, internal terminology)
You have 100+ high-quality labeled examples
Latency matters — you want a smaller, faster model

Don't fine-tune when...

Prompt engineering already works well enough
You need up-to-date knowledge (use RAG instead)
You have fewer than 50 training examples
Your use case changes frequently

LoRA: Efficient Fine-tuning

Fine-tuning all billions of parameters is expensive. LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices — reducing trainable parameters by 100–1,000x while keeping most of the benefit.

lora_finetune.py
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3-8B")

lora_config = LoraConfig(
    r=16,               # rank — higher = more capacity
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    task_type="CAUSAL_LM"
)

model = get_peft_model(base_model, lora_config)
model.print_trainable_parameters()
# trainable params: 4,194,304 || all params: 8,030,261,248 (0.05%)

4. Embeddings & Semantic Search

An embedding converts text into a dense vector of floats. Similar texts produce similar vectors — so you can find related content by comparing vectors in high-dimensional space.

This is fundamentally different from keyword search: embeddings capture meaning, not just matching words.

⚡

Cosine similarity is the standard distance metric. It measures the angle between two vectors — 1.0 means identical meaning, 0.0 means unrelated, -1.0 means opposite.

embeddings.py
from anthropic import Anthropic
import numpy as np

client = Anthropic()

def embed(text):
    resp = client.messages.create(
        model="claude-3-5-haiku-latest",
        max_tokens=1,
        messages=[{"role": "user", "content": text}]
    )
    # In practice, use a dedicated embedding model like text-embedding-3
    return np.array(resp.content[0].text)

def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

5. Retrieval-Augmented Generation (RAG)

RAG solves LLMs' biggest weakness: they're frozen at training time. Instead of baking all knowledge into model weights, RAG retrieves relevant documents at query time and puts them in the context window.

Index: chunk your documents, embed each chunk, store in a vector DB (Pinecone, Chroma, pgvector)

Retrieve: embed the user's question, find the top-k most similar chunks via cosine search

Generate: pass retrieved chunks + user question to the LLM as context, get a grounded answer

rag.py
def rag_query(question, vector_db, llm_client, top_k=3):
    # Step 1: embed the question
    q_embedding = embed(question)

    # Step 2: retrieve relevant chunks
    results = vector_db.query(q_embedding, top_k=top_k)
    context = "\n\n".join(r.text for r in results)

    # Step 3: generate grounded answer
    prompt = f"""Answer the question using ONLY the context below.
If the answer isn't in the context, say "I don't know."

Context:
{context}

Question: {question}"""

    response = llm_client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        messages=[{"role": "user", "content": prompt}]
    )
    return response.content[0].text

⚡

Chunk size matters. 200–400 tokens per chunk is a common sweet spot. Too small and you lose context; too large and irrelevant content dilutes the signal.

6. Which Technique to Use

Technique	Best for	Cost	Freshness
Prompt engineering	Behavior, format, tone	Free	Training cutoff
Fine-tuning	Style, domain vocabulary	Medium	Training cutoff
RAG	Private or fresh knowledge	Low (infra)	Real-time
Embeddings	Search, deduplication	Low	As fresh as index

In practice, most production apps combine all four. Prompt engineering is always layer one; RAG for knowledge; fine-tuning when needed for style; embeddings underneath the search.

7. Knowledge Check

5 questions · pick the best answer

Q1 of 5

What does temperature control in an LLM?

Q2 of 5

What is the main advantage of LoRA over full fine-tuning?

Q3 of 5

What does an embedding capture that keyword search does not?

Q4 of 5

What problem does RAG primarily solve?

Q5 of 5

Few-shot prompting means...

8. What's Next

You now have the core LLM toolkit. Next we apply it — building a full chatbot with memory and tools.