Topics PII & CUI with AI Agents
Series · 6 posts Contact

Easy Config via CLI Tool

Detection logic belongs in code. Thresholds, exclusion lists, and output modes belong in configuration. When operators have to read Python source to change how sensitive the detector is, you've made a tooling error — not a tuning problem. This post designs a YAML config file and CLI flag system that gives operators full control without requiring a single line of code change, while keeping secrets safely outside of config files entirely.

The operator vs developer distinction

Two different people interact with a PII detection system:

  • Developers write detection patterns, implement redaction strategies, and define what entity types the system understands. They work in code and commit changes to version control.
  • Operators run the system against specific datasets, tune sensitivity for their risk tolerance, exclude test data and synthetic fixtures, and change output formats to match their downstream pipelines. They should never need to read source code to do any of this.

Conflating these roles produces systems that are either locked down (operators can't tune anything without a developer) or dangerous (developers put policy-level decisions in code where they're hard to review separately). The config layer is the boundary between them.

Easy config philosophy: operators should never need to read source code to change system behavior. If an operator needs to open a .py file to adjust a threshold, the interface is wrong — move it to YAML.

The YAML config file

A single pii-guard.yaml file controls all tunable behavior. The design principle is that the file should be self-documenting — comments explain each key and its valid values.

pii-guard.yaml — Annotated configuration file
# pii-guard.yaml
# All keys are optional — omit to use the default shown in the comment.

# detection_mode controls which entity types are checked.
# Options: "strict" (all types) | "standard" (PII only) | "minimal" (high-value PII only)
detection_mode: standard

# confidence_threshold: minimum model confidence to treat a candidate as confirmed PII.
# Range: 0.0–1.0. Lower = more aggressive. Default: 0.75
confidence_threshold: 0.75

# exclusion_patterns: glob patterns relative to the scan root.
# Files matching any pattern are skipped entirely.
exclusion_patterns:
  - "tests/**"
  - "fixtures/**"
  - "**/*.min.js"
  - "docs/schema-examples/**"

# exclusion_strings: literal strings that, when present in a span's surrounding
# context (±50 chars), suppress detection. Use for known-safe templates.
exclusion_strings:
  - "example.invalid"
  - "test@"
  - "000-00-0000"
  - "555-"

# redaction_strategy: how to handle confirmed PII in the output.
# Options: "mask" | "replace" | "pseudonymize" | "fpe"
redaction_strategy: mask

# output_mode: format for the results report.
# Options: "json" | "text" | "sarif"  (SARIF for IDE/CI integration)
output_mode: text

# audit_level: controls how much is written to the audit log.
# Options: "full" | "summary" | "off"
audit_level: full

# max_workers: parallel document processing threads. Default: 4
max_workers: 4

# model: which LLM to use for Pass 2 confirmation.
# API key must be set via environment variable, never here.
model: claude-sonnet-4-5

All config keys at a glance

Key Type Default Controls
detection_mode string standard Which entity type categories are included in the sweep
confidence_threshold float 0.75 Minimum model confidence to confirm a candidate as PII
exclusion_patterns list[str] [] Glob patterns for files/directories to skip
exclusion_strings list[str] [] Context strings that suppress detection when present near a span
redaction_strategy string mask How confirmed PII is transformed in the output
output_mode string text Format of the scan results report
audit_level string full Verbosity of the append-only audit log
max_workers int 4 Parallel worker threads for document processing
model string claude-sonnet-4-5 LLM endpoint for Pass 2 confirmation calls

CLI usage examples

CLI flags mirror the YAML keys and always take precedence. This enables CI/CD pipelines to override thresholds without editing the committed config file.

shell — CLI usage examples
# Scan a directory using the default config file (looks for pii-guard.yaml in cwd)
pii-guard check ./docs

# Strict mode: check all entity types including CUI categories
pii-guard check ./docs --mode strict

# Raise threshold for a low-risk code review scan
pii-guard check ./src --threshold 0.90

# Exclude test directories and use JSON output for downstream parsing
pii-guard check ./data --exclude "tests/**" --exclude "*.fixture.json" --output json

# CI/CD pipeline: strict, JSON output, fail on any detection (exit code 1)
pii-guard check ./reports --mode strict --output json --fail-on-detect

# Override config file location
pii-guard check ./docs --config /etc/pii-guard/prod.yaml

# Dry run: detect but don't write redacted output or audit log
pii-guard check ./inbox --dry-run

# Scan and redact in place (writes redacted copies to ./redacted/)
pii-guard redact ./inbox --strategy pseudonymize --output-dir ./redacted

Python: loading and merging config with CLI overrides

The config loader follows a strict precedence order: CLI flags beat environment variables beat YAML config beat built-in defaults. This means the same config file works in development (with default thresholds) and in a strict CI pipeline (with --threshold 0.60).

config.py — Config loading and CLI merge
from __future__ import annotations
import argparse
import os
from dataclasses import dataclass, field
from pathlib import Path
from typing import Literal

import yaml

# ── Config dataclass ──────────────────────────────────────────────────────────

@dataclass
class PiiGuardConfig:
    detection_mode: Literal["strict", "standard", "minimal"] = "standard"
    confidence_threshold: float = 0.75
    exclusion_patterns: list[str] = field(default_factory=list)
    exclusion_strings: list[str] = field(default_factory=list)
    redaction_strategy: Literal["mask", "replace", "pseudonymize", "fpe"] = "mask"
    output_mode: Literal["json", "text", "sarif"] = "text"
    audit_level: Literal["full", "summary", "off"] = "full"
    max_workers: int = 4
    model: str = "claude-sonnet-4-5"
    # Secrets: always from environment, never from YAML
    anthropic_api_key: str = field(
        default_factory=lambda: os.environ.get("ANTHROPIC_API_KEY", "")
    )

# ── YAML loader ───────────────────────────────────────────────────────────────

def _load_yaml(path: Path) -> dict:
    if not path.exists():
        return {}
    with path.open() as f:
        return yaml.safe_load(f) or {}

# ── Main config builder ───────────────────────────────────────────────────────

def build_config(cli_args: argparse.Namespace, config_path: Path | None = None) -> PiiGuardConfig:
    """
    Merge config sources in precedence order:
    CLI flags > environment variables > YAML file > dataclass defaults
    """
    # Start with YAML (or empty dict if file not found)
    if config_path is None:
        config_path = Path(getattr(cli_args, "config", None) or "pii-guard.yaml")
    yaml_data = _load_yaml(config_path)

    # Build config from YAML defaults
    cfg = PiiGuardConfig(
        detection_mode=yaml_data.get("detection_mode", "standard"),
        confidence_threshold=yaml_data.get("confidence_threshold", 0.75),
        exclusion_patterns=yaml_data.get("exclusion_patterns", []),
        exclusion_strings=yaml_data.get("exclusion_strings", []),
        redaction_strategy=yaml_data.get("redaction_strategy", "mask"),
        output_mode=yaml_data.get("output_mode", "text"),
        audit_level=yaml_data.get("audit_level", "full"),
        max_workers=yaml_data.get("max_workers", 4),
        model=yaml_data.get("model", "claude-sonnet-4-5"),
    )

    # Apply CLI overrides (only if explicitly provided)
    if getattr(cli_args, "mode", None):
        cfg.detection_mode = cli_args.mode
    if getattr(cli_args, "threshold", None) is not None:
        cfg.confidence_threshold = cli_args.threshold
    if getattr(cli_args, "exclude", None):
        cfg.exclusion_patterns.extend(cli_args.exclude)
    if getattr(cli_args, "strategy", None):
        cfg.redaction_strategy = cli_args.strategy
    if getattr(cli_args, "output", None):
        cfg.output_mode = cli_args.output

    # Validate
    if not 0.0 <= cfg.confidence_threshold <= 1.0:
        raise ValueError(f"confidence_threshold must be 0.0–1.0, got {cfg.confidence_threshold}")
    if not cfg.anthropic_api_key:
        raise EnvironmentError("ANTHROPIC_API_KEY environment variable is not set")

    return cfg

# ── CLI argument parser ───────────────────────────────────────────────────────

def build_parser() -> argparse.ArgumentParser:
    parser = argparse.ArgumentParser(
        prog="pii-guard",
        description="Detect and redact PII/CUI in documents",
    )
    sub = parser.add_subparsers(dest="command", required=True)

    # pii-guard check 
    check_cmd = sub.add_parser("check", help="Scan documents for PII/CUI")
    check_cmd.add_argument("path", type=Path, help="File or directory to scan")
    check_cmd.add_argument("--config", type=Path, default=None, help="Config file path")
    check_cmd.add_argument("--mode", choices=["strict", "standard", "minimal"])
    check_cmd.add_argument("--threshold", type=float, metavar="0.0-1.0")
    check_cmd.add_argument("--exclude", action="append", metavar="GLOB", default=[])
    check_cmd.add_argument("--output", choices=["json", "text", "sarif"])
    check_cmd.add_argument("--fail-on-detect", action="store_true")
    check_cmd.add_argument("--dry-run", action="store_true")

    # pii-guard redact 
    redact_cmd = sub.add_parser("redact", help="Redact PII/CUI from documents")
    redact_cmd.add_argument("path", type=Path)
    redact_cmd.add_argument("--config", type=Path, default=None)
    redact_cmd.add_argument("--strategy", choices=["mask", "replace", "pseudonymize", "fpe"])
    redact_cmd.add_argument("--output-dir", type=Path, required=True)

    return parser
🚨

Never put API keys in YAML config files. Even if the file is in .gitignore, config files end up in backups, deployment tarballs, and log directories. The config loader above reads ANTHROPIC_API_KEY exclusively from the environment. Rotate it as a secret via your secret manager (Vault, AWS Secrets Manager, GitHub Actions secrets) — never commit it.

← 03 — Redaction Strategies Series Overview 05 — Audit Trails & Logging →