Easy Config via CLI Tool
The operator vs developer distinction
Two different people interact with a PII detection system:
- Developers write detection patterns, implement redaction strategies, and define what entity types the system understands. They work in code and commit changes to version control.
- Operators run the system against specific datasets, tune sensitivity for their risk tolerance, exclude test data and synthetic fixtures, and change output formats to match their downstream pipelines. They should never need to read source code to do any of this.
Conflating these roles produces systems that are either locked down (operators can't tune anything without a developer) or dangerous (developers put policy-level decisions in code where they're hard to review separately). The config layer is the boundary between them.
Easy config philosophy: operators should never need to read source code to
change system behavior. If an operator needs to open a .py file to adjust a
threshold, the interface is wrong — move it to YAML.
The YAML config file
A single pii-guard.yaml file controls all tunable behavior. The design principle
is that the file should be self-documenting — comments explain each key and its valid values.
# pii-guard.yaml
# All keys are optional — omit to use the default shown in the comment.
# detection_mode controls which entity types are checked.
# Options: "strict" (all types) | "standard" (PII only) | "minimal" (high-value PII only)
detection_mode: standard
# confidence_threshold: minimum model confidence to treat a candidate as confirmed PII.
# Range: 0.0–1.0. Lower = more aggressive. Default: 0.75
confidence_threshold: 0.75
# exclusion_patterns: glob patterns relative to the scan root.
# Files matching any pattern are skipped entirely.
exclusion_patterns:
- "tests/**"
- "fixtures/**"
- "**/*.min.js"
- "docs/schema-examples/**"
# exclusion_strings: literal strings that, when present in a span's surrounding
# context (±50 chars), suppress detection. Use for known-safe templates.
exclusion_strings:
- "example.invalid"
- "test@"
- "000-00-0000"
- "555-"
# redaction_strategy: how to handle confirmed PII in the output.
# Options: "mask" | "replace" | "pseudonymize" | "fpe"
redaction_strategy: mask
# output_mode: format for the results report.
# Options: "json" | "text" | "sarif" (SARIF for IDE/CI integration)
output_mode: text
# audit_level: controls how much is written to the audit log.
# Options: "full" | "summary" | "off"
audit_level: full
# max_workers: parallel document processing threads. Default: 4
max_workers: 4
# model: which LLM to use for Pass 2 confirmation.
# API key must be set via environment variable, never here.
model: claude-sonnet-4-5 All config keys at a glance
| Key | Type | Default | Controls |
|---|---|---|---|
| detection_mode | string | standard | Which entity type categories are included in the sweep |
| confidence_threshold | float | 0.75 | Minimum model confidence to confirm a candidate as PII |
| exclusion_patterns | list[str] | [] | Glob patterns for files/directories to skip |
| exclusion_strings | list[str] | [] | Context strings that suppress detection when present near a span |
| redaction_strategy | string | mask | How confirmed PII is transformed in the output |
| output_mode | string | text | Format of the scan results report |
| audit_level | string | full | Verbosity of the append-only audit log |
| max_workers | int | 4 | Parallel worker threads for document processing |
| model | string | claude-sonnet-4-5 | LLM endpoint for Pass 2 confirmation calls |
CLI usage examples
CLI flags mirror the YAML keys and always take precedence. This enables CI/CD pipelines to override thresholds without editing the committed config file.
# Scan a directory using the default config file (looks for pii-guard.yaml in cwd)
pii-guard check ./docs
# Strict mode: check all entity types including CUI categories
pii-guard check ./docs --mode strict
# Raise threshold for a low-risk code review scan
pii-guard check ./src --threshold 0.90
# Exclude test directories and use JSON output for downstream parsing
pii-guard check ./data --exclude "tests/**" --exclude "*.fixture.json" --output json
# CI/CD pipeline: strict, JSON output, fail on any detection (exit code 1)
pii-guard check ./reports --mode strict --output json --fail-on-detect
# Override config file location
pii-guard check ./docs --config /etc/pii-guard/prod.yaml
# Dry run: detect but don't write redacted output or audit log
pii-guard check ./inbox --dry-run
# Scan and redact in place (writes redacted copies to ./redacted/)
pii-guard redact ./inbox --strategy pseudonymize --output-dir ./redacted Python: loading and merging config with CLI overrides
The config loader follows a strict precedence order: CLI flags beat
environment variables beat YAML config beat
built-in defaults. This means the same config file works in development
(with default thresholds) and in a strict CI pipeline (with --threshold 0.60).
from __future__ import annotations
import argparse
import os
from dataclasses import dataclass, field
from pathlib import Path
from typing import Literal
import yaml
# ── Config dataclass ──────────────────────────────────────────────────────────
@dataclass
class PiiGuardConfig:
detection_mode: Literal["strict", "standard", "minimal"] = "standard"
confidence_threshold: float = 0.75
exclusion_patterns: list[str] = field(default_factory=list)
exclusion_strings: list[str] = field(default_factory=list)
redaction_strategy: Literal["mask", "replace", "pseudonymize", "fpe"] = "mask"
output_mode: Literal["json", "text", "sarif"] = "text"
audit_level: Literal["full", "summary", "off"] = "full"
max_workers: int = 4
model: str = "claude-sonnet-4-5"
# Secrets: always from environment, never from YAML
anthropic_api_key: str = field(
default_factory=lambda: os.environ.get("ANTHROPIC_API_KEY", "")
)
# ── YAML loader ───────────────────────────────────────────────────────────────
def _load_yaml(path: Path) -> dict:
if not path.exists():
return {}
with path.open() as f:
return yaml.safe_load(f) or {}
# ── Main config builder ───────────────────────────────────────────────────────
def build_config(cli_args: argparse.Namespace, config_path: Path | None = None) -> PiiGuardConfig:
"""
Merge config sources in precedence order:
CLI flags > environment variables > YAML file > dataclass defaults
"""
# Start with YAML (or empty dict if file not found)
if config_path is None:
config_path = Path(getattr(cli_args, "config", None) or "pii-guard.yaml")
yaml_data = _load_yaml(config_path)
# Build config from YAML defaults
cfg = PiiGuardConfig(
detection_mode=yaml_data.get("detection_mode", "standard"),
confidence_threshold=yaml_data.get("confidence_threshold", 0.75),
exclusion_patterns=yaml_data.get("exclusion_patterns", []),
exclusion_strings=yaml_data.get("exclusion_strings", []),
redaction_strategy=yaml_data.get("redaction_strategy", "mask"),
output_mode=yaml_data.get("output_mode", "text"),
audit_level=yaml_data.get("audit_level", "full"),
max_workers=yaml_data.get("max_workers", 4),
model=yaml_data.get("model", "claude-sonnet-4-5"),
)
# Apply CLI overrides (only if explicitly provided)
if getattr(cli_args, "mode", None):
cfg.detection_mode = cli_args.mode
if getattr(cli_args, "threshold", None) is not None:
cfg.confidence_threshold = cli_args.threshold
if getattr(cli_args, "exclude", None):
cfg.exclusion_patterns.extend(cli_args.exclude)
if getattr(cli_args, "strategy", None):
cfg.redaction_strategy = cli_args.strategy
if getattr(cli_args, "output", None):
cfg.output_mode = cli_args.output
# Validate
if not 0.0 <= cfg.confidence_threshold <= 1.0:
raise ValueError(f"confidence_threshold must be 0.0–1.0, got {cfg.confidence_threshold}")
if not cfg.anthropic_api_key:
raise EnvironmentError("ANTHROPIC_API_KEY environment variable is not set")
return cfg
# ── CLI argument parser ───────────────────────────────────────────────────────
def build_parser() -> argparse.ArgumentParser:
parser = argparse.ArgumentParser(
prog="pii-guard",
description="Detect and redact PII/CUI in documents",
)
sub = parser.add_subparsers(dest="command", required=True)
# pii-guard check
check_cmd = sub.add_parser("check", help="Scan documents for PII/CUI")
check_cmd.add_argument("path", type=Path, help="File or directory to scan")
check_cmd.add_argument("--config", type=Path, default=None, help="Config file path")
check_cmd.add_argument("--mode", choices=["strict", "standard", "minimal"])
check_cmd.add_argument("--threshold", type=float, metavar="0.0-1.0")
check_cmd.add_argument("--exclude", action="append", metavar="GLOB", default=[])
check_cmd.add_argument("--output", choices=["json", "text", "sarif"])
check_cmd.add_argument("--fail-on-detect", action="store_true")
check_cmd.add_argument("--dry-run", action="store_true")
# pii-guard redact
redact_cmd = sub.add_parser("redact", help="Redact PII/CUI from documents")
redact_cmd.add_argument("path", type=Path)
redact_cmd.add_argument("--config", type=Path, default=None)
redact_cmd.add_argument("--strategy", choices=["mask", "replace", "pseudonymize", "fpe"])
redact_cmd.add_argument("--output-dir", type=Path, required=True)
return parser Never put API keys in YAML config files. Even if the file is in
.gitignore, config files end up in backups, deployment tarballs, and log
directories. The config loader above reads ANTHROPIC_API_KEY exclusively
from the environment. Rotate it as a secret via your secret manager (Vault, AWS Secrets
Manager, GitHub Actions secrets) — never commit it.