Skip to contents

Runs morie_siu_anomaly_check() on a vector of case_numbers and aggregates per-field across them. Output is a data frame with one row per SIU column, ordered by how often the LLM auditor agreed with the C++ parser. The worst-ranked rows are the parser fields that most deserve regex / extraction-logic fixes.

Usage

morie_siu_audit_columns(
  case_numbers,
  model = c("ollama", "gemini"),
  cache_dir = file.path(tempdir(), "morie", "siu"),
  max_html_chars = 80000L,
  max_examples_per_field = 5L,
  progress = TRUE
)

Arguments

case_numbers

Character vector of SIU case numbers to audit.

model

One of "ollama" (default; free, runs locally, zero-config when an Ollama daemon is on localhost:11434), "gemini" (paid), or "claude" (paid). A character vector enables fail-over: the first model whose call succeeds wins. The default c("ollama", "gemini") tries the local free model first and only escalates to paid Gemini if Ollama isn't installed or fails – so morie costs $0 to use as long as you have a free Gemma / Qwen / Llama running locally (e.g. ollama pull gemma3:4b).

cache_dir

Directory holding the harvester's SIU.csv and the optional html/ subdirectory.

max_html_chars

Soft cap on the HTML payload sent to the model (default 80,000 – larger than any real SIU report, small enough to stay under typical context budgets).

max_examples_per_field

Maximum disagreement examples retained per field (default 5).

progress

Logical; print a per-case progress line.

Value

A data frame with columns field, n_audited, n_agree, n_disagree, n_unclear, agree_rate. Sorted ascending by agree_rate so the most-broken fields land at the top. The "examples" attribute holds nested data frames of flagged cases per field.

Details

Examples of LLM-flagged disagreements are attached as the "examples" attribute of the returned data frame (one nested data frame per field), with at most max_examples_per_field cases each. Each example carries the case_number, the parser_value, and the LLM's one-sentence reason – enough for a maintainer to pop the cached HTML for that case, see who's right, and decide whether to refine the regex pattern for that field.

Designed for cheap local audit: with model = "ollama" pointed at a local Gemma / Qwen / DeepSeek instance, auditing 50-100 cases costs zero API spend and finishes in a few minutes. With model = c("gemini", "ollama") the chain uses paid Gemini first and silently falls back to the local model on quota / network errors.

Examples

if (FALSE) { # \dontrun{
Sys.setenv(
  OLLAMA_HOST = "http://localhost:11434",
  OLLAMA_MODEL = "gemma3:4b"
)
csv <- morie_fetch_siu(cache_html = TRUE)
df <- utils::read.csv(csv, colClasses = "character")
sample <- sample(df$case_number[nzchar(df$case_number)], 50L)
audit <- morie_siu_audit_columns(sample, model = "ollama")
# Worst 8 fields, ripe for parser fixes:
head(audit, 8)
# See concrete disagreements for the worst field:
attr(audit, "examples")[[audit$field[1L]]]
} # }