Per-column accuracy audit: estimate every SIU column's correctness
Source:R/siu.R
morie_siu_audit_columns.RdRuns morie_siu_anomaly_check() on a vector of case_numbers
and aggregates per-field across them. Output is a data frame with
one row per SIU column, ordered by how often the LLM auditor
agreed with the C++ parser. The worst-ranked rows are the
parser fields that most deserve regex / extraction-logic fixes.
Arguments
- case_numbers
Character vector of SIU case numbers to audit.
- model
One of
"ollama"(default; free, runs locally, zero-config when an Ollama daemon is onlocalhost:11434),"gemini"(paid), or"claude"(paid). A character vector enables fail-over: the first model whose call succeeds wins. The defaultc("ollama", "gemini")tries the local free model first and only escalates to paid Gemini if Ollama isn't installed or fails – so morie costs $0 to use as long as you have a free Gemma / Qwen / Llama running locally (e.g.ollama pull gemma3:4b).- cache_dir
Directory holding the harvester's SIU.csv and the optional
html/subdirectory.- max_html_chars
Soft cap on the HTML payload sent to the model (default 80,000 – larger than any real SIU report, small enough to stay under typical context budgets).
- max_examples_per_field
Maximum disagreement examples retained per field (default 5).
- progress
Logical; print a per-case progress line.
Value
A data frame with columns field, n_audited,
n_agree, n_disagree, n_unclear,
agree_rate. Sorted ascending by agree_rate so the
most-broken fields land at the top. The "examples"
attribute holds nested data frames of flagged cases per field.
Details
Examples of LLM-flagged disagreements are attached as the
"examples" attribute of the returned data frame (one
nested data frame per field), with at most
max_examples_per_field cases each. Each example carries
the case_number, the parser_value, and the LLM's one-sentence
reason – enough for a maintainer to pop the cached HTML for
that case, see who's right, and decide whether to refine the
regex pattern for that field.
Designed for cheap local audit: with model = "ollama"
pointed at a local Gemma / Qwen / DeepSeek instance, auditing
50-100 cases costs zero API spend and finishes in a few
minutes. With model = c("gemini", "ollama") the chain
uses paid Gemini first and silently falls back to the local
model on quota / network errors.
Examples
if (FALSE) { # \dontrun{
Sys.setenv(
OLLAMA_HOST = "http://localhost:11434",
OLLAMA_MODEL = "gemma3:4b"
)
csv <- morie_fetch_siu(cache_html = TRUE)
df <- utils::read.csv(csv, colClasses = "character")
sample <- sample(df$case_number[nzchar(df$case_number)], 50L)
audit <- morie_siu_audit_columns(sample, model = "ollama")
# Worst 8 fields, ripe for parser fixes:
head(audit, 8)
# See concrete disagreements for the worst field:
attr(audit, "examples")[[audit$field[1L]]]
} # }