Row-level sanity check on a parsed SIU table (regex-only, no LLM)
Source:R/siu.R
morie_siu_sanity_check.RdFor every row in a parser-emitted SIU table, flag cells that
don't match the expected format for their column – case_number
that doesn't look like an SIU case id, date_*_iso that isn't a
valid ISO 8601 date, number_of_* that isn't a positive integer,
charges_recommended that isn't "Yes" / "No", etc. Returns a
data frame ranked by issue count so the most-broken rows surface
at the top for manual inspection against the cached HTML.
Value
A data frame with one row per source row, columns:
case_number, drid, issues_count (integer
number of suspicious cells), issues (semicolon-separated
string of field:reason pairs). Ordered descending by
issues_count.
Details
Designed to be a fast first-pass quality filter – runs in
milliseconds, no network, no LLM, no API key. Doesn't try to
verify correctness against the underlying report (that's what
morie_siu_audit_columns() is for); just checks that each
value MATCHES THE EXPECTED FORMAT for its field. A clean sanity
check is necessary but not sufficient for correctness.
Examples
if (FALSE) { # \dontrun{
csv <- morie_fetch_siu(cache_dir = tempdir(), cache_html = TRUE)
sanity <- morie_siu_sanity_check(csv)
head(sanity, 10) # worst 10 rows -- inspect against HTML
table(sanity$issues_count)
} # }