Skip to contents

For every row in a parser-emitted SIU table, flag cells that don't match the expected format for their column – case_number that doesn't look like an SIU case id, date_*_iso that isn't a valid ISO 8601 date, number_of_* that isn't a positive integer, charges_recommended that isn't "Yes" / "No", etc. Returns a data frame ranked by issue count so the most-broken rows surface at the top for manual inspection against the cached HTML.

Usage

morie_siu_sanity_check(df)

Arguments

df

A data frame in the morie SIU 64-column schema, or a path to such a CSV.

Value

A data frame with one row per source row, columns: case_number, drid, issues_count (integer number of suspicious cells), issues (semicolon-separated string of field:reason pairs). Ordered descending by issues_count.

Details

Designed to be a fast first-pass quality filter – runs in milliseconds, no network, no LLM, no API key. Doesn't try to verify correctness against the underlying report (that's what morie_siu_audit_columns() is for); just checks that each value MATCHES THE EXPECTED FORMAT for its field. A clean sanity check is necessary but not sufficient for correctness.

Examples

if (FALSE) { # \dontrun{
csv <- morie_fetch_siu(cache_dir = tempdir(), cache_html = TRUE)
sanity <- morie_siu_sanity_check(csv)
head(sanity, 10) # worst 10 rows -- inspect against HTML
table(sanity$issues_count)
} # }