Skip to contents

For one case_number, line up the parser's value against the same field in a user-supplied external data source – and, critically, show the surrounding report HTML so the user can adjudicate any disagreement against the actual source document.

Usage

morie_siu_compare(
  case_number,
  external,
  field_map = NULL,
  external_case_col = "Q1",
  cache_dir = file.path(tempdir(), "morie", "siu")
)

Arguments

case_number

A case number (e.g. "17-OVI-201").

external

A data frame of external answers, OR a path to an .xlsx file (read with readxl). Must contain a column whose values match SIU case numbers (default external_case_col = "Q1").

field_map

A named list mapping external-column names to morie field names.

external_case_col

Name of the external column carrying the case-number key.

cache_dir

Directory holding the harvester's SIU.csv and optional cached HTML.

Value

A data frame with one row per mapped field: field, parser_value, external_value, agree, and html_excerpt (a 240-character window around the first occurrence of either value in the cleaned report text). When parser and external disagree, the html_excerpt is the tie-breaker.

Details

The ground truth is the SIU director's-report HTML itself. The HTML is what the SIU published; the parser's job is to extract structured fields from it faithfully, and any field's correctness is decidable by reading the cached HTML for that case. Any external reference – a hand-coded survey, an independently-scraped CSV, a colleague's analysis – is just another extraction attempt, possibly with its own errors. This function does not endorse any external source; it only displays both side-by-side with the HTML excerpt so you can decide.

The default field map covers the common SIU-extraction column layout (Q1 = case_number, Q3 = police_service, Q4 = number_of_officers_involved, ...). Pass a custom field_map for any other external schema.

Examples

if (FALSE) { # \dontrun{
# Caller supplies their own external table; nothing about the
# mapping or the file format is canonical to morie.
external <- data.frame(case_id = "17-OVI-201", officers = 1L)
cmp <- morie_siu_compare(
  "17-OVI-201",
  external = external,
  field_map = list(officers = "number_of_officers_involved"),
  external_case_col = "case_id"
)
subset(cmp, !agree)
} # }