Skip to contents

Fetches and parses the Ontario Special Investigations Unit (police-oversight) corpus – every director's report and the news releases they link – into a single CSV with the canonical 64-column schema, one row per case.

Usage

morie_fetch_siu(
  cache_dir = file.path(tempdir(), "morie", "siu"),
  overwrite = FALSE,
  max_drid = NULL,
  concurrency = 4L,
  rate_rps = 4,
  use_manifest = TRUE,
  lang = c("all", "en", "fr"),
  cache_html = FALSE,
  progress = TRUE
)

Arguments

cache_dir

Output directory. Defaults to a session-scoped subdirectory of tempdir() that R cleans up automatically. For persistent cross-session caching pass cache_dir = morie_cache_dir("siu") instead; see morie_cache_dir and morie_cache_clear.

overwrite

Logical; if FALSE and SIU.csv already exists in cache_dir, its path is returned without reparsing.

max_drid

Highest director's-report id to fetch. NULL (default) uses the shipped manifest's max + a small margin, falling back to discovery from the SIU site.

concurrency

Maximum simultaneous HTTP transfers. Default 4 is a polite rate paired with rate_rps = 4; raising either above ~8/8 risks triggering WAF interstitials that return short non-report HTML.

rate_rps

Maximum request starts per second across the pool (token-bucket throttle). Default 4 is the rate the package was empirically validated against; lower it on poor connections or contested endpoints.

use_manifest

If TRUE (default), restrict the sweep to the known-valid drids in the shipped manifest (inst/extdata/siu_drid_manifest.csv.gz), still topping up with any drid above the manifest's max up to max_drid. Cuts the fetch by ~30-50 percent on a typical run by skipping holes.

lang

Language filter on the manifest. "all" (default, back-compat) fetches every known-valid drid – English and French copies of each case – and then collapses to one row per case_number with English winning the dedupe. "en" fetches only the English drids (about half the size of the corpus and half the network round trips); "fr" fetches only French. Use "en" for the fastest cold-start when you only need the canonical English text.

cache_html

If TRUE, gzip and save the raw HTML of every fetched director's-report and news-release page under <cache_dir>/html/drid_NNNN.html.gz and <cache_dir>/html/nrid_NNNN.html.gz. This is the persistent ground truth for every row in the emitted CSV: any later discrepancy between the parser and a human coder can be adjudicated against the saved HTML without re-hitting SIU. Adds ~80-100 MB to cache_dir for a full run; default FALSE (the harvester remains lean unless you ask).

progress

Logical; print progress messages.

Value

Path to the written SIU.csv.

Details

The parser is implemented entirely in C/C++ (src/siu_parser.cpp): libcurl drives the HTTP transport and a concurrent curl_multi pool fetches the ~9,000+ pages, while the 64-field extraction is C++ std::regex parsing. There is no Python dependency.

This is the Ontario Special Investigations Unit – distinct from the federal Structured Intervention Units and from OTIS. The parsed corpus is not shipped with the package; each user runs the parser themselves, which is fair use of public oversight reports.

Examples

if (FALSE) { # \dontrun{
# Network: parses the full Ontario SIU corpus (~15-25 min at the
# default polite rate of 4 RPS).
csv <- morie_fetch_siu(cache_dir = tempdir())
siu <- utils::read.csv(csv)
nrow(siu)
} # }