Fetches and parses the Ontario Special Investigations Unit (police-oversight) corpus – every director's report and the news releases they link – into a single CSV with the canonical 64-column schema, one row per case.
Arguments
- cache_dir
Output directory. Defaults to a session-scoped subdirectory of
tempdir()that R cleans up automatically. For persistent cross-session caching passcache_dir = morie_cache_dir("siu")instead; seemorie_cache_dirandmorie_cache_clear.- overwrite
Logical; if
FALSEandSIU.csvalready exists incache_dir, its path is returned without reparsing.- max_drid
Highest director's-report id to fetch.
NULL(default) uses the shipped manifest's max + a small margin, falling back to discovery from the SIU site.- concurrency
Maximum simultaneous HTTP transfers. Default
4is a polite rate paired withrate_rps = 4; raising either above ~8/8 risks triggering WAF interstitials that return short non-report HTML.- rate_rps
Maximum request starts per second across the pool (token-bucket throttle). Default
4is the rate the package was empirically validated against; lower it on poor connections or contested endpoints.- use_manifest
If
TRUE(default), restrict the sweep to the known-valid drids in the shipped manifest (inst/extdata/siu_drid_manifest.csv.gz), still topping up with any drid above the manifest's max up tomax_drid. Cuts the fetch by ~30-50 percent on a typical run by skipping holes.- lang
Language filter on the manifest.
"all"(default, back-compat) fetches every known-valid drid – English and French copies of each case – and then collapses to one row per case_number with English winning the dedupe."en"fetches only the English drids (about half the size of the corpus and half the network round trips);"fr"fetches only French. Use"en"for the fastest cold-start when you only need the canonical English text.- cache_html
If
TRUE, gzip and save the raw HTML of every fetched director's-report and news-release page under<cache_dir>/html/drid_NNNN.html.gzand<cache_dir>/html/nrid_NNNN.html.gz. This is the persistent ground truth for every row in the emitted CSV: any later discrepancy between the parser and a human coder can be adjudicated against the saved HTML without re-hitting SIU. Adds ~80-100 MB tocache_dirfor a full run; defaultFALSE(the harvester remains lean unless you ask).- progress
Logical; print progress messages.
Details
The parser is implemented entirely in C/C++ (src/siu_parser.cpp):
libcurl drives the HTTP transport and a concurrent curl_multi
pool fetches the ~9,000+ pages, while the 64-field extraction is C++
std::regex parsing. There is no Python dependency.
This is the Ontario Special Investigations Unit – distinct from the federal Structured Intervention Units and from OTIS. The parsed corpus is not shipped with the package; each user runs the parser themselves, which is fair use of public oversight reports.