Skip to contents

Parses one SIU director's-report HTML page (or one news-release page) into a structured row list. The production parser lives in the Rcpp / C++ backend (.siu_parse_report, .siu_parse_news); this pure-R port is provided as a reference implementation and as a fallback for environments where the compiled libmorie backend is unavailable.

Details

Suggested dependencies. These functions optionally use rvest + xml2 for DOM walking; without them, a regex-based fallback over flat tag-stripped text is used. Either way the parser is pure (no network) – hand it a raw HTML string and it returns a row dict matching SIU_COLUMNS.

Hardened against the SIU page markup shifting over time by:

  • looking for several label variants per field,

  • falling back to regex on stripped text when DOM structure shifts,

  • preserving the verbatim narrative_full regardless of parse success.