Translate SIU report text into any target language via local LLM
Source:R/siu.R
morie_siu_translate.RdFor SIU cases whose parser-emitted text isn't in the reader's
preferred language, translate the long-form text fields into
target_lang via a local Ollama model (default $0 cost,
no API key) and save each translation as a canonical override.
Subsequent morie_fetch_siu() runs then return text in
target_lang for those cases automatically.
morie_siu_translate_fr_to_en is a thin
back-compat wrapper that calls morie_siu_translate
with target_lang = "en", source_lang = "fr".
Usage
morie_siu_translate(
target_lang = NULL,
source_lang = NULL,
case_numbers = NULL,
model = "ollama",
fields = c("narrative_summary", "news_release_summary", "news_release_title",
"relevant_legislation"),
cache_dir = file.path(tempdir(), "morie", "siu"),
progress = TRUE
)
morie_siu_translate_fr_to_en(
case_numbers = NULL,
model = "ollama",
fields = c("narrative_summary", "news_release_summary", "news_release_title",
"relevant_legislation"),
cache_dir = file.path(tempdir(), "morie", "siu"),
progress = TRUE
)Arguments
- target_lang
Target ISO 639-1 language code (or full language name). Defaults to
Sys.getenv("MORIE_USER_LANG")or, failing that, the first two characters ofSys.getenv("LANG")– so it picks up the user's system locale automatically.- source_lang
Source language code, or
NULL(default) to use each row's parsed_languagefield.- case_numbers
Character vector of SIU case numbers to translate. Defaults to every row whose
_languagediffers fromtarget_langand has no override yet.- model
LLM model chain (see
morie_siu_llm_extract). Default"ollama"for $0 cost via local Gemma / etc.- fields
Which text fields to translate. Defaults to the long-form fields that benefit most from translation:
narrative_summary,news_release_summary,news_release_title,relevant_legislation.- cache_dir
Directory holding the harvester's SIU.csv and cached HTML.
- progress
Print per-case progress.
Details
Use cases:
French-only SIU reports (a few per year of SIU output) that have no English-paired drid – translate to "en" so downstream analyses can join them with the rest.
English SIU reports that a Hindi / Spanish / Mandarin / Punjabi / Arabic / etc. reader needs – translate to their first language for accessibility.
Any cross-language pivot for community-oriented publication, where the reader's first language isn't what the SIU originally published in.
Idempotent (skips cases that already have an override on file
for this target_lang). Self-improving (every translation
accumulates in <cache_dir>/canonical_overrides.csv, so
the SIU table becomes more accessible every time you run this).
Maintainers can promote the resulting overrides into the
shipped inst/extdata/siu_canonical_overrides.csv.gz.
For best speed/quality on multilingual translation use
OLLAMA_MODEL=translategemma:latest – a Gemma model
fine-tuned for translation. Falls back to whatever model
OLLAMA_MODEL points at.
Examples
if (FALSE) { # \dontrun{
Sys.setenv(
OLLAMA_HOST = "http://localhost:11434",
OLLAMA_MODEL = "translategemma:latest"
)
csv <- morie_fetch_siu(cache_html = TRUE)
# Translate every non-English row to English:
morie_siu_translate(target_lang = "en")
# Or translate everything to Hindi for a Hindi-first reader:
morie_siu_translate(target_lang = "hi")
# Re-fetch picks up the new overrides automatically:
csv <- morie_fetch_siu(overwrite = TRUE)
} # }