Skip to contents

For SIU cases whose parser-emitted text isn't in the reader's preferred language, translate the long-form text fields into target_lang via a local Ollama model (default $0 cost, no API key) and save each translation as a canonical override. Subsequent morie_fetch_siu() runs then return text in target_lang for those cases automatically.

morie_siu_translate_fr_to_en is a thin back-compat wrapper that calls morie_siu_translate with target_lang = "en", source_lang = "fr".

Usage

morie_siu_translate(
  target_lang = NULL,
  source_lang = NULL,
  case_numbers = NULL,
  model = "ollama",
  fields = c("narrative_summary", "news_release_summary", "news_release_title",
    "relevant_legislation"),
  cache_dir = file.path(tempdir(), "morie", "siu"),
  progress = TRUE
)

morie_siu_translate_fr_to_en(
  case_numbers = NULL,
  model = "ollama",
  fields = c("narrative_summary", "news_release_summary", "news_release_title",
    "relevant_legislation"),
  cache_dir = file.path(tempdir(), "morie", "siu"),
  progress = TRUE
)

Arguments

target_lang

Target ISO 639-1 language code (or full language name). Defaults to Sys.getenv("MORIE_USER_LANG") or, failing that, the first two characters of Sys.getenv("LANG") – so it picks up the user's system locale automatically.

source_lang

Source language code, or NULL (default) to use each row's parsed _language field.

case_numbers

Character vector of SIU case numbers to translate. Defaults to every row whose _language differs from target_lang and has no override yet.

model

LLM model chain (see morie_siu_llm_extract). Default "ollama" for $0 cost via local Gemma / etc.

fields

Which text fields to translate. Defaults to the long-form fields that benefit most from translation: narrative_summary, news_release_summary, news_release_title, relevant_legislation.

cache_dir

Directory holding the harvester's SIU.csv and cached HTML.

progress

Print per-case progress.

Value

Invisibly, a data frame of newly-recorded (case_number, field, verified_value) translations.

Details

Use cases:

  • French-only SIU reports (a few per year of SIU output) that have no English-paired drid – translate to "en" so downstream analyses can join them with the rest.

  • English SIU reports that a Hindi / Spanish / Mandarin / Punjabi / Arabic / etc. reader needs – translate to their first language for accessibility.

  • Any cross-language pivot for community-oriented publication, where the reader's first language isn't what the SIU originally published in.

Idempotent (skips cases that already have an override on file for this target_lang). Self-improving (every translation accumulates in <cache_dir>/canonical_overrides.csv, so the SIU table becomes more accessible every time you run this). Maintainers can promote the resulting overrides into the shipped inst/extdata/siu_canonical_overrides.csv.gz.

For best speed/quality on multilingual translation use OLLAMA_MODEL=translategemma:latest – a Gemma model fine-tuned for translation. Falls back to whatever model OLLAMA_MODEL points at.

Examples

if (FALSE) { # \dontrun{
Sys.setenv(
  OLLAMA_HOST = "http://localhost:11434",
  OLLAMA_MODEL = "translategemma:latest"
)
csv <- morie_fetch_siu(cache_html = TRUE)
# Translate every non-English row to English:
morie_siu_translate(target_lang = "en")
# Or translate everything to Hindi for a Hindi-first reader:
morie_siu_translate(target_lang = "hi")
# Re-fetch picks up the new overrides automatically:
csv <- morie_fetch_siu(overwrite = TRUE)
} # }