Skip to contents

Sends the cached director's-report HTML for one case through a large-language-model endpoint and asks it to return the 64-column morie schema as JSON. The result is in the SAME row format as the C++ parser, so it drops straight into morie_siu_compare() as the external argument for an independent diff against the parser.

Usage

morie_siu_llm_extract(
  case_number,
  model = c("ollama", "gemini"),
  cache_dir = file.path(tempdir(), "morie", "siu"),
  max_html_chars = 80000L,
  mock_response_text = NULL
)

Arguments

case_number

An SIU case number (e.g. "17-OVI-201").

model

One of "ollama" (default; free, runs locally, zero-config when an Ollama daemon is on localhost:11434), "gemini" (paid), or "claude" (paid). A character vector enables fail-over: the first model whose call succeeds wins. The default c("ollama", "gemini") tries the local free model first and only escalates to paid Gemini if Ollama isn't installed or fails – so morie costs $0 to use as long as you have a free Gemma / Qwen / Llama running locally (e.g. ollama pull gemma3:4b).

cache_dir

Directory holding the harvester's SIU.csv and the optional html/ subdirectory.

max_html_chars

Soft cap on the HTML payload sent to the model (default 80,000 – larger than any real SIU report, small enough to stay under typical context budgets).

mock_response_text

For testing only: if non-NULL, skip the network call and use this string as the model's raw reply.

Value

A one-row data frame with the 64 morie SIU columns. Any field the model could not extract is the empty string (matching the C++ parser's convention).

Details

The cached HTML remains the ground truth. This function does not claim the LLM is more accurate than the regex parser; it provides a fast second extraction so disagreements between two independent methods (regex vs. LLM) can be flagged for human review against the saved report.

Credentials are read from environment variables only – never hard-coded, never passed as function arguments – so secrets do not leak into call traces, logs, or scripts. Set GOOGLE_API_KEY for Gemini, ANTHROPIC_API_KEY for Claude, or OLLAMA_HOST (e.g. "http://localhost:11434" or an OllamaFreeAPI base URL) plus optionally OLLAMA_MODEL (default "llama3.2:3b") for Ollama-compatible open-weight endpoints.

Examples

if (FALSE) { # \dontrun{
Sys.setenv(GOOGLE_API_KEY = "your-gemini-key")
r <- morie_siu_llm_extract("17-OVI-201", model = "gemini")
# Diff parser vs LLM against the HTML:
morie_siu_compare(
  "17-OVI-201",
  external = r,
  field_map = setNames(as.list(names(r)), names(r)),
  external_case_col = "case_number"
)
} # }