Problem: the tolerance field for floating-point comparison was named
`epsilon`, which is an implementation detail, not the user-visible concept.
Solution: rename to `precision` in run.lua type annotations, internal
variables, and comparison logic.
Problem: output comparison used exact string equality after whitespace
normalisation, causing correct solutions to fail on problems where
floating-point answers are accepted within a tolerance (e.g. 1e-6).
Solution: add an optional ui.panel.epsilon config value. When set,
actual and expected output are compared token-by-token: numeric tokens
are compared with math.abs(a - b) <= epsilon, non-numeric tokens fall
back to exact string equality. Per-problem epsilon can also be stored
in the cache and takes precedence over the global default.
Problem: test cases were executed sequentially, each waiting for the
previous process to finish before starting the next. On problems with
many test cases this meant wall-clock run time scaled linearly.
Solution: fan out all test case processes simultaneously. A remaining
counter fires on_done once all callbacks have returned. on_each is
called per completion as before; callers that pass on_each ignore its
arguments so the index semantics change is non-breaking.
Problem: with debug = true, there is not enough diagnostic output to
troubleshoot environment or execution issues. The resolved python path,
scraper commands, and compile/run shell commands are not logged.
Solution: add logger.log calls at key decision points: python env
resolution (nix vs uv vs discovery), uv sync stderr output, scraper
subprocess commands, and compile/run shell strings. All gated behind
the existing debug flag so they only appear when debug = true.