Problem: test cases were executed sequentially, each waiting for the
previous process to finish before starting the next. On problems with
many test cases this meant wall-clock run time scaled linearly.
Solution: fan out all test case processes simultaneously. A remaining
counter fires on_done once all callbacks have returned. on_each is
called per completion as before; callers that pass on_each ignore its
arguments so the index semantics change is non-breaking.