Error Highlighting in PoprC

Posted on July 21, 2020

What is an error anyway…

I have refrained from implementing error messages for PoprC until now.

There are several reasons for this:

Failure is okay, and expected.

Other than trivial syntax errors, there is only one possible error in a Popr program - program non-determinism. Determinism is the property of having a unique successful execution path for any input, so a non-deterministic program has either no successful path, or multiple successful paths, for some input. While the program must be deterministic, it’s okay for any part to fail (e.g. division when the divisor is zero) as long as it is handled by the consumer of that computation. Every branch must fail except one, so failure is common and expected.
```
1 0 /     __ Division by zero
"one" 2 + __ Addition operates on integers
3 False ! __ Asserting False
```
These all fail, but appending True | to any one of these will result in True.

This means detecting true errors requires the entire program context, which is not always available. Automatically proving determinism is not implemented yet, and is not solvable in general.
Reduction is lossy, which can make locating errors in the source imprecise.

Location information is discarded during reduction, and yet the result might fail after reduction. The reduction graph is more expressive than the source language, so a failure might not correspond to a precise location in the original source code.

A simple example is that in 1 2 swap 3 + odd !, 2 is not involved in the error, but after reducing the addition, the location of the result must include the constants 1 and 3 from which the result (4) is derived.
Source locations take up space.

Popr uses a very compact internal representation (IR) graph. For a long time, I’ve been able to fit everything I’ve needed for the IR into 64 byte nodes, but there’s no space for another byte, much less another pointer.

Despite this, the previous behavior of omitting output for all errors is not an acceptable user experience, especially for new users. Just printing “Error!” isn’t acceptable, either. So we need a useful error message, if not a perfectly accurate one. It should be useful for catching simple errors without looking through the logs.

So I made some concessions:

While some failures are expected, the programmer rarely intends an expression that will always fail (such as 1 0 /.)
Imprecise locations are better than nothing.
Error reporting is worth increasing the IR node size, even though this space is “wasted” on a successful compilation.
If this ever becomes a concern, I can use a compile time flag to disable it.

How it works

All nodes are annotated with source locations (seg_t) at parse time. This requires an extra 16 bytes (+25%). Oh well.

union cell {
  uintptr_t c[10]; // <--- UP FROM 8
  struct {
    union {
      cell_t *alt;
      const char *word_name; /* entry */
    };
    union {
      cell_t *tmp;
      val_t tmp_val;
      const char *module_name; /* entry */
      char_class_t char_class; /* tok_list */
    };
    enum op op;
    seg_t src; // <--- SOURCE LOCATION
    union {
      uint8_t pos; /* see below */
      uint8_t arg_index; /* arg index (for dep vars) */
      uint8_t var_index; /* final index for vars in trace */
      priority_t priority; /* used in func_list() & delay_branch() */
    };
    refcount_t n;
    csize_t size;
    union {
      expr_t expr;
      value_t value;
      tok_list_t tok_list;
      entry_t entry;
      mem_t mem;
    };
  }
};

When a node is reduced, the source range in the context is expanded to include that node. If a node fails, the context is limited to just that node. This benefits from logic that prioritizes early failures, so that code is often not highlighted if it doesn’t contribute to the failure.

The final source location from the context is propagated to the result node.

// Reduce then split c->arg[n]
response reduce_arg(cell_t *c,
                csize_t n,
                context_t *ctx) {
  cell_t **ap = &c->expr.arg[n];
  response r = reduce(ap, ctx);
  if(r <= DELAY) {
    ctx->up->alt_set |= ctx->alt_set;
    ctx->up->text = seg_range(ctx->up->text, ctx->text); // <--- PROPAGATE SOURCE LOCATION ON REDUCTION
    split_arg(c, n, dup_alt);
  } else if(r == FAIL) {
    ctx->up->text = ctx->text; // <--- BLAME ONLY THIS REDUCTION ON FAILURE
  }
  return r;
}

On each failure, the location from the context is logged to a buffer.

// Reduce *cp with type t
response reduce(cell_t **cp, context_t *ctx) {
  cell_t *c = *cp;
  const char *module_name, *word_name;
  get_name(c, &module_name, &word_name); // debug
  assert_error(ctx->depth < MAX_CALL_DEPTH, "stack too deep %C", c);

  while(c) {
    assert_error(is_closure(c));
    c = *cp = fill_incomplete(c);
    stats.reduce_cnt++;
    ctx->text = c->src;
    op op = c->op;
    response r = op_call(op, cp, ctx);

    // prevent infinite loops when debugging
    assert_counter(LENGTH(cells));

    if(!*cp) {
      LOG(MARK("FAIL") ": %O %C (%s.%s) %L @abort",
          op, c, module_name, word_name, ctx->loc.raw);
      log_fail(ctx); // <--- LOCATION LOGGED HERE
    }
    c = *cp;
    if(r <= DELAY || (r == RETRY && ctx->retry)) {
      ctx->retry = false;
      return r;
    }
  }

  *cp = &fail_cell;
  return FAIL;
}

If reduction fails entirely, meaning that the expression will always fail, all locations from the failure log are flattened into a set of non-overlapping locations.

size_t get_flattened_error_ranges(seg_t src, pair_t *res) {
  if(!src.s || !src.n) return 0;

  uintptr_t
    l[fail_location_n],
    r[fail_location_n];

  // load offsets into l and r
  COUNTUP(i, fail_location_n) {
    l[i] = clamp(0, (intptr_t)src.n, fail_location[i].s - src.s);
    r[i] = clamp(0, src.n, l[i] + fail_location[i].n);
  }

  return flatten_ranges(l, r, res, fail_location_n);
}

void highlight_errors(seg_t src) {
  pair_t res[fail_location_n];
  size_t n = get_flattened_error_ranges(src, res);
  uintptr_t last = 0;
  COUNTUP(i, n) {
    printf("%.*s", (int)(res[i].first - last), src.s + last);
    printf(UNDERLINE_START);
    print_seg_escape((seg_t) { .s = src.s + res[i].first, .n = res[i].second - res[i].first });
    printf(UNDERLINE_END);
    last = res[i].second;
  }
  printf("%.*s\n", (int)(src.n - last), src.s + last);
}

Ranges are flattened by sorting the start and end points (l & r), incrementing/decrementing depth, and recording the transitions from and to depth == 0. Each transition to a non-zero depth is the start of a flattened range, and each transition back to zero is the end. These will always alternate, so they can be stored as pair_ts.

size_t flatten_ranges(uintptr_t *l, uintptr_t *r, pair_t *res, size_t n) {
  quicksort(l, WIDTH(l), n);
  quicksort(r, WIDTH(r), n);
  int depth = 0;
  size_t out_n = 0;
  uintptr_t *l_end = l + n;
  LOOP(n * 2) {
    if(l >= l_end || *l > *r) {
      depth--;
      if(!depth) {
        res->second = *r;
        res++;
        out_n++;
      }
      r++;
    } else if (*l < *r) {
      if(!depth) {
        res->first = *l;
      }
      depth++;
      l++;
    } else { // *l == *r
      l++;
      r++;
    }
  }
  return out_n;
}

The original source is printed to the screen, with the failure locations underlined.

For example:

1 2 odd ! 3 +

Because 2 odd is True, causing ! to fail.

Or:

1 2 + A *

Because A is a symbol, and * operates on integers.

There’s still more work to do

This is just the first small step in error reporting. There is much left to do:

While I find source location is usually all I need to spot an error (what is wrong), it would also be useful to add an explanation why it is wrong.
Locations are only reported in the expression be compiled, but it might be helpful to see failures in source expanded from a function call.
The programmer probably doesn’t expect a branch to fail for all inputs, so an option would be useful to report this as an error.

Try it!

Try it out and let me know what you think. For example, try this:

1 3 4 - 2 2 - * /

Note that only the responsible code is underlined in this case.