How do you evaluate agent systems beyond "did it produce correct output"?

System design quality matters, not just output quality. We're looking for evaluation frameworks that capture elegance, maintainability, and composability — not just correctness.