I call it the balance arena. Come up with a better name if you like. The idea is that you create small, organized mazes with triggers at either end. These triggers would basically represent stairs. The mazes would be able to take many shapes - long corridors, random trees, etc. They are filled with varying numbers of creatures at, above, or below your level, as desired, and you fight them. Once you've killed them all, you hit the trigger at the other end, and they respawn. This continues until you die or stop. No loot drops and your character never gains experience (unless you want them to, of course). You would begin with a specific build, including stats and gear, and just gather data.
You count all the turns that happen, and measure things like the following:
-How much damage happens overall, and get a per turn damage number
-How much healing happens, what type it was, and how much happened per turn
-For example, how much healing was from infusions and how much was from skills.
-How often you activated abilities, counting sustains and activated separately
-How many enemies were killed with what skills
-How many enemies died with what debuffs on them
-How much faster enemies died with debuffs than without
-How many turns a character lasts, on average
-How many times a class dies to a certain boss, over 150 tries
-Freaking anything else you can think of
Here are some scenarios that I'm envisioning, just as examples of the kind of data you could gather.
-Rogues seem really weak to start out, so you could test various builds at level 3, and then drop Bill in, then kill Bill 50 times, or something, and see how you fare against other classes.
-Sun paladins seem weak after level 30, in my experience. Create a bunch of level 35 paladins and test them for a while until you see what builds are strong or weak, then buff or nerf accordingly.
-Test every class against 100 enemies at your level, then drop in Bill. You gain experience, and fight bill at that level. This helps you test the exp penalty vs. the benefits in the early game. Or, just test certain boss fights that appear late in the game with, say, 1000000 experience or something, given to level up with. You could see how a higher with %15 exp penalty fares vs a cornac with no penalty, due to the extra levels he'll have.
-People complain about skeleton mages. Test each class at level 5 against wave after wave of those, and see what succeeds and what doesn't.
-Test each class against random enemies 3 levels higher, then test them all against single types of enemy 3 levels higher, one wave at a time. See which classes fare better in crazy situations.
To sum up, just imagine, when some newb complains about Sun Paladins being underpowered, instead of going, "No they're not!", saying something like, "Well, they last on average 10% longer than half the classes, do 5% less damage, but defend against 10% more attacks and damage, at level 40. Also, I've run them through the end game fight 25 times in greens, and won 22 times."
This is just a rough outline of an idea. If anyone likes the concept, it could certainly be improved.
EDIT - One last thing. It's probably a good idea to make sure that your testers meet a minimum skill level, such as people who've won the game, to make sure that your data is accurate.