For Coding:

  • HumanEval: Python coding tasks (higher % = better)
  • MBPP: Python programming problems
  • MultiPL-E: Multi-language coding

For General Reasoning:

  • MMLU: General knowledge (target: 60%+)
  • GSM8K: Math problems
  • BBH: Complex reasoning