For Coding:
- HumanEval: Python coding tasks (higher % = better)
- MBPP: Python programming problems
- MultiPL-E: Multi-language coding
For General Reasoning:
- MMLU: General knowledge (target: 60%+)
- GSM8K: Math problems
- BBH: Complex reasoning
For Coding:
For General Reasoning: