LM Benchmarks - Harbor

For Coding:

HumanEval: Python coding tasks (higher % = better)
MBPP: Python programming problems
MultiPL-E: Multi-language coding

For General Reasoning:

MMLU: General knowledge (target: 60%+)
GSM8K: Math problems
BBH: Complex reasoning