Revision: 15/02/2025
Benchmark Command¶
The benchmark command runs OpenBench LLM benchmarking and evaluation.
Usage¶
Subcommands¶
list¶
List available benchmark categories.
Options:
--details: Show detailed info.-f, --format <type>: Output format (json,table,markdown).
Categories:
knowledge: General knowledge (MMLU, TriviaQA)coding: Code generation (HumanEval, MBPP)math: Mathematical reasoning (GSM8K, MATH)reasoning: Logic and deduction (ARC, HellaSwag)cybersecurity: Security-related taskssearch: Retrieval and search quality
run¶
Run a standardized LLM benchmark.
Options:
-c, --category <name>: Benchmark category (default:search).-p, --provider <name>: LLM provider:groq,openai,anthropic(default:groq).-m, --model <name>: Model name (default:llama-3.1-8b-instant).-n, --samples <number>: Max samples to evaluate (default: 100).--no-cache: Disable result caching.-f, --format <type>: Output format.
corpus¶
Evaluate a specific Deposium corpus for retrieval quality with custom query-document pairs.
Options:
-t, --tenant <id>: Tenant ID.-s, --space <id>: Space ID.-q, --queries <file>: JSON file with query-document pairs.-p, --provider <name>: LLM provider.-m, --model <name>: Model name.
compare¶
Compare benchmark results across multiple models.
Options:
--models <list>: Comma-separated list of model names (e.g.,model1,model2).-c, --category <name>: Filter by category.-n, --samples <number>: Samples limit.
Query Format for Corpus Benchmark¶
queries.json should look like: