En chat
mt-en¶
Multi-Turn Benchmark for large language models.
References:
- Data Source
- Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
{ class_path: 'ChatResponse', init_args: { eval_dataset: { class_path: 'ChatbotBench', init_args: { path_or_name: 'mt-en', ref_path_or_name: 'mt-en-ref-gpt4', }, }, metrics: [ { class_path: 'OutputLengthStats' }, ], gen_kwargs: { max_new_tokens: 1024 }, batch_size: 4, }, }
vicuna-en¶
Vicuna Benchmark for large language models.
References:
- Data Source
{ class_path: 'ChatResponse', init_args: { eval_dataset: { class_path: 'ChatbotBench', init_args: { path_or_name: 'vicuna-en', ref_path_or_name: 'vicuna-en-ref-gpt4', }, }, metrics: [ { class_path: 'OutputLengthStats' }, ], gen_kwargs: { max_new_tokens: 1024 }, batch_size: 4, }, }