Ja chat

aio_chat¶

AI王 (AI king) is a Japanese quiz dataset developed for research and competition purposes. This is a evaluation setup for chat LLMs.

References:

Hugging Face Dataset
AI王〜クイズAI日本一決定戦〜

JAQKET: クイズを題材にした日本語 QA データセットの構築

local dataset_base_args = {
  class_path: 'HFChatDataset',
  init_args: {
    path: 'llm-book/aio',
    input_template: '{{ question }}',
    reference_list_template: '{{ answers }}',
    dataset_kwargs: { trust_remote_code: true },
  },
};

{
  class_path: 'ChatResponse',
  init_args: {
    eval_dataset: dataset_base_args { init_args+: { split: 'validation' } },
    few_shot_generator: {
      class_path: 'RandomFewShotGenerator',
      init_args: {
        dataset: dataset_base_args { init_args+: { split: 'train' } },
        num_shots: 4,
      },
    },
    metrics: [
      {
        class_path: 'CharF1',
        init_args: {
          lm_output_processor: { class_path: 'AIONormalizer' },
          reference_processor: { class_path: 'AIONormalizer' },
        },
      },
      {
        class_path: 'ExactMatch',
        init_args: {
          lm_output_processor: { class_path: 'AIONormalizer' },
          reference_processor: { class_path: 'AIONormalizer' },
        },
      },
    ],
    gen_kwargs: { max_new_tokens: 32 },
    batch_size: 4,
  },
}

elyza_tasks_100¶

A dataset for evaluating instruction-tuned models developed by ELYZA Inc.

References:

Hugging Face Dataset

公式ブログ

{
  class_path: 'ChatResponse',
  init_args: {
    eval_dataset: {
      class_path: 'HFChatDataset',
      init_args: {
        path: 'elyza/ELYZA-tasks-100',
        split: 'test',
        input_template: '{{ input }}',
        reference_template: '{{ output }}',
        extra_info_templates: { eval_aspect: '{{ eval_aspect }}' },
      },
    },
    metrics: [
      { class_path: 'OutputLengthStats' },
    ],
    gen_kwargs: { max_new_tokens: 1024 },
    batch_size: 4,
  },
}

mgsm_ja_chat¶

Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems. This is a Japanese subset of the benchmark. This is a evaluation setup for chat LLMs.

References:

Hugging Face Dataset

Language Models are Multilingual Chain-of-Thought Reasoners

local dataset_base_args = {
  class_path: 'HFChatDataset',
  init_args: {
    path: 'juletxara/mgsm',
    subset: 'ja',
    reference_template: '{{ answer }}',
  },
};

{
  class_path: 'ChatResponse',
  init_args: {
    eval_dataset: dataset_base_args { init_args+: { split: 'test', input_template: '問題: {{ question }}' } },
    few_shot_generator: {
      class_path: 'RandomFewShotGenerator',
      init_args: {
        dataset: dataset_base_args { init_args+: { split: 'train', input_template: '{{ question }}' } },
        num_shots: 4,
      },
    },
    metrics: [
      { class_path: 'ExactMatch', init_args: { lm_output_processor: { class_path: 'RegexExtractor', init_args: { pattern: '-?[0-9.,]+' } } } },
    ],
    gen_kwargs: { max_new_tokens: 256 },
  },
}

mt-ja¶

Multi-Turn Benchmark for large language models in Japanese.

References:

Data Source

{
  class_path: 'ChatResponse',
  init_args: {
    eval_dataset: {
      class_path: 'ChatbotBench',
      init_args: {
        path_or_name: 'mt-ja',
        ref_path_or_name: 'mt-ja-ref-gpt4',
      },
    },
    metrics: [
      { class_path: 'OutputLengthStats' },
    ],
    gen_kwargs: { max_new_tokens: 1024 },
    batch_size: 4,
  },
}

rakuda-v2-ja¶

Rakuda benckmark concists of a set of 40 questions in Japanese about Japanese-specific topics designed to evaluate the capabilities of AI Assistants in Japanese.

References:

Original Repository

Hugging Face Dataset

{
  class_path: 'ChatResponse',
  init_args: {
    eval_dataset: {
      class_path: 'ChatbotBench',
      init_args: {
        path_or_name: 'rakuda-v2-ja',
      },
    },
    metrics: [
      { class_path: 'OutputLengthStats' },
    ],
    gen_kwargs: { max_new_tokens: 1024 },
    batch_size: 4,
  },
}

vicuna-ja¶

Vicuna Benchmark for large language models in Japanese.

References:

Data Source

{
  class_path: 'ChatResponse',
  init_args: {
    eval_dataset: {
      class_path: 'ChatbotBench',
      init_args: {
        path_or_name: 'vicuna-ja',
        ref_path_or_name: 'vicuna-ja-ref-gpt4',
      },
    },
    metrics: [
      { class_path: 'OutputLengthStats' },
    ],
    gen_kwargs: { max_new_tokens: 1024 },
    batch_size: 4,
  },
}