Ja generation

aio¶

AI王 (AI king) is a Japanese quiz dataset developed for research and competition purposes.

References:

Hugging Face Dataset
AI王〜クイズAI日本一決定戦〜

local dataset_base_args = {
  class_path: 'HFGenerationDataset',
  init_args: {
    path: 'sbintuitions/aio-extended-answers',
    split: 'validation',
    reference_list_template: '{{ answers }}',
  },
};

{
  class_path: 'Generation',
  init_args: {
    eval_dataset: dataset_base_args,
    prompt_template: '{{ question }}答えは「',
    metrics: [
      {
        class_path: 'CharF1',
        init_args: {
          lm_output_processor: { class_path: 'AIONormalizer' },
          reference_processor: { class_path: 'AIONormalizer' },
        },
      },
      {
        class_path: 'ExactMatch',
        init_args: {
          lm_output_processor: { class_path: 'AIONormalizer' },
          reference_processor: { class_path: 'AIONormalizer' },
        },
      },
    ],
    gen_kwargs: { max_new_tokens: 64, stop_sequences: ['」'] },
    batch_size: 1,
  },
}

jcommonsenseqa¶

JCommonsenseQA is a Japanese version of CommonsenseQA, which is a multiple-choice question answering dataset that requires commonsense reasoning ability. The dataset is built using crowdsourcing with seeds extracted from the knowledge base ConceptNet. This is a setup for generating answers based on the choices provided.

References:

Hugging Face Dataset
Original Repository
JGLUE: Japanese General Language Understanding Evaluation

JGLUE: 日本語言語理解ベンチマーク

local dataset_base_args = {
  class_path: 'HFGenerationDataset',
  init_args: {
    path: 'sbintuitions/JCommonsenseQA',
    split: 'validation',
    reference_template: '{% set choices = [choice0, choice1, choice2, choice3, choice4] %}{{ choices[label] }}',
  },
};

local template_ = |||
  以下はタスクを説明する指示と、追加の背景情報を提供する入力の組み合わせです。要求を適切に満たす回答を書いてください。
  ### 指示
  質問と回答の選択肢を入力として受け取り、選択肢から回答を選択してください。回答の他には何も含めないことを厳守してください。

  ### 入力：
  質問：主に子ども向けのもので、イラストのついた物語が書かれているものはどれ？
  選択肢：世界,写真集,絵本,論文,図鑑
  ### 回答：
  絵本

  ### 入力：
  質問：未成年者を監護・教育し，彼らを監督し，彼らの財産上の利益を守る法律上の義務をもつ人は？
  選択肢：浮浪者,保護者,お坊さん,宗教者,預言者
  ### 回答：
  保護者

  ### 入力：
  質問：数字の１を表すときに使う体は？
  選択肢：胸,肉球,背中,人差し指,親指
  ### 回答：
  人差し指

  ### 入力：
  質問：火を起こすとあらわれるもくもくするものは？
  選択肢：歯の変色,ガス,中毒,爆発,煙
  ### 回答：
  煙

  ### 入力：
  質問：{{ question }}
  選択肢：{{ choice0 }},{{ choice1 }},{{ choice2 }},{{ choice3 }},{{ choice4 }}
  ### 回答：
|||;

{
  class_path: 'Generation',
  init_args: {
    eval_dataset: dataset_base_args,
    prompt_template: template_,
    metrics: [
      { class_path: 'ExactMatch' },
    ],
    gen_kwargs: { max_new_tokens: 64, stop_sequences: ['\n\n'] },
    batch_size: 1,
  },
}

jnli¶

JNLI is a Japanese version of the NLI (Natural Language Inference) dataset. The sentence pairs are extracted from image captions and annotated by crowd workers.

References:

Hugging Face Dataset
Original Repository
JGLUE: Japanese General Language Understanding Evaluation

JGLUE: 日本語言語理解ベンチマーク

local dataset_base_args = {
  class_path: 'HFGenerationDataset',
  init_args: {
    path: 'llm-book/JGLUE',
    subset: 'JNLI',
    reference_template: "{{ ['\"含意\"', '\"矛盾\"', '\"中立\"'][label] }}",
    dataset_kwargs: { trust_remote_code: true },
  },
};

{
  class_path: 'Generation',
  init_args: {
    eval_dataset: dataset_base_args { init_args+: { split: 'validation' } },
    few_shot_generator: {
      class_path: 'BalancedFewShotGenerator',
      init_args: {
        dataset: dataset_base_args { init_args+: { split: 'train' } },
        num_shots: 3,
      },
    },
    prompt_template: |||
      前提と仮説の関係を「中立」、「含意」、「矛盾」の中から回答してください。
      {% for item in few_shot_data %}
      前提：「{{ item.sentence1 }}」
      仮説：「{{ item.sentence2 }}」
      関係：「{{ item.references[0] }}」
      {% endfor %}
      前提：「{{ sentence1 }}」
      仮説：「{{ sentence2 }}」
    ||| + '関係：「',
    metrics: [
      { class_path: 'ExactMatch' },
    ],
    gen_kwargs: { max_new_tokens: 6, stop_sequences: ['前提', '」'] },
  },
}

jsquad¶

JSQuAD is a Japanese version of SQuAD, one of the datasets of reading comprehension. The passages are extracted from Japanese Wikipedia, and the questions and answers are created by crowd workers.

References:

Hugging Face Dataset
Original Repository
JGLUE: Japanese General Language Understanding Evaluation

JGLUE: 日本語言語理解ベンチマーク

local dataset_base_args = {
  class_path: 'HFGenerationDataset',
  init_args: {
    path: 'sbintuitions/JSQuAD',
    split: 'validation',
    reference_list_template: '{{ answers.text }}',
  },
};

local template_ = |||
  以下はタスクを説明する指示と、追加の背景情報を提供する入力の組み合わせです。要求を適切に満たす回答を書いてください。
  ### 指示
  質問に対する回答を文章から一言で抽出してください。回答は名詞で答えてください。 それ以外には何も含めないことを厳守してください。

  ### 入力：
  文章：聖武天皇 [SEP] 文武天皇の第一皇子として生まれたが、慶雲4年6月15日（707年7月18日）に7歳で父と死別、母・宮子も心的障害に陥ったため、その後は長らく会うことはなかった。物心がついて以後の天皇が病気の平癒した母との対面を果たしたのは齢37のときであった。このため、同年7月17日（707年8月18日）、父方の祖母・元明天皇（天智天皇皇女）が中継ぎの天皇として即位した。和銅7年6月25日（714年8月9日）には首皇子の元服が行われて同日正式に立太子されるも、病弱であったこと、皇親勢力と外戚である藤原氏との対立もあり、即位は先延ばしにされ、翌霊亀元年9月2日（715年10月3日）に伯母（文武天皇の姉）・元正天皇が「中継ぎの中継ぎ」として皇位を継ぐことになった。24歳のときに元正天皇より皇位を譲られて即位することになる。
  質問：文武天皇の第一皇子として生まれたのは？
  ### 回答：
  聖武天皇

  ### 入力：
  文章：通称 [SEP] 人名としての通称は通り名、二つ名、異名、などと呼ばれる事もある。近世までは、本名（実名）は「」と呼ばれ、公言は避ける習慣があった。そのため、人を呼ぶ時は「仮名」「字」などの通称、官職名を用いるのが一般的だった。今日でも「総理」「大臣」「社長」「専務」などと呼びかけに使うのがこれにあたる。
  質問：人名としての通称は何と呼ばれているか
  ### 回答：
  通り名、二つ名、異名

  ### 入力：
  文章：坂本龍一 [SEP] 2014年7月10日、所属事務所エイベックス・ミュージック・クリエイティヴから中咽頭癌であること、療養に専念するためにコンサート活動などを中止する旨が発表された。かつてはインタビューなどで度々自身の健康状態や体力に自信を表しており、コンサート等公演スケジュールを自身の健康に起因する理由でキャンセルしたことがなかった。
  質問：坂本龍一が療養に専念するためコンサート活動などを中止すると発表したのはいつか。
  ### 回答：
  2014年7月10日

  ### 入力：
  文章：リリーフ [SEP] プレッシャーの比較的かからない状態で投げることができるので、若手投手のテストの場としたり、故障明けや登板間隔の開いた投手を調整目的で登板させることもある。敗戦処理であっても好投すれば次回から先発や接戦での中継ぎに起用されるようになる場合もあり、幸い打線の援護を受けてチームが逆転すれば勝利投手に輝くこともある。
  質問：打線の援護を受けてチームが逆転するとどんな投手になる？
  ### 回答：
  勝利投手

  ### 入力：
  文章：{{ context }}
  質問：{{ question }}
  ### 回答：
|||;

{
  class_path: 'Generation',
  init_args: {
    eval_dataset: dataset_base_args,
    prompt_template: template_,
    metrics: [
      { class_path: 'CharF1' },
      { class_path: 'ExactMatch' },
    ],
    gen_kwargs: { max_new_tokens: 64, stop_sequences: ['\n\n'] },
    batch_size: 1,
  },
}

mgsm_ja¶

Multilingual Grade School Math Benchmark (MGSM) is a benchmark of grade-school math problems. This is a Japanese subset of the benchmark.

References:

Hugging Face Dataset

Language Models are Multilingual Chain-of-Thought Reasoners

local dataset_base_args = {
  class_path: 'HFGenerationDataset',
  init_args: {
    path: 'juletxara/mgsm',
    subset: 'ja',
    reference_template: '{{ answer_number }}',
  },
};

{
  class_path: 'Generation',
  init_args: {
    eval_dataset: dataset_base_args { init_args+: { split: 'test' } },
    few_shot_generator: {
      class_path: 'RandomFewShotGenerator',
      init_args: {
        dataset: dataset_base_args { init_args+: { split: 'train' } },
        num_shots: 4,
      },
    },
    prompt_template: |||
      {% for item in few_shot_data %}
      {{ item.question }}
      {{ item.answer }}
      {% endfor %}
      問題: {{ question }}
    ||| + 'ステップごとの答え:',
    metrics: [
      { class_path: 'ExactMatch', init_args: { lm_output_processor: { class_path: 'RegexExtractor', init_args: { pattern: '-?[0-9.,]+' } } } },
    ],
    gen_kwargs: { max_new_tokens: 256, stop_sequences: ['問題:'] },
  },
}

wrime_pos_neg¶

WRIME (dataset of Writers’ and Readers’ Intensities of eMotion for their Estimation) is constructed by annotating Internet posts with both the writer’s subjective emotional intensity and the reader’s objective one. This setup converts the original dataset into binary sentiment classification.

References:

Hugging Face Dataset
Original Repository
WRIME: A New Dataset for Emotional Intensity Estimation with Subjective and Objective Annotations

A Japanese Dataset for Subjective and Objective Sentiment Polarity Classification in Micro Blog Domain

local dataset_base_args = {
  class_path: 'HFGenerationDataset',
  init_args: {
    path: 'llm-book/wrime-sentiment',
    reference_template: "{{ ['\"ポジティブ\"', '\"ネガティブ\"'][label] }}",
    dataset_kwargs: { trust_remote_code: true },
  },
};

{
  class_path: 'Generation',
  init_args: {
    eval_dataset: dataset_base_args { init_args+: { split: 'validation' } },
    few_shot_generator: {
      class_path: 'BalancedFewShotGenerator',
      init_args: {
        dataset: dataset_base_args { init_args+: { split: 'train' } },
        num_shots: 4,
      },
    },
    prompt_template: |||
      文の極性について「ポジティブ」か「ネガティブ」かで答えてください。
      {% for item in few_shot_data %}
      文：{{ item.sentence }}
      極性：「{{ item.references[0] }}」
      {% endfor %}
      文：{{sentence}}
    ||| + '極性：「',
    metrics: [
      { class_path: 'ExactMatch' },
    ],
    gen_kwargs: { max_new_tokens: 8, stop_sequences: ['」'] },
  },
}

xlsum_ja¶

XLSum is a comprehensive and diverse dataset comprising 1.35 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. This is a Japanese subset of the dataset.

References:

Hugging Face Dataset
Original Repository

XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

local dataset_base_args = {
  class_path: 'HFGenerationDataset',
  init_args: {
    path: 'csebuetnlp/xlsum',
    subset: 'japanese',
    reference_template: '{{ summary }}',
  },
};

{
  // as we deal with LLMs with short context window, we set max_text_length and max_summary_length
  class_path: 'Generation',
  init_args: {
    eval_dataset: dataset_base_args { init_args+: { split: 'validation' } },
    few_shot_generator: {
      class_path: 'BalancedFewShotGenerator',
      init_args: {
        dataset: dataset_base_args { init_args+: { split: 'train' } },
        num_shots: 1,
      },
    },
    prompt_template: |||
      文章を１〜３文で要約してください。
      {% for item in few_shot_data %}
      文章: {{ item.text }}
      要約: {{ item.references[0] }}
      {% endfor %}
      文章: {{ text }}
    ||| + '要約:',
    metrics: [
      {
        class_path: 'ROUGE',
        init_args: { tokenizer: { class_path: 'SacreBleuTokenizer', init_args: { name: 'ja-mecab' } } },
      },
    ],
    gen_kwargs: { max_new_tokens: 100, stop_sequences: ['\n'] },
  },
}