PairwiseJudge

assistant_judge_en_single_turn¶

This is a configuration for evaluting the quality of responses generated by an AI assistant. Originally used to generate scores for MT-bench or Vicuna-bench.

Adapted from lm-sys/FastChat.

{
  class_path: 'ChatLLMPairwiseJudge',
  init_args: {
    language_model: { class_path: 'OpenAIChatAPI', init_args: { model: 'gpt-4-turbo-2024-04-09' } },
    prompt_template: {
      class_path: 'Jinja2PromptTemplate',
      init_args: {
        template: std.stripChars(|||
          {% set question = model1_item["extra_info"]["messages"][0]["content"] -%}
          {% set model1_messages = model1_item["extra_info"]["messages"] -%}
          {% set model2_messages = model2_item["extra_info"]["messages"] -%}
          [Instruction]
          {% if references|length > 0 -%}
          Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. Your evaluation should consider correctness and helpfulness. You will be given a reference answer, assistant A's answer, and assistant B's answer. Your job is to evaluate which assistant's answer is better. Begin your evaluation by comparing both assistants' answers with the reference answer. Identify and correct any mistakes. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: "[[1]]" if assistant 1 is better, "[[2]]" if assistant 2 is better, and "[[3]]" for a tie.
          {%- else -%}
          Please act as an impartial judge and evaluate the quality of the responses provided by two AI assistants to the user question displayed below. You should choose the assistant that follows the user's instructions and answers the user's question better. Your evaluation should consider factors such as the helpfulness, relevance, accuracy, depth, creativity, and level of detail of their responses. Begin your evaluation by comparing the two responses and provide a short explanation. Avoid any position biases and ensure that the order in which the responses were presented does not influence your decision. Do not allow the length of the responses to influence your evaluation. Do not favor certain names of the assistants. Be as objective as possible. After providing your explanation, output your final verdict by strictly following this format: "[[1]]" if assistant 1 is better, "[[2]]" if assistant 2 is better, and "[[3]]" for a tie.
          {%- endif %}

          [Question]
          {{ question }}

          {% if references|length > 0 -%}
          [The Start of Reference Answer]
          {{ references[0] }}
          [The End of Reference Answer]
          {% endif -%}
          [The Start of Assistant 1's Answer]
          {% if model1_messages|length == 1 %}{{ model1_item["lm_output"] }}{% else %}{{ model1_messages[1]["content"] }}{% endif %}
          [The End of Assistant's Answer]
          [The Start of Assistant 2's Answer]
          {% if model2_messages|length == 1 %}{{ model2_item["lm_output"] }}{% else %}{{ model2_messages[1]["content"] }}{% endif %}
          [The End of Assistant's Answer]
        |||, '\n'),
      },
    },
  },
}

assistant_judge_ja_single_turn¶

This is a configuration for evaluting the quality of responses generated by an AI assistant. Originally used to generate scores for the Japanese versions of MT-bench or Vicuna-bench.

Translated and adapted from lm-sys/FastChat.

{
  class_path: 'ChatLLMPairwiseJudge',
  init_args: {
    language_model: { class_path: 'OpenAIChatAPI', init_args: { model: 'gpt-4-turbo-2024-04-09' } },
    prompt_template: {
      class_path: 'Jinja2PromptTemplate',
      init_args: {
        template: std.stripChars(|||
          {% set question = model1_item["extra_info"]["messages"][0]["content"] -%}
          {% set model1_messages = model1_item["extra_info"]["messages"] -%}
          {% set model2_messages = model2_item["extra_info"]["messages"] -%}

          [ユーザの質問]
          {{ question }}

          {% if references|length > 0 -%}
          [参考回答の開始]
          {{ references[0] }}
          [参考回答の終了]
          {% endif -%}
          [アシスタント1の回答開始]
          {% if model1_messages|length == 1 %}{{ model1_item["lm_output"] }}{% else %}{{ model1_messages[1]["content"] }}{% endif %}
          [アシスタント1の回答終了]
          [アシスタント2の回答開始]
          {% if model2_messages|length == 1 %}{{ model2_item["lm_output"] }}{% else %}{{ model2_messages[1]["content"] }}{% endif %}
          [アシスタント2の回答終了]
        |||, '\n'),
      },
    },
    system_message: {
      class_path: 'Jinja2PromptTemplate',
      init_args: {
        template: std.stripChars(|||
          {% if references|length > 0 -%}
          あなたは、回答の質をチェックするための審判員です。以下に示されるユーザーの質問に対する2つのAIアシスタントの応答の品質を評価してください。回答の内容がユーザーの指示に従っており、ユーザーの質問によりよく答えているアシスタントを選んでください。参照回答、アシスタント1の回答、アシスタント2の回答が与えられるので、どちらのアシスタントの回答が優れているかを評価してください。評価の際には、まずそれぞれのアシスタントの回答を参照回答と比較し、回答の誤りを見つけて修正してください。立場が偏らないようにし、回答の提示順があなたの判断に影響しないようにしてください。回答の長さが評価に影響しないこと、特定のアシスタントの名前を好まないこと、できるだけ客観的であること、に気をつけてください。説明の後に、最終的な判断を以下の形式に従って出力してください：アシスタント1が優れていれば[[1]]、アシスタント2が優れていれば[[2]]、同点の場合は[[3]]
          {%- else -%}
          あなたは、回答の質をチェックするための審判員です。以下に示されるユーザーの質問に対する2つのAIアシスタントの応答の品質を評価してください。回答の内容がユーザーの指示に従っており、ユーザーの質問によりよく答えているアシスタントを選んでください。具体的には、回答の有用性、関連性、正確性、深さ、創造性、詳細レベルなどの要素を考慮する必要があります。評価の際には、まず2つの回答を比較し、簡単な説明をしてください。立場が偏らないようにし、回答の提示順があなたの判断に影響しないようにしてください。回答の長さが評価に影響しないこと、特定のアシスタントの名前を好まないこと、できるだけ客観的であること、に気をつけてください。説明の後に、最終的な判断を以下の形式に従って出力してください：アシスタント1が優れていれば[[1]]、アシスタント2が優れていれば[[2]]、同点の場合は[[3]]
          {%- endif %}
        |||, '\n'),
      },
    },
  },
}