Skip to content

Evaluate with LLM Judges

Evaluation of chat model is difficult since the response is open-ended and manually evaluating the responses is not scalable. One solution is to use a LLM as an evaluator.

Single Judge Evaluation

First, we need to generate responses from the chat model. In this example, we use ChatGPT with the following command:

export OPENAI_API_KEY="YOUR_API_KEY"

flexeval_lm \
  --language_model OpenAIChatAPI \
  --language_model.model "gpt-3.5-turbo" \
  --eval_setup "mt-en" \
  --save_dir "results/mt-en_gpt3.5-turbo"

Now you have the model outputs in results/mt-en-gpt3.5-turbo/outputs.jsonl.

Let's evaluate the responses with GPT4. The LLM evaluation is implemented as a Metric class and we will use a preset metric named assistant_eval_en_single_turn. You can check its configuration with the following command:

flexeval_presets assistant_eval_en_single_turn

In this metric, GPT4 is asked to rate the responses with the score from 1 to 10. The score is extracted as the last digit found in the evaluator's output.

Tip

To take a closer look at the prompt template, combine pipeline with jsonnet and jq:

flexeval_presets assistant_eval_en_single_turn | jsonnet - | jq -r ".init_args.prompt_template.init_args.template"

Perform automatic evaluation with GPT4 with the following command:

flexeval_file \
   --eval_file "results/mt-en-gpt3.5-turbo/outputs.jsonl" \
   --metrics "assistant_eval_en_single_turn" \
   --save_dir "results/mt-en_gpt3.5-turbo/eval_by_gpt"

☕️ It may take a while to finish the evaluation...

By hitting cat results/mt-en-gpt3.5-turbo/eval_by_gpt/metrics.json, you can see the evaluation result like {"llm_score": 7.795}. The evaluation for each response is stored in results/mt-en-gpt3.5-turbo/eval_by_gpt/outputs.jsonl.

You can check the output of the evaluator LLM in the llm_score_output field.

head -n 1 results/mt-en-gpt3.5-turbo/eval_by_gpt/outputs.jsonl | jq -r ".llm_output"

Info

flexeval_file just runs the same evaluation as flexeval_lm but with the given file. So, theoretically, you can perform the same evaluation with flexeval_lm in one go:

flexeval_lm \
  --language_model OpenAIChatAPI \
  --language_model.model "gpt-3.5-turbo" \
  --eval_setup "mt-en" \
  --metrics+="assistant_eval_en_single_turn" \
  --save_dir "results/mt-en_gpt3.5-turbo"

Yet, we recommend separate the response generation (flexeval_lm) and evaluation (flexeval_file) so that you don't lost the response by some errors in the evaluation process.

Pairwise Judge Evaluation

Sometimes, evaluating chat models individually cannot capture a subtle difference between models. In such cases, pairwise evaluation is useful.

The overview of process is generating the responses using flexeval_lm and evaluating them with flexeval_pairwise.

In this example, we will compare the responses from GPT3.5 and GPT-4o.

First, generate the responses with GPT3.5. You can skip this if you have already generated the responses.

export OPENAI_API_KEY="YOUR_API_KEY"

flexeval_lm \
  --language_model OpenAIChatAPI \
  --language_model.model "gpt-3.5-turbo" \
  --eval_setup "mt-en" \
  --save_dir "results/mt-en_gpt3.5-turbo"

Generate the responses with GPT-4o.

flexeval_lm \
  --language_model OpenAIChatAPI \
  --language_model.model "gpt-4o" \
  --eval_setup "mt-en" \
  --save_dir "results/mt-en_gpt-4o"

Now, compare the responses with GPT-4.

flexeval_pairwise \
  --lm_output_paths.gpt_3_5 "results/mt-en_gpt3.5-turbo/outputs.jsonl"  \
  --lm_output_paths.gpt_4o "results/mt-en_gpt-4o/outputs.jsonl"  \
  --judge "assistant_judge_en_single_turn" \
  --save_dir "results/mt-en_gpt3.5_vs_gpt4o"

☕️ It may take a while to finish the evaluation...

You can see the result in results/mt-en_gpt3.5_vs_gpt4o/scores.json.

{
    "win_rate": {
        "gpt_4o": 85.3125,
        "gpt_3_5": 14.6875
    },
    "bradley_terry": {
        "gpt_4o": 1152.8129577165919,
        "gpt_3_5": 847.1870422834081
    }
}

The win_rate shows the percentage of wins of each model. The bradley_terry shows the Bradley-Terry score of each model.

What's Next?