How to configure your own evaluation¶
Overview¶
flexeval
allows you to evaluate any language model with any task, any prompt, and any metric via the flexeval_lm
command.
The CLI command is implemented on jsonargparse, which allows a flexible configuration either by CLI arguments or by a configuration file.
There are many ways to write configuration files, but for now let's see how to define a config for the argument --eval_setup
.
You can check the configuration for preset setups by running the following command:
flexeval_presets commonsense_qa
This command will show the configuration for the commonsense_qa
setup.
The content is written in the jsonnet format, which is a superset of JSON.
Tip
If you want to convert it to JSON, install jsonnet
command and run flexeval_presets commonsense_qa | jsonnet -
.
The skeleton of the configuration is as follows:
{
"class_path": "Generation",
"init_args": {
"eval_dataset": {"class_path": "HFGenerationDataset", "init_args": ...},
"prompt_template": {"class_path": "Jinja2PromptTemplate", "init_args": ...},
"gen_kwargs": {"max_new_tokens": 32, "stop_sequences": ["」"]},
"metrics": [{"class_path": "CharF1"}, {"class_path": "ExactMatch"}],
"batch_size": 4
}
}
The fields class_path
and init_args
directly mirror the initialization of the specified class.
At the top level, "class_path": "Generation"
specifies what kind of EvalSetup
to use.
Currently, there are four types of EvalSetup
: Generation
, ChatResponse
, MultipleChoice
, and Perplexity
.
Then, Generation
is composed of the following components:
eval_dataset
: The dataset to evaluate. You can choose from concrete classes inheritingGenerationDataset
. Most presets useHFGenerationDataset
, which load datasets from Hugging Face Hub.prompt_template
: The template to generate prompts fed to the language model. We haveJinja2PromptTemplate
, which uses Jinja2 to embed the data fromGenerationDataset
into the prompt.gen_kwargs
: The keyword arguments passed toLanguageModel.batch_complete_text
. For example,max_new_tokens
andstop_sequences
are used to control the generation process. Acceptable arguments depend on the underlying implementation of the generation function (e.g.,generate()
intransformers
).metrics
: The metrics to compute. You can choose from concrete classes inheritingMetric
. These modules take the outputs of the language model, the references, and dataset values, and compute the metrics.
Please refer to the API reference for available classes and their arguments.
Customizing the Configuration¶
Writing a configuration file from scratch is a bit cumbersome, so we recommend starting from the preset configurations and modifying them as needed.
flexeval_presets commonsense_qa > my_config.jsonnet
Then, pass your config file to --eval_setup
argument.
flexeval_lm \
--language_model HuggingFaceLM \
--language_model.model "sbintuitions/tiny-lm" \
--eval_setup "my_config.jsonnet"
Info
Under the hood, the preset name like commonsense_qa
is resolved to the corresponding configuration file under flexeval/preset_configs
in the library.
Argument Overrides¶
jsonargparse allows you to flexibly combine configuration files and CLI arguments. You can override the argument values by specifying them in the CLI.
flexeval_lm \
--language_model HuggingFaceLM \
--language_model.model "sbintuitions/tiny-lm" \
--eval_setup "commonsense_qa" \
--eval_setup.batch_size 8
The value of --eval_setup.batch_size
overrides the value defined in the config file of commonsense_qa
.
What's Next?¶
- Proceed to How to to find examples that suit your needs.
- Look at the API reference to see the available classes and their arguments.