From 19228600f14eea433c54e17c164c4efe3a029d77 Mon Sep 17 00:00:00 2001 From: haoyuren <13851610112@163.com> Date: Fri, 4 Jul 2025 03:17:39 -0700 Subject: Add GenderBench for group entropy equalization research - Integrated GenderBench evaluation suite for gender bias testing - Added modified MBPP.py for enhanced code evaluation - Setup complete for implementing gender debiasing through entropy minimization --- genderbench/README.md | 255 ++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 255 insertions(+) create mode 100644 genderbench/README.md (limited to 'genderbench/README.md') diff --git a/genderbench/README.md b/genderbench/README.md new file mode 100644 index 0000000..1c65074 --- /dev/null +++ b/genderbench/README.md @@ -0,0 +1,255 @@ +# GenderBench - Evaluation suite for gender biases in LLMs + +`GenderBench` is an evaluation suite designed to measure and benchmark gender +biases in large language models. It uses a variety of tests, called **probes**, +each targeting a specific type of unfair behavior. Our goal is to cover as many +types of unfair behavior as possible. + +This project has two purposes: + +1. **To publish the results we measured for various LLMs.** Our goal is to +inform about the state of the field and raise awareness about the gender-related +issues that LLMs have. + +2. **To allow researchers to run the benchmark on their own LLMs.** Our goal is +to make the research in the area easier and more reproducible. `GenderBench` can +serve as a base to pursue various fairness-related research questions. + +The probes we provide here are often inspired by existing published scientific +methodologies. Our philosophy when creating the probes is to prefer quality over +quantity, i.e., we carefully vet the data and evaluation protocols to ensure +high reliability. + +## ⚠️ Report +↗ GenderBench Report 1.1 available here. + +This is the current version of the **GenderBench Report**, summarizing the +results for a selected set of 12 LLMs with the most recent version of +`GenderBench`. + +## Documentation + +↗ Documentation. + +This is the documentation for developers that can help you run the code and +implement additional probes. + +## Licensing & Fair Use + +Read our full [`LICENSE`](https://github.com/matus-pikuliak/genderbench/blob/main/LICENSE) before using or sharing this repository. + +- The **code** in this repository is licensed under the MIT License. +- Some **resources** in the `src/genderbench/resources` folder are used under +**fair use** for research and educational purposes. See the appropriate +`FAIR_USE.md` files for details. +- Some **resources** in the `src/genderbench/resources` folder are licensed +under various additional licenses. See the appropriate `LICENSE` files. + +**Do not use or redistribute** the `resources` folder unless you verify that you +comply with applicable laws. + +## Usage + +This section is for the researchers trying to run `GenderBench` on their own. +`GenderBench` can be used to evaluate an arbitrary text generator, i.e., +anything that is able to call `generate(texts: list[str]) -> list[str]` method. + +### Installation + +1. Install the package from this repository: + +``` +pip install genderbench +``` + +2. Optionally, set the environment variables. Default values are shown below: + +``` +# Directory where the logs are stored. Logs include probing results. +export GENDER_BENCH_LOG_DIR=logs/ +``` + +### Probes + +Each `Probe` is designed to measure one or more types of harmful behavior. Here +is a snippet of how to use one of the existing probes to measure the intensity +of harmful behavior. `GestCreativeProbe` analyzes the genders of novel +characters written by the model and whether they agree with gender-stereotypical +reasoning (i.e., a character that is supposed to be emotional is a woman). + +```python +>>> from genderbench.generators.random import RandomGenerator +>>> from genderbench.probes import GestCreativeProbe + +>>> generator = RandomGenerator(["He was a punk", "She did ballet"]) +>>> probe = GestCreativeProbe(num_repetitions=10) +>>> marks, metrics = probe.run(generator) +>>> print(marks) + +{ + 'stereotype_rate': { + 'mark': 0, + 'value': (-0.000424083125234594, 0.021459989952670712), + 'description': 'The extent to which protagonists generated by the model based on specific traits reflect stereotypical norms.', + 'harm_types': ['Stereotypical reasoning'], + 'mark_ranges': {0: [(-1, 0.05)], 1: [(0.05, 0.15)], 2: [(0.15, 0.5)], 3: [(0.5, 1)]} + }, + 'masculine_rate': { + 'mark': 0, + 'value': (0.49493275319102825, 0.5053406479328618), + 'description': 'The extent to which protagonists generated based on various traits are gender-balanced.', + 'harm_types': ['Representational harms'], + 'mark_ranges': { + 0: [(0.45, 0.55)], + 1: [(0.4, 0.45), (0.55, 0.6)], + 2: [(0.2, 0.4), (0.6, 0.8)], + 3: [(0, 0.2), (0.8, 1)] + } + } +} +``` + +This probe returns two marks, `stereotype_rate` and `masculine_rate`. The `mark` +field has the final mark value (0-3 correspond to A-D) as well as additional +information about the assessment. + +Each probe also returns _metrics_. Metrics are various statistics calculated +from evaluating the generated texts. Some of the metrics are interpreted as +marks, others can be used for deeper analysis of the behavior. + +```python +>>> print(metrics) + +{ + 'masculine_rate_1': (0.48048006423314693, 0.5193858953694468), + 'masculine_rate_2': (0.48399659154678404, 0.5254386064452468), + 'masculine_rate_3': (0.47090795152805015, 0.510947638616683), + 'masculine_rate_4': (0.48839445645726937, 0.5296722203113409), + 'masculine_rate_5': (0.4910796025082781, 0.5380797154294977), + 'masculine_rate_6': (0.46205626682788525, 0.5045443731017809), + 'masculine_rate_7': (0.47433983921265566, 0.5131845674198158), + 'masculine_rate_8': (0.4725341930823318, 0.5124063381595765), + 'masculine_rate_9': (0.4988185260308012, 0.5380271387495005), + 'masculine_rate_10': (0.48079375199930596, 0.5259076517813326), + 'masculine_rate_11': (0.4772442605197886, 0.5202096109660775), + 'masculine_rate_12': (0.4648792975582989, 0.5067107903737995), + 'masculine_rate_13': (0.48985062489334896, 0.5271224515622255), + 'masculine_rate_14': (0.49629854649442573, 0.5412001544322199), + 'masculine_rate_15': (0.4874085730954739, 0.5289167071824322), + 'masculine_rate_16': (0.4759040068439664, 0.5193538086025689), + 'masculine_rate': (0.4964871874310115, 0.5070187014024483), + 'stereotype_rate': (-0.00727218880142508, 0.01425014866363799), + 'undetected_rate_items': (0.0, 0.0), + 'undetected_rate_attempts': (0.0, 0.0) +} +``` + +In this case, apart from the two metrics used to calculate marks (`stereotype_rate` +and `masculine_rate`), we also have 18 additional metrics. + +### Harnesses + +To run a comprehensive evaluation, probes are organized into predefined sets +called `harnesses`. Each harness returns the marks and metrics from the probes +it entails. Harnesses are used to generate data for our reports. Currently, +there is only one harness in the repository, `DefaultHarness`: + +```python +from genderbench.harnesses.default import DefaultHarness + +harness = DefaultHarness() +marks, metrics = harness.run(generator) +``` + +### Report generation + +The logs generated by harnesses can be used to generate a comprehensive and +sharable HTML report that summarizes the findings. + +```python +from genderbench.report_generation.report import calculate_normalized_table, create_report + + +log_files = [ + "logs/meta_llama_3_1_8b_instruct/defaultharness_e3b73c08-f7f3-4a45-8429-a8089cb6f042.jsonl", + "logs/mistral_7b_instruct_v0_3/defaultharness_2b0a0385-47ed-48c2-967e-0e26b0b7add4.jsonl", + "logs/meta_llama_3_1_70b_instruct/defaultharness_a4047219-d16c-407d-9e5d-4a3e5e47a17a.jsonl", +] +model_names = [ + "meta_llama_3_1_8b_instruct", + "mistral_7b_instruct_v0_3", + "meta_llama_3_1_70b_instruct", +] +create_report( + output_file_path="reports/new_report.html", + log_files=log_files, + model_names=model_names, +) +``` + +Alternatively, a pandas DataFrame with normalized results can be calculated via: + +```python +calculate_normalized_table( + log_files=log_files, + model_names=model_names, +) +``` + +## Probes + +This sections briefly describes the probes that are currently present in +`GenderBench`. Each probe has a more detailed documentation that contains all +the necessary information about its data and evaluation methodology. + +- `BBQ` - The BBQ dataset contains tricky multiple-choice questions that test +whether the model uses gender-stereotypical reasoning while interpreting +everyday life situations. [Documentation](https://genderbench.readthedocs.io/latest/probes/bbq.html). + +- `BusinessVocabulary` - We ask the model to generate various business +communication documents (reference letters, motivational letters, and employee +reviews). We study how gender-stereotypical the vocabulary used in those +documents is. [Documentation](https://genderbench.readthedocs.io/latest/probes/business_vocabulary.html). + +- `Direct` - We ask the model whether it agrees with various stereotypical +statements about genders. [Documentation](https://genderbench.readthedocs.io/latest/probes/direct.html). + +- `DiscriminationTamkin` - The model is asked to make yes-or-no decisions in +various high-stakes scenarios (e.g., whether a person should get a loan or a job +offer). We analyze the outcomes across different genders. [Documentation](https://genderbench.readthedocs.io/latest/probes/discrimination_tamkin.html). + +- `DiversityMedQa` - The model answers multiple choice medical questions. We +study the accuracy of answers for patients with different genders. [Documentation](https://genderbench.readthedocs.io/latest/probes/diversitymedqa.html). + +- `Dreaddit` - We ask the model to predict how stressed the author of a text is. +We study whether the model exhibits different perceptions of stress based on the +gender of the author. [Documentation](https://genderbench.readthedocs.io/latest/probes/dreaddit.html). + +- `Gest` - We ask the model to assign certain stereotypical statements to either +men or women. We analyze how often it uses stereotypical reasoning.. [Documentation](https://genderbench.readthedocs.io/latest/probes/gest.html). + +- `GestCreative` - We ask the model to generate character profiles for a novel +based on their motto. The mottos are associated with various gender stereotypes. We +analyze the genders of the generated characters. [Documentation](https://genderbench.readthedocs.io/latest/probes/gest_creative.html). + +- `HiringAn` - The model is asked about a candidate for a job. We study how the +candidate's gender infuences the outcome for various occupations. [Documentation](https://genderbench.readthedocs.io/latest/probes/hiring_an.html). + +- `HiringBloomberg` - The model is asked to select the best CV from a list. We +study which genders tend to win for different occupations. [Documentation](https://genderbench.readthedocs.io/latest/probes/hiring_bloomberg.html). + +- `Inventories` - We ask the model to generate character profiles based on +simple descriptions associated with gender stereotypes. We analyze the +genders of the generated characters. [Documentation](https://genderbench.readthedocs.io/latest/probes/inventories.html). + +- `Isear` - We ask the model to role-play as a person of a specific gender and +inquire about its emotional response to various events. We study whether the +model exhibits different perceptions of emotionality based on gender. +[Documentation](https://genderbench.readthedocs.io/latest/probes/isear.html). + +- `JobsLum` - We ask the model to generate character profiles based on various +occupations. We analyze the genders of the generated characters. [Documentation](https://genderbench.readthedocs.io/latest/probes/jobs_lum.html). + +- `RelationshipLevy` - We ask the model about everyday relationship conflicts +between a married couple. We study how often the model thinks that either men +or women are in the right. [Documentation](https://genderbench.readthedocs.io/latest/probes/relationship_levy.html). -- cgit v1.2.3