From 19228600f14eea433c54e17c164c4efe3a029d77 Mon Sep 17 00:00:00 2001 From: haoyuren <13851610112@163.com> Date: Fri, 4 Jul 2025 03:17:39 -0700 Subject: Add GenderBench for group entropy equalization research - Integrated GenderBench evaluation suite for gender bias testing - Added modified MBPP.py for enhanced code evaluation - Setup complete for implementing gender debiasing through entropy minimization --- .../_static/reports/genderbench_report_0_1.html | 685 ++++++++++ .../_static/reports/genderbench_report_1_0.html | 1325 +++++++++++++++++++ .../_static/reports/genderbench_report_1_1.html | 1349 ++++++++++++++++++++ 3 files changed, 3359 insertions(+) create mode 100644 genderbench/docs/source/_static/reports/genderbench_report_0_1.html create mode 100644 genderbench/docs/source/_static/reports/genderbench_report_1_0.html create mode 100644 genderbench/docs/source/_static/reports/genderbench_report_1_1.html (limited to 'genderbench/docs/source/_static/reports') diff --git a/genderbench/docs/source/_static/reports/genderbench_report_0_1.html b/genderbench/docs/source/_static/reports/genderbench_report_0_1.html new file mode 100644 index 0000000..75452e0 --- /dev/null +++ b/genderbench/docs/source/_static/reports/genderbench_report_0_1.html @@ -0,0 +1,685 @@ + + + + + + GenderBench Results + + + + + + + + +

GenderBench Results

GenderBench is an evaluation suite designed to measure and benchmark gender biases in large language models. It uses a variety of tests, called probes, each targeting a specific type of unfair behavior.

This document presents the evaluation results for selected models. The results are organized into sections based on different use cases, which include:

Decision Making (DM) - Evaluates how fair and unbiased the model is in making decisions.
Creative Writing (CW) - Examines how the model handles stereotypes and representation in creative outputs.
Manifested Opinions (MO) - Assesses whether the model's expressed opinions show bias when asked.
Affective Computing (AC) - Looks at whether the model makes assumptions about users' emotional states based on their gender.

To categorize the severity of biases or harmful behaviors, we use a four-tier system:

A - Healthy. No detectable signs of harmful behavior.
B - Cautionary. Low-intensity harmful behavior, often subtle enough to go unnoticed by most users.
C - Critical. Noticeable harmful behavior that may affect user experience.
D - Catastrophical. Harmful behavior is common and present in most interactions.

Main Table

+ The main table below provides an overview of results across the use cases. Detailed breakdowns for each use case are provided in the following sections. + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

	DM	CW	MO	AC	Overall
meta_llama_3_1_8b_instruct	B	C	B	A	B
mistral_7b_instruct_v0_3	A	C	B	A	B
meta_llama_3_1_70b_instruct	A	D	C	A	C
qwen_32b_preview	A	C	A	A	B
gemma_2_9b_it	A	C	C	A	B

+ + +

Decision Making

+ Evaluates how fair and unbiased the model is in making decisions. We simulate scenarios where the model is used in fully automated systems or as a decision-making assistant for users. The focus is on important real-world situations, such as hiring. +

+ Differences in success rates for genders in everyday decision-making scenarios.
Read documentation. + + + + +

Probe	`DiscriminationTamkinProbe`
Metric	`max_diff`
Harms	+ Outcome Disparity +

+ Differences in success rates for genders when applying for jobs.
Read documentation. + + + + +

Probe	`HiringAnProbe`
Metric	`diff_acceptance_rate`
Harms	+ Outcome Disparity +

+ Extent to which hiring decisions are influenced by stereotypical gender roles for a job.
Read documentation. + + + + +

Probe	`HiringAnProbe`
Metric	`diff_correlation`
Harms	+ Stereotyping +

+ Frequency of one gender being chosen over another for the same job.
Read documentation. + + + + +

Probe	`HiringBloombergProbe`
Metric	`masculine_rate`
Harms	+ Representational Harm +

+ Extent to which hiring decisions are influenced by stereotypical gender roles for a job.
Read documentation. + + + + +

Probe	`HiringBloombergProbe`
Metric	`stereotype_rate`
Harms	+ Stereotyping +

+ +

Creative writing

+ Examines how the model handles stereotypes and representation in creative outputs. We simulate scenarios where authors ask the model to help them with creative writing. Writing is a common applications of large language models. +

+ The extent to which protagonists generated by the model based on specific traits reflect stereotypical norms.
Read documentation. + + + + +

Probe	`GestCreativeProbe`
Metric	`stereotype_rate`
Harms	+ Stereotyping +

+ The extent to which protagonists generated by the model based on specific traits reflect stereotypical norms.
Read documentation. + + + + +

Probe	`InventoriesProbe`
Metric	`stereotype_rate`
Harms	+ Stereotyping +

+ The extent to which protagonists generated by the model based on specific occupations reflect stereotypical norms.
Read documentation. + + + + +

Probe	`JobsLumProbe`
Metric	`stereotype_rate`
Harms	+ Stereotyping +

+ The extent to which protagonists generated based on various traits are gender-balanced.
Read documentation. + + + + +

Probe	`GestCreativeProbe`
Metric	`masculine_rate`
Harms	+ Representational Harm +

+ The extent to which protagonists generated based on various traits are gender-balanced.
Read documentation. + + + + +

Probe	`InventoriesProbe`
Metric	`masculine_rate`
Harms	+ Representational Harm +

+ The extent to which protagonists generated based on various occupations are gender-balanced.
Read documentation. + + + + +

Probe	`JobsLumProbe`
Metric	`masculine_rate`
Harms	+ Representational Harm +

+ +

Manifested Opinions

+ Assesses whether the model's expressed opinions show bias when asked. We coverly or overtly inquire about how the model perceives genders. While this may not reflect typical use cases, it provides insight into the underlying ideologies embedded in the model. +

+ Likelihood of the model agreeing with stereotypical statements about genders.
Read documentation. + + + + +

Probe	`DirectProbe`
Metric	`fail_rate`
Harms	+ Stereotyping +

+ Likelihood of the model attributing stereotypical quotes to their associated genders.
Read documentation. + + + + +

Probe	`GestProbe`
Metric	`stereotype_rate`
Harms	+ Stereotyping +

+ Likelihood of the model attributing stereotypical behaviors to their associated genders.
Read documentation. + + + + +

Probe	`BbqProbe`
Metric	`stereotype_rate`
Harms	+ Stereotyping +

+ +

Affective Computing

+ Looks at whether the model makes assumptions about users' emotional states based on their gender. When the model is aware of a user's gender, it may treat them differently by assuming certain psychological traits or states. This can result in unintended unequal treatment. +

+ Differences in perceived stress levels between genders.
Read documentation. + + + + +

Probe	`DreadditProbe`
Metric	`max_diff_stress_rate`
Harms	+ Outcome Disparity +

+ Differences in perceived emotions, such as anger or joy, between genders.
Read documentation. + + + + +

Probe	`IsearProbe`
Metric	`max_diff`
Harms	+ Outcome Disparity +

+ + +

Methodological Notes

Marks (A-D) are assigned by comparing confidence intervals to predefined thresholds. A probe's final mark is the highest category that overlaps with its confidence interval.
To calculate overall results, we average the three worst marks in each section and compare it to the worst mark reduced by one. Whatever is worse is the final mark.
These marks are optimistic estimates. Despite our efforts to cover a wide range of potential issues, GenderBench may not detect all gender-related harms.

+ + + + \ No newline at end of file diff --git a/genderbench/docs/source/_static/reports/genderbench_report_1_0.html b/genderbench/docs/source/_static/reports/genderbench_report_1_0.html new file mode 100644 index 0000000..5f25372 --- /dev/null +++ b/genderbench/docs/source/_static/reports/genderbench_report_1_0.html @@ -0,0 +1,1325 @@ + + + + + + GenderBench Results + + + + + + + + +

GenderBench 1.0 Results

Matúš Pikuliak (matus.pikuliak@gmail.com)

What is GenderBench?

GenderBench is an open-source evaluation suite designed to comprehensively benchmark gender biases in large language models (LLMs). It uses a variety of tests, called probes, each targeting a specific type of unfair behavior.

What is this document?

This document presents the results of GenderBench 1.0, evaluating various LLMs. It provides an empirical overview of the current state of the field as of March 2025. It contains three main parts:

Final marks - This section shows the marks calculated for evaluated LLMs in various categories.
Executive summary - This section summarizes our main findings and observations.
Detailed results - This sections presents the raw data.

How can I learn more?

For further details, visit the project's repository. We welcome collaborations and contributions.

Final marks

This section presents the main output from our evaluation.

Each LLM has received marks based on its performance in four use cases. Each use case includes multiple probes that assess model behavior in specific scenarios.

Decision-making - Evaluates how fair the LLMs are in making decisions in real-life situations, such as hiring. We simulate scenarios where the LLMs are used in fully automated systems or as decision-making assistants.
Creative Writing - Examines how the LLMs handle stereotypes and representation in creative outputs. We simulate scenarios when users ask the LLM to help them with creative writing.
Manifested Opinions - Assesses whether the LLMs' expressed opinions show bias when asked. We covertly or overtly inquire about how the LLMs perceive genders. Although this may not reflect typical use, it reveals underlying ideologies within the LLMs.
Affective Computing - Looks at whether the LLMs make assumptions about users' emotional states based on their gender. When the LLM is aware of the user's gender, it may treat them differently by assuming certain psychological traits or states. This can result in an unintended unequal treatment.

To categorize the severity of harmful behaviors, we use a four-tier system:

A - Healthy. No detectable signs of harmful behavior.
B - Cautionary. Low-intensity harmful behavior, often subtle enough to go unnoticed.
C - Critical. Noticeable harmful behavior that may affect user experience.
D - Catastrophic. Harmful behavior is common and present in most assessed interactions.

	Decision-making	Creative Writing	Manifested Opinions	Affective Computing
claude-3-5-haiku	A	D	C	A
gemini-2.0-flash	A	C	C	A
gemini-2.0-flash-lite	A	C	C	A
gemma-2-27b-it	A	C	C	A
gemma-2-9b-it	A	C	C	A
gpt-4o	B	C	C	A
gpt-4o-mini	A	C	C	A
Llama-3.1-8B-Instruct	A	C	B	A
Llama-3.3-70B-Instruct	A	D	C	A
Mistral-7B-Instruct-v0.3	A	C	C	A
Mistral-Small-24B-Instruct-2501	A	C	B	A
phi-4	A	C	C	A

+ +

Executive summary

This section introduces several high-level observations we have made based on our results. All the data we used to infer these observations are in the figures below.

🙈 Note on completeness

This benchmark captures only a subset of potential gender biases - others may exist beyond our scope. Biases can manifest differently across contexts, cultures, or languages, making complete coverage impossible. Results should be interpreted as indicative, not exhaustive.

Converging behavior

All the LLMs we evaluated have noticeably similar behavior. If one model proves to be healthy for a given probe, others likely are too. If one LLM prefers one gender in a given probe, others likely prefer it too. This is not surprising, as we have seen a remarkable convergence of training recipes in recent years. Most AI labs train their LLMs using similar methods, data, and sometimes even outputs from competitors. In effect, the behavior of the LLMs is very similar.

LLMs treat women better

Historically, it was assumed that machine learning models might treat men better due to their historically advantageous position that is often reflected in training text corpora. However, when we directly compare the treatment for men and women, our probes show either equal treatment or women being treated better. In creative writing, most of the characters are written as women, in decision-making, women might have a slight edge over men, when asked about who is right in relationship conflicts, LLMs tend to take women's side. This overcorrection should be considered when deploying the LLMs into production.

Strong stereotypical reasoning

Using gender-stereotypical reasoning is a relatively common failure mode. LLMs tend to write characters with stereotypical traits, assign stereotypical statements to certain genders, agree with stereotypical ideas, and so on. Stereotypical associations with occupations are especially troubling, considering the usage of LLMs in professional settings. Mitigating this issue is extremely challenging, as stereotypes are deeply embedded in vast amounts of training data.

Decision-making deserves caution

Decision-making in everyday and business situations, such as hiring decisions or financial decisions, does not seem to be strongly affected by biases, but there are still cases when the results could be characterized as unfair. We recommend special caution in all use cases when the LLM is making decisions based on data that contain the information about gender. Fairness should always be monitored. Removal of gender-related personal information, such as names or pronouns, can also be considered as a mitigation measure.

What is missing

There are still noticeable gaps in our evaluation. GenderBench currently does not address several important verticals, such as multimodal processing, non-English languages, reasoning capabilities, or multi-turn conversations. These will be progressively covered in future releases.

+ + +

Decision-making

This section shows the probe results for the decision-making use case. It evaluates how fair the LLMs are in making decisions in real-life situations, such as hiring. We simulate scenarios where the LLMs are used in fully automated systems or as decision-making assistants.

+ Difference in success rates between genders when the model makes decisions in high-stakes scenarios.
Read documentation. + + + + +

Probe	`DiscriminationTamkinProbe`
Metric	`max_diff`
Harms	+ Outcome Disparity +

+ Difference in success rates between genders when the model makes hiring decisions.
Read documentation. + + + + +

Probe	`HiringAnProbe`
Metric	`diff_acceptance_rate`
Harms	+ Outcome Disparity +

+ The extent to which gender stereotypes about certain occupations influence the model's hiring decisions.
Read documentation. + + + + +

Probe	`HiringAnProbe`
Metric	`diff_regression`
Harms	+ Stereotyping +

+ Frequency at which the model selects a male candidate from a balanced list of CVs.
Read documentation. + + + + +

Probe	`HiringBloombergProbe`
Metric	`masculine_rate`
Harms	+ Representational Harm +

+ The extent to which gender stereotypes about certain occupations influence the model's hiring decisions.
Read documentation. + + + + +

Probe	`HiringBloombergProbe`
Metric	`stereotype_rate`
Harms	+ Stereotyping +

+ Difference in correct answer rates based on patient gender in medical scenarios.
Read documentation. + + + + +

Probe	`DiversityMedQaProbe`
Metric	`diff_success_rate`
Harms	+ Outcome Disparity +

+ +

Creative writing

This section shows the probe results for the creative writing use case. It examines how the LLMs handle stereotypes and representation in creative outputs. We simulate scenarios when users ask the LLM to help them with creative writing.

+ The extent to which business communication contains stereotypical vocabulary.
Read documentation. + + + + +

Probe	`BusinessVocabularyProbe`
Metric	`mean_diff`
Harms	+ Stereotyping +

+ The extent to which protagonists generated by the model based on specific traits reflect stereotypical norms.
Read documentation. + + + + +

Probe	`GestCreativeProbe`
Metric	`stereotype_rate`
Harms	+ Stereotyping +

+ The extent to which protagonists generated by the model based on specific traits reflect stereotypical norms.
Read documentation. + + + + +

Probe	`InventoriesProbe`
Metric	`stereotype_rate`
Harms	+ Stereotyping +

+ The extent to which protagonists generated by the model based on specific occupations reflect stereotypical norms.
Read documentation. + + + + +

Probe	`JobsLumProbe`
Metric	`stereotype_rate`
Harms	+ Stereotyping +

+ The extent to which protagonists generated based on various traits are gender-balanced.
Read documentation. + + + + +

Probe	`GestCreativeProbe`
Metric	`masculine_rate`
Harms	+ Representational Harm +

+ The extent to which protagonists generated based on various traits are gender-balanced.
Read documentation. + + + + +

Probe	`InventoriesProbe`
Metric	`masculine_rate`
Harms	+ Representational Harm +

+ The extent to which protagonists generated based on various occupations are gender-balanced.
Read documentation. + + + + +

Probe	`JobsLumProbe`
Metric	`masculine_rate`
Harms	+ Representational Harm +

+ +

Manifested Opinions

This section shows the probe results for the manifested opinions use case. It assesses whether the LLMs' expressed opinions show bias when asked. We covertly or overtly inquire about how the LLMs perceive genders. Although this may not reflect typical use, it reveals underlying ideologies within the LLMs.

+ Likelihood of the model agreeing with stereotypical statements about genders.
Read documentation. + + + + +

Probe	`DirectProbe`
Metric	`fail_rate`
Harms	+ Stereotyping +

+ Difference in how likely each gender is considered to be “in the right” in relationship conflicts.
Read documentation. + + + + +

Probe	`RelationshipLevyProbe`
Metric	`diff_success_rate`
Harms	+ Outcome Disparity +

+ Likelihood of the model attributing stereotypical quotes to their associated genders.
Read documentation. + + + + +

Probe	`GestProbe`
Metric	`stereotype_rate`
Harms	+ Stereotyping +

+ Likelihood of the model attributing stereotypical behaviors to their associated genders.
Read documentation. + + + + +

Probe	`BbqProbe`
Metric	`stereotype_rate`
Harms	+ Stereotyping +

+ +

Affective Computing

This section shows the probe results for the affective computing use case. It looks at whether the LLMs make assumptions about users' emotional states based on their gender. When the LLM is aware of the user's gender, it may treat them differently by assuming certain psychological traits or states. This can result in an unintended unequal treatment.

+ Difference in perceived stress levels between genders.
Read documentation. + + + + +

Probe	`DreadditProbe`
Metric	`max_diff_stress_rate`
Harms	+ Outcome Disparity +

+ Difference in perceived emotions, such as anger or joy, between genders.
Read documentation. + + + + +

Probe	`IsearProbe`
Metric	`max_diff`
Harms	+ Outcome Disparity +

+ + +

Treatment of women and men

This section directly compares the treatment of men and women in situations when it can clearly be said that one or the other group is being preferred. In the probe below, negative values mean that the LLMs give preferential treatment for women, positive values mean preferential treatment for men.

+ Difference in success rates between men and women when the model makes decisions in high-stakes scenarios.
Read documentation. + + + + +

Probe	`DiscriminationTamkinProbe`
Metric	`diff_mvf_success_rate`
Harms	+ Outcome Disparity +

+ Difference in success rates between genders when the model makes hiring decisions.
Read documentation. + + + + +

Probe	`HiringAnProbe`
Metric	`diff_acceptance_rate`
Harms	+ Outcome Disparity +

+ Frequency at which the model selects a male candidate from a balanced list of CVs.
Read documentation. + + + + +

Probe	`HiringBloombergProbe`
Metric	`masculine_rate`
Harms	+ Representational Harm +

+ Difference in correct answer rates based on patient gender in medical scenarios.
Read documentation. + + + + +

Probe	`DiversityMedQaProbe`
Metric	`diff_success_rate`
Harms	+ Outcome Disparity +

+ The extent to which protagonists generated based on various occupations are gender-balanced.
Read documentation. + + + + +

Probe	`JobsLumProbe`
Metric	`masculine_rate`
Harms	+ Representational Harm +

+ Difference in how likely each gender is considered to be “in the right” in relationship conflicts.
Read documentation. + + + + +

Probe	`RelationshipLevyProbe`
Metric	`diff_success_rate`
Harms	+ Outcome Disparity +

+ + +

Normalized results

+ The table below presents the results used to calculate the marks, normalized in different ways to fall within the (0, 1) range, where 0 and 1 represent the theoretically least and most biased models respectively. We also display the average result for each model. However, we generally do not recommend relying on the average as a primary measure, as it is an imperfect abstraction. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

	DiscriminationTamkinProbe.max_diff	HiringAnProbe.diff_acceptance_rate	HiringAnProbe.diff_regression	HiringBloombergProbe.masculine_rate	HiringBloombergProbe.stereotype_rate	DiversityMedQaProbe.diff_success_rate	BusinessVocabularyProbe.mean_diff	GestCreativeProbe.stereotype_rate	InventoriesProbe.stereotype_rate	JobsLumProbe.stereotype_rate	GestCreativeProbe.masculine_rate	InventoriesProbe.masculine_rate	JobsLumProbe.masculine_rate	DirectProbe.fail_rate	RelationshipLevyProbe.diff_success_rate	GestProbe.stereotype_rate	BbqProbe.stereotype_rate	DreadditProbe.max_diff_stress_rate	IsearProbe.max_diff	Average
claude-3-5-haiku	0.062	0.022	0.006	0.021	0.015	0.010	0.000	0.116	0.116	0.572	0.400	0.404	0.231	0.026	0.329	0.578	0.096	0.005	0.077	0.162
gemini-2.0-flash	0.023	0.003	0.017	0.044	0.000	0.023	0.000	0.106	0.000	0.571	0.257	0.160	0.202	0.046	0.312	0.687	0.013	0.007	0.059	0.133
gemini-2.0-flash-lite	0.007	0.001	0.000	0.041	0.011	0.001	0.000	0.176	0.105	0.747	0.068	0.283	0.109	0.037	0.277	0.535	0.033	0.013	0.078	0.133
gemma-2-27b-it	0.039	0.003	0.016	0.030	0.023	0.002	0.003	0.154	0.160	0.591	0.220	0.279	0.209	0.037	0.635	0.563	0.020	0.013	0.060	0.161
gemma-2-9b-it	0.043	0.024	0.001	0.010	0.011	0.001	0.004	0.132	0.097	0.604	0.262	0.294	0.193	0.030	0.543	0.477	0.011	0.008	0.067	0.148
gpt-4o	0.007	0.020	0.026	0.101	0.009	0.004	0.000	0.287	0.279	0.624	0.169	0.205	0.195	0.052	0.542	0.238	0.001	0.010	0.021	0.147
gpt-4o-mini	0.020	0.011	0.002	0.061	0.000	0.003	0.003	0.227	0.153	0.593	0.294	0.294	0.211	0.085	0.379	0.415	0.075	0.009	0.029	0.151
Llama-3.1-8B-Instruct	0.078	0.001	0.017	0.023	0.044	0.015	0.018	0.232	0.280	0.842	0.259	0.313	0.078	0.017	0.126	0.108	0.207	0.011	0.071	0.144
Llama-3.3-70B-Instruct	0.010	0.027	0.022	0.024	0.008	0.002	0.022	0.195	0.271	0.648	0.340	0.313	0.188	0.042	0.290	0.641	0.041	0.009	0.062	0.166
Mistral-7B-Instruct-v0.3	0.008	0.005	0.011	0.057	0.014	0.009	0.000	0.270	0.284	0.801	0.100	0.188	0.095	0.053	0.443	0.143	0.238	0.002	0.078	0.147
Mistral-Small-24B-Instruct-2501	0.036	0.005	0.006	0.026	0.001	0.002	0.000	0.215	0.159	0.689	0.266	0.271	0.150	0.031	0.464	0.165	0.049	0.017	0.038	0.136
phi-4	0.024	0.008	0.020	0.057	0.002	0.002	0.000	0.338	0.320	0.747	0.143	0.277	0.124	0.031	0.272	0.416	0.017	0.008	0.030	0.149

Methodological Notes

The results were obtained by using genderbench library version 1.0. Proprietary models were queried on their respective APIs, open-weight LLMs were queried via deepinfra.com. All the LLMs were run in March 2025.
Marks (A-D) are assigned by comparing confidence intervals to predefined thresholds. A probe's final mark is the healthiest category that overlaps with its confidence interval.
To aggregate results, we average the three worst marks in each section and compare it to the worst mark reduced by one. Whatever is worse is the final mark.

+ + + + \ No newline at end of file diff --git a/genderbench/docs/source/_static/reports/genderbench_report_1_1.html b/genderbench/docs/source/_static/reports/genderbench_report_1_1.html new file mode 100644 index 0000000..8c4f367 --- /dev/null +++ b/genderbench/docs/source/_static/reports/genderbench_report_1_1.html @@ -0,0 +1,1349 @@ + + + + + + GenderBench Results + + + + + + + + +

GenderBench 1.1 Results

Matúš Pikuliak (matus.pikuliak@gmail.com)

What is GenderBench?

What is this document?

This document presents the results of GenderBench 1.1, evaluating various LLMs. It provides an empirical overview of the current state of the field as of May 2025. It contains three main parts:

Final marks - This section shows the marks calculated for evaluated LLMs in various categories.
Executive summary - This section summarizes our main findings and observations.
Detailed results - This sections presents the raw data.

How can I learn more?

For further details, visit the project's repository. We welcome collaborations and contributions.

Final marks

This section presents the main output from our evaluation. Each LLM has received marks based on its performance with various probes. To categorize the severity of harmful behaviors, we use a four-tier system:

A - Healthy. No detectable signs of harmful behavior.
B - Cautionary. Low-intensity harmful behavior, often subtle enough to go unnoticed.
C - Critical. Noticeable harmful behavior that may affect user experience.
D - Catastrophic. Harmful behavior is common and present in most assessed interactions.

Harms

We categorize the behaviors we quantify based on the type of harm they cause:

Outcome disparity - Outcome disparity refers to unfair differences in outcomes across genders. This includes differences in the likelihood of receiving a positive outcome (e.g., loan approval from an AI system) as well as discrepancies in predictive accuracy across genders (e.g., the accuracy of an AI-based medical diagnosis).
Stereotypical reasoning - Stereotypical reasoning involves using language that reflects stereotypes (e.g., differences in how AI writes business communication for men versus women), or using stereotypical assumptions during reasoning (e.g., agreeing with stereotypical statements about gender roles). Unlike outcome disparity, this category does not focus on directly measurable outcomes but rather on biased patterns in language and reasoning.
Representational harms - Representational harms concern how different genders are portrayed, including issues like under-representation, denigration, etc. In the context of our probes, this category currently only addresses gender balance in generated texts.

Comprehensive table

Below is a table that summarizes all the marks received by the evaluated models. It is also possible to categorize the marks by harm. The marks are sorted by their value.

+ Categorize by harm + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

	Outcome disparity	Stereotypical reasoning	Representational harms
claude-3-5-haiku	🟩🟩🟩🟩🟩🟨🟧	🟩🟩🟩🟩🟩🟨🟨🟥🟥	🟧🟥🟥
gemini-2.0-flash	🟩🟩🟩🟩🟩🟨🟧	🟩🟩🟩🟩🟩🟩🟨🟥🟥	🟧🟧🟧
gemini-2.0-flash-lite	🟩🟩🟩🟩🟩🟨🟧	🟩🟩🟩🟩🟩🟩🟧🟥🟥	🟨🟨🟧
gemma-2-27b-it	🟩🟩🟩🟩🟩🟩🟥	🟩🟩🟩🟩🟩🟨🟨🟥🟥	🟧🟧🟧
gemma-2-9b-it	🟩🟩🟩🟩🟩🟨🟥	🟩🟩🟩🟩🟩🟩🟨🟥🟥	🟧🟧🟧
gpt-4o	🟩🟩🟩🟩🟩🟧🟥	🟩🟩🟩🟩🟩🟧🟧🟧🟥	🟧🟧🟧
gpt-4o-mini	🟩🟩🟩🟩🟩🟨🟧	🟩🟩🟩🟩🟨🟨🟧🟥🟥	🟧🟧🟧
Llama-3.1-8B-Instruct	🟩🟩🟩🟩🟩🟩🟨	🟩🟩🟩🟩🟨🟧🟧🟧🟥	🟩🟧🟧
Llama-3.3-70B-Instruct	🟩🟩🟩🟩🟩🟩🟧	🟩🟩🟩🟩🟩🟧🟧🟥🟥	🟧🟧🟥
Mistral-7B-Instruct-v0.3	🟩🟩🟩🟩🟩🟨🟧	🟩🟩🟩🟩🟧🟧🟧🟧🟥	🟨🟨🟧
Mistral-Small-24B-Instruct-2501	🟩🟩🟩🟩🟩🟩🟧	🟩🟩🟩🟩🟩🟨🟧🟧🟥	🟧🟧🟧
phi-4	🟩🟩🟩🟩🟩🟨🟧	🟩🟩🟩🟩🟩🟧🟧🟥🟥	🟨🟧🟧

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +

	All
claude-3-5-haiku	🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟨🟨🟨🟧🟧🟥🟥🟥🟥
gemini-2.0-flash	🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟨🟨🟧🟧🟧🟧🟥🟥
gemini-2.0-flash-lite	🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟨🟨🟨🟧🟧🟧🟥🟥
gemma-2-27b-it	🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟨🟨🟧🟧🟧🟥🟥🟥
gemma-2-9b-it	🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟨🟨🟧🟧🟧🟥🟥🟥
gpt-4o	🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟧🟧🟧🟧🟧🟧🟧🟥🟥
gpt-4o-mini	🟩🟩🟩🟩🟩🟩🟩🟩🟩🟨🟨🟨🟧🟧🟧🟧🟧🟥🟥
Llama-3.1-8B-Instruct	🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟨🟨🟧🟧🟧🟧🟧🟥
Llama-3.3-70B-Instruct	🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟧🟧🟧🟧🟧🟥🟥🟥
Mistral-7B-Instruct-v0.3	🟩🟩🟩🟩🟩🟩🟩🟩🟩🟨🟨🟨🟧🟧🟧🟧🟧🟧🟥
Mistral-Small-24B-Instruct-2501	🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟨🟧🟧🟧🟧🟧🟧🟥
phi-4	🟩🟩🟩🟩🟩🟩🟩🟩🟩🟩🟨🟨🟧🟧🟧🟧🟧🟥🟥

+ +

Executive summary

This section introduces several high-level observations we have made based on our results. All the data we used to infer these observations are in the figures below.

🙈 Note on completeness

Converging behavior

LLMs treat women better

Strong stereotypical reasoning

Decision-making deserves caution

What is missing

Outcome disparity

This section shows the probe results for the outcome disparity probes. This includes differences in the likelihood of receiving a positive outcome (e.g., loan approval from an AI system) as well as discrepancies in predictive accuracy across genders (e.g., the accuracy of an AI-based medical diagnosis).

+ Difference in success rates between genders when the model makes decisions in high-stakes scenarios.
Read documentation. + + + +

Probe	`DiscriminationTamkinProbe`
Metric	`max_diff`

+ Difference in correct answer rates based on patient gender in medical scenarios.
Read documentation. + + + +

Probe	`DiversityMedQaProbe`
Metric	`diff_success_rate`

+ Difference in success rates between genders when the model makes hiring decisions.
Read documentation. + + + +

Probe	`HiringAnProbe`
Metric	`diff_acceptance_rate`

+ The extent to which gender stereotypes about certain occupations influence the model's hiring decisions.
Read documentation. + + + +

Probe	`HiringAnProbe`
Metric	`diff_regression`

+ Frequency at which the model selects a male candidate from a balanced list of CVs.
Read documentation. + + + +

Probe	`HiringBloombergProbe`
Metric	`masculine_rate`

+ The extent to which gender stereotypes about certain occupations influence the model's hiring decisions.
Read documentation. + + + +

Probe	`HiringBloombergProbe`
Metric	`stereotype_rate`

+ Difference in how likely each gender is considered to be “in the right” in relationship conflicts.
Read documentation. + + + +

Probe	`RelationshipLevyProbe`
Metric	`diff_success_rate`

+ +

Stereotypical reasoning

This section shows the probe results for the stereotypical reasoning probes. Stereotypical reasoning involves using language that reflects stereotypes (e.g., differences in how AI writes business communication for men versus women), or using stereotypical assumptions during reasoning (e.g., agreeing with stereotypical statements about gender roles).

+ Likelihood of the model attributing stereotypical behaviors to their associated genders.
Read documentation. + + + +

Probe	`BbqProbe`
Metric	`stereotype_rate`

+ The extent to which business communication contains stereotypical vocabulary.
Read documentation. + + + +

Probe	`BusinessVocabularyProbe`
Metric	`mean_diff`

+ Likelihood of the model agreeing with stereotypical statements about genders.
Read documentation. + + + +

Probe	`DirectProbe`
Metric	`fail_rate`

+ Difference in perceived stress levels between genders.
Read documentation. + + + +

Probe	`DreadditProbe`
Metric	`max_diff_stress_rate`

+ Likelihood of the model attributing stereotypical quotes to their associated genders.
Read documentation. + + + +

Probe	`GestProbe`
Metric	`stereotype_rate`

+ The extent to which protagonists generated by the model based on specific traits reflect stereotypical norms.
Read documentation. + + + +

Probe	`GestCreativeProbe`
Metric	`stereotype_rate`

+ The extent to which protagonists generated by the model based on specific traits reflect stereotypical norms.
Read documentation. + + + +

Probe	`InventoriesProbe`
Metric	`stereotype_rate`

+ Difference in perceived emotions, such as anger or joy, between genders.
Read documentation. + + + +

Probe	`IsearProbe`
Metric	`max_diff`

+ The extent to which protagonists generated by the model based on specific occupations reflect stereotypical norms.
Read documentation. + + + +

Probe	`JobsLumProbe`
Metric	`stereotype_rate`

+ +

Representational harms

This section shows the probe results for the representational harms probes. Representational harms concern how different genders are portrayed, including issues like under-representation, denigration, etc.

+ The extent to which protagonists generated based on various traits are gender-balanced.
Read documentation. + + + +

Probe	`GestCreativeProbe`
Metric	`masculine_rate`

+ The extent to which protagonists generated based on various traits are gender-balanced.
Read documentation. + + + +

Probe	`InventoriesProbe`
Metric	`masculine_rate`

+ The extent to which protagonists generated based on various occupations are gender-balanced.
Read documentation. + + + +

Probe	`JobsLumProbe`
Metric	`masculine_rate`

+ +

Treatment of women and men

+ Difference in success rates between men and women when the model makes decisions in high-stakes scenarios.
Read documentation. + + + +

Probe	`DiscriminationTamkinProbe`
Metric	`diff_mvf_success_rate`

+ Difference in correct answer rates based on patient gender in medical scenarios.
Read documentation. + + + +

Probe	`DiversityMedQaProbe`
Metric	`diff_success_rate`

+ Difference in success rates between genders when the model makes hiring decisions.
Read documentation. + + + +

Probe	`HiringAnProbe`
Metric	`diff_acceptance_rate`

+ Frequency at which the model selects a male candidate from a balanced list of CVs.
Read documentation. + + + +

Probe	`HiringBloombergProbe`
Metric	`masculine_rate`

+ The extent to which protagonists generated based on various occupations are gender-balanced.
Read documentation. + + + +

Probe	`JobsLumProbe`
Metric	`masculine_rate`

+ Difference in how likely each gender is considered to be “in the right” in relationship conflicts.
Read documentation. + + + +

Probe	`RelationshipLevyProbe`
Metric	`diff_success_rate`

+ + +

Normalized results

+ The table below presents the results used to calculate the marks, normalized in different ways to fall within the [0, 1] interval, where 0 and 1 represent the theoretically least and most biased models respectively. We also display the average result for each model. +

	DiscriminationTamkin.max_diff	DiversityMedQa.diff_success_rate	HiringAn.diff_acceptance_rate	HiringAn.diff_regression	HiringBloomberg.masculine_rate	HiringBloomberg.stereotype_rate	RelationshipLevy.diff_success_rate	Bbq.stereotype_rate	BusinessVocabulary.mean_diff	Direct.fail_rate	Dreaddit.max_diff_stress_rate	Gest.stereotype_rate	GestCreative.stereotype_rate	Inventories.stereotype_rate	Isear.max_diff	JobsLum.stereotype_rate	GestCreative.masculine_rate	Inventories.masculine_rate	JobsLum.masculine_rate	Average
claude-3-5-haiku	0.062	0.010	0.022	0.006	0.021	0.015	0.329	0.096	0.000	0.026	0.005	0.578	0.116	0.116	0.077	0.572	0.400	0.404	0.231	0.162
gemini-2.0-flash	0.023	0.023	0.003	0.017	0.044	0.000	0.312	0.013	0.000	0.046	0.007	0.687	0.106	0.000	0.059	0.571	0.257	0.160	0.202	0.133
gemini-2.0-flash-lite	0.007	0.001	0.001	0.000	0.041	0.011	0.277	0.033	0.000	0.037	0.013	0.535	0.176	0.105	0.078	0.747	0.068	0.283	0.109	0.133
gemma-2-27b-it	0.039	0.002	0.003	0.016	0.030	0.023	0.635	0.020	0.003	0.037	0.013	0.563	0.154	0.160	0.060	0.591	0.220	0.279	0.209	0.161
gemma-2-9b-it	0.043	0.001	0.024	0.001	0.010	0.011	0.543	0.011	0.004	0.030	0.008	0.477	0.132	0.097	0.067	0.604	0.262	0.294	0.193	0.148
gpt-4o	0.007	0.004	0.020	0.026	0.101	0.009	0.542	0.001	0.000	0.052	0.010	0.238	0.287	0.279	0.021	0.624	0.169	0.205	0.195	0.147
gpt-4o-mini	0.020	0.003	0.011	0.002	0.061	0.000	0.379	0.075	0.003	0.085	0.009	0.415	0.227	0.153	0.029	0.593	0.294	0.294	0.211	0.151
Llama-3.1-8B-Instruct	0.078	0.015	0.001	0.017	0.023	0.044	0.126	0.207	0.018	0.017	0.011	0.108	0.232	0.280	0.071	0.842	0.259	0.313	0.078	0.144
Llama-3.3-70B-Instruct	0.010	0.002	0.027	0.022	0.024	0.008	0.290	0.041	0.022	0.042	0.009	0.641	0.195	0.271	0.062	0.648	0.340	0.313	0.188	0.166
Mistral-7B-Instruct-v0.3	0.008	0.009	0.005	0.011	0.057	0.014	0.443	0.238	0.000	0.053	0.002	0.143	0.270	0.284	0.078	0.801	0.100	0.188	0.095	0.147
Mistral-Small-24B-Instruct-2501	0.036	0.002	0.005	0.006	0.026	0.001	0.464	0.049	0.000	0.031	0.017	0.165	0.215	0.159	0.038	0.689	0.266	0.271	0.150	0.136
phi-4	0.024	0.002	0.008	0.020	0.057	0.002	0.272	0.017	0.000	0.031	0.008	0.416	0.338	0.320	0.030	0.747	0.143	0.277	0.124	0.149

+ +

Methodological Notes

The results were obtained by using genderbench library version 1.1.
Marks (A-D) are assigned by comparing confidence intervals to predefined thresholds. A probe's final mark is the healthiest category that overlaps with its confidence interval.

+ + \ No newline at end of file -- cgit v1.2.3