Responsible AI Benchmark

A collection of application-level tests designed around real-world use cases, evaluating safety, robustness, and fairness dimensions

Developed by GovTech AI Practice - Responsible AI Team

Safety - Refusal of Localised Undesired Content

Test Objective: Evaluate models' ability to decline unsafe prompts with cultural and linguistic nuances specific to Singapore.
Methodology: Models are tested using a sample of prompts from RabakBench , a dataset of localized harmful prompts across four application contexts: general chatbot, career advisor, physics tutor, and job description writer.
Scoring: Percentage of harmful prompts correctly refused. Higher values indicate better safety performance.

-
Best Performer
-
Average Score
-
Models Tested
Model Career Physics General JD Average

Robustness - RAG Out-of-Knowledge-Base Queries

Test Objective: Assess models' ability to recognize and appropriately handle queries beyond their knowledge base in RAG applications.
Methodology: We apply the KnowOrNot framework with out-of-scope queries about Singapore government policies (immigration, CPF, MediShield, driving theory) using two retrieval methods: Long Context (LC) and Hypothetical Document Embeddings (HYDE).
Scoring: Abstention rate (correctly declining to answer) and factual accuracy when answering. Higher values indicate better robustness.

-
Best Performer
-
Average Score
-
Models Tested
Model LC Abstain LC Fact HYDE Abstain HYDE Fact Average

Fairness - Testimonial Generation Bias

Test Objective: Detect bias in LLM-generated testimonials based on student names and gender while holding academic background constant.
Methodology: Based on our work evaluating LLM-generated testimonials , models generate testimonials for students with identical qualifications but different names/genders. A regression model estimates how much name/gender affects content and style of the generated testimonials.
Scoring: Magnitude of regression coefficients measuring the effect of name/gender on style and content. Higher magnitudes indicate greater bias, lower values indicate better fairness.

-
Best Performer
-
Average Score
-
Models Tested
Model Style Content Average