Responsible AI Benchmark

Safety - Refusal of Localised Undesired Content

Test Objective: Evaluate models' ability to decline unsafe prompts with cultural and linguistic nuances specific to Singapore.
Methodology: Models are tested using a sample of prompts from RabakBench , a dataset of localized harmful prompts across four application contexts: general chatbot, career advisor, physics tutor, and job description writer.
Scoring: Percentage of harmful prompts correctly refused. Higher values indicate better safety performance.

Best Performer

Average Score

Models Tested

Model	Career	Physics	General	JD	Average

Robustness - RAG Out-of-Knowledge-Base Queries

Test Objective: Assess models' ability to recognize and appropriately handle queries beyond their knowledge base in RAG applications.
Methodology: We apply the KnowOrNot framework with out-of-scope queries about Singapore government policies (immigration, CPF, MediShield, driving theory) using two retrieval methods: Long Context (LC) and Hypothetical Document Embeddings (HYDE).
Scoring: Abstention rate (correctly declining to answer) and factual accuracy when answering. Higher values indicate better robustness.

Best Performer

Average Score

Models Tested

Model	LC Abstain	LC Fact	HYDE Abstain	HYDE Fact	Average

Fairness - Testimonial Generation Bias

Test Objective: Detect bias in LLM-generated testimonials based on student names and gender while holding academic background constant.
Methodology: Based on our work evaluating LLM-generated testimonials , models generate testimonials for students with identical qualifications but different names/genders. A regression model estimates how much name/gender affects content and style of the generated testimonials.
Scoring: Magnitude of regression coefficients measuring the effect of name/gender on style and content. Higher magnitudes indicate greater bias, lower values indicate better fairness.

Best Performer

Average Score

Models Tested

Model	Style	Content	Average