Open Source AI Test Tools
Guardrails
Guardrails AI by Guardrails AI, (Cf. Guardrails Hub)
Safety and Trustworthiness
ChatProtect (see "Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation" (Mündler et al., 2024)
DecodingTrust (see "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models", Wang et al., 2023) by AI Secure (UIUC Secure Learning Lab)
HELM: Holistic Evaluation of Language Models, by Stanford CRFM
LLM-rules (see "Can LLMs Follow Simple Rules?", Mu et al., 2023)
TruthfulQA (see "TruthfulQA: Measuring How Models Mimic Human Falsehoods", Lin et al., 2022)
Explainability and Transparency
SHAP: SHapley Additive exPlanations (see "Unified Approach to Interpreting Model Predictions", Lundberg, et al. 2017)
LIME: Local Interpretable Model-agnostic Explanations (see "Why Should I Trust You?": Explaining the Predictions of Any Classifier", Ribeiro, at al. 2016), by Marco Tulio Correia Ribeiro, Google DeepMind
RepE: Representation Engineering (see "Representation Engineering: A Top-Down Approach to AI Transparency", Zou, et al., 2023)
Security
CyberSecEval Purple Llama cybersecurity benchmark (Meta)
garak: An LLM security scanner by Leon Derczynski (Nvidia)
Rebuff by Protect AI
Cf. LLM Security & Privacy maintained by Chawin Sitawarin
Deepfake Detection
Augmentations library (AugLy) for function and class-based audio, image, text, and video transforms by Meta Research (see documentation)
Adversarial Attack
Adversarial Robustness Toolbox (ART) by LF AI & Data Trusted AI (see documentation)
LLM Attacks (see Universal and Transferable Adversarial Attacks on Aligned Language Models, Zou et al., 2023)
TextAttack (see "TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP", John X. Morris, et al. 2020) (see documentation)
Regulatory Compliance
COMPL-AI open-source compliance-centered evaluation framework for Generative AI model compliance with the EU AI Act, combining several open source AI test tools and benchmarks
Corporate Social Responsibility (CSR)
Adaptation Vectors: A simple modeling tool to analyze the rate of technological change versus ability of workers to adapt to change, by Aethercloud
Bias
BBQ (see "BBQ: A Hand-Built Bias Benchmark for Question Answering", Parrish et al., 2022)
BOLD (see "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation", Dhamala et al., 2021)
Evaluate by Hugging Face (see "HONEST: Measuring Hurtful Sentence Completion in Language Models", Nozza, et al. 2021)
RedditBias (see "RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models", Barikeri et al., 2021)
Benchmarks
BIG-bench (see "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models", Srivastava et al., 2022)
BoolQ (see "BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions", Clark et al., 2019)
HellaSwag (see "HellaSwag: Can a Machine Really Finish Your Sentence?", Zellers et al., 2019)
Language Understanding (see "Measuring Massive Multitask Language Understanding", Hendrycks et al., 2021)
RealToxicityPrompts ("RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models", Gehman et al., 2020)
TriviaQA (see "TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension", Joshi et al., 2017)
Coding
Human-eval (see "Evaluating Large Language Models Trained on Code", Chen et al., 2021)
Reasoning
ARC (see "Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge", Clark et al., 2018)
Superhuman AI Consistency (see "Evaluating superhuman models with consistency checks", Fluri et al., 2024)
Fairness
FaiRLLM (see "Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation", Zhang et al., 2023)