Open Source AI Test Tools

Guardrails

Guardrails AI by Guardrails AI, (Cf. Guardrails Hub)
LLM Guard by Protect AI
NeMo Guardrails by Nvidia

Safety and Trustworthiness

ChatProtect (see "Self-contradictory hallucinations of large language models: Evaluation, detection and mitigation" (Mündler et al., 2024)
DecodingTrust (see "DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models", Wang et al., 2023) by AI Secure (UIUC Secure Learning Lab)
HELM: Holistic Evaluation of Language Models, by Stanford CRFM
LLM-rules (see "Can LLMs Follow Simple Rules?", Mu et al., 2023)
TruthfulQA (see "TruthfulQA: Measuring How Models Mimic Human Falsehoods", Lin et al., 2022)

Privacy

HoundDog.ai Privacy-by-Design Code Scanner for Personally Identifiable Information (PII), Protected Health Information (PHI), Cardholder Data (CHD), and authentication tokens
Pebblo: A system to classify and securely load sensitive data while generating a Data Bill of Materials (DBoM), by Daxa

Explainability and Transparency

SHAP: SHapley Additive exPlanations (see "Unified Approach to Interpreting Model Predictions", Lundberg, et al. 2017)
LIME: Local Interpretable Model-agnostic Explanations (see "Why Should I Trust You?": Explaining the Predictions of Any Classifier", Ribeiro, at al. 2016), by Marco Tulio Correia Ribeiro, Google DeepMind
RepE: Representation Engineering (see "Representation Engineering: A Top-Down Approach to AI Transparency", Zou, et al., 2023)

Security

CyberSecEval Purple Llama cybersecurity benchmark (Meta)
garak: An LLM security scanner by Leon Derczynski (Nvidia)
LLMFuzzer by Max Brin (floom.ai)
Rebuff by Protect AI
Vigil by Adam Swanda (Robust Intelligence)

Cf. LLM Security & Privacy maintained by Chawin Sitawarin

Deepfake Detection

Augmentations library (AugLy) for function and class-based audio, image, text, and video transforms by Meta Research (see documentation)

Adversarial Attack

Adversarial Robustness Toolbox (ART) by LF AI & Data Trusted AI (see documentation)
Armory-library by Two Six Technologies (DARPA)
LLM Attacks (see Universal and Transferable Adversarial Attacks on Aligned Language Models, Zou et al., 2023)
TextAttack (see "TextAttack: A Framework for Adversarial Attacks, Data Augmentation, and Adversarial Training in NLP", John X. Morris, et al. 2020) (see documentation)

Regulatory Compliance

COMPL-AI open-source compliance-centered evaluation framework for Generative AI model compliance with the EU AI Act, combining several open source AI test tools and benchmarks

Corporate Social Responsibility (CSR)

Adaptation Vectors: A simple modeling tool to analyze the rate of technological change versus ability of workers to adapt to change, by Aethercloud

Bias

BBQ (see "BBQ: A Hand-Built Bias Benchmark for Question Answering", Parrish et al., 2022)
BOLD (see "BOLD: Dataset and Metrics for Measuring Biases in Open-Ended Language Generation", Dhamala et al., 2021)
Evaluate by Hugging Face (see "HONEST: Measuring Hurtful Sentence Completion in Language Models", Nozza, et al. 2021)
RedditBias (see "RedditBias: A Real-World Resource for Bias Evaluation and Debiasing of Conversational Language Models", Barikeri et al., 2021)

Benchmarks

BIG-bench (see "Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models", Srivastava et al., 2022)
BoolQ (see "BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions", Clark et al., 2019)
HellaSwag (see "HellaSwag: Can a Machine Really Finish Your Sentence?", Zellers et al., 2019)
Language Understanding (see "Measuring Massive Multitask Language Understanding", Hendrycks et al., 2021)
RealToxicityPrompts ("RealToxicityPrompts: Evaluating Neural Toxic Degeneration in Language Models", Gehman et al., 2020)
TriviaQA (see "TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension", Joshi et al., 2017)

Coding

Human-eval (see "Evaluating Large Language Models Trained on Code", Chen et al., 2021)

Reasoning

ARC (see "Think you have solved question answering? Try ARC, the AI2 Reasoning Challenge", Clark et al., 2018)
Superhuman AI Consistency (see "Evaluating superhuman models with consistency checks", Fluri et al., 2024)

Fairness

FaiRLLM (see "Is ChatGPT Fair for Recommendation? Evaluating Fairness in Large Language Model Recommendation", Zhang et al., 2023)

Auditing

Idiosyncrasies in Large Language Models by CMU Locus Lab (Zico Kolter, et al.)
LLM API Audit by sunblaze-ucb
Model Equality Testing ("Model Equality Testing: Which Model Is This API Serving?", Gao, et al., 2024), Cf. Model Equality Testing Data Set