Believe: Belief-enhanced instruction generation and augmentation for zero-shot bias mitigation

Abstract

Language models, pre-trained on large amounts of unmoderated content, have been shown to contain societal biases. Mitigating such biases typically requires access to model parameters and training schemas. In this work, we address bias mitigation at inference time, such that it can be applied to any black-box model. To this end, we propose a belief generation and augmentation framework, BELIEVE, that demonstrates effective bias mitigation for natural language generation by augmenting input prompts with automatically generated instruction-based beliefs. Our framework eases the bottleneck required for manually crafting these instruction-based beliefs, by extending a recently proposed iterative in-context learning framework to automatically generate beliefs via a language model. We assess the impact of this system on fairness, and demonstrate effective bias mitigation on pretrained and instruction-tuned models for both sentiment and regard with respect to multiple protected classes including race, gender, and political ideology.

Date: January 1, 1970
Authors: Lisa Bauer, Ninareh Mehrabi, Palash Goyal, Kai-Wei Chang, Aram Galstyan, Rahul Gupta
Conference: Proceedings of the 4th Workshop on Trustworthy Natural Language Processing (TrustNLP 2024)
Pages: 239-251

View Paper

Information Sciences Institute

Publications

Believe: Belief-enhanced instruction generation and augmentation for zero-shot bias mitigation

Abstract