Multimodal Racialising Discourses in Polish Large Language Models

Principal Investigator :
Dr hab. Margaret Ohia-Nowak, Prof. UMCS
Maria Curie-Skłodowska University in Lublin

Panel: HS2

Funding scheme : SONATA 20
announced on 16 September 2024 r.

The emergence of ChatGPT at the end of 2022 revolutionised the way digital content is created, processed and reproduced. It quickly became apparent, however, that the data on which generative AI models, including large language models (LLMs), are trained reflect societal stereotypes and biases. The aim of the project is to investigate the extent to which this also applies to text and images generated by Polish-language large language models. The starting point is therefore the question: What happens when algorithms, learning from our language habits, replicate stereotypes and biases? The subject of the research is the so-called racialising discourses — that is, linguistic and visual representations of people of different skin colours, particularly people who are non-white, which may reinforce racism.

Dr hab. Margaret Ohia-Nowak, Prof. UMCS; photo: Łukasz Bera In Poland, issues related to cultural diversity, women’s rights and hate speech frequently appear in contemporary public debate, and gender and racial stereotypes are constantly reinforced by algorithms, with a huge impact on social behaviour. Recent research clearly shows this: men are portrayed as leaders and women as assistants. The phenomenon intensifies in automated recommendation systems and chatbots, where decisions are made with minimal human oversight — from what we see on social media to the tone and content of responses provided to users. Algorithms may unconsciously favour some groups and discriminate against others, especially those vulnerable to exclusion and marginalisation.

So far, research on AI language models has mainly focused on English-language models and does not take into account the specificities of Slavic languages or the Polish cultural context. In languages other than English, there is a lack of high-quality data and tools, which reduces the effectiveness of hate speech detection and increases model bias. Although methods for mitigating biases are emerging, and there are a few studies on Slavic-language models, we still know too little about how these mechanisms operate in our linguistic and cultural context. This project fills that gap. The execution of the research will lead to the creation of a methodology that will facilitate preventing, limiting, and perhaps even completely eliminating such content on the internet and in public communication.

Dr hab. Margaret Ohia-Nowak, Prof. UMCS; photo: Łukasz Bera One of the fundamental components of the project is the analysis of a corpus of data generated by large language models and the examination of the overt and latent racialising mechanisms they replicate. An important complementary element to the analysis of linguistic-visual material is interviews with experts researching and creating artificial intelligence models and large language models for the Polish market. Another essential stage of the project consists of interviews with ethnically diverse users of Polish LLMs. Particularly valuable is the perspective of people exposed to racism in Poland — including, among others, people of African, Roma and Asian descent — and how toxic AI content affects their everyday experiences.

The results of the corpus analysis will therefore be combined with analysis of users’ language experiences. These will, in turn, be used to develop a tool for testing the presence of biases in Polish language models. The project will enable mapping the most common linguistic and visual racialising discourses in Polish LLMs. Its outcomes will also include the development of an interdisciplinary perspective combining critical linguistics, sociolinguistics and media linguistics for use in research on AI algorithms in Polish language models. By doing so, it will support the creation of more inclusive and equitable digital technologies.

Polish AI models will thus be able to use language that does not hurt or exclude, and the digital environment can be more inclusive and safer.

Project title: Multimodal Racialising Discourses in Polish Large Language Models

Dr hab. Margaret Ohia-Nowak, Prof. UMCS

Professor at the Institute of Media and Communications of Maria Curie-Skłodowska University in Lublin. Author of the book Antyczarny rasizm. Język – dyskurs – komunikacja (2025) (Eng. Antiblack Racism: Language — Discourse — Communication). She was a Fulbright scholarship grantee at the University of California, Berkeley, and has undertaken research stays at the University of Amsterdam, City University London, and the Centre of Discourse Studies in Barcelona. She has delivered guest lectures at, among others, Stanford University, the European University Institute in Florence and Charles University in Prague. She is a recipient of the international Emmy Goldman Award. She has been principal investigator and researcher in national and international grants, including projects funded by the National Science Centre, the European Commission and the United Nations.

Multimodal Racialising Discourses in Polish Large Language Models

Dr hab. Margaret Ohia-Nowak, Prof. UMCS

National Science Centre

Information for applicants

Social media