Methods for anonymising qualitative data

For questions, contact your privacy officer (PO). On My EUR you can find contact details for your faculty PO.

Things to think about before data collection

One of the best ways to protect the privacy of research participants is not to collect certain identifiable information at all. While planning your research, please, consider data minimisation. For example, during a recorded interview do not ask for full names of research participants.

In the absence of consent, the data you disclose must be anonymous. Anonymisation is best planned early in the research process, to help reduce anonymisation costs. It should be noted that excessive removing of information in qualitative data such as text or audio/video recordings can lead to distortion of data, making them unusable, unreliable or misleading. To balance privacy protection and keeping data useful, anonymisation should be considered alongside informed consent and access controls.

Pre-planning and agreeing with participants during the consent process, on what may and may not be recorded or transcribed, can be a much more effective way of creating data that accurately represents the research process and the contribution of participants. For example, if an employer’s name cannot be disclosed, it should be agreed in advance that it will not be mentioned during an interview. This is easier than spending time later removing it from a recording or transcript.

Personal data contains information that directly or indirectly identifies a natural person (for definitions and examples see this link). Generally speaking, direct identifiers and strong indirect identifiers need to be removed or replaced with pseudonyms. Indirect identifiers can either be removed or categorized. In the case of qualitative data, categorising means coarsening identifying information. This concerns such indirect identifiers as: Postal code, District/Part of town, Municipality of residence, Region, Municipality type, Year of birth, Age, Household composition, Occupation, Education, Mother tongue, Nationality, Workplace/Employer, Crime or punishment, Position of trust or membership + all special categories information.

Best practices for pseudonymisation/anonymisation of qualitative data

Anonymisation of audio-visual data, such as editing of digital images or audio recordings, should be done sensitively. Bleeping out real names or place names is acceptable, but disguising voices by altering the pitch in a recording, or obscuring faces by pixelating sections of a video image significantly, reduces the usefulness of data. These processes are also highly labour intensive and expensive.

If confidentiality of audio-visual data is an issue, it is better to obtain the participant’s consent to use and share the data unaltered. Where anonymisation would result in too much loss of data content, regulating access to data can be considered as a better strategy.

  • Plan anonymisation (and experiment with a couple of files) at the time of transcription or initial write-up (longitudinal studies may be an exception if relationships between waves of interviews need special attention for harmonised editing). 

  • Use pseudonyms or generic descriptors to edit identifying information, rather than blanking-out that information. 

  • Use pseudonyms or replacements that are consistent throughout the research team and the project. For example, using the same pseudonyms in publications and follow-up research.

  • Identify replacements in text clearly, for example with [brackets] or using XML tags such as <seg>word to be anonymised</seg>.

  • Use 'search and replace' techniques carefully so that unintended changes are not made, and misspelt words are not missed.

  • Keep unedited versions of data (but store them separately) for use within the research team and for preservation (for persons who have both the unedited version and the anonymised version, the data is pseudonymised).

  • Create a pseudonymisation key (also known as an anonymisation log) of all replacements, aggregations or removals made and store such a log securely and separately from the anonymised data files.

  1. Find and highlight direct identifiers by reading the transcript. 
  2. Assess indirect identifiers: 
    • Can the identity of a participant be known from information in the data file? 
    • Can a third party be disclosed or harmed from information in the data file? 
  3. Assess the wider picture:

    • Which identifying information about an individual participant can be noted from all the data and documentation available to a user. Remove (or pseudonymise) direct identifiers.

  4. Redact or categorize (in)direct indentifiers.

  5. Re-assess any remaining disclosure risk.

Tips and tricks

The UK Data Service has developed a Text anonymisation helper tool with how to install instructions. It is an add-on MS Word macros for aiding anonymisation of qualitative data. The tool does not anonymise or make changes to data but finds and highlights numbers and words starting with capital letters in text. Numbers and capitalised words are often disclosive, it can be names, companies, birth dates, addresses, educational institutions and countries.

CESSDA has a detailed example/exercise of anonymising a transcript at the bottom of this page.

On the page of the Finnish Social Science Data Archive you can find practical tips and a detailed guide of techniques for anonymisation of qualitative data (which can also be used in case anonymisation can only be done to a degree).

UK Data Service has a whole page on the best practices of transcribing audio-visual data. In case you decide (or are considering) to use external transcribers or automatic speech recognition (ASR) software to do an initial transcription, do contact your privacy officer to discuss if and which agreements need to be signed (before the use of the software).

Advice on this page is compiled based on the information provided by the UK Data Service, CESSDA and the Finnish Social Science Data Archive.

The open-source text anonymisation software Textwash allows researchers who know Python basics to automatically detect and replace potential identifiers in English-language text. More information can be found in this paper by Kleinberg and colleagues (2022) and on the project’s GitHub page. Building on Textwash, the tool FAMTAFOS will feature an easy-to-use desktop app that allows users to anonymise English and Dutch texts at scale (expected release in Spring 2023).

This page was last updated in January 2023. Did you find a broken link or (seemingly) incorrect information? Please send an email with the title 'Website content' to datasteward@eur.nl.

Compare @count study programme

  • @title

    • Duration: @duration
Compare study programmes