Methods for anonymising quantitative data

For questions, contact your privacy officer (PO). On My EUR you can find contact details for your faculty PO.

Data minimisation

Minimise the amount of personal data you need to collect. What you do not have, you do not need to remove or store separately. There must be a clear, specified need for collecting personal data. You as a researcher can specify why certain personal data need to be collected and why they are relevant for your research. Consider already at the planning stage of your project which background information of research participants you need and how detailed this information must be.

For more information on how to minimise data, for both quantitative and qualitative research, see the Finnish Social Science Data Archive

Anonymous by design

In case it is possible to do the research without the processing of any personal data, the GDPR is not applicable. For if data cannot be linked to a person, there is no risk of causing harm to a person.

When using data collected by organizations such as Statistics Netherlands (CBS) or the Centerdata LISS panel, data will usually be provided anonymously. The national research infrastructure for the social sciences in the Netherlands (ODISSEI: Open Data Infrastructure for Social Science and Economic Innovations) also provides opportunities to re-use data securely.

Below you will find a list of examples, but please note that this list is neither exhaustive nor does it guarantee success: 

  • It is harder to anonymise data from a small or concise group, therefore it is better to sample widely. Note that if you collect data from a specific region/area or organization, that is also considered an identifier.

  • Only ask for general demographic data or background information, this will make it harder to single out individuals within the data set. Preferably ask for general instead of precise background information (e.g., age ranges instead of precise age or birthdate) and use closed questions.

  • Use closed questions about opinions, feelings, personal situations, and the like. With open-ended questions you have less control over what the participants write down (you only know if it is anonymous after you receive the answers). In case you need open-ended questions:

    • Instruct the participants clearly what to write (or what not to write).

    • Check and anonymise answers if necessary (see 'Best practices' below or simply remove the personal data).. 

  • Inform and ask for consent of participants without asking for personal data (see example from Radboud University here).

  • Check the settings of the survey tool; the tool should not collect more data than required and the link to the survey should be anonymous (see instructions for Qualtrics here or on the Qualtrics website here).

  • Note that if you use a platform to collect the data, the data might be anonymous to you, but might not be to the platform. Use only EUR approved platforms for data collection to be sure that an agreement between EUR and the platform regarding handling the data is in place.

  • Double check if the data is also anonymous to others before you release the data in a repository.

Best practices for anonymising quantitative data 

The UK data service and CESSDA ERIC provide a list of best practices that can be used to anonymise quantitative data: 

  • Remove direct identifiers from a dataset. Store the removed identifiers separately if needed. Note that if the removed identifiers are kept, data is pseudonymous but never anonymous .

  • Aggregate or reduce the precision of a variable (e.g., for variables such as age, place of residence, geo-locations). 

  • Generalise the meaning of a detailed text variable (from open-ended free-text questions).

  • Restrict the upper or lower ranges of a continuous variable to hide exceptional cases within the dataset (e.g., for variables such as age and income).

On the website of the Finnish Social Science Data Archive you can find a detailed guide to anonymisation techniques for quantitative data including many examples.

Statistical methods for anonymisation

A widespread technique for anonymising data is k-anonymity. The idea is to generalise variables to minimise the risk of re-identification of individuals or groups of individuals. A dataset is k-anonymous if an individual in the dataset cannot be distinguished from at least k-1 individuals in the same dataset using the same set of identifiers. Thus, for every combination of values of the (indirect) identifiers there are at least k individuals with the same values (Machanavajjhala et al. 2007). Sometimes k-anonymity cannot prevent the detection of sensitive information based on background knowledge and lack of diversity in the k-anonymised dataset. One solution is to consider the criterion of l-diversity, which is a way to ensure enough diversity in the values of sensitive information to prevent disclosure. For more information see the website of the Finnish Social Science Data Archive and this paper by Machanavajjhala and colleagues (2007)

If you are using R, there are packages available that can be used for anonymisation:

  • sdcMicro is an R package suitable for anonymisation of large datasets. For more information, see this paper by Templ and colleagues (2015)
  • Synthpop is an R package for creating synthetic data, whereby the original data is being replaced to prevent disclosure whilst the statistical features of the data are preserved. For more information, see this paper by Nowok and colleagues (2016)

This page was last updated in January 2023. Did you find a broken link or (seemingly) incorrect information? Please send an email with the title 'Website content' to datasteward@eur.nl.

Vergelijk @count opleiding

  • @title

    • Tijdsduur: @duration
Vergelijk opleidingen