A dataset based on original data from a university, making it impossible to trace back to real students. Dominique van Deursen, data scientist at the Business Intelligence Competence Center (BICC) of our university, recently launched such a simulation dataset with synthetic student data in collaboration with VU University in Amsterdam. She and her team have also been nominated for a Computable Award with the Student Data Simulation Dataset.
Dominique van Deursen has been working at the General Administration Service BICC for about two years now. As a data scientist, she is involved in all kinds of information provision in and around the university. Recently, the project concerning synthetic student data came her way, which was realised within the framework of the Zone Studiedata, a joint venture between Erasmus University Rotterdam and VU University Amsterdam. Zone Studiedata is a nationwide initiative dedicated to educational improvement through the use of IT. Versnellingsplan Onderwijsinnovatie met ICT is the initiator and client of the simulation dataset.
To begin with, what are synthetic data?
"Many people think that synthetic data is derived from original data in some way, but that is not the case. Synthetic data are entirely generated by computer simulation and cannot be compared to anonymised original data. The persons in the simulation dataset are non-existent persons who have been artificially generated by a computer algorithm."
What can you use a simulation dataset based on study data for?
"The simulation dataset that we have made consists of study enrolment data and study progress data and that is only a small part of all types of study data that exist. The use of study data also varies widely. Think for example of a student advisor who is interested in a certain type of data to be able to guide students better, lecturers who use it to gain certain insights to better design their courses, or a programme director who is interested in the quality of a programme compared to another programme. The common goal is often related to improving the quality of education."
Study data, especially enrolment data and study progress data, often contain privacy-sensitive information. For what reason was this simulation dataset created?
"Yes, that is correct. Study registration data and study progress data contain personal data. With the necessary background knowledge, you can, without knowing who the person is, still deduce who the person behind the data is. That is a risk and as a user of data you should not simply rely on it. When using study data, you will therefore inevitably have to deal with legislation on privacy-sensitive data: the General Data Protection Regulation (GDPR)."
"Because of this legislation, cooperation between students and teachers from different institutions is difficult. This only makes sense, because in many cases it involves extremely privacy-sensitive information. As a person, you naturally do not want just anyone to be able to use your personal data as a third party, even if it is with good intentions. To make cooperation between different institutions possible, a dataset is needed that allows people to use the data it contains, but at the same time guarantees the privacy of the people in the dataset. That's how the idea of developing a simulation dataset consisting of synthetic data came about."
Are there already people who have used the dataset in their research?
"Yes, recently a EUR student graduated who used our synthetic dataset. She used the dataset to investigate whether more ethical models are possible in analyses of how universities can better guide students during their education. People often think about entry requirements. You have to have achieved certain grades in your previous education. How do you deal with that in the case of international students? They often did not have a Dutch education, which means you cannot measure them against the same yardstick and you have to use different indicators. By using the synthetic dataset, the student shows which indicators are meaningful and ethical to use."
With the simulation dataset, you have been nominated for a Computable Award. How does the nomination affect you?
"I think it's great that the project has been nominated and that this way more visibility is created for synthetic data. I hope that other projects can result from it. I recently heard that municipalities are also interested in using it. This would allow you, for example, to use privacy-sensitive information from a certain municipality in a way that is responsible and at the same time safeguards privacy. Synthetic data is a very good alternative for that kind of issue too."
A synthetic dataset works very well for various educational purposes
Marlon Domingus is Data Protection Officer at our university and advises and informs the CVB and EUR staff on the obligations of the GDPR. In this role, he was also involved in Dominique van Deursen's project.
"If you want to work with large data sets in a collaborative manner, there are various challenges. You have to take many organisational and technical measures to ensure that you can guarantee the privacy of the personal data. Such a synthetic dataset works very well for various educational purposes. With a very rich dataset, you can learn all kinds of things about analysis without it being personal data that has legal protection under the GDPR. I enjoyed involving Dominique in a project with the city of The Hague and Statistics Netherlands to come up with a synthetic dataset. They were also impressed by Dominique's experience in the Study Data simulation dataset project. Funding is still being sought for such a trajectory."