Homework 2
Download the Covid-19 dataset and anonymise it using the ARX tool. Explain your choices throughout the process.
Choose identifying, quasi-identifying, sensitive and insensitive attributes. For quasi-identifiers, define levels of anonymisations (create hierarchies). Choose privacy models, it is enough to use k-anonymity and l-diversity. Run the anonymisation process and choose an appropriate transformation. Analyse the anonymised dataset in terms of utility. Try to make sure no columns besides the identifying attributes are entirely deleted while keeping the number of suppressed records under control. Export the final ARX project file.
Create a report, where you explain why you made the choices you made (you can briefly describe previous attempts if they gave unsatisfactory results).
- Explain why you chose the levels of generalisations that you did (if you chose ordering, explain why you ordered things the way you did; if you chose intervals, explain how you chose them; if you created a custom hierarchy you can explain the logic behind it).
- In your own words add a small explanation of what guarantees the privacy models offer in terms quasi-identifiers and sensitive attributes.
- Briefly explain which transformation you chose and why.
- Explain what level the transformation chose for each attribute. Do this by attribute (for instance “Age: Level 2” is sufficient, the actual levels are visible from the project file).
- Report the minimal class size from the input and output data.
- Report how many records were suppressed (Analyze utility → Class sizes) and which attributes had the most missing values (Analyze utility → Quality models).
- Analyze the risk of the input data: report the estimated percentage of records from the input data that had a larger re-identification risk than 50%.
- Analyze the risk of the output data: report the estimated percentage of records from the output data that have a re-identification risk below 5%.
Hints
If the anonymisation suppresses entire attributes or too many records:
- Check the suppression level (suggested: 100%);
- Define more hierarchy levels for quasi-identifiers;
- Play around with sensitive attributes and quasi-identifiers (The distinction between them is not always clear. Turning a sensitive value into a quasi-identifier allows ARX to generalise it, while making a quasi-identifier into a sensitive attribute will mean that it is not considered in the k-anonymous classes).
The ARX tool documentation is available here. Click on the blue links to read more about each particular part of the program.
Stuck? Don't know where to begin? The general order in which you should proceed is
- Create a new project and load in the dataset.
- Explore the dataset from the Analyze Utility view. How are columns distributed?
- Define column types. What are the unique values and how many does each column have? Which columns could be considered identifiers, which ones quasi-identifiers, which ones are sensitive?
- Define privacy models. In the practice session, we used k-anonymity for quasi-identifiers and l-diversity for sensitive features.
- Define transformation hierarchies. For binary attributes it might be sufficient to simply create a suppression rule. For others, try to figure out what is a reasonable hierarchy type (intervals, ordering, masking). Try to define hierarchies that have multiple steps, if it is reasonable for a particular feature.
- Run the anonymization, then from "Explore Results" select the best transformation (or any that you like) and apply it.
- Check the resulting data ("Analyze Utility"). What kind of transformation was applied? How many measures remain? Have the distributions changed?
- Check the risk statistics ("Analyze Risk"). How are the risks distributed? How has the risk distribution changed after the transformation?
Deadline: March 13th, 2024, 23:59 EET.
Submission Form
Submit a zip container containing the ARX project file (*.deid) and the PDF of the report.
2. Task 2: Data anonymization / pseudonymization