Optimizing the Privacy Risk - Utility Framework in Data Publication
In the past decade, we have witnessed a rapid growth in the quantity, quality, and diversity of personal data we shed through our daily activities. At the same time, it is increasingly recognized that personal data has tremendous value in supporting a variety of endeavors, which leads to sharing of such data in a de-identified form for secondary uses. This practice has raised numerous concerns about privacy, in particular that de-identified data will be linked to, or reveal sensitive information about, the corresponding named individual. As a result, disclosure control methods have been developed to mitigate risks when sharing such data. Yet disclosure risk is often measured simply as the probability that a record can be linked. Traditional views on disclosure risk often neglect that an adversary is often driven by gain (which may be economic in nature), with limited resources, and must undertake a series of actions to achieve a successful attack. As a result, demonstration of possible disclosures does not necessarily indicate the likely disclosure risk level. Applying de-identification methods without consideration for this fact can lead to overprotection, which causes unnecessary data distortion and diminishes the legitimate usage of the data, as well underprotection, which could harm relationships between data subjects and data publishers. The goal of this thesis is to build a framework to reason about privacy disclosure risk given the dataset and the context in which it is published. The context includes an adversary’s decision making process, the adversary’s gain from a successful attack, and the adversary’s cost for accomplishing the attack process. It also develops methods to find data publishing solutions that provide desirable tradeoffs between identity disclosure risks and data utility. This dissertation is specifically partitioned into three main contributions. First, this dissertation proposes a novel re-identification risk framework, which formalizes incentive and deterrence mechanisms in a real world environment where the de-identified dataset is released. This framework explicitly models the adversary as an optimal planning agent using a factored Markov decision process (FMDP). Second, this dissertation investigates the feasibility of certain types of penalty on deterrence. Specifically, it considers a temporal penalty for violating the terms of a data use agreement (i.e., holding a data user out of a system for a prespecified period of time). This is accomplished by analyzing the value of datasets, through the impact of publications based upon them, over time. The analysis suggests that such temporal penalties may not be appropriate for protecting data from attack. Finally, this dissertation develops a de-identification policy discovery platform that selects high performance de-identification policies by accounting for the tradeoff between risk and utility criteria.