About Me
Blog

Privacy Preserving Machine Learning: Part 1

Deep Learning with Differential Privacy (DP-SGD Explained)

December 25, 2021

7 min read

Last updated: December 26, 2021


Check this post out on YouTube!

Consider subscribing while you're there?

Many of the world’s greatest machine learning problems require access to sensitive data, such as patient records for cancer detection. The issue is that machine learning models have a tendency to memorise private data, even if they aren’t overfitting

How can we train our models whilst still preserving privacy?

Through a technique called differential privacy.

In this post I’ll explain what differential privacy is, and go through the seminal differentially-private machine learning algorithm: DP-SGD (differentially-private stochastic gradient descent).

Why not just anonymise the data?

Wait wait, why do we need a special privacy-preserving algorithm at all? Surely the easiest solution is to just remove any private information by anonymising the data?

Well, that doesn’t work.

Netflix anonymised the training set for its recommender challenge by replacing all names with random ids. Turns out that by linking this dataset with public IMDB review ratings, researchers were able to deanonymise people.

Likewise, Strava got into hot water when it released an anonymised aggregate heatmap of location data, as it inadvertently revealed the location of military bases.

Every time we perform any form of statistical analysis, we leak a little information about the data.

See, if I tell you that the mean of the ages of the participants of a dataset is 34, then I’ve revealed a bit of information about the participants.

If we want total privacy, we learn nothing about the data.

Revealing the mean age tells us less than revealing all of the participants’ ages. But that’s a bit wishy-washy: let’s quantify this loss of privacy.

Enter Differential Privacy

Differential privacy lets us quantify the “privacy” of any mechanism MM that interacts with data DD to produce some output.

The definition of a mechanism is intentionally broad, it covers everything from statistical analysis through database queries to even simple pen & paper calculations.

M: Data -> Output

In the case of machine learning, the mechanism MM isn’t just the final model, it’s the entire process of training a model on the dataset DD and using that model’s predictions.

Let’s use an example to illustrate the goal of differential privacy.

Example: training on a cancer dataset.

Say we’re training a model on a cancer dataset DD, and we ask it to predict if a particular patient Bob has cancer. We’ll call this entire mechanism (training a model and then predicting whether Bob has cancer) MM.

P(M(D)=“Bob has cancer”)=0.55P(M(D)=\text{“Bob has cancer”}) = 0.55

If we train a model on the same dataset augmented with Bob’s records and it improves from 55% to 57% confident,

P(M(D+Bob)=“Bob has cancer”)=0.57P(M(D+\text{Bob})=\text{“Bob has cancer”}) = 0.57

then maybe Bob’s data says he has cancer, but maybe it’s just that the model got more confident overall with more data. Bob’s privacy comes from this plausible deniability of him having cancer.

What if it actually became 80% confident Bob has cancer?

P(M(D+Bob)=“Bob has cancer”)=0.80P(M(D+\text{Bob})=\text{“Bob has cancer”}) = 0.80

Just by adding Bob to the dataset, the outcome “Bob has cancer” has become significantly more likely, so it’s pretty clear that Bob does in fact have cancer, and the mechanism MM leaked information about Bob!

We can quantify the privacy loss as the log of the ratio of the probabilities before and after Bob was added to the dataset: i.e.

logP(M(D+Bob))=“Bob has cancer”)P(M(D)=“Bob has cancer”)\Large{\log{\frac{P(M(D+\text{Bob}))=\text{“Bob has cancer”})}{P(M(D)=\text{“Bob has cancer”})}}}.

ASIDE (if you’re interested): This definition of privacy loss comes from information theory. Let the events be defined as:

A=M(D)=“Bob has cancer”A = M(D)=\text{“Bob has cancer”}

B=M(D+Bob)=“Bob has cancer”B = M(D+\text{Bob})=\text{“Bob has cancer”}

Then from information theory we have that the information associated with AA is defined as I(A)=logP(A)I(A) = - \log{P(A)}, and likewise for BB.

The privacy loss is the difference in information between these two events (i.e. the information we gained by adding/removing Bob’s record)

I(A)I(B)=log(A(log(B))=logBAI(A) - I(B) = -\log({A} - (-log({B})) = \log{\frac{B}{A}}

So if we plug in the numbers for our two examples, we have:

log0.570.55=0.0357\large{\log{\frac{0.57}{0.55}} = 0.0357}

log0.800.55=0.375\large{\log{\frac{0.80}{0.55}} = 0.375}

We can quantifiably say that the second outcome has 10x the privacy loss.

The key idea behind differential privacy is to limit this difference in performance: the model should make a make a similar inference about me whether or not I individually opt-in to the dataset, thus preserving my privacy.

In the equations that follow that formally define differential privacy, we look at datasets DD and DD' that differ in one data record (i.e. we add/remove a record). We consider the privacy loss on a set of outcomes SS:

logP(M(D)S)P(M(D)S)\Large{\log{\frac{P(M(D) \in S)}{P(M(D') \in S)}}}.

Mathematical definitions of differential privacy

If we had zero privacy loss, then from our definition the probabilities P(M(D)S)P(M(D) \in S) and P(M(D)S)P(M(D') \in S) would be the same → we wouldn’t have learnt anything from that record we added/removed.

So we have to be a bit more flexible and specify an upper-bound on our privacy loss, which we denote with an ϵ\epsilon. We call this our privacy budget.

logP(M(D)S)P(M(D)S)ϵ\Large{\log{\frac{P(M(D) \in S)}{P(M(D') \in S)}} \leq \epsilon}.

We tune this ϵ\epsilon to trade-off how much we learn vs the privacy of the individual.

ϵ\epsilon-Differential Privacy

We can rearrange this inequality to get our first mathematical definition of privacy.

P(M(D)S)eϵP(M(D)S)\large{{P(M(D) \in S)} \leq e^\epsilon {P(M(D') \in S)}}.

So if we can bound this privacy loss by ϵ\epsilon for any datasets D and D’ whatever the outcome we observe, we have a formal guarantee of privacy. We say the mechanism MM is ϵ\epsilon-differentially private.

(ϵ\epsilon-δ\delta)-Differential Privacy

This bound is very tight - in practice we allow for a tiny failure probability δ\delta (much less than the probability of a given individual).

P(M(D)S)eϵP(M(D)S)+δ\large{{P(M(D) \in S)} \leq e^\epsilon {P(M(D') \in S)}} + \delta.

We say the mechanism MM is (ϵ\epsilon-δ\delta)-differentially private.

Privacy is no longer subjective

I want to just take a moment to pause here and appreciate how elegant this solution is:

  • Privacy isn’t subjective - we can quantitatively compare mechanisms by comparing their privacy budgets.
  • This privacy guarantee is rock solid - the bound holds no matter how the attackers try to post-process the outputs of the mechanism or combine them with other data.

How do we achieve this guarantee?

So differential privacy gives us this definition, but how do we achieve it in practice?

Adding noise

We can limit the privacy loss by adding noise. Think of an image: the more noise you add, the fuzzier the image, so the less you learn from it.

Every time you add a bit of random noise, you spread the range of possible values the input could have been. So if I take my age as 35, and then add noise sampled from a Gaussian distribution with a standard deviation of 5 - then within a 95% confidence interval my age could be anywhere between 25 and 45.

Noise spreads the range of possible values the input could have been"

The noise essentially smooths the probabilities and prevents one outcome from becoming much more likely. The smaller your privacy budget, the more noise you need to add.

Subsample and aggregate

There’s another technique we can use add “randomness’ - sampling. If we have a large dataset, and we sample a random fraction qq rather than the entire data, then the privacy amplification theorem states that our privacy budget needed reduces by a factor of qq.

I.e. a (ϵ\epsilon-δ\delta)-differentially private mechanism M becomes (qϵq\epsilon-qδq\delta)-differentially private

Intuitively, it’s because it’s only revealing information from that fraction of data a qq fraction of the the time (when we sample it).

Composition of privacy budgets

Now if you’ve ever played the game 20 questions, you know you can hone in on the answer by asking enough questions. (This is the Fundamental Law of Information Recovery.)

So the more queries we have, the larger the privacy loss.

Luckily for us, this is easy to compute: we can add up our privacy budgets:

If we have a mechanism M1M_1 that has a budget of ϵ1\epsilon_1, and another mechanism M2M_2 that has budget ϵ2\epsilon_2 , the overall budget we need is at most ϵ1\epsilon_1 + ϵ2\epsilon_2.

This occurs because the probabilities are multiplicative.

eϵ1eϵ2=eϵ1+ϵ2\large{e^{\epsilon_1} * e^{\epsilon_2} = e^{\epsilon_1 + \epsilon_2}}

And it turns out this also follows for (ϵ\epsilon-δ\delta)-differential privacy, i.e. the overall mechanism (M1M_1 followed by M2M_2) is (ϵ1+ϵ2)(\epsilon_1 + \epsilon_2)-(δ1+δ2)(\delta_1+\delta_2)-differentially private.

This budget is a loose upper bound and with careful analysis we can prove tighter bounds. But more on that in a second! Let’s apply what we’ve learnt so far to our deep learning models.

DP-SGD (Deep Learning with Differential Privacy)

Now we have the building blocks in place, we can tackle the seminal paper that applies differential privacy to deep learning models.

This paper proposes a modification to stochastic gradient descent algorithm to make it differentially-private: i.e. differentially-private stochastic gradient descent or DP-SGD for short.

ASIDE: why modify the training process (SGD) to get a differential privacy guarantee?

Intuitively, through DP-SGD we bound the amount of information the parameters have about the data. So no matter how many times we use the model parameters to produce outputs, we can’t leak any more information than this bound.

The alternative would be to train the model normally and then add noise to its outputs. However as we’ve discussed earlier, the privacy loss would accumulate every time you queried the model, so you’d have to add noise proportional to the number of queries. You can imagine how you’d end up with practically useless outputs in the limit.

The DP-SGD Algorithm

DP SGD

The aim of the algorithm is to limit the privacy loss per gradient update, by post-processing the gradient update in 2 steps:

  1. We clip the gradients - scaling the gradient so it has a max L2 norm of CC. Clipping the gradient bounds the amount of information in a given update, which lets us reason about the maximum privacy loss.

  2. We then add noise proportional to our clipping norm CC to the gradient updates. You can see we sample from a normal (Gaussian) distribution with standard deviation CσC\sigma. We call σ\sigma the noise multiplier.

These hyperparameters CC and σ\sigma can be tuned to get a particular (ϵ\epsilon-δ\delta) guarantee for each individual step of training.

The specific value of CC doesn’t actually affect the (ϵ\epsilon-δ\delta) guarantee, since a larger value of CC means more noise will be added to compensate.

It’s possible to show that σ=2log1.25δ/ϵ\sigma = \sqrt{2 \log{\frac{1.25}{\delta}}} / \epsilon gives us the (ϵ\epsilon-δ\delta) guarantee for that individual step of gradient descent. For brevity, we’ll skip the mathematical derivation in this blog post.

So now we have an (ϵ\epsilon-δ\delta) guarantee for an individual step of gradient descent, how do we compute the privacy guarantee of the overall training process?

Moments accountant

This leads us to the second key contribution of this paper: the privacy analysis through a technique called the moments accountant.

The maths for the moments accountant technique is rather heavy, so I’ll have a follow-up blog post dedicated to it, but in this post I want to demonstrate its benefits.

Getting a tighter bound on the privacy budget means we can run our training process for longer given the same privacy budget.

Let’s look at a simple privacy analysis based on the techniques we’ve looked at so far:

If each step has a (ϵ\epsilon-δ\delta) guarantee, then:

  • Sampling minibatch with some probability q=LNq = \frac{L}{N}, gives us a revised guarantee of (qϵq\epsilon-qδq\delta)

  • Adding up the TT steps of gradient descent gives us an overall (qTϵqT\epsilon-qTδqT\delta) differential privacy guarantee

So you can see our budget scales with the number of steps TT.

Prior to this paper, the tightest bound came from the strong composition theorem, which gave a bound of (O(qϵTlog1δ)O(q\epsilon \sqrt{ T \log{\frac{1}{\delta}}} ), TqδTq\delta) differential privacy.

The moment accountant gets us an even tighter bound: the overall training process is (O(qϵT)O(q\epsilon \sqrt{ T} ), δ\delta)-differentially private.

See that’s two major improvements:

  • First the budget is now proportional to T\sqrt{T} not TT, so you can run the algorithm for longer
  • Secondly the failure probability is δ\delta (not TqδTq\delta), so it is independent of the number of training steps TT.

Comparing the bounds

Implementing DP-SGD

DP-SGD is implemented as part of the TensorFlow Privacy library. Pytorch also has an equivalent library for differential privacy called Opacus.

This makes using differential privacy as simple as just swapping out your optimiser.

mnist_dpsgd_tutorial.py
from tensorflow_privacy.privacy.optimizers import dp_optimizer
# replace this
optimizer = tf.train.GradientDescentOptimizer(
learning_rate=__)
# with this
optimizer = dp_optimizer.DPGradientDescentGaussianOptimizer(
l2_norm_clip=__,
noise_multiplier=__,
num_microbatches=__,
learning_rate=__)

Rounding off this blog post with some practical notes:

  • We don’t need to do learning rate decay, as DP-SGD never gets to a regime where we’d converge (because of the noise).

  • The size of your minibatch affects privacy - larger batches mean less noisy training but less privacy, as we’re sampling a larger fraction of the data (remember q=LNq=\frac{L}{N}).

  • Rather than clipping each gradient separately, Tensorflow specifies microbatches of gradients to average the norm over, for performance reasons.

  • In practice DP-SGD tends to only work well for larger datasets, as the larger the dataset the less noise you need to achieve a given privacy budget.

Share This On Twitter

If you liked this post, please consider sharing it with your network. If you have any questions, tweet away and I’ll answer :) I also tweet when new posts drop!

PS: I also share helpful tips and links as I'm learning - so you get them well before they make their way into a post!

Mukul Rathi
© Mukul Rathi 2023