Quantifying and Understanding Memorization in Deep Neural Networks
Date Posted:
March 22, 2023
Date Recorded:
March 21, 2023
Speaker(s):
Chiyuan Zhang, Google
All Captioned Videos Brains, Minds and Machines Seminar Series
Description:
Abstract: Deep learning algorithms are well-known to have a propensity for fitting the training data very well and memorize idiosyncratic properties in the training examples. From a scientific perspective, understanding memorization in deep neural networks shed light on how those models generalize. From a practical perspective, understanding memorization is crucial to address privacy and security issues related to deploying models in real world applications. In this talk, we present a series of studies centered at quantifying memorization in neural language models. We explain why in many real world tasks, memorization is necessary for optimal generalization. We also present quantitative studies on memorization, forgetting and unlearning of both vision and language models, to better understand the behaviors and implications of memorization in those models.
TOMASO POGGIO: I'm Tomaso Poggio. Welcome to the CBMM seminar. It's great to have Chiyuan Zhang today. One of the few advantages of getting older, like I'm doing, to have great students. And welcome back.
And so Chiyuan is one of them. I don't know if you know, but there is a paper that he published as an effect of a summer project at Google in 2016 which was called "Understanding Deep Learning Requires Rethinking Generalization," which was a milestone at least in the theory of neural network, has more than 5,000 citations.
And actually, this is a good opportunity because I owe you an apology, official apology. At the time, I teased him quite a bit about that paper because they were claiming essentially that theory, classical machine learning theory cannot explain what they found, and I was objecting to that. But I must say, it took me six years to have now finally prove that I was right, but the point is, the paper was really important.
[LAUGHTER]
It was really important because they showed that you can have interpolation of the training data, and at the same a good generalization. Now people should have known better because there is a classical linear artifact. That is a good example of it, and it's the [INAUDIBLE].
But of course, giving a demonstration on CIFAR with deep network was a different thing, and it really was a very theoretical paper more or less cites that [INAUDIBLE]. So Chiyuan.
Now, he has done a lot of other interesting things in the meantime. Everything you did with me is forgotten. This was-- variances in speech recognition. But if only it has hundreds of citations instead of 5,000.
But anyway, the more recent work he's going to speak about I think is quite intuitive, high-level understanding-- and not only high level, of possibly of how [INAUDIBLE] matters, including transformers work, this issue of generalization and memorization. Or learning rules and exceptions. And so Chiyuan.
CHIYUAN ZHANG: Thanks, Tommy.
[APPLAUSE]
Thanks for the very nice introduction. I'm really happy to be back, and I hope what I'm talking about today you will agree on. And so today, I'm going to talk about quantifying and understanding memorization in your network. It covers a number of recent projects that I worked with some many of the collaborators here.
Yeah. Feel free to interrupt me any time. If you have any questions. So without further ado, I want to, I guess-- maybe I don't need to motivate that why we want to study those. So first, the central topic we study is large neural network that maybe has the potential capacity to memorize. And we all know that those neural networks can do amazing things.
And the example we're showing here, this is the Midjourney, which is an online service for artists, that you can generate art basically by describing what you want. And this is a GitHub copilot that can basically write code based on your comment, and nowadays we see a lot-- many, many more such amazing services and models coming up that can do all amazing things.
And in the meantime, so we know that those models have huge capacity. They can memorize things. This is an example with GPT-3. Maybe I should update it to GPT-4 now.
But this is a concrete example I prompted with the first sentence of a classical novel. And then it completes in the highlighted text the entire first several paragraph of this text, and it even do the British spelling that are marked as a spelling error by our US spell checker.
So it basically memorized at least several paragraph of this book verbatim. But like, so what? Like why does it has-- like why do we need to study? We know it memorized, but does it has any implication at all?
So at least for me, there are two kind of reasons why we want to study this. The first one is more like a scientific motivation where we want to study it from the perspective of understanding generalization behavior and learning behavior of neural networks.
For example, so this is another kind of old chat bot that I interacted with. I was asking it to do a simple calculation, and it gave me the correct answer, and then somehow along with the answer, it gave me a website, and I looked up the website. And it's actually-- there is a website that at least it has a page for every pair of numbers up to-- I don't know up to how many, but it has the correct answer on the page.
So I'm not saying that this chat bot is not doing the right thing or is cheating, but I'm saying, when we see the neural networks, being able to generalize through reasoning and through programming. We might be overinterpreting it. It would be great if we have deeper understanding or figuring out how does the neural network do such things?
Maybe it's via really amazing compositional generalization or maybe it's seeing similar examples in the training data that it basically do some pattern-matching. So I think one motivation is from understanding generalization.
The other motivation is more practical, which-- like we have issues such as privacy and copyright when those models memorize, and this is an example from earlier paper that shows that, for example, if you have a language modeled, you have some sensitive information in your training data, such as your Social Security Number, the model might memorize such things and then it will generate such-- leak such information when under attack.
And another issue is also related to art is that we have those generative models that can generate amazing art. And it's very easy for other people who has interesting art to create it without going through a lot of training-- artistic training.
But at the same time, those artists who train for a very long time and they went through a very hard procedure finding their own style might get their style copied very easily by other people who just punch in the model or even outside the box when the artist's artwork was somehow included in the internet growth training data.
So yeah, those are more practical issues. Like if we are able to understand the memorization of models in this sense, we might be able to maybe modulate-- modify it so that we can have better control over those issues.
So yeah. The outline of the talk will be mostly four-part. I will probably have time to cover the first three parts. If I speak to quickly, I will go to the first one, but mostly I will start with a definition, a formal definition what do I mean by memorization when I talk about it.
And then I will talk about memorization mostly in image classification. This is more related to the understanding and generalization perspective.
And then the last part, I will talk about memorizing in language model which covers two types of memorization. One is really those model-generating training data, the other is more related, again, to generalization learning theory-related stuff.
So without further ado, I want to basically mention why do we want to define memorization? The main reason is that even though memorization is like everyday concept that we all know, it can mean a lot of different things, even when we constrain it to the context of machine learning and deep learning.
For example, we have this classical theory of overfitting. Like when you do curve-fitting, if you use a polynomial too higher-- too high degree, it will fit to the noise and give you those crazy interpolation result. We say it's memorized the noise in the training data.
Or maybe your model learns outliers or even mislabeled example. We know that model can fit to random label examples. And we also call this behavior memorization when we describe it.
And the other scenario is that when you have those generating models that are trained to generate something, those model can also just copy-- or like approximately copy what it sees in the training data, and we also say this is a memorization.
Of those particular training, then, we have other things such as like, you can probably attack a model by extracting training data from maybe the model weights or even the activation or even the gradient for training and so on.
So there are many different notions of memorization. What we want here today is mainly a notion that it's related to generalization. And so we want to use this as an angle or as a perspective to understand or have a better insight into the generalization behavior of machine learning.
So I want to come back to the classical plot that you see in your first machine learning class where you have a model complexity in the horizontal axis, and then you have error in the vertical axis. And there is a region of overfitting where the training error is very low, but your test error is very high.
So we call this kind of gap generalization gap. When this gap is large, we say this model overfit, or we say it memorized the training data without learning a pattern that is generalizable. So we want to extend this notion to measure memorization, basically. But this is the notion about a model, a model that memorized.
What we really want is we want to compute or test whether or not an example is memorized. So can we extend this notion to measure example property? So in order to do that, let's go back to the formula we usually define for generalization gap. There's nothing complicated here.
Basically, if you have a machine learning model fS that's trained on the training set S, you measure the training error, which is just the performance on the training set, minus the test error, which is the performance on the test set.
I guess the thing here is that you are basically measuring the performance of fS on the example that it has seen during training minus the performance that it has not seen during training, which like both of them are IID again. And you define the gap as the generation gap.
Now you can basically do some massage to this formula here. Let's say instead of having a particular model, we want to measure the generalization gap of a model. Let's say we have an example z. And then we-- again, we measure the gap.
But now the first term is basically we sample some model that is trained on some training set. In this first example, z is included in that training set. So basically we compute the average performance of model that has seen this example minus the average performance of example that has not seen this example. In this gap, we call generation gap of this example.
And basically, what we're doing here is leave one out measurement, which Tommy has been doing for many decades. And given an example, you train a number of model with this example and other examples. And now when you train another bunch of example without that example, we call them in-and-out models, and basically you compute the difference here.
If the gap is large, we call this example memorized. The intuition is that if you have an example that is-- the gap is small, meaning that even if you have not seen this example in your training, you're still able to predict it quite well.
So that means-- like there are probably many other examples in the training set that encode similar information, and model is able to learn the generalizable pattern. So we say this example is learned, like it's generalizing instead of memorizing this idiosyncratic information in that example.
On the other hand, if a model is only able to make a very good prediction after seeing it in a training set, then it's essentially memorizing the unique information encoded in that example. So yeah. So this gives us a procedure leave-one-out estimation, but this is expensive to do, especially considering nowadays that the data set we're dealing with are pretty large.
So we can do something that is slightly more smart, which we call subset estimation. So instead of train a bunch of in-and-out model for every example that you want to measure, you basically just do some random subset-- random subsets of your training set and train models on it.
And then for every example you want to measure, you go back and filter those models that has seen versus has not seen this example. So I have a figurative illustration of the procedure here, assuming you have a training set of many images.
What you do is you just randomly subsample, let's say, 70% of your training data, and you remove the remaining 30% from your training set and you train the model with your standard state-of-art training pipeline, and you will get model f hat. And now you repeat the same thing many, many times, each time with independent subset sampling of the training set.
With that, you will essentially have two matrices here. So the number of rows here correspond to a number of times you repeat the training, and the number of columns here correspond to the number of training examples in your training set.
So the first matrix will be the prediction correctness. So the i-th row and j-th column will mean the i-th [INAUDIBLE], i-th trained model I trained whether or not it predict the j-th training example correctly. And the mask down below will be the binary mask or whether or not I include the j-th example in the i-th training of the model.
And if I just take the i-th column-- each of the column, it will basically give me all the information I needed to calculate the generalization gap or the memorization score that we defined earlier. Yeah. So that's basically the definition.
And the few, I guess, comparison to other definitions that people have used in the literature. The first one is, again, overfitting to random label. It's kind of similar, but here, we are not restricted to random label or outliers or mislabeled examples.
I'll show in a moment some examples. So what we found is that there is actually a continuous spectrum. Like if you think about data you found in real world, it's a continuous spectrum of very canonical example, and maybe rare subpopulations, and then completely outlier examples and even mislabelled examples. So it's a continuous spectrum there.
So this notion captures that. And it has content some connection to interpolation, but interpolation does not really distinguish between memorizing of common examples and outliers. It's not super related, I think, to spurious features, and it has connection to membership inference attack, which people using privacy to measure privacy.
And it related to training data reconstruction, but training data reconstruction usually means-- in many cases, usually generalization model, but here, we mostly talk about classification model, but I would talk about language model in the second part of the talk. So there will be some connection there. Yeah. And we have language model.
So I guess that's the definition we're working with for memorization. And here, we're going to look at some result in the first part in image classification, and then we will talk about language models.
So this is exactly what we did with the procedure I described, and we train a number of models, and then we compute the mask, and then we compute memorization score for each example.
And then we can essentially rank the example according to their scores, or we can maybe threshold it. Like if the memorization gap, generation gap is small, we say it's not memorized, and if it's large, we say it's memorized.
What we see here is that the score matches our intuition quite well. Like for example, this is a peacock class in the ImageNet data set. The canonical example are essentially not memorized, and those are-- because it provides more or less similar visual information to help the model discriminate this class.
But on the other hand, if you look at the memorized example, they're much less canonical, and they're like essentially encoding like rare populations. For example, there's-- I didn't know there is a population of white peacock until I saw those images, but there are also some very ambiguous and outlier examples that may or may not be classified as peacock depending on the context.
So I guess this is just more examples of it, and this is the class toaster. Again, the canonical examples are not memorized, but the memorized example contains a lot of outliers and ambiguous examples.
And so one thing we look at when we have the memorization score is that we want to see, I guess, the learning dynamics or learning behavior of those examples throughout training. And what we're plotting here is basically we group the example into a subset of [INAUDIBLE], and then ranging from low memorization to high memorization, and we want to track how well they learn throughout training.
And I guess maybe somewhat unsurprisingly, we found that even for different optimizers, the memorized example are learned much later than the non-memorized example. That might explain why early stopping helps a lot in some cases.
And for the same reason, we imagine that you can maybe-- after you identify memorized example, because those are not very-- like some of them are outliers, you might want to remove those examples. Maybe that helps generalization and that's a natural thing to try and that's what we tried here on this sub SVHN data set. And it helps a bit up to some extent.
However, if you tried on many other like data set-- like here, we tried ImageNet CIFAR-100. It actually hurt performance after you remove memorized example. Surprisingly, that removing memorized example hurt even more than removing random example.
So if you remove equal number of examples-- randomly sample from training set, the test performance will drop, but if you just remove the top k memorized example, the performance drop even more.
So that's an interesting phenomenon, which connected to this observation or this trend that larger model usually leads to better generalization and that's what people observe in practice. And larger model usually means you have more memorization because those model has more capacity to memorize.
And here, I'm just citing some figures from the literature. Like we have the scaling law in language model. Basically the larger your model, the more compute your model can do, the better the performance. And here, we have the double [INAUDIBLE] phenomenon where we also see that after a certain threshold, more larger model basically leads to better generalization.
So on one hand, we're thinking that generalization and memorization are kind of the opposite thing, but on the other hand, we are seeing this phenomenon that memorization or memorizing essential to generalization and removing memorized example hurt generalization. So what's going on here?
And then we try to answer this question. And in the end, I think the least the intuition is pretty simple in the sense that if you imagine, even when you are dealing with standard machine learning benchmark data set that has equally-- like partitioned-- like you have, for example, 10,000 example in each class, they are equally balanced, they are still-- there are still a subpopulations within each class that has different property or different frequency when you sample them.
So on the one hand, you have those canonical example that has high frequency, but you also have a long tail which contains many of those subpopulations that are rare subpopulation or rare instances in your training data.
Memorization helps because memorizing those rare instances-- or instances-- or subpopulations in detail could potentially help the test accuracy because in the test set, you might also run into similar examples in the tail.
So even though for each of the subpopulation, the frequency is very low, but if you consider the entire tail, it actually has some non-trivial probability of seeing those examples. And there's actually-- one of our collaborator, Vitaly, has a theory paper trying to-- they have a model to describe this behavior.
Like I'm not-- like we're not going to go into details here, but essentially what it says is that the generalization error of any classifier, any model will be larger than the optimal generalization error that you can achieve plus some term, and this term can be lower-bounded if-- by basically-- it can be lower-bounded if your model does not memorized the training example.
So essentially-- so this theoretical model is constructed on a simple discrete learning scenario that can be extended to a mixture of distribution case, but it's still synthetic. And the key here is that if this distribution follows a long tail, we can show that achieving the optimal generalization is only possible when you memorize everything.
But one thing we want to do is to verify whether or not this synthetic or theoretical model is true in real-world data. So in order to do that, we basically try to compute the memorization in real-world data and try to find-- basically measure the generalization impact of those memorized examples.
So for each memorized example, we try to find if there is-- like if the hypothesis is true, then there is going to be a test example that's also in the tail, but that matches the corresponding training example, and we want to find it. But I guess the question is how to find it. The answer is we can basically extend the original memorization equation to compute some kind of influence.
So basically here, we are measuring the gap-- the performance gap of on this example itself when we include or exclude it from the training set. But what we can do is we can decouple the two. And we can measure the impact of inclusion or exclusion of this example on another example, the performance of another example. And this gave us the influence of a particular example on a-- a training example on a particular test example.
So here is, again, an illustration of what we are trying to do. And assume we have some example, some training set, and a subset of it. The purple color one are memorized examples. And what we do is we basically try to compute the influence using the formula we described in the previous page.
And we will identify the example of pairs between train and test that has large influence exceeding a certain threshold. And then after we identify those, we can-- excuse me? We can essentially remove the corresponding training example that is memorized and has strong influence on a particular test example. And then we measure how does it impact the test performance?
And for example, on this CIFAR-100 data set, we identify around 1,000 unique training example that has drawn influence on a test example.
TOMASO POGGIO: How big is the training set in CIFAR?
CHIYUAN ZHANG: It's 50,000.
TOMASO POGGIO: 50,000.
CHIYUAN ZHANG: Yeah. So it's not a very big subset. So most of the example-- I guess it also depends on the threshold you're choosing. We try to make it conservative so that it's not including other example, but yeah. So we see a test accuracy drop of the same model after we remove this 1,000 example.
And again, this is also consistent with what we observed before, because memorized example help generalization. Interesting thing is that this performance drop, 2%, 2.5%, is equivalent to if you just remove random example, you have to remove 11,000 example in order to achieve the same drop.
But I guess what's more interesting is that the test accuracy on the corresponding-- on those examples that are highly influenced by those examples drop significantly, and it almost-- like this drop almost entirely explained the performance drop in the entire test accuracy.
And here, we have some illustrative example images we found. So in the first column, we have training images, and in the second column, we have test images. And the test images are ranked by the influence of this particular training image on it.
So the first test image will have the strong-- will receive the strongest inference, and the remaining image will-- like with decreasing order. So here, we sample a different number of images from different influence ranges, from high to intermediate.
What you essentially see is that the first image on the right or the most strongest influenced test image is very visually similar and sometimes even maybe just different crops of the same images. And that-- and also, that image is usually not a canonical instance in that class.
So here, we are seeing a relatively intermediate influence, and you see, they are still quite visually related. And this particular training example provide very strong support for correctly classifying this training images.
And similar things happens on CIFAR data set, and it's even more severe there in the sense that there's a lot of near-duplicate image between training sets and test sets in the CIFAR data set.
I think earlier algorithm that-- like when people construct such data set, they do deduplication, but the earlier algorithm were not good enough to identify those images. And those ended up being the most strongly coupled pairs. Yeah.
So I think in summary, that is affirmative of the hypothesis that why memorization helps generalization in those large models is at least mostly explained by this hypothesis of a long tail. And next, I want to talk a bit about memorization in language model. Yeah, question?
AUDIENCE: Just to understand this memorization score metric. So it seems that it defines or it captures how canonical an image is. There are other ways of defining this. Have you checked whether these metrics are-- like how related these two metrics are?
CHIYUAN ZHANG: Yeah. So I think that's a very good question. So relationship of this memorization metric to other memorization metric, I guess I talk about it a bit in the table I presented where-- so there are-- for example, there are memorization in the sense that you are essentially overfitting to random labels.
Those are what people call memorization, and that's actually captured by this-- like if you have random labels, then you are, by definition-- except with a chance of guessing, that you will have high gap in the memorization score.
And there are also-- like the notion of interpolation, for example, where you basically your model has large enough capacity that it fits all the training example, but that notion is general in the sense like if you have a canonical example where it also fit to the training data, it does not distinguish between a more canonical example and the more outlier example.
And we're going to talk in a few moments in language model where it's essentially a generalization model where-- one thing that is very natural to measure is when you generate something, you check whether or not it's basically copying from the training data. And this is also a notion of memorization.
And there is actually anti-correlation between this notion of memorization and the memorization score we defined earlier. I hope that answers your-- yeah. OK.
AUDIENCE: I had another question.
CHIYUAN ZHANG: Yeah.
AUDIENCE: OK. People in this building have also defined memory ability score for images. I wonder whether you have checked whether there is correlation between your score and that memorability?
CHIYUAN ZHANG: Yeah, that's a good question. I think-- so one thing-- one interesting thing here is that this memorization score we define is purely based on the learning behavior of a model. Like it kind of disregard the contents of that image.
I think memorability score defines human's response to it, and maybe there are some characteristic of image that image content that is more memorable. But here, we define it mostly via the dynamics of model learning. We didn't explicitly measure the content because it's hard to computationally estimate it.
But I think in the end, it's not completely irrelevant to what content is, but it's more like measuring the relationship between a particular training example and the rest of the training example, whether or not there are many similar-- visually similar or semantically similar examples or whether or not it's an outlier or a rare subpopulations.
AUDIENCE: I see.
CHIYUAN ZHANG: Yeah.
AUDIENCE: So if I listed to it correctly, the ones with the high-- the data points of with the high influence are the ones of the modes, not the long tail. Is that correct, first of all?
CHIYUAN ZHANG: So if your example is not in the long tail, it's not possible to have very strong influence because the influence will be kind of equally distributed among many training examples. So there's no single training example that can incur a dominant influence on a test example.
So if you imagine you have a canonical peacock and you have another peacock in your test set, this example cannot influence the test example because even if you remove it, there are many other canonical peacock that provide the same visual information for the model to learn to recognize the test peacock.
Only if this one itself is unique in the tail that it's rare, can it has a dominating influence on the test example? So it's--
AUDIENCE: So I guess, could you clarify further on how you are going about identifying the long tails here? Because here, it seems like even the high influence are probably most likely or more likely to be the modes or the canonical examples you're talking about.
But the low-influence ones will actually be the long tail. The rarer ones that will be somehow like magically be helpful later on, right?
CHIYUAN ZHANG: Yes, yes. So what we did is a two-step procedure. So we first find the memorized example, which, by definition, are the example in the tail. And then among those examples, we find the example that has the strongest influence, dominating influence on a single or a few test examples.
So it's like a two-step procedure. So we don't include other examples that are the mode of a dominating cluster. But also, by calculation, those examples, like most of a dominating, very high-frequent subcluster, they will not have a very high influence score under this definition anyway. Yeah. Cool. Another question?
AUDIENCE: [INAUDIBLE] you use the word memorization, but is it possible to recover these images? Like whatever you call them as memorized images, like, did you recover them as well? Is there a way to recover them [INAUDIBLE]?
CHIYUAN ZHANG: Yeah. So recovering, you mean attacking-- like extracting-- yeah. So there is a deep connection between memorization under this definition and membership inference attack, which is not exactly extracting the example, but the membership inference attack is canonical privacy attack where the attacker want to figure out whether or not you include particular example in your training set.
And it is recently known that the most memorized example are the most vulnerable to such a privacy attack. Maybe there are ways you can extend this membership inference attack to reconstructing those images from the model, but yeah, I don't know if there are existing algorithms to do so.
I'll quickly talk about language model, and this is just a basic introduction to language model, but I'm sure everybody here knows what the language model is. And in particular, in language model, I want to talk about some scaling law we measure in the memorization of language model.
And I will talk about some work we did trying to prevent memorization in language model. And in the end, I will basically compare the memorization verbatim or textual memorization to semantic-level memorization that also loop back to what we talked about earlier in division model.
So I guess the special thing about a language model is that it's a generating model, and it actually generate examples. And the natural thing we can do is we can basically compare the generated example with the training set to see whether or not it matches. And if it matches, it's a very natural definition of memorization.
Basically, essentially, your model maybe didn't learn how to speak English, but it stores a lot of essays in its ways and it can be repeat the same thing when you do the right problem. Maybe that's the same. So we want to measure to what extent the model memorize things.
And the protocol here is very simple. We just sample a particular perfect from the training set. And then we prompt it with the model and ask you to generate a completion, and then we compare with the training example to see if it matches.
And it does not necessarily match the original example where we generate prompt because there are many similar or other examples in a training set that could contain similar prefix, but has different prompts-- different completion and model can match others. So we check all the training examples to see if there is a match.
And one thing we measure is whether or not-- or like how does the model's memorization behavior impacted by different parameters? And here, we are measuring the model size. We know that larger models generalize better, but we want to measure whether or not larger model also memorize more.
So here, we have essentially a log linear curve showing the positive correlation with the model size and the memorization behavior. One interesting thing we were asking is, when we see more match with the training data, is it because the model that actually memorizing more training data or is it actually becoming better at English, let's say, and it knows how to write the correct-- or a more meaningful sentence in the end, that it happens to match the training data more?
So in order to test that hypothesis, we have a baseline model which is trained on a different training set, but also with increasing model sizes. And it turns out that for the baseline model that's not trained on the particular training set, the memorization rate does not increase significantly with the model size.
So I think it's fair to say that this actually measures memorization instead of the language capability. The other thing that's interesting in this definition of memorization is that it measures the, operational-wise, how much data you can extract, but not necessary-- it's kind of a lower bound of what the model memorize-- maybe the model stores more information in the ways that you are not able to extract in this mechanism.
So here, what we're measuring is we increase the length of the prompt we gave to model so that we see whether or not, giving more context, allows the model to recover the memorized text more like with higher chance. So it turns out to be true.
So what we are measuring, essentially, is only what we-- we call it discoverability, s what we are able to extract. But if you give it more context, you can extract more.
And the last kind of scaling law here we discover is that it turns out there is a lot of repetitions or near-repetitions in the training set that are brought from internet. And it turns out that the repetition of data has a very large impact in terms of the memorization rate. And basically, the more repetition you have, the more likely the text is going to be memorized.
So with those kind of measurement behind, we are now trying to see if we can modify or prevent the language model from memorizing. , I guess the most approachable actionable thing is with the repetition of data because we don't want to restrict the model size.
Apparently, larger models are going to be way better than those smaller model and people won't be able to create that often. [INAUDIBLE] is not something that we can control. Like if an adversary comes and they are able to do whatever they want. So repetition in the training data is the most actionable.
So what we did here is that we essentially implement an efficient algorithm to detect near-duplicate examples in common language model training set. What we found is that there are a lot of duplicate in many common data set, including like near-duplicate between training example and also between train and test examples. And they're not very clean.
Yeah. Maybe 3% is not that large in terms of this number, but those data set are also quite large. Like 3% of CIFAR data set is 10 million documents. So I think another fun example we found is that there is one kind of advertisement, I think, that's repeated 60,000 times in this single data set. So the model are seeing various skewed distribution of the text data [INAUDIBLE].
And here are some examples of near-duplicate we found in those data set, and some of them are just templated text with different filling in, and some are-- this is like news article and the other one citing it, and verbatim copying a large paragraph and so on.
And what we found is that if you can detect those near-duplicate and then de-duplicate your training data to remove those duplicated, it actually has almost no impact or slightly better improve-- slight improvement on your model utility, perplexity. But it will drastically reduce the memorization rate of those models.
And another way to prevent memorization is that you can maybe, during inference time, do some check. Like I check the 10 grand that I'm going to generate whether or not it matches any training document. If it does, I reject it and ask them all to do another one.
This is a very simple solution that actually turns out to be very "effective," with the quotation mark. So here, recall we have a very nice data structure called bloom filter that can actually very efficient to do this. And it gives us zero-false negative rate, meaning that it is kind of conservative, but it will never miss a memorization check.
So we can-- here is what we do with this bloom filter-based memorization prevention mechanism. And what we see here is like we're measuring the blue score between-- like approximate memorization, basically, because it perfectly prevent verbatim memorization.
What we see here is that it does not scale with the model size anymore. However, when I say it works with quotation mark, is that verbatim memorization, even though that's something that is very easy to measure, it's not really the, I guess, the main type of memorization or the only type of memorization a model does. A model does a lot of this approximate memorization.
So if you measure a proxy memorization rate by some arbitrary blue score threshold, they will have much higher rate than the verbatim rendition. And here is, I guess, fun example, we found with GitHub Copilot.
And they have a similar-- we didn't know what exactly they do with their model, but they have a similar switch you can switch on so that you do not generate-- it will refuse to generate completion of code if it matches with the public code.
So here is a very famous inverse square root algorithm that available online. And if we just do the blue line as the prompt, it will generate a few lines and then it will realize, oh, I'm copying this algorithm verbatim, I will stop here.
However, if you just add a pound character to it, to prefix it, and fake Python comment here, because this is a C code, it will never have this comment so it will not it will not match any training data. And now the model is very smart.
It realized that, OK, I'm going to pretend this pound character to every line of generation, and it also kind of now get around this text-based matching future because it now does not verbatim-match any of the training data at all. And it will happily generate the entire algorithm here.
And this is another example. If you maybe-- I think it's French. If you change the variable name to French, and the model will now generate the whole thing, but change some of the variable names to French. And it's-- yeah. It's memorized the essential information, the algorithm. And it's almost verbatim.
It's like maybe some of us did similar things, we change variable when you copy other people's programming solution, but-- so I guess this tells us that verbatim memorization check is something that very easy to work with, but it's not going to work in many cases.
And here are some other examples, and they basically do this style transfer. Like it do lowercase, uppercase, and it-- yeah. And change white spaces. And here are like some more language model d not code-based models that also do similar things to prevent such a memorization check.
It can split it into multiple tokens, or it can use a different word that means the same thing, and it can do uppercase in lowercase.
So because of that, I think in the last three minutes, I will very briefly talk about some of earlier effort we did try to go beyond verbatim memorization. I guess I'll skip that. And I guess something that is probably related to what people in this building think about it, like there's actually a lot of study in psychology and cognitive science about different types of memory. There's a whole taxonomy of memory.
And not all memory are bad even when we are going to-- in the context of machine learning models. And in particular, I want to talk about this subcategory of explicit memory. There is this difference between episodic memory and semantic memory.
So very roughly, what semantic memory encode is, like this general knowledge that you want. Maybe you want your language model to know, and such as like Paris is the city in France.
But episodic memory is more like a detailed information of a specific episode of event that happens or maybe very detailed information. So we want our model to know those common knowledge common sense so that it can be useful, but we don't want model to learn very specific thing about the specific user that might leak information that will not be very useful probably.
But the definition here from the cognitive science perspective is intuitive, but it's not very approachable to compute or to measure. So what we found is that it actually matches the memorization we defined earlier.
Like if we use the intuition that semantic memory is something that is repeated many, many times over and over again and it's shared by many examples in the training set, like this-- Paris is a city in France, this knowledge is probably encoded in many examples.
But private information or episodic memory is probably something that not represented that many times. So by using the frequency, we can come up with something that we can compute. And essentially goes back to the memorization score that we define earlier, and here, we call it counterfactual memorization, but it's essentially the same.
So if you have a document that says my Social Security Number is blah, blah, blah, blah, hopefully this document is not-- or this piece of information is not contained in many documents over the internet. And in that sense, what you will see, that this memorization gap or memorizing score is going to be high.
Because if you-- assuming that piece of information is private, not something common knowledge that is shared by many examples. If you remove it from the training set, this model will not be able to complete your Social Security Number very easily and this gap is going to be large.
But for common knowledge that is encoded by in many examples, this gap is going to be small. So we can essentially measure the same memorization, and hopefully, I guess, approximate the notion in the taxonomy of memory and allows us to capture different types of memory.
So here are just-- let's just look at the histogram on the top. It's basically the memorize distribution or memorization score computed on three different data set. The plot is actually shown in log scale. So what we see here is that most of example have low memorization. There's a dominant mode in here, but there is a tail that-- there is a kind of tail that has high-memorization examples.
And here are some text examples with different memorization scores. And what we found is the highest memorized example are actually not that interesting. Those are like all capital text or some of them are foreign text and some of them are unstructured text.
So basically, those are very atypical examples in English [INAUDIBLE] so the model. Basically have to memorize them because they don't follow usual English grammar.
But intermediate memorized example becomes slightly more interesting in the sense that those are kind of news article reporting specific episode of event. And then when you go to very low memorization, you essentially see those repeated tags or common informations, and so on.
And here, I guess just to relate to some of earlier question, is that we have a relation between memorization and a number of duplicates. We know that number of duplicates is-- we have this scaling law that memorization in the text matching verbatim memorization sense basically scales with the number of duplicate. The more duplicate you have, the more memorization you're going to have.
So what we see here is that there is an anti-correlation. Basically, high duplicate will lead to low memorization in this semantic-memorizing sense, but the converse is not necessarily true.
And another thing we can measure is essentially influence. It's the same thing as the image classification case where we essentially look at which training example is highly memorized and which test example does it highly influence.
And we found that-- also, we found many like semantically almost identical articles, but textually they have some-- like maybe either explicit or implicit added that makes them-- it's hard to match them textually, but yeah.
Yeah. So this figure, again, shows the memorization score and influence score. What we see is that in order to have high influence, the memorization has to be higher. I guess, answering your question earlier, can we have a high influence when the mode-- when it's actually not high-memorized?
I think from this figure that is mostly not true, but not all highly memorized the example have high influence on the test example. That also depends on the exact test set that you are sample with. The larger the test that you are sampling, the more coupled pair that you are going to find.
Yeah. So those are high-influence pairs with generated examples. This is the last slide of this section and this talk. We do have some limitations in this semantic similarity measure and semantic memorization. One thing is that it is computationally expensive.
We cannot very easily-- those large language models, we still need to train at least hundreds of models in order to measure such memorization. But for the currently largest model like GPT scale model, it's not really feasible to do such retraining.
So one thing we've been looking at is some of those retrieval-based models that we can easily do this subset measurement without retraining. And the other thing is that we don't really have ground tools for the measurement of influence and memorization. Even though we talk about semantic memorization, we can show examples, we can show some statistics, but we cannot measure the accuracy or performance of this algorithm.
So that's one limitation that we are looking into to solve. And I will skip the last section here-- I'll stop here. Uh huh. Yeah. And if there's any question, I'm happy to answer.
[APPLAUSE]