Photographic Image Priors in the Era of Machine Learning
May 19, 2023
May 9, 2023
Eero Simoncelli, Silver Professor; Professor of Neural Science, Mathematics, Data Science and Psychology, NYU
All Captioned Videos Brains, Minds and Machines Seminar Series
Abstract: Inference problems in machine or biological vision generally rely on knowledge of prior probabilities, such as spectral or sparsity models. In recent years, machine learning has provided dramatic improvements in most of these problems using artificial neural networks, which are typically optimized using nonlinear regression to provide direct solutions for each specific task. As such, the prior probabilities are implicit, and intertwined with the tasks for which they are optimized. I'll describe properties of priors implicitly embedded in denoising networks, and describe methods for drawing samples from them. Extensions of these sampling methods enable the use of the implicit prior to solve any deterministic linear inverse problem, with no additional training, thus extending the power of a supervised learning for denoising to a much broader set of problems. The method relies on minimal assumptions, exhibits robust convergence over a wide range of parameter choices, and achieves state-of-the-art levels of unsupervised performance for deblurring, super-resolution, and compressive sensing. It also can be used to examine perceptual implications of physiological information processing.
Bio: Eero received a BA in Physics from Harvard (1984), a Certificate of Advanced study in Math(s) from University of Cambridge (1986), and a MS and PhD in Electrical Engineering and Computer Science from MIT (1988/1993). He was an assistant professor in the Computer and Information Science Department at the University of Pennsylvania from 1993 to 1996, and then moved to NYU as an assistant professor of Neural Science and Mathematics (later adding Psychology, and most recently, Data Science). Eero received an NSF CAREER award in 1996, an Alfred P. Sloan Research Fellowship in 1998, and became an Investigator of the Howard Hughes Medical Institute in 2000. He was elected a Fellow of the IEEE in 2008, and an associate member of the Canadian institute for Advanced Research in 2010. He has received two Outstanding Faculty awards from the NYU GSAS Graduate Student Council (2003/2011), two IEEE Best Journal Article awards (2009/2010) and a Sustained Impact Paper award (2016), an Emmy Award from the Academy of Television Arts and Sciences for a method of measuring the perceptual quality of images (2015), and the Golden Brain Award from the Minerva Foundation, for fundamental contributions to visual neuroscience (2017). His group studies the representation and analysis of visual images in biological and machine systems.
MODERATOR: Welcome, everybody. It's great to see everybody here. It's a great pleasure to welcome Eero Simoncelli back to CBMM and BCS. Eero got his start here-- well, not his real start, but got his PhD here, did with Ted Adelson as a student in ESC, right? Yeah. And everybody in the room is probably very familiar with his trajectory and his many contributions. And he's a really remarkable guy in my mind just because of the breadth of the contributions that he's made to so many different fields.
So his PhD did really pioneering work on visual motion processing, introducing the idea that you could do Bayesian analysis of visual motion and introducing the idea of a slow prior, but then ended up explaining lots of different stuff and was one of the first kind of influential applications of Bayesian approaches to perception. He's made lots of important contributions to image processing. So he's got this method for measuring image similarity that turned out to be wildly incredibly popular, that many of you may know.
He's also made lots of contributions for visual neuroscience, modeling the responses of individual neurons with lots of influential collaborations with physiologists, in particular. Lots of important contributions to perception. So he's done massively influential work on crowding, probably most important paper in texture perception. And so it's really kind of a shockingly diverse portfolio.
He's been at NYU for most of his career. So he was briefly at Penn straight out of grad school, was then hired by Center for Neuroscience. I had the great pleasure of doing a postdoc in his lab, which was a fantastic place full of smart and friendly people, right before I came here. And for the last couple of years, he's been the director of the Flatiron Institute, which is a new institute for computational neuroscience that he started a couple of years ago. So that's getting up and running and is a new chapter in things for him.
And so I think what he's going to tell us about today is some work that he's done since he moved there. So hopefully that did you justice. I got told I was introducing him two minutes ago. So that was my attempt to summarize the many contributions. So welcome Eero. Thank you. And look forward to your talk.
EERO SIMONCELLI: Thanks, Jeff. Thanks, Josh. And thanks Gabriel. Thanks for inviting me. That was Josh's way of saying that I'm all over the map, and I still haven't figured out what I want to study. I guess that's been like that since I was a graduate student. And I forgot, I was going to add one more thing.
I'm actually not the director of the Flatiron Institute. That's the one thing you got wrong, Josh. I'm the director of a new center within the Flatiron Institute. The Flatiron Institute is an internal research endeavor by the Simons Foundation aimed at using computational methods and tools and ideas and theories to advance science in different directions. It actually consists of five different centers. And I'm the director of the Center for Computational Neuroscience.
Enough said. So I'm going to tell you tell you about stuff that's developed in my group over the last-- it's really about four years when we succumbed to the deep net craze. We were resisting, but I'm told resistance is futile, and I think that's correct. We basically had to give up about four years ago. And we started thinking about what to do about this.
And so I'll tell you basically some bits and pieces of the story of where we've gone on this and some things we've learned along the way. And the talk is not explicitly about biology, but there is some biology in it. And the kind of work that I do is really broadly conceptual. It's about vision. So that's really the theme.
So just to emphasize that, so when you were looking at that picture of the cows, which I don't know why I always start my talks with the picture of the cows. It has nothing to do with the topic of my talk. I just love the photograph.
And when you're looking at that, of course, the visual information's coming into your eyes. It's working its way through your eyes, through your lateral geniculate nucleus-- down your optic nerve to the LGN, back to the cortex. And then it winds its way through your cortex, and out of that emerges sight, whatever it is that you think you're looking at, your ability to interpret things, to recognize them, to react to them emotionally, to understand their material properties, to understand where they are and in space and-- three-dimensional space, et cetera.
So my group, in general, is interested in how that happens. How do those neurons encode visual information? What is it that they're doing to represent that information? How does that allow or limit our perceptual capabilities? And how do we build engineered systems that can exploit that, maybe use some of the same principles, maybe mesh with those biological systems?
And all of it really is built around a foundation of thinking about, what are the principles, what are the fundamentals of visual signals and their properties? And in fact, for today's talk, the core really, if I think all the way back to when I was a graduate student starting out with Ted, I did my master's thesis on wavelets. They will appear in this talk and in multiscale representations. And a lot of what drove that work, and continues to drive my thinking about vision, is understanding prior probabilities, understanding what is it that makes a visual image different from white noise? What is it that characterizes the properties that we see in the visual world? How does our brain learn to understand those things?
So your brain and in many artificial visual systems actually have priors for the signals that they have to operate on. Sometimes those are quite explicit. Sometimes they're implicit. Sometimes they're built into the algorithmic structure. If we build an algorithm for processing images, we might build the priors into the algorithm.
But usually they're there, in some form or another. And for your visual system, you know that you have things like priors for images because if I ask you to look at something like this, and I say, well, which of these images is the correct or true image and which ones are distorted, every one of you in the room can answer the question, and you'll all get it right. So do you know? How does your brain know? How does your visual system know?
What is it that you've learned through your lifetime-- it probably doesn't take all that much experience to be able to answer this question. A young child could do it. What is it that you learned about images that tells you what they're supposed to look like and what they're not supposed to look like that allows you to answer that question?
So there's a long history to building images priors, explicit image priors. It goes back to the beginning of the era of signal processing. Maybe that's like the 1940s or something. And this is my brief, in-a-nutshell summary of that. And a lot of times, so because it's difficult to build a prior on a signal that lives in a high-dimensional space, we all know that there's something that the statisticians tell us about called the curse of dimensionality, which, more or less, is the idea that you can't just make histograms.
If you try to make histograms of things in high-dimensional spaces, you're never going to fill up the bins because the number of bins that you have goes up as a power. It's exponential in the number of dimensions. So the curse of dimensionality says you're not going to figure this out from data.
And so the old approach to this was always, well, think about symmetry properties of the signal. Think about the generation of the signal, the physical properties. Make various kinds of structural assumptions that make sense. And then come up with a very simple description, like a parametric description, of what's going on.
So we have-- the tradition in the field started with Gaussian models, of course, the simplest parametric model we have for probability distributions. And there are various reasons why those things show up. And probably many of you know those. And that model, which kind of takes off in the '50s, is the bedrock model is the thing you find still in most textbooks as a description of natural signals, and in particular, natural images.
And that model leads to our abilities to solve various problems. I'll walk you through in just a second an example, a quick example. In the '90s-- really sort of in the '80s, but it really starts to take off in the '90s-- there are these new observations that are made, which are that if you look at images locally, you find out that their distributions are very not Gaussian. They're very sharply peaked. They have heavy tails. They don't look Gaussian at all.
And so that is the beginning of a new era of thinking about images and their properties, also natural sounds. And we can think of that as the sparse model era. And usually those involve local filters, like the things that you find in multiscale decomposition like the Laplacian pyramid or the wavelet decomposition.
And the models, depending on who you ask and how they like to describe them, they're either heavy-tailed probability distributions, or they're things that involve delta functions and uniform distributions combined. And the way that pans out, when you actually think about what's going on in the space, is that you transform your image in the high-dimensional space into-- basically you can do it with an orthogonal transform. So it's just a rotation of the space.
And in that new space, most of the data lie along small groups of axes. That's the sparsity property. And what that turns into geometrically is that the model is a union of subspaces that lie along collections subsets of the axes.
So if you think in three dimensions, of three axes, and you take the three planes, the horizontal plane and this vertical plane and this vertical plane, you can think of most of the data being concentrated around those planes, not living in the ambient space, but concentrated around those planes. And that takes off, then, as a methodology and a conceptualization of how to think about these signals, how to process them, how to solve problems, how to solve inverse problems or inference problems, involving measurements of those signals. So that's the 90s.
In the 2000s, we kind of a whole bunch of people, including my group, but many, many groups worked on trying to evolve those a little bit to say, well, actually those things that describe these distributions as sparse or heavy tailed, they're just describing marginal distributions. What you really want is to start looking at the joint action of these things, the interactions of these things in neighborhoods. And once you do that, you realize they're not independent, and you do actually need to start processing and describing what's going on, at least in little groups of them.
And that leads to a whole set of models that I don't know what to call them. But there's joint sparse models or adaptive sparse models that know how to find clusters or clumps of stuff that belong together. So that's the early 2000s. And before I jump to the modern era and my nightmare with deep nets, I'll say a few things about this these models.
So how do we test them? So the purpose of learning these priors is not just because I want a probability distribution. I kind of like them. It's not because I want to generate pretty pictures, although that's what people often do with them. In the brain, the purpose is presumably because you want to be able to fill in missing information.
Or if you gather information under bad conditions-- it's a rainy stormy night. You're looking through your car windshield. The windshield wipers are going like this. And you're trying to figure out what's out there. You're using an enormous amount of knowledge about how the world works and how images work, in combination with these measurements that you're making that are really crummy. And you're trying to fill in all the missing bits so that you can make reasonable decisions.
There's more basic things. Like, have a blind spot in each retina, where there are no photoreceptors. So the usual story that's told is that you somehow fill that information in. You have a sense that you know what's there, even though you are not making any measurements in that region of the image. How do you do that? Well, you have priors. You have an understanding of how images are supposed to behave.
So how do we test them? So the simplest inverse problem we can solve is probably the denoising problem. And if you set it up using a prior, it looks kind of like this. It's a problem in Bayesian inference, or Bayesian estimation. So over here is an original image. I'll just use x for the original signal. Here's y, the noisy observation. And here's the denoised one that's actually a denoised one that came from a deep net. So it's pretty impressive, you can probably see already.
And how do you how to formulate this problem? It's pretty straightforward. I know most of you have probably seen this. If you haven't, then just try to catch the concept. I'm only going to show a tiny bit of math in this talk. So you won't get buried in equations. It's just a few examples of equations just to illustrate ideas.
So this is a least squares-- a formulation of a least squares, or a minimum mean squared error estimator for the image. And we write that as x out of y. It's the thing that minimizes here the squared error between our estimate and the true thing. And we do that by computing the expectation. It's the expected squared error.
We compute the expectation over the conditional distribution of the true thing given the noisy thing. That's the posterior distribution of the true image. And so this is our loss function, the squared error. And it turns out that when you want to minimize the squared error, you can calculate this directly. And the thing that minimizes the squared error is just the conditional mean.
It's kind of like the thing that minimizes squared variability. If you're taking the squared variability around some point, the best thing to choose is the mean. You want to pick the mean.
So the mean is written this way. That's the expected value of x given y. And we can write that out using Bayes' rule like this. So what was the point of all of that? The point of that is that there are three ingredients. There's the loss function, the squared error. There's the measurement model, which is p of y given x. That's the noise distribution. And then there's the prior.
So the three things come together. And we turn the crank, and we get our answer. And it looks like we have to calculate this big integral. So these ingredients will come back, which is why I'm going through them at this point.
So what happens in the model from the 50s, the Gaussian spectral model? Well, that's pretty easy. What you do is you transform into the Fourier space. You do a Fourier transform using these big sinusoidal basis functions. And then in that space, you multiply each one of these frequency responses by a constant. The constant depends on the signal-to-noise ratio.
The low frequencies have better signal-to-noise ratio. So those you basically keep. The high frequencies you throw away because they have lousy signal-to-noise ratio. You do a low-pass filter. You throw away the high frequencies, keep the low frequencies. Roughly speaking, you just kind of chop and keep some stuff.
So interestingly, this is a projection into a subspace. You're taking the full space, and you're smashing it into the low frequencies, throwing away the high frequencies. And it doesn't work that well. But it is the classic thing, and it is the thing in the textbooks.
What happens in the era of the wavelets? I'm having trouble seeing my own slides. Sorry. I'm leaning out here with the angle. What happens in the era of the sparse wavelet priors? Well, now what you do is you do a different transform. Think of it as local Fourier transform.
These are the wavelet decompositions, little oriented filters. And in each of those, you notice that the coefficients have this sparse behavior. And what you do is you say, if they're small, throw them away. But if they're large, I'm going to keep them.
So now the function is not linear. It's something maybe like this. And what the function is doing is saying, keep only the large amplitude coefficients. Throw away the small ones. So interestingly, again, this is a projection onto a subspace.
But now it's an adaptive projection. You're not saying, always go onto the low-frequency space, throw away the high frequencies. You're saying, well, I'm going to be selective. I'm going to go through and look at all the coefficients. I'm going to pick the ones that are below some threshold. I'm going to throw those away. Those are getting projected out.
But all the ones that are clearing the threshold I'm going to keep them. So this is still projection onto a subspace. But now the subspace depends on the image. So that's a abstract concept for what's going on, but it helps later with the intuition, so.
And then Martin Wainwright and I came up with this joint Gaussian scale mixture model. This is about the joint statistics of the wavelets. And again, you can look at these as-- the idea is basically that the dependence that one of these things-- the responses depend on its neighbors. And the dependencies can be captured by just thinking about what the local amplitude looks like.
So again, it's amplitude that's going to drive the decision. And again, we're going to do something like projecting onto a low-dimensional space. And now it's joint. So this is a picture of the actual calculation that comes out of using this for denoising that says that if I have a noisy child coefficient, and I also have its noisy parent, I should do something like throw away the small coefficients, but only if the parent is also small, right? So now there's a dependency on-- and of course, this generalizes to a whole neighborhood, as I showed you in that little cartoon.
So what's the point of all of this? The point of this is that we had a long history of building methodologies for doing denoising based on priors. And in the end, they all come down to projections onto subspaces. They all come down to taking the data and smashing it onto some surface, which is an appropriate surface, where the signal likes to live-- or is most likely to live maybe is the right way to say it-- and kind of clearing out the stuff that's eh. It's buried in noise. It's too small. Or maybe it's not likely to be there. In any case.
And so all of these methods are doing things like that. So now along come the deep nets. And they change everything. And we still don't know how they work, not really. But they did change everything.
And it's inescapable, as I was saying in the beginning of my talk, right? I succumbed to this four years ago. I've been avoiding it, trying to pretend it's people just screwing around. It's not really going anywhere. It's another one of these waves. Like a lot of the things that show up in NeurIPS, it'll be gone in three years. And then we can ignore it safely. But that didn't happen.
So it's still here. And it seems to be getting bigger and better. So the truth is that it is amazing, actually, what you can do with these simple networks. So this is a really dumb one here. Sorry, that's not meant to be offensive to the authors. They did the simplest thing they could do, which is just to stack up a bunch of convolutions, 64 channels per layer, 20 layers. I don't know how these numbers are arrived at, but this is what they did-- 3-by-3 filters, batch normalization, rectifiers.
But it's just rectifiers. It's convolutions and rectifiers, 20 stages of it. Train the thing on denoising. How does that work? You get a big database of clean images. You add some noise to them, and then you train the thing. Basically, you train it using nonlinear regression to remove the noise. That's it.
And it has 700,000 parameters. That's all those filters that you have to learn. And you train it on a big data set, the Berkeley data set in this case. And the performance is stunning. That's actually an example image.
I'm not here to talk so much about the performance. But it is stunning. It's way better than any of those old things-- way better. And so we set about trying to figure out, how does this work? Like, how is this even possible? And nicely enough, after a bit of effort and a lot of missteps, we realized a couple of things.
First of all, we found a way to take that network and make it universal. So we stripped out a bunch of things and simplified it even more. And it turned out that you could just apply it. You could train it on very small amounts of noise, and it would generalize to any amount of noise. That was a bit of a shock.
So training it on basically almost invisible noise allowed it to handle even huge amounts of noise and without losing any performance. And the other advantage to doing that, to setting it up the way we did, is that it became more accessible in terms of understanding and analysis. So it turns out that when you analyze what this thing is doing in different regions of the space-- maybe it shouldn't have been a surprise-- it's projecting onto subspaces, low-dimensional subspaces. And it's extremely adaptive.
So unlike the wavelet decomposition, where the projection was onto these subspaces that are aligned with the axes-- I told you imagine the 3D case where you have the three planes-- this thing knows how to find very complex structure, notice it in the noise, find the appropriate space, and project onto it, wherever it is. The spaces all go through the origin. So if you imagine little slivers of planes that you're projecting onto, and they're all connected up because the whole thing is continuous. It's a cone. It's a generalized cone.
The underlying geometry of the structure of this can be approximately described as something that looks kind of like that, like a flower almost. It's got undulations. It's rapidly varying. And it's very, very adaptive. So it's not something that we knew how to write down or how to express or how to learn by hand, in any way. But this is what this network seems to arrive at.
So that's great. And now I want to do something with it. The thing knows how to denoise. We know how to train that with regression. Great. But now we're in this-- pardon my language, but we're in an ass-backwards situation. We started with the denoiser. That we were supposed to learn a prior and then go use it to solve different kinds of problems that we wanted to solve, to solve inverse problems, to solve inference problems. And now somehow we ended up on the wrong end of this.
We solved with regression to get a denoiser. It does something spectacular, but we don't know how. And we didn't want a denoiser.
Who wants to solve denoising, especially Gaussian denoising? It's a boring problem. Nobody cares . Except this thing knows what an image is supposed to look like. So the question is, how do you take that and yank out of it, the implicit prior and put it to use in other problems, in lots of other problems? And that's what I'm going to tell you about.
So that's not a new idea. The idea of taking a denoising laser and using it to solve other problems is not a new idea. And if you go back in the literature, you can find these things called plug-and-play methods that try to do an approximate version of this inside of an iterative scheme. There's these denoising score-matching methods that were kind of independently developed in the machine learning community. And you can see some names that look like machine learning names.
And so those are developed in parallel, often not aware of each other, certainly not citing each other. So two parallel literatures developing side by side. This is kind of the image processing and IEEE community. And this is the machine learning community. And there they are. They're both happily ignoring each other. And then along comes--
AUDIENCE: As usual.
EERO SIMONCELLI: As usual. Well, often. So along comes Jascha Sohl-Dickstein, who has this kind of strange but interesting paper in 2015 about how you can think about diffusion in models and maybe thinking about trying to reverse that. He doesn't have any examples. He just suggests the idea.
And that takes off. People start noticing that. And so when we started our work, which was in 2019, we started working on this, and trying to put this together, we noticed there were a couple of papers that were really interesting for us. And we tried to piece together what was going on there with what we had already learned about this denoiser.
And so the story that we arrived at-- I'm going to actually run through the derivation. So I'm to drag you through the math because it's so simple. And it's kind of shocking, I think.
If you take that mean squared error denoiser, the thing that minimizes the squared error, it computes the conditional mean, the expected value of x given y. There's the expression. And it turns out-- and so what is that thing? Well, if you've got Gaussian noise, the distribution of the noisy, or the noisy data, p of y, is just a blurred version of the prior.
So here's the prior p of x. This is an integral. It's integrating against a Gaussian. So this is a Gaussian blur of the prior. It's just a fuzzy version of the prior.
So now we can do just a little bit of math. It's basically algebra. By taking the gradient of this and then dividing it by itself and multiplying by sigma squared, we get this funny looking expression, which is that this thing, which you can calculate off of this blurred prior, is the estimator minus the noisy observation. Or rewriting, it looks like this. So the estimator, which, remember, was supposed to look like an integral over this gigantic space, can be rewritten as something which is a step, a gradient step.
This is strange. That's an integral. This is a derivative. What? And this thing is not an approximation. This is exact. This thing is exactly equal to that thing. And this thing is much more convenient, actually.
So where does this come from? So when my student discovered this, I told him he was crazy. He wrote it on the board. This is something we did in 2010. He wrote it on the board. I said, it can't be right.
He said, it is right. He rederived it. He was a math student so I should have believed him. But I didn't. And he wrote it again. And I said, all right, this is really strange. And we went over this again and again and found different ways to write it, and eventually generalized it to a large class of measurement models not just Gaussian noise. But the Gaussian noise one is particularly interesting.
And then it turns out that this is known in some very old statistics literature from the '60s, known as the empirical Bayes' approach basically to describing a number of different problems in statistics. So Miyasawa is the main reference that we usually cite. This is a guy, who in 1961, derives this, and just derives it and says, look at this.
What he didn't realize is that this would be important for machine learning. And there are related things that came from Stein, Charles Stein, and other statisticians that are also, I think, highly relevant to the current machine learning revolution, but are largely forgotten. These are in dusty, old journals in the library. They're often not scanned, not in electronic form.
And they're a peculiar little corner of the statistics world. So anyway, we discovered them and tried to resurrect them. But we're not statisticians, so nobody paid any attention. So the bottom line is, this is a very strange rewriting of this in terms of gradient.
And I already said some of these things. But this is not an approximation. It's exactly equivalent. It looks like gradient descent, but this is not iterative. This is one-- you take one step, and you're there. That's the answer.
And the last thing is that the prior-- the answer to my question, how am I going to get the prior out of that network is this last point. The prior's sort of in there. Now, instead of being embedded in this giant integral-- there's p of x-- It's kind of in here. This is a blurred version of the prior. And it's in there in that network somehow. That network knows how to compute this, right?
I've trained it, and it's really good at computing this. So now we want to put it to work. So here's a picture just to illustrate what's going on. Imagine that your prior is that you have a signal that lies on some curvy manifold in a two-dimensional space. That's the green thing. So you're drawing points randomly from that curvy thing.
And your job is to figure out, well, if I make a noisy observation, how would I push it back onto that curvy thing? If that was a plane or a line, it would be easy. You would do a projection. But it's not. It's this weird curvy thing.
So this expression, which is the least squares estimated, which you can train on data drawn from that green curve, does this. So each of these red points are samples of noisy things. The line shows you what happens when you compute that. And the fuzzy gray thing underneath is the blurred prior.
So if you take that green thing, which lies only on that little curved thing, and you blur it, you get the gray thing. So here, with a lot of noise, this is what it would look like in an image. It's very blurred. And the steps you're taking are up the gradient of that very blurred thing.
Here it's less blurred with less noise. Here it's less blurred with less noise. So each of these, the steps are smaller. And you get closer and closer to going back to the green manifold.
So in this case, these little-- you can't really see it. The lighting is a little too high. But these things are almost landing on that green manifold. Whereas here, they're landing all over the place. You can see a bunch here that are landing somewhere in the middle. Thank you. Speak, and it happens. This is pretty good. What else can I get you guys to do back there?
So you see these things are not even close to the manifold in many cases. Whereas these are almost landing on it almost all the time, or very close. So we can use that idea to do basically course define gradient ascent. That's the idea.
Start out far from the manifold, very blurry, and start taking some steps. And as you get closer-- I didn't tell you this-- but the denoiser that we built is universal and blind. It doesn't know how much noise has been added to the image. You just hand it the image, and it'll clean it up.
So you can hand it a very noisy image. It works. You hand it a slightly noisy image. It works. It just knows. It doesn't have any auxiliary information, no side information, no lines. It's not multiple networks. It's one network. It does the whole thing.
So that thing knows how to figure out how close it is to the manifold. The network knows where am I, which way do I want to go, and how far do I need to go to get to that manifold. And so you can just allow it to control step sizes and to work its way along some path until it converges. And shockingly-- yeah?
AUDIENCE: Does the sample size, the training sample size affect--
EERO SIMONCELLI: Yes.
AUDIENCE: --anything about what you just said?
EERO SIMONCELLI: Yes. So the training-- you need a pretty big sample set, but not as big as you thought because, well, we used to think cursive dimensionality meant that it would have to be absolutely enormous. But you could train these things on pretty decent-sized images, and it works. So I'm going to say more about that in a little bit. But for now, pretty big data set, but something doable in today's world. Maybe it wasn't doable 25 years ago.
So this is the algorithm. Sorry for the algorithmic layout. This is apparently what everybody expects these days. But basically, this thing is the denoiser minus the original, the noisy image. So this is something we can compute from the network. And all we're going to do is take little steps there, right here. Each y gets-- you start out with your initial noisy, very just basically a sample of noise, and you take these little tiny steps that are fractions of this thing until you converge. That's it.
And so just to illustrate how it looks, here's a video. That's a trajectory. As we get closer and closer, the network is estimating that there's less and less noise. And so that blurred thing, it's getting less and less blurred until you land on the underlying manifold.
And so here's a picture of a whole bunch of them. The video is just showing you individual trajectories. But here's a whole bunch of them. You can see they're curved. They always land on the manifold. They land all over the manifold, even in these little crevices. They managed to work their way into little crevices and take little turns to land right on this thing.
AUDIENCE: Can you tell us a little bit about how you made those images?
EERO SIMONCELLI: Oh, it was really fun. I'll send you the Matlab code.
AUDIENCE: OK. [LAUGHS]
EERO SIMONCELLI: So it works. So these are fairly small images. These are, I think, 50 by 50 that are generated from this process. And this thing is trained on natural images. So it generates things that looks like little chunks of natural images. They have features. They have edges. They have sharp things.
Sometimes they have textures. They have corners and junctions. They don't look like noise. They look like they could be snippets out of natural images. If you-- sorry. yeah?
AUDIENCE: Early on you showed us an equation. You said this is just one step. It's not iterative. And now we're doing--
EERO SIMONCELLI: And now it's iterative.
EERO SIMONCELLI: I know. Yes. Sorry. I slipped that by you. You weren't supposed to notice. Bill was my office mate when I was a graduate student. So he knows everything and notices everything.
So what I said was correct. The original form is a one-shot deal. That is the least squares denoiser. If you're going onto a surface that's curved and you have a noisy version, then the expected value, the mean of the posterior, is not going to lie on the surface. It's going to be the average of a bunch of possibilities that lie along the curve. And it'll be somewhere inside the curve.
So what you want to do is take very small steps, fractions of the recommended step, because you're not trying to do least squares denoising. You're using the least squares denoiser, but really what you want is to use that gradient that it has. So the trick is to say, OK, I train-- this is why everything is, as I said, ass backwards. You started by training the denoiser. It knows how to compute the gradient.
But it's taking these big steps to land right on the least squared solution, which is not what we want. So what you do instead is take little tiny steps in the direction of that gradient. And since the denoiser's universal, you keep recomputing it. And each time you recompute it, it's getting a slightly different direction and a different step size.
So it not only chooses a good path, as I was showing you with those curved trajectories, but it's reducing the step sizes as it gets closer in an appropriate way. So it controls the step sizes for the entire iteration. There's no schedule. There's no magic set of parameters that we're choosing to get to set the step size. It really is a dead-simple algorithm, where it just controls its own steps the whole way through. Does that help? OK.
So if you train it on digits, like the MNIST, then it generates things that look like digits. If you train it on faces, it generates faces. These are tiny faces. I think these are something like 32-by-32 or maybe 40-by-40 faces.
And most of you know-- I work in a lab where we're using small computers and small data sets and just trying to understand how these things work. But out there in the real world, this is happening. So basically what we started looking at in terms of these processes turned into what's now known as diffusion modeling. And people started generating fantastic images, very high quality, and quite a bit larger. So these are all larger than the things I was showing you. And it's completely exploded.
AUDIENCE: Eero, sorry.
EERO SIMONCELLI: Yeah?
AUDIENCE: Just another basic question to piggy back on what Bill said. So I still don't totally get how this actually works. So you've got the denoising network which takes an image in and outputs an image, right?
EERO SIMONCELLI: Yes.
EERO SIMONCELLI: So take the difference--
AUDIENCE: What corresponds to this little step that you're doing?
EERO SIMONCELLI: Got it. So take the difference-- it went by quickly. Take the difference between the output image of the denoiser and the input image of the denoiser, i.e. what did it do? What step did it take? It's those little red lines in my plots, right? That's what I want. That's the gradient.
AUDIENCE: And what do you add that to?
EERO SIMONCELLI: That thing is the gradient. That thing is sigma squared times the gradient of the log of p of y, the blurred prior. That's what that vector is.
AUDIENCE: OK. So that--
EERO SIMONCELLI: And I know how to compute that anywhere in space.
AUDIENCE: So you have one for the image--
EERO SIMONCELLI: Yes.
AUDIENCE: --and one denoised output. You get this one delta. And then you add that to the noise sample here?
EERO SIMONCELLI: Yes. But you add only a little bit of it, the exact amount.
AUDIENCE: OK. So that's one step.
EERO SIMONCELLI: You take 1%.
AUDIENCE: What's the next iteration period for this?
EERO SIMONCELLI: Compute it again. You've just moved a little bit. You're now in a different part of the space. You've made a little step. Change the image a little bit. Compute it again. Run it through denoiser there again.
AUDIENCE: I see, so you--
EERO SIMONCELLI: It gives back a new gradient.
AUDIENCE: That noise sample's what you're adding to the original image to noise it up.
EERO SIMONCELLI: Just as an initialization, start with just a pure noise image. That's a starting point. Start in a random place in space. Take a step, a little step. Recompute. Take another step. Recompute. Take another step. Recompute. Take another step. Each time you have to call the denoiser.
AUDIENCE: Well, when you recomputing, what's the input on step two to the denoiser?
EERO SIMONCELLI: The output from the previous step. I'm starting with an image. And then I'm adjusting it a little bit. Call the denoiser again. Take another step, this time in the new direction that the denoiser told me to go, do it again That's it. You can inject-- I didn't say this. I'm skipping past some things.
You can inject noise in this process and create things that have a little bit more entropy. And you can inject the noise in a controlled way so that it doesn't overwhelm the gradient that you've computed so that it's not bigger than the gradient you inject-- your step And you can also build things that converge using that algorithm. It's a very simple algorithm also, just a little bit of randomness injected, a little stochastic.
So those are examples. Now what we really want to do is not make pretty pictures. I told you that before. I want to solve inverse problems. So I want to figure out, how do I take this thing and combine it with some measurement constraint and solve for an image that satisfies the measurement constraints and is a good image, a good natural image drawn from this prior. And it turns out that you can modify this algorithm to do that. And the critical piece is right here in the middle.
It's basically saying-- I don't know if most of you can grok this. But this is the denoiser step. And we're multiplying it by something that's projecting out the measurement. And this is the part that the measurement is telling us.
So the measurement is a linear measurement in the cases I'm going to look at. And I take my original image. I make a some measurement. I measure some things. I'm going to show you five examples of this. So if you're not getting the abstraction, you'll see some examples that will tell you, at least what the idea is.
So I'm going to make some measurements. I'm going to hold on to those measurements. And now I'm going to enforce the measurements gradually in concert with using the gradient that came from the denoiser in the orthogonal space. So I'm using the denoiser to fix the stuff that I don't know. And I'm using the constraint to fix the stuff that I do know, right? So it's just two pieces partitioned, and I'm combining them.
So how does it look? This is the picture that you have in your head from this little abstract diagram. There's a blue line here, which I guess you can't see. But that's the constraint. So let's say I make a one-dimensional measurement. That tells me that I want an answer that lies on that blue line.
But I also want something that's drawn from this prior, the green wiggle. So I want things that are the intersection of the blue line and the green wiggle. And when I start from random starting points, and I follow this procedure, I land up in these intersections. So that's the basic idea.
So let's use it for an image processing problem. So here's an example of a linear measurement filling in this hole. So the linear measurement, think of it as a giant matrix with ones on the diagonal for all these pixels that I'm keeping and zeros for all the ones that I'm throwing away, right? So all the middle ones are thrown away.
And here's three examples. Because this is stochastic, I'm starting from a random starting point. Here's three examples of how it takes that and fills in the missing bits. And here's the one that was trained on digits. If you do this and erase the top of this seven, it'll hand you back these three things, for example, which are obviously different digits. But how would it know if you've erased the top of the seven? So the point is that it does something quite reasonable in matching things up and producing something that fits with the measurement.
Here's a bunch of examples on more complicated images. And you can see that it actually works pretty well in all these cases. At the top is the original. The middle is the missing square. The bottom is what was reconstructed using this algorithm.
And it's worth pointing out this is the same denoising-- all the examples I'm about to show you-- I'm going to give five of these quickly. It's all the same denoising network. We didn't retrain it for each of these applications. It's one denoising network. It was trained once on a big database. We put it aside. That's our engine, our prior engine. And we're just reusing it to solve these different problems, right?
The whole point of the Bayesian paradigm is you get to separate out the ingredients and reuse them as appropriate, right? You don't have to train a new network to solve this problem, a new network to solve that problem, a new on, right? Each problem we're using the same prior, the same denoiser.
So this is another one, dropping pixels randomly. So we only keep 10% of the pixels. We drop 90%, recover the images. It actually works, I think, shockingly well. It manages to get out things like texture, which you would think you would never be able to get out from that sparse sampling. But it manages to figure out where the fur needs to go and even where the whiskers need to go just from this very small amount of information that's in these random samples.
This one Bill will recognize. I won't say anymore. Here are the original images. Here are the low resolution. These are 4 by 4 block averaged. So that's another linear measurement. I'm measuring the average of each block of pixels on non-overlapping blocks. Those are linear measurements. I can solve that problem.
I go ahead and I solve it. These are two other things that are published that use deep nets actually, the Ulyanov paper from 2020 and the Matias paper from 2019. And I think you can see-- I hope you can see, even with the current lighting, that this is much sharper and much crisper as a representation of these images. And in fact, even in this case, which I always like, where you see really heavy aliasing-- so you see the sort of-- these are thin, diagonal lines that end up with these jagged aliasing structures that show up when you block average them.
These two methods both end up replicating the jaggedness. And this actually does a pretty good job of giving you back something that's reasonably continuous and straight. So to me this is shocking how well the prior works in this context. Again, this is the same-- apart from-- sorry. We have two denoisers here. One of them was trained on color images, and one was trained on black and white and grayscale images. So I slightly cheated when I told you it's the same denoiser.
It's two denoisers. But they're trained the same way. And so all the color examples are coming from the color one. And the grayscale ones are coming from the grayscale one.
So this is the last one. This is compressive sensing. You project on to random basis functions, again, 10% the number of the pixels. This is a published result that uses a deep net that's trained on a particular set of random measurements. Ours works on any random measurements. It doesn't need any extra training. You just fire away and do it. And I think, again, the results are, I think, impressive.
Somebody told me I need to have tables of numbers. So we made some tables of numbers. It actually comes out really well in the tables of numbers. It's both fast and high quality.
I mentioned this earlier. If you run just the straight version on the spatial super resolution, it's slightly worse than some of these other algorithms. But if you then average over 10 samples, it's better again because the mean squared error-- if you want to minimize mean squared error, you want the average of the posterior. And so ours is more or less giving you samples of the posterior.
If you average over them, you'll make the mean squared error better. But the results will be worse. They will be visually worse. They're a little bit blurred. So it runs really fast. Sorry.
AUDIENCE: I'm confused again.
EERO SIMONCELLI: You're confused?
AUDIENCE: Well you took these small steps to avoid taking the mean--
EERO SIMONCELLI: I know.
AUDIENCE: --of your posterior. Now you're taking the mean of your posterior.
EERO SIMONCELLI: Only to give good numbers for the table. I'm serious. Ted's smiling because he hounded me mercilessly when I was working on my master's thesis. Because every time we were doing denoising, Ted would always say they were too blurry. And I would say, but those are the optimal answers. I can't do anything about it. And Ted would say, well, you should sharpen it a little. I said, you can't do that! That's the optimal solution. I can't touch it.
And the answer, of course, Ted was right. It only took me 30 years to realize it. The answer is-- it's what I said, right? So if you want something that looks like a natural image, you want something from that manifold, that cone-shaped manifold. And so you really want to get a sample from the manifold, even if it's going to be on average a little bit worse in terms of squared error.
If you wanted to minimize squared error, if that's your goal, then you need to average over those. And you'll get something that's a little bit blurry.
AUDIENCE: And so why do you bother doing the diffusion thing? Why not do it doing one step, then?
EERO SIMONCELLI: You can. How do you project onto a curved surface that you haven't even defined? See, that's the thing. It's ass backwards, I told you. You're starting with a denoiser. It's designed to go to the mean.
That sounds like not a useful thing, except that you rewrite it, and you say, oh, it's going to the mean by taking a step in the direction of this gradient. I want the gradient. I don't want the mean. I want the gradient. And that's what we're doing. We're grabbing that gradient and taking these little steps.
It's actually-- each piece of this is simple, but it's kind of hard to assemble the whole sequence and make sense of it because it's so weird. The formula was weird. So there's two more tables of numbers. Ours works really well. Is that what you're supposed to say with tables of numbers, works really well?
You can do some nonlinear things. And we've been looking into examples of this. And there are more in process. But this is quantization. So if I do a three-bit quantization, so I take this thing and quantized it to eight levels of gray, you can see lots of contouring and funny looking artifacts.
You can recover from this because it's not a linear measurement. But it's a measurement-- it's not hard to combine a gradient with these constraints that are about things that live in boxes. That's what quantization is. You just alternate between projecting into the box and taking a gradient step. And you can get results like this that look pretty good. I don't have any tables of numbers for these.
Let me show you two quick biological examples. Well, let me show you one quick biological example because I'm going to I feel like I'm going to run out of time here. And I want to get to the last little chunk.
So let me show you one biological example. Here we go. So I'll do this one. We wanted to do a really difficult nonlinear inverse problem. And we wanted it to be more biological.
So I have a colleague at Stanford that I work with for many years, I collaborate with, EJ Chichilnisky. He records from retinas in a dish. He records from ganglion cells, the spike trains of those ganglion cells. And he can expose them to light by focusing it through an imaging objective and shining patterns of light on the retina that's been taken out of a macaque monkey.
So this is an example. So that's the chip that he uses. It's got 500 electrodes. He presses the retina into that little central dish against the electrodes.
And after he's done the recordings and analyzed the responses, he can actually separate out the cells into different types. And he gets these fantastic mosaics that cover the back of your eye with their receptive fields. And in particular, here are the major cell types that are in your retinal ganglion cells that form your optic nerve. So these are the things that send the message. Everything you see comes out of the responses of these cells, these and another 16 or so types. These are the four major ones.
So these are the cells that are basically gathering visual information and sending it down your optic nerve. So what we wanted to do is to ask the question, well, if we have this nice prior, can we recover information from the spike chains of these cells? What's happening here? There we go.
So to do that, we want we have a prior, and we need a likelihood function. We've got spike trains, but we need something that describes the relationship between the image and the spike trains. So for that, we turned back to some old work that Jonathan Pillow did in my lab using something called a GLM, or a recursive LNP model.
So this is a model that we can use to fit those ganglion cells. It's got a receptive field up front, a stimulus filter linear. It's got an exponential nonlinearity. It then generates spikes according to the rate that is specified by the output of that nonlinearity. And then it feeds the spikes back through a filter back into a summing junction that then gets injected back into the cell.
And the idea is that these cells have state. Once they fire a spike, one thing is that they've got a refractory period. So they won't fire another spike for a millisecond or two. But they also had these rebound effects. So once they fire a spike, after a little bit of time, they're a little more likely to fire a spike.
And you see these-- so you can fit this model. The amazing thing about this model is not only that it explains the data pretty well, but that you can fit it to the data. It's a convex optimization problem to fit this to data. And you can get those filters, the stimulus filter and the post-spike filter, in such a way that they really do capture the details of the spike timing.
So that's going to be our likelihood model. In fact, we'll make it even more complicated. We have one that has cross coupling between the cells. So you've got that big population of cells. It's a couple hundred cells. And we have them connected up to their nearest neighbors using additional filters.
That entire thing can be fit to the data. So we take a data set from the retina. We fit this model to the data set. We now have a description of how images turn into get turned into spikes. And now what we want to do is if I'm at the receiving end of those spikes, I'm in the brain somewhere-- well, the lateral geniculate nucleus, and I'm receiving those spikes, what do I know about the image that landed on the retina?
And so what we're going to do is we're going to reconstruct that image by combining the likelihood the description of the probability of spikes given the image. We're going to combine that with our prior, which we have implicitly in a denoiser. And we're going to smash those two things together and get an answer.
The cool thing here is that you could also say, well, why don't you just train a deep net to decode the spikes? That would be more direct. Yes, it would be more direct. But you would need a lot of data. You need a lot of spikes and a lot of images to do that. EJ only hold these things for a day or two. So we're not going to-- and he's got lots of other experiments he wants to run. So we don't have a lot of spikes and a lot of images. We have a relatively small number of those.
We have, in fact, enough to fit this model, which is a relatively simple model. But we don't need to fit the prior to any of this. The prior, we go off and we fit that to a big pile of images from the Berkeley data set or whatever, right? So these things are fit separately, separate components. And we smash them together only when we want to solve the inference problem.
And so these are just to give you an idea of what this could do. These are examples. So on the left is the actual ground truth image that was shown to the retina. And the next column is what you get out of a linear reconstruction. This is what we get out of-- I should have simplified these names, sorry. This is our method.
This is what you get if you replace that GLM model with a simpler linear-- basically linear model, so simplify the likelihood. And this is what you get if you replace the prior, the one that comes from the denoiser with a 1/f spectral model. So the prior matters a lot. The LNP versus GLM model matters not quite as much, but still a fair amount. You can see a lot of detail is lost, for example, in this cricket when you go from here to here.
And you can see that we're recovering-- of course, these images are not-- there's information lost. But we can get a lot more out of them by paying attention to the spike trains and the details of the spike trains as modeled by that model, the GLM model. One last chunk I want to tell you about.
So we're continuing to work on developing inverse problems. Some of you are probably wondering about texture since I've done a lot of work on texture modeling. We're working on thinking about how to incorporate these into texture models. And if I had that to show you, I would love to. But I don't.
So I want to show you one more thing. This is a presentation that was at ICLR last week. And I'm just going to show you the highlights of that. But it's to resolve this issue.
So we went back, and we were trying to think, what's going on? How are these things representing prior information? And we discovered that in order to do things like that with big images, you need really big networks.
So most of these things are using networks with hundreds of millions, or even billions of parameters in order to capture and generate things like that. And it seems that every time you make the images bigger, you have to make the networks that much bigger, right? So they're scaling pretty badly. It's not quite the curse of dimensionality. It's not exponential. But they have to be really huge.
So we started digging into trying to figure out what's going on. So we can generate faces out of a small model like this, but they're small faces. And if we take that small model and we train it to denoise bigger images, the same size model-- and when I say same size, it's the same network and the same architecture, but the critical thing is what's called the receptive field. So each output pixel you can trace back through the gradients and ask to the network, the architecture of the network, and ask, which pixels in the input are being used to compute that output pixel?
And here it's global. The whole image is being used to compute every output pixel. But here it's local. This pixel came from that box, the content of that box. And when you run this on the synthesis procedure, it doesn't work. You get textures. You get bits of face-like tissue pasted together.
It does not generate faces. So it's lost-- so if you want to be technical about it, this is basically a Markov model. It's capturing local structure in overlapping neighborhoods. And iterating that Markov model to generate samples does not give you things that capture the global structure. So the lesson is that if you want to capture the global structure, you need things with global receptive fields.
But we don't want to make bigger and bigger models. So what do we do? Let's make the image smaller. And of course, this is going to go back to the same multiscale and wavelet and pyramid style tricks that I learned from Ted when I was a master's student. So here's an image. Make the image smaller.
Of course, when you make it small-- like blur and downsample. And when you do that, you lose a lot of information. So put the information back. In fact, let's make an orthogonal transformation. This W is a matrix. It takes this thing, transforms it to that. And what is that? Well, that's just four channels in a deep net, if you like, four channels where there's a low frequency channel and three others that have the vertical, horizontal, and diagonal information.
All together these are equivalent. In fact, this is a convertible, orthogonal transformation. And so the probability distribution-- anything that you can write for probabilities on this can be re-expressed as a joint probability over these. And of course, now we want to write them conditionally because so far we didn't do anything yet. We're just rewriting using probability and now Bayes' rule. So I'm going to rewrite them as a distribution on this low-pass guy, this low-resolution version of the image, and conditional distribution of these three conditioned on that.
And now there's one more piece, which is we're going to assume-- this is where the assumption-- so far, there was no assumption, but now there's an assumption. I'm going to assume that this piece can be made local. That is it does not need to be global. It's a tiny model.
And this one has to be global because it has to capture the global structure of the face. And of course, this may still be kind of a big image. So in order to do this more effectively, we can telescope it and make essentially a pyramid or if you like, a wavelet decomposition of this image. And we can repeat this and we write it out probabilistically.
The overall distribution of this image on the left is just a product over each of these things. It's the little guy at the top. That's global. That's this one right here-- little gal at the top, sorry. And then these four are combined with this conditional local thing to give you this one. And then you do the same thing, do the same thing, and that gives you the whole thing.
So this is a very simple structure with one assumption, which is that these can be done locally. So we implement the locality of those by making, again, a deep net. But this is a conditional deep net. It takes a little side input of this low-frequency thing with these noisy things, and it figures out how to denoise those. So that's a conditional denoiser.
We put that all together in an architecture like this, which basically does conditional denoising. And at the very top it does global denoising. And it works. And the question then is, well, how local can we make it? The key thing is we're assuming it's local. And this is only going to be a win if we can make it really local.
We don't want global representations because they require really big networks. Even worse, the network has to keep getting bigger as they make the image bigger. So what we want is something where we can keep using local networks, no matter how big the image is. And this works quite well.
So this is denoising results as you make the receptive fields, the size of the network, smaller and smaller . And the top curve is a 43-by-43 receptive field. And as you drop the size of the receptive field, the performance starts to drop. These are all plots of the input noise versus the output noise. You can see they drop, and they drop quite substantially. This is a log plot. This is PSNR.
When you do it multiscale, there's basically no drop. So the performance is completely preserved all the way through this until the very end. And that's how it works out when you look at the results.
So here, just to show that to you visually, here's a very noisy image, 7 dB on the input. And here are the three results that you get from these three different-sized networks if you just apply them in the image domain. And here's what you get if you do the multiscale thing. So it basically is very stable and consistent down to very small sizes of network.
So we can make much smaller networks to handle any size image by doing this conditioning trick. And just to show you that it works-- so now this I showed you before. It didn't work if you tried to use a small network to generate a big image. And now it does. Now we can generate faces with a little tiny network, a sequence of little tiny networks and a little one that captures that.
That's it. That's all I was going to say. I'm sorry I'm slightly over time. Well, I'm actually under time four in terms of my clock on my computer. But I'm over time in terms of the wall. So the old method was to build a density model and then use it to solve inference problems. And we did that by basically integrating Bayesian.
The new method is you train to denoiser with tons of data. And now you solve inference problems by doing what I'm calling empirical Bayes' assent. I'm just going to do gradient descent using the denoiser. And so we're working on lots of things to expand and generalize this, including thinking about how this might be relevant in implementation in biological systems that need to learn priors.
I don't have an answer for that yet. But we are thinking about it and working on it. I have one student who's working on that project. And I want to thank all my coauthors that worked on this. A lot of the work that I told you about was done by Zahra Kadkhodaie. Florentine and Stephan have recently joined in. All the multiscale stuff was done with them. So this is a joint project with Stephan Mallat. And I didn't show you the stuff that I did with Ling-Qi and David. But I showed you this. Eric Wu and EJ Chichilnisky did the retinal work. Thanks.