L. Mahadevan: Ill Posed Problems
Date Posted:
May 29, 2014
Date Recorded:
May 29, 2014
CBMM Speaker(s):
L. Mahadevan All Captioned Videos Brains, Minds and Machines Summer Course 2014
Description:
Topics: Inverse problems are pervasive in physics, geophysics, biology, engineering; definition of direct vs. inverse problem; direct problem may involve linear or nonlinear operator, and there may be noise; definition of well-posed problems (solution exists and is unique and stable); issues of sampling, noise, stability; inverse problems are typically ill-posed and can be regularized, for example, by minimizing both accuracy and smoothness; linear (matrix) inverse problems, SVD, least squares solution, eliminating infeasible solutions, selecting relative weight of accuracy vs. smoothness terms; example: blurring/deblurring using Tikhonov regularization; probabilistic (Bayesian) approach to inverse problems, e.g. MLE and MAP; minimizing least squares error for linear inverse problems is analogous to maximizing the likelihood if noise is Gaussian
LAKSHMINARAYANAN MAHADEVAN: There are also a number of teaching fellows who are helping us, officially and unofficially, so I'd like to introduce them. Andrei Barbu, Leyla Isik, Emily Makevicus, and Yasmine Meroz, if you can raise your hands and introduce yourself to everybody over here. So for the students over here who are going to be doing projects, they are your point people for the projects.
In fact, that leads me to the next point to be made, which is that according to the schedule, the description of the projects was actually planned for tomorrow. But because of the fact that this is a two-week course and you already heard Bill Reznikoff tell you that the only way to learn science is to do it. So we're going to get started immediately.
And therefore, this afternoon at 2:30, right here, all the students please come back. And the TFs will essentially help you get acquainted with the projects. I hope everybody got the email that we sent out a few days ago. The goal of that description is, of course, for you to essentially self organize so that you can get some progress and then report to everybody else in about two weeks, in fact. And so please, of course, come here this afternoon.
Also, I think there are people from very different backgrounds who have come to this school-- mathematicians, computer scientists, cognitive scientists, psychologists, statisticians. And so because of these heterogeneous backgrounds, and given that much of the progress that eventually needs to be done will require [INAUDIBLE] methods, if you've forgotten the math or you need a refresher, Emily and the other TFs have kindly offered to essentially put together a math and a computation camp for you this afternoon. So following the project descriptions at about 4:30, again, over here.
There are a number of events which allow you to get to know each other through informal meetings like the one yesterday evening, like the barbecue on Saturday, the outing on Sunday. But in addition, so that you also get to know each other in terms of research, we have organized a half day of presentations on June 3 at 9:00 AM. Please give your slides to one of the TFs, presumably the one that you're essentially going to work with through the term of the course. Make sure that there are just two slides, one introducing the question, one introducing where you are, so that we can essentially all hear each other. So I hope you'll be willing to do that next Monday or Tuesday.
So what I'd like to do for you over the next hour or so is to tell you a little bit about what Tommy just stopped with, which is the question of why we should care about inverse problems and how we are going to essentially think about this. And this is going to be a theme that's essentially going to be pervasive right through the course. And the reason is very simple.
From even the perspective of my lecture, everything that you're trying to infer from what I'm speaking is based on partial information. You have some information associated with my voice. You have some information associated with my wild gesticulations, which will become wilder with time. You have partial information associated with the text that's there on the screen. And you're trying to integrate all that and essentially try and develop a whole.
In fact, we think of science as the ability to start with the theory and make predictions, but in fact, it's almost always the other way around. You have some data, partial information, and you're trying to construct a theory. So inverse problems are pervasive everywhere. Some of you have seen this. Others may not have. And so this introduction, I hope, will essentially bring everybody onto the same page.
So as I said, over the next some time, I'm going to go through a set of topics associated with inverse problems. And this is the outline for the next hour. I'll tell you about some simple and not so simple inverse problems. The not so simple one is, of course, as Tommy already pointed out, a very big part of learning, which is a very big part of intelligence.
I want to tell you about the conflicting requirements that inverse problems essentially bring to the table associated with accuracy and smoothness. For example, if I wanted to essentially fit a curve through a bunch of points, a perfectly accurate model would be a polynomial which essentially has a degree, which is essentially the number of points. But of course, it's going to be very non-smooth. And so there is this issue if it's very smooth, I have something which is completely flat. If it's very non-smooth, it's fitting everything and it's not very robust. So these conflicting requirements are things that we constantly have to deal with.
And using a bunch of examples, I will focus on what is now fairly well understood, the idea of what's a general linear inverse problem. In order to solve that, I will recapitulate some aspects of matrix algebra. For those of you who have forgotten it, for those who need a refresher, come back this afternoon. Stop us. We're all there for two weeks. This is going to be a bit of knowledge through the fire hose, if you will, if you haven't seen this before.
So I'll first focus on a deterministic framework and introduce, for those of you who have not seen it, very quickly, and refresh for those of you who have, the idea of solving these problems using singular value decomposition. Then focus on the idea that these sometimes conflicting requirements of accuracy and smoothness in the context of these inverse problems tend to be ill-posed in a way which I'll make very precise in a minute.
We have to look at regularization. And I'll tell you about a famous regularization which is about 40 years old [INAUDIBLE] Russian mathematician, Tikhonov, all in the deterministic framework. And then I'll close by telling you about a probabilistic framework for doing exactly the same thing. And again, as I said, this is going to be something which you're going to see again and again through the next two weeks in multiple different guises.
Much of what I'm going to tell you, in fact, can be found in a very nice online textbook. This is a very introductory lecture at this point. And here is the website. I should also tell you all these slides will be available to everybody in the course via the Dropbox or through some other mechanism which will allow us to essentially communicate.
Also, I should say please stop me whenever you feel fit if I'm going too fast. I'm going to be able to speed up, but I can certainly slow down. So onwards, then.
So inverse problems are not new. Inverse problems have been around for a very long time. And I'll give you an example in a couple of minutes. As I said, much of science can be thought of as inverse problems. Given data, what kind of theory should you essentially construct?
But in a more humble, pedestrian framework, inverse problems are associated, for example, in physics, which was where they first started to appear. The idea of scattering theory. The idea is I have waves which are essentially moving past an obstacle. I essentially pick up the scattered wave fields. I want to now try and construct the obstacle. How do I essentially construct that?
In quantum mechanics, I have a scattered wave field. Can I construct the potential? In cluster wave propagation, in radar, for example, can I construct the obstacle which essentially scattered the wave field?
Geophysics, essentially the same problem, worth a lot more money than in physics because if I can essentially prospect using seismic or electrical signals, I should be able to tell you something about whether there is gas or there is oil, what the density of rocks is, and so on. So again, I have a wave which is essentially impinging on a medium. I don't know what the medium is, but I'm essentially now trying to recover from the signal that scattered from the medium some properties associated with the medium.
Same kind of problem in biology. Instead of using acoustic waves or light waves, you use x-rays. We scatter a molecules. You want to try and understand what its structure is. So many of these problems have the sense of reversing what you normally do, which is given a wave, given an obstacle, find out what the scattered wave field. Over here, given the scattered wave field, find out what the obstacle is.
Something that you're going to hear a lot about, something that many of you are familiar with, already over here, people talk a lot about fMRI, the ability to essentially try and map activity in the brain. But everybody over here also knows, even though we like to forget, what we're really mapping when we're essentially looking at fMRI is blood flow. And trying to understand from that how neurons are firing is an inverse problem. It's a very hard problem. We don't quite know how to solve it. We are trying to essentially make guesses.
In a different setting, in engineering, often we are fitting models. We are trying to deconvolve images. Again, an inverse problem.
So what are inverse problems? Very, very general setting, which will become more and more clear. So given that there is an image-- for example, in front of me, all of you. As an operator, which is my lens, my retina and so on, I get some data, which is essentially going up to my brain all the way from [? v1 ?] to [? v6. ?] There is some noise. I can essentially ask what the direct problem is or I can ask what the inverse problem is.
So the direct problem is given the object, give me the data. The inverse problem is given the data, please give me the image. And here is a description which I like very much, which I'll just read out to you. "Two problems are inverses of one another if the formulation of each involves all or part of the solution of the other.
And what is direct and what is inverse is essentially purely historical, often. It's not really something that is set in stone. The better studied one is typically called the direct problem. The poorly studied one is called the inverse problem.
So the direct problem over here, as I already told you, is given f, find d. The inverse problem is given d, find f. And a could be a linear operator. a could be a nonlinear operator. There could be noise which is additive. There could be noise which is multiplicative. So there's a huge variety of phenomena which fit into this innocuous looking equation.
In the context of the things that we're going to see over here is a classic one. The classic one is trying to understand images from scattering. If I'm thinking about now light and how light interacts with the body, a simple question that one can ask is given the shape of the body, can you describe the shadow? Very well defined. You give me a body. You essentially use ray optics. You have a light field essentially propagates or interacts with the body. You can construct the shadow.
The inverse problem is characterizing shape from shadow. And immediately, you know very, very well that it's ill-defined, ill-posed. And the reason it's ill-posed is if I have a concave body, the shadow will always be convex, and so there's a problem.
In fact, I think this is probably the oldest inverse problem there is because I read a description in the literature that Aristotle thought about this problem of characterizing shape from shadow in the context of asking whether the shape of the Earth was that of a circular drum or was that of a sphere. I will not read out the entire description, except to point out that in order to solve this problem, he had to have multiple different pieces of information. It wasn't sufficient for him just to get the shadow of the Earth and the moon, which would be a circle if it was a long drum, but it would also be a circle if it was part of a sphere.
But in addition, he recognized that essentially, if I move from one place to the other on the Earth, the celestial map changed. In addition, he recognized that in fact, when I move from one place to the other, bodies always fell the same way, towards the center. Using these additional pieces of information, he was essentially able to deduce, or he thought he was able to deduce, that it's more likely than not that the Earth is a sphere. It took a lot longer.
And so again, already in this very simple example, you see that you're bringing in information from different bits in order to essentially build what you actually know from facts. So the facts are the shadow is a circle. The other bits of information that you have are beliefs, or some sort of confidence, in whether or not the particles fall centrally at all locations, whether or not the celestial map essentially changes from one location to the other. And you're trying to build in different parts of your sensory input to essentially create a model, or to create some view of what the shape is, given that you essentially got shadows.
So starting with that, in fact, you can essentially try and describe a whole series of problems which are inverses of each other. One is well-posed and the other is ill-posed in a way which will become clear as soon as I give you the first example. For example, in arithmetic, if I multiply a number by another small number, that's a very well-defined problem.
But if I essentially reverse the question, instead of multiplying q by a, I actually ask, what happens if I essentially divide s by a? And a is a small number over here. Then you know that that's a very badly defined problem just from experience. Because if I essentially divide by a very small number, any small changes that I make in that small number get [INAUDIBLE].
So in arithmetic, multiplication is well-posed. Division is ill-posed. You change that simple problem, instead of multiplication of one number by another, multiply a vector by a matrix, you have the generalization for linear algebra. And we'll spend a fair bit of time on that. Again, linear algebra, multiplying a vector by a matrix is well-posed.
Dividing a vector by a matrix is extremely ill-posed, potentially. And the reason it's ill-posed is because the matrix may not be square. And even if it is square, it may have eigenvalues which are very small. And for either one of these reasons, essentially, you're going to have lots of trouble.
So you see an example in arithmetic, you see an example an algebra. And of course, naturally, you want to move further with an example from calculus, and also life, I would claim. In life, differentiation is ill-posed. Differentiation is bad. And integration is well-posed and integration is good. Integration is always smooth. And one of the things that we're going to do with all of you guys is essentially integrate.
So what has all this got to do with brains, mind, and machines? Tommy has already said that. I'm going to essentially tell you a slightly different version of that. I'm thinking about how we respond to what's happening in the world. We get sensory input, which is proximal, and we are trying to essentially use some computation or some set of computations in our brain to try and get perception.
I get input from vision, I get input from olfaction, I get input from movement. I'm trying to put all that together and say, and now taking all these things together, this is the person. And that person is defined in terms of all these different sensory inputs.
And in fact, I like this view, which [INAUDIBLE] many times before, and it's worth emphasizing, that every problem in physics, if I essentially turn around, becomes a problem in perception at some level. So for example, in physics, the problems have all been associated with understanding how light interacts with matter. And the inverse problem is the problem of vision in psychology, in neuroscience.
In the context of mechanics, forward problem, Newtonian problem, is given initial conditions, given the forces, please tell me how the bodies move. The inverse problem in neuroscience is associated with movement-- trying to understand, basically asking, what happened in your brain? What happened in terms of planning, given that you moved around in certain ways?
Olfaction is the inverse problem of hydrodynamics because hydrodynamics tells you the forward problem. Given a concentration of a molecule, how does it essentially change as a function of space? How does it change as a function of time? The inverse problem is measure the concentration and please tell me something about where it came from, when it came from, how long has it been around.
And in all these problems, inverse problem, you typically have partial information and you want to essentially determine the object. And the reason you want to do that is because if you can do that, you can classify. And of course, you can compress. And again, this is a theme that you will see right through the next couple of weeks.
So I want to now give you two or three little examples. The first one is an example of this in calculus. The second one is an example from numerical analysis. And then the third one essentially generalizes these ideas to focus on what the direct and what the inverse problems are, to emphasize the point.
So in Newtonian mechanics, we have, for example, mass times acceleration is the force. And if I'm given initial conditions for the position and for the velocity and I'm given the force, then I can essentially solve the problem forward. Please tell me what the trajectory is. You give me some initial conditions. You give me the force, and I'm done.
So the direct problem, the problem of integration, the smooth problem, the well-posed problem, is given the force or the acceleration, find the displacement. Easy to do. We learn this, and we've learned this for many, many years.
The inverse problem is you find q. So given the displacement, find the force. And you immediately know that this is not nice. And the reason it's not nice is because essentially, every time I differentiate, I start to make errors. I make errors because the differentiation is essentially going to be dependent on what the smallest window that I take for time, and I have to divide. Every time I differentiate, I have to essentially take position at one point, position at a different point, divide by the distance, or the temporal variable at a time step, and that becomes very, very bad.
I'll give you an example completely quantitatively without essentially going through anything more in what's written over here. So for example, if I consider a potential view and I add to this function which seems innocuous. I haven't defined what n is. n is just a number over here, u of t plus cosine nt times 1 over n.
If I differentiate this twice, I get q is 2n, which is a function of n, is q of t minus n cosine nt. And you immediately see that as n becomes larger and larger, n approaches u, so un is the approximation to u. un approaches u, which means the solutions are well behaved. But qn and q diverge from each other because as n goes to infinity, the q's become worse and worse.
And you see this transparently because unlike in the forward Newtonian mechanics integration problem, where things are smoothed out as I integrate. If I have a function which has jumps and bumps, if I integrate things out, everything smooths. But if I differentiate, it becomes worse and worse. And here is a fantastically simple [INAUDIBLE] example where the function converges perfectly nicely as this parameter n goes to infinity. But in fact, the quantity that I'm interested in, the acceleration diverges.
So building on that simple example, we can now construct this idea, which goes back to Hadamard of what a well-posed problem is. And it doesn't necessarily have to be a forward or a backward problem. It's just a question of whether it's well posed or not. Tommy alluded to this a few minutes ago and I want to emphasize.
The first problem, the first question that one needs to address if I give you an equation-- an algebraic equation, a differential equation. It doesn't really matter what, linear algebra equation. The first question is, is there a solution? Second question is if there is a solution, how many such solutions are there? And the third is the question of whether the solution is stable. So small changes in my initial conditions or small changes in my data, do they lead to large changes in the solution?
If all three of these are true, that is to say, you have existence, you have uniqueness, and you have stability, and stability in this case means continuous depends on data-- then Hadamard told us that the problem is what's called well-posed. Well-posed in the sense that if I give you now this differential equation, algebra equation, or, in the context of perception, I give you a set of sensory inputs, can I essentially deduce the object from which the sensory inputs are coming?
If all these three conditions are satisfied, the answer is yes, I can do it, and I can do it uniquely, and that's great. If any one of these is violated, the problem is ill-posed. That was Hadamard's contribution from almost 100 years ago. In 1902, he published a very beautiful paper essentially where he describes.
Of these, the one that we're going to be worried about the most is the last. And the reason we're going to worry about the last one the most is because I'm thinking what linear equations, which is going to be the focus for today. We're going to start thinking about nonlinear equations and nonlinear ways of trying to invert problems later on.
In linear equations, existence is almost a given. Not quite, but it's almost a given under conditions which I will explicitly indicate in a few minutes. Linear equations, if I have solutions, they're unique. Definition.
This is going to be the scorpion's sting. This is going to be where things are going to become difficult because it's no longer clear. It will not be clear, and I'll show you why-- I've actually shown you why in this very simple example already-- that you don't necessarily have continuous dependence on the data. That is to say, if I invert the problem, I'm going to have multiple ways of interpretation.
In vision, For example, the whole idea of illusions is essentially associated precisely with that. Because I get certain projections, and those projections are not sufficient for me to essentially extract or recover the object from which those projections came. And so you get tricked.
So how do we solve these? Or can we solve these three? So it took another 50 units from Hademard to a Russian mathematician and his colleagues, Nikolai Tikhonov and [INAUDIBLE], who in the 1940s and '50s recognized that even though problems are ill-posed, you can actually make progress. You can actually start approximating solutions. And I want to tell you about that after I give you the second example.
And the second example is more of a numerical example rather than an analytic example. It still makes the same point. So how, for example, if I'm given a set of points associated with how a function changes in time, how can we estimate the slope?
It's an obvious question that comes up in the context of vision, where I'm essentially trying to detect differences between light and shadow. It comes up in motion, where I'm essentially looking in the body, which is moving past me. In all these problems, I have to essentially calculate or estimate gradients. And so there's a very basic question that comes up. If I want to calculate gradients, I have to estimate the slope.
I'll start with a graph, which is essentially showing you what, for example, might be input into your senses of some stimulus. So the solid line is what should be, in an ideal world. And on top of that has been superimposed some noise. And what you're given is the sum of these two, which is essentially indicated in red over here.
So I have some noisy data. And generally, noise is associated with what is happening at high frequency, short wavelengths. And so what you want to do is to try to extract from this some measure of the slope. So you have high frequency noise superimposed on low frequency data.
And if I want to try and estimate the slope of the function, the first question that you've got to ask yourself is, how do I sample? Because basically, I'm not going to be able to get infinite sampling. I'm not going to be able to get the value of the function at every point in time. So I have to first ask myself, how frequently should I sample?
And based on that, I have to ask myself, are there optimal ways of sampling? If I don't sample well, it's useless. If I basically sample two points, I'm never going to be able to capture even the low frequency content. If I sample too strongly, I'm going to claim, I'm going to tell you immediately, then I'm going to be trying to over fit. Because if I sample too strongly, I'm going to pick up every nuance of this high frequency information. And then I'm going to make a mistake because the slopes are going to change enormously.
So again, there is this very simple question which I hope we'll come back to over the next couple of days when we think about questions of sampling. There are different views and different ways. There is a classical view, which says that if I have frequencies which are bounded-- so as long as the signal has an upper limit above which there are no frequencies, there is a theorem which goes back 75 years ago to a mathematician, Nyquist, who said that in fact, there is an optimal rate.
But more recently, there have been very, very nice views of the sampling problem which say that in fact, I can get away with a lot less. The Nyquist view is the pessimistic view, if you will. A more modern view is the optimistic view, which says that in fact, most functions are fairly well behaved, and so I don't have to sample all that often. Most function in particular associated with natural themes-- in vision, natural movements, and natural motions in movement, and so on and so forth.
I've already told you that differentiation is inherently unstable and noisy. And so if I actually started with noisy data-- let me call the noisy data sampled at points tk, sk, which would be the true value of the function plus some noise. We're trying to estimate the slope.
So naively, we would say the slope is just the difference of the function at two particular values, two particular points, tk and tk plus 1, at which I have sampled, divided by delta t. And if I essentially use and recognize the fact that the approximation to the function essentially has noise built in, I can rewrite this as f of tk plus 1 minus f of tk over delta t plus epsilon k plus 1, the noise at time tk, minus epsilon k, the noise at tk, divided by delta t.
And the problem is that as I make my sampling narrower and narrower, as delta t becomes smaller and smaller, this becomes better and better, but this becomes worse and worse. It's a very simple example which says, now I have a problem. I have a problem because you may think that to become more and more accurate, I'll reduce my sampling time, increase my sampling frequency. But if I have noise, eventually, it becomes worse and worse. It becomes worse and worse precisely because of the noise.
So if you were trying to accurately estimate the slope, you might say, what I really need to do is if the actual slope is tk and my approximation is this quantity, I just minimize the square of the errors. It's a very natural thing for you to tell me.
But this is going to become worse and worse because you'd say delta t becomes smaller and smaller. And as delta t becomes smaller and smaller, this quantity, which you've forgotten about, which you have no control over, essentially starts to divert. So if you're solely thinking about accuracy, increasing the frequency of sampling makes things worse and worse.
On the hand, you'd say, we hope-- and hope is what drives, I think, all of us in science. Certainly the first thing in the morning, if I didn't have a hope of trying to solve the problem that I was starting out to do, I'll never succeed. So our hope is that, in fact, there is some underlying signal over here.
And if there is some underlying signal, maybe I want to essentially balance this competition associated with trying to maximize accuracy with another trend, which is essentially look at smoothness. Because I'm hoping that there is a real signal, and if there is a signal, then there is an underlying smooth quantity below. And therefore, I want to essentially now worry not only about accuracy, but about smoothness.
And this is the idea that Tikhonov essentially told us how to characterize and quantify. And the idea is instead of now maximizing the accuracy, minimizing the squared error, you also add a contribution. And the contribution over here is your approximation for the slope at time tk plus 1 minus the approximation of the slope at time tk.
Why? Because if I do that, then I'm allowing the slope to vary, but not too strongly. Effectively, this is looking at the derivative of the slope. The derivative of the slope is essentially moving around rapidly, so the acceleration is very, very large. That's bad. It's unlikely that I'll essentially see a solution like that.
But when I say that it's unlikely, you've already made the hypothesis. You've made a hypothesis. You've said that there is a prior. And the prior is that somewhere under this mess, there is a signal. That's the hope that drives us.
And so we're essentially saying, maximize some combination of accuracy and smoothness. And this parameter alpha is what essentially controls how much I'm weighting the smoothness compared to the accuracy. As alpha becomes larger and larger-- if alpha becomes infinity, minimizing this, who cares about this? Because if there is even a small change in the slope, you're going to penalize it. And so what are you going to get? You're going to get essentially approximating that complex curve which is a straight line.
But as alpha goes to zero, you suffer because then, you start to pick up the jitter. And so alpha is a filter, or alpha plays the role of a filter. How much am I willing to pay attention to the high frequency components compared to the low frequency components?
So the little example-- and I'm really sorry that it's not showing up as well as on my screen. This is an example where I have this little function, which has a linear part, f of t is proportional to t, and an oscillatory part. That's the actual derivative. You add some noise and you try to reconstruct the function by minimizing this quantity.
And as alpha goes to zero, you do a disastrous job because of the noise. As alpha becomes larger and larger, for some optimal value of alpha you're doing, for example, over here, alpha [INAUDIBLE] minus 2, you're doing a reasonably good job. As alpha becomes very large, exactly what you expect. You just replace the whole thing by average.
So how can we now quantify this in a more general sense? As I said, please do stop me for any reason whatsoever, even if you want to take a break. If there's a majority, then I will.
So I want to now generalize these two examples from an analytic perspective and from a simple differentiation perspective now to ask how we should think about linear inverse problems. And linear inverse problems I can always couch as matrix products because any linear operator I can discretize, and if I discretize, I can write it essentially as a matrix.
So my linear forward problem would be given f, given A, find d. f is an image, A is a measurement matrix, or a projection, and d is data. Forward problem is f is given. Find d. The inverse problem is given d, find f, which involves the inversion of a matrix. I could have noise, and I'll come back to that in a while. But for now, I'm going to focus on the case when I don't have noise at all.
I already told you that there are potential difficulties if my matrix is not square. And in general, it won't be square. So I'm going to essentially tell you, just for notation, that d is an m by 1 vector. A is a rectangular matrix, m cross n, a dimension. And f, of course, by construction, then, must be a vector, which is n equals 1.
The rank of the matrix is r. The rank of the matrix if it's a square matrix is the number of nonzero eigenvalues. If it's a non-square matrix, it's associated with the nonzero singular value, which I'll essentially refresh for you in a minute.
The boring problem, the one that all of you certainly know how to solve, was the one in high school that was taught to you where m was the same as n, and m and n were also the same as r. So you had a full rank matrix which was square.
So basically, I have a set of simultaneous equations. Number of unknowns is equal to the number of equations. And no equation was linearly dependent on any other. That means it was full rank. That's the boring problem. And so we're going to immediately forget about that because you know how to solve it.
The interesting problem, the real problem, is either when m is much less than n-- and I'll come back to the condition on r. So if m is less than n, then it potentially could be underdetermined, but I just don't have enough equations for unknowns. Or if m is greater than n, in which case, we over-determined. I have more equations than unknowns.
How can we essentially solve either one of these? I'll come back to this later. I'll focus on this for now. This is a more tractable version. In some sense, if it's underdetermined, I have a lot of freedom. Solutions potentially could be non-unique because I have more unknowns than equations. So I can arbitrarily start throwing out some equations. Or equivalently, I can zero out some of the variables.
And this is a whole field of signal processing, which has exploded in the last few years under the guise of what's called compressive sensing, matching for suit, and so on and so forth. For now, let's focus on the case when the number of equations is more than the number of unknowns.
So I have over-determination, which is typically what's going to happen, for example, every time you try and fit a line through a set of points. A line has basically two quantities which describe it, its slope and an intercept. I have a whole number of points I want to fit it through. What do you do? You minimize the error, or the least squared error.
I want to now formalize that question because that's essentially the heart of these inverse problems. And I want to formalize it by essentially asking that we solve this equation, d is equal to Af. We invert A. There are two ways.
One is to generalize the idea of an inverse. The inverse of a square matrix we know how to compute. The inverse of a rectangular matrix, there is something called the generalized Moore-Penrose inverse. If you've seen this, then you know what I'm talking about. If you haven't seen this, look it up afterwards.
I won't tell you about that. I'll tell you an equivalent, and a very powerful computational approach, which is to recognize that for any rectangular matrix, A, there is a theorem which says that I can always rewrite that matrix A in terms of what's called traditionally a singular value decomposition, which is a product of three matrices. U times sigma times V, where U essentially has for its columns orthonormal vectors, which correspond to the left singular vectors, or they're also called the data space.
And the reason they're called the data space is because it has dimensions of m times n. m and the data has dimension m times 1. So essentially, it's the space in which the data lives, if you want, with respect to which I can essentially express everything using U.
Sigma is a rectangular matrix. So U is a square matrix, B is a square matrix. Sigma is a rectangular matrix. Sigma is a rectangular matrix which has a certain number of nonzero diagonal elements. Everything else is zero, including a block which is completely zero. And the number of nonzero elements, r, is essentially the rank of this matrix, r.
And then V is another square matrix, and it corresponds to what's called the right singular vector, or the image space. And the reason it's called the image space is because it's operating on the image. If I write down A times f, V acts on f. So that corresponds to the image.
So in terms of the singular value decomposition, which I hope all of you are familiar with. And if you aren't, come and talk to us later on. We can essentially write down a solution formally, and actually quantify how we should invert this matrix equation, d is equal is Af, by trying to minimize what's called the L2 norm, or the Euclidean norm.
And here's the formal [INAUDIBLE]. So it basically says the least squares idea is to find this approximation, f, which is the solution of the following optimization problem, which is the argument associated with minimizing this norm, d minus Af. Which norm do we pick?
Typically, what's picked is the 2 norm, the Euclidean distance, essentially. Equivalently, geometrically, what we're trying to do is to find the size of the sphere in which the points would fit so that the center of the sphere is as close to all the points as you possibly can have. That's the geometric interpretation of the least squares problem.
How do we do that? We do that by expressing A in terms of its singular values and then rewriting this equation, d minus Af, or equivalently the norm by essentially writing d minus Af squared can be written down in terms of the singular vectors, the left singular vector, the right singular vector, and the singular values. So this is a little bit of tedious algebra that I have to be able to at least tell you about, show you, and you can work through this later on.
d is the data, and d must be represented in terms of the left singular vectors, which correspond to the data space. So I project the data onto the left singular vectors. That's my projection. And then times the singular vectors themselves. That's d.
And A is the singular value decomposition I've written down over here. Singular values multiplied by the left singular vector times the right singular vector projected on the image.
AUDIENCE: I have a question.
LAKSHMINARAYANAN MAHADEVAN: Yeah.
AUDIENCE: So the only regularizers I ever see people use are L2, L1, or L0. Is that for any reason besides computational or analytical?
LAKSHMINARAYANAN MAHADEVAN: Short answer is no. So L2 is good in certain situations. It's very bad in certain other situations. Again, if I think about trying to fit a set of points with a line, if I have outliers, then L2 is a disaster because the outliers essentially control everything.
And then you would say, maybe what I want to do is not to go to L2 but to go to L0. Basically, I just start throwing out the variables at either zero. Or not zero. But L0 turns out to be non-convex. This is going to very rapidly become a technical discussion.
And if it's non-convex, you don't have any good methods to do that. And so there are some theorems which show that almost always, you can minimize the L0 norm by the L1 norm. And minimizing the L1 norm allows you to essentially use methods from convex optimization. So the short answer to your question is they are all associated with computational ease rather than anything else.
So I want to minimize this norm. And if I just write it out explicitly, I can write it out in terms of two parts, one which has the sum going from k equals 1 to the rank, and then the rest of it associated with going from the rank plus 1 all the way to m. So I've done nothing except expand this out into two parts.
And now I just write out the Euclidean norm of this vector. And I'm sorry that this is at the bottom of the page because I realize that this is not optimal for lectures since people at the back can't see it. But if you can't read it from the back, I'll just tell you it has two parts to it, one which corresponds to effectively a term or a set of terms which depends on the image, f, and another which has no dependence on the term.
And the term which has the dependence on the image is essentially written down as a square. And so if I want to minimize it, I can always make that zero. The other part I have no control over. And if I make that zero, I essentially get what we are trying to understand, what we're trying to derive.
We're interested in trying to essentially get f. And so we get the components of f. What components of f do we get? The components of f which are projected onto the image space because I've written down everything in terms of the image space. So I write down my image in terms of an expansion in terms of the singular vectors associated with the image space. And I write down the data in terms of an expansion associated with the data space.
And so I get what I need. And now I want to summarize what we know. What we know is if the rank of the matrix is equal to the size of the data, n, then you have a unique image. And here is an expression. All I'm going to basically point out is that is my expression.
So the most important thing to point out over here is that if you look at the expression for the image, which is what we were interested in, inverting the matrix, you see that there is a dependence on 1 over the singular values. That makes sense, because if I essentially think about the linear algebra problem as a simple generalization of my arithmetic problem. My arithmetic problem was I'm dividing. Linear algebra problem is just a fancy way of dividing rather than a number, I'm dividing a matrix.
But if I write down the matrix in terms of its singular values, then all I've got to do is divide different components by different singular values. So it's a completely natural generalization of the idea of arithmetic division.
You can ask the question, what would happen if one of the singular values became small? And you have a problem because you had a singular value which is small. And if the data had noise, effectively replace d by d plus noise, the noise is going to be divided by a small number. And if it's divided by a small number, I'm going to run into trouble.
So this is the heart of the matter. And this also raises an important issue, which is why should we make the cutoff from nonzero to zero, which defines the rank, be something which is God given? There's no reason because after all, singular values of a matrix can have some distribution. There can be some which are small, some which are large. Why should we make zero something which is super critical? Because if there's a singular value which is small, I'm going to run into trouble.
So that suggests two things. The first thing it suggests that you don't want to think about the singular value decomposition of a matrix or an operator in general. You want to think about an effective singular value decomposition. You just decide-- again, this depends on your priors. This depends on how strongly you believe the data.
You have to decide how many singular values I want to keep, and I just truncate. So I have the effective rank, and therefore, an effective singular value decomposition. And the important message, even if the details are not completely obvious, is that there is some notion of a belief, there is some notion of a prior, which tells you how much of the information that I actually get, how much of the data that I actually get should I actually trust. And the rest of it is noise. And this is something, again, as I said, you're going to see again and again.
So if r is equal to n, as I said, we have a unique image. I just told you about it, and I have written it down explicitly. If there was noise in the data, so the data is now the operator times the image plus some noise. I write down the data in terms of a description that I have already.
And if I essentially take this and substitute into my approximation for f, then I find that I get the real f. And then I get a whole bunch of stuff, which corresponds to the projection of the noise on the data space, multiplied by the image singular vector, divided by sigma k. And I explicitly now compute for you the dependence of this approximation on the noise and on the singular value.
So if the noise is small but if the singular value is also small, the ratio of these two can be large, and that will give you trouble. That's my explicit calculation for what I just told you in words. The noise gets amplified by the small singular values. And again-- question?
AUDIENCE: Yeah. Just want to make sure I'm following the notation. So what we did on the previous slide was reduce the U, sigma, and V matrices to the lowercase u--
LAKSHMINARAYANAN MAHADEVAN: Those are just associated with the columns of U, sigma, and V.
AUDIENCE: Which are nonzero?
LAKSHMINARAYANAN MAHADEVAN: They're just normalized vectors. That's it. So the small u's, the small v's, and the small sigmas are just the elements, or the columns, associated with the matrices U, V, and sigma just the top part of a diagonal. The number of nonzero distributions in the diagonal are exactly equal to the rank. The rest of it is just a block of zeros.
AUDIENCE: Do you also refer to them as singular value?
LAKSHMINARAYANAN MAHADEVAN: With singular values, the generalization of eigenvalues for non-square matrices. If you like, more precisely, the singular values correspond to essentially to the eigenvalues of A transpose A or AA transpose.
AUDIENCE: Thanks.
LAKSHMINARAYANAN MAHADEVAN: Thank you. Please do continue to stop and interrupt. So question. Now, if I already told you that m equals n equals r-- boring. If r is equal to n, and it's different from m, I can still have a unique image. But what happens if the rank is less than n? What happens if your matrix A has a smaller rank than the dimension of the image space?
That basically corresponds to saying that there is a nonzero null space of A. Or mathematically, it says that A can operate on a linear combination of v's-- v's correspond to the image space-- and still have a contribution which gives rise to zero where the coefficients are not all zero. Basically, it says that there are vectors v so that A times v is equal to 0. And v is not zero, and certainly, A is not zero.
If that's the case, you now have a peculiar problem. The peculiar problem is that I can add to my approximation that I already had, approximation I've written down over here for f. So the approximation solution of this matrix equation, A times f is equal to d, to which I can add any linear combination associated with the null space of A. And that's guaranteed to satisfy the equation A times f is equal to d because A times f will give you this part, which will give you d. And A times this, by definition, is going to give you zero.
Again, geometrically, what this is saying is if there are parts of my signal which are essentially going to be associated with the null space, then I can always add any contribution from the null space to my solution and it's a perfectly valid one.
And to make a connection to what I showed you three or four slides ago, that was exactly like solving the inverse problem for the Newtonian mechanics question where I could add to the U 1 over n cosine nt. It makes no difference whatsoever. This is the exact analog of that problem, except now in the matrix context.
So now I have a problem because generically, the rank of these matrices in small. And if it is small, I have a nonzero null space. And if I have a nonzero null space, then I have a whole bunch of solutions which are also possible, not the one that I'm interested in alone.
So what do I do? So I now need to figure out how to make this better. So before I go there, I want to talk about what happens if the rank of the matrix is less than m. And m is the dimension of the data space.
Now there's another problem because if the rank is smaller than the dimension of the data space, that means I may not even get a solution because there are certain kinds of data for which I have no way of knowing how to invert the matrix. Again, I can potentially couch this in the context of perception, but I will leave it in the context of this matrix inversion. There are certain quantities, there are certain parts of the data which have no way of being represented in the image of A because A has essentially got too small a rank compared to the dimension of [INAUDIBLE].
So there may be no solution. And if there is no solution, the best I can do is essentially get an approximation. And again, an approximation can be done using least squares. If I'm doing an approximation using least squares, it becomes part of the same problem.
So what do I want to do? I want to summarize now what we've done so far. So starting with this extremely simple idea that division of a number by a small number is an ill-posed problem, going from there into a simple differential equation where I showed you that-- and I want to say the same issue hits us in the face when I'm thinking about matrix inversion.
And so we've got to now ask, how do we essentially resolve this problem of dividing by a small number, particularly if the data is noisy? What's the answer? The answer I already told you. The answer that Tikhonov told us is to say I must have some other norm, some other way of evaluating how I'm doing, which means I have to introduce some notion of smoothness, which is the same thing as saying I must know something about f. All these are equivalent.
How do we choose among many different feasible solutions in a case which is very typical, where the rank is small compared to both n and m, the dimension of both the image space and the data space? And the answer is in words-- and it becomes quite hard and there are many different ways of doing this in equations. The answer in words is that we need some criterion to eliminate infeasible solutions. Or equivalently, we want to build a bias towards feasible solutions.
So the Tikhonov idea, which you already saw in the context of trying to ask how I should describe a smooth interpolation of slopes of functions where I'm not sampling infinitely often and where there is noise, is to say that in addition to minimizing this misfit norm, or the L2 norm, or the Euclidean norm, you may also want to essentially minimize some other quantity which is strictly a function of f. And f is the image itself.
And that's the notion of saying, I actually know something else. I know that that image is smooth. I know it's not varying rapidly on small scales. Or I may know something not only just about the image. I may know about some operator, L, operating on the image. So f infinity, if you will, is some potential feasible solution that somebody told you, and I want to minimize the distance of f compared to f infinity.
Or I may want to minimize the derivative of f minus f infinity. This is, for example, L is a linear operator operating on f minus f infinity, and the first derivative will have this form. It will be a square matrix which has along its diagonal minus 1, and then just above the diagonal, have a 1. So I may want to minimize the Euclidean norm relative to a default solution or the bias, or some smooth version of that.
But now I have a conundrum. The conundrum is, do I minimize this norm or do I minimize that norm? And what do we do? You just weight them. You now say, how much belief do you have in basically the default solution? And how much belief do you have in the data? In the absence of data, I basically use the default solution. In the presence of data, I gradually start to be led by the data. That clear?
This is the idea which essentially goes back to Tikhonov, now formalized in this linear algebraic setting, which has replaced the previous formulation, which was minimizing the L2 norm, v minus Af. But now if I say I minimize a weighted function, which is the sum of the L2 norm associated with the misfit, and some regularization term, which characterizes how smooth or how rough the solution should be, where the critical quantity is this parameter lambda or lambda squared because lambda squared is always positive called regularization parameter. If lambda is equal to zero, then I recover least squares, the original version. But least squares is very sensitive to noise, in particular, if there are singular values which are small because I'm dividing the noise by a small quantity.
On the other hand, as lambda becomes infinity, then the misfit essentially plays no role because I'm weighting very strongly the deviation of f from this prior, this bias towards f infinity. And so I ignore the data and I just focus on the prior. So this now is the analog of what I told you before for how to essentially estimate the slope of a function. But now it generalizes to a linear algebra problem.
AUDIENCE: So is the reason why we call this roughness because we have a prior for low polynomial functions?
LAKSHMINARAYANAN MAHADEVAN: Exactly. Exactly. Now, you have to decide, or somebody has to decide, what variation do I use? So in other words, do I want to look at function? Do I want to look at the slope? Do I want to look at the second derivative? And then how strongly do I want to essentially bias it?
Both of those are very important. And those things I can't tell you. You have to decide. The user has to decide. That is a priori, not an immediate procedure.
So now I want to show you something that happens, which again, from the arithmetic problem, you already see. So in the arithmetic problem, I want to divide a number by a small number. What do I do to essentially regularize that? Because I don't control the small number, I just add to the small number another small number, but that number I control.
And if I control that number, it doesn't really matter what the previous small number is because that other small number that I control is never going to be smaller than critical value, so I can always be sure that the function remains under control. I want to use exactly the same idea over here.
So if I write the solution of this minimization problem in terms of trying to characterize f, minimizing that would be saying that I want to take the derivative of this function with respect to the quantity that I'm trying to essentially find, which is the f sub k, the values of the unknown elements of the matrix, f. Then this 2 norm is lambda squared times f minus f infinity times L transpose L times f minus f infinity because it's just L times this multiplied by its transpose.
And similarly, the same thing over here, which I've done over here, equal to 0. Take the derivative of that with respect to f, and you get an equation which is this. I didn't write this out. Previously, we had Af is equal to d or, in the squared version, A transpose Af is equal to A transpose d because A transpose A is a square matrix. A is rectangular but A transpose A is a square matrix. And the Moore-Penrose inverse is the inverse of essentially A transpose A, the inverse of that multiplied by A transpose.
Now what I've got, because I've regularized it, I've got an extra lambda squared times L transpose times L times f infinity on the right side. And I have an extra lambda squared times L transpose times f. If L was identity-- so I'm not worried about smoothness of derivatives. I'm simply worried about smoothness of the function itself. That is called Tikhonovic relationship.
And if that is identity, then you see that I have A transpose A plus lambda squared times the identity. And now I've done something very nice which Tikhonov essentially told is, which is that no matter what the singular values of A are-- so therefore, singular values of A transpose A would be sigma k squared.
I add to those lambda squared. So even if sigma is going to zero, if my regularization parameter is large enough, if my lambda is large enough, lambda squared plus sigma k squared never goes to zero. And so when I divide, which is what I'm doing, morally, when I'm solving a matrix equation, all I'm doing is dividing. And if I divide, I don't care now because I'm never going to be able to divide by zero because even if sigma k becomes small or even if it's equal to zero, I always have lambda to control.
So I hope that you're seeing this very simple, almost stupid trend. If I want to divide, I want to make sure I don't divide by zero. That's it. That's all there is. The rest of it is formalism.
So in fact, if I want to write that out explicitly, so what I want to do is now solve this problem when L weighting for the preference, or the prior, is identity. I want to solve this problem now rather than the previous problem, which was A transpose A times f is equal to A transpose d. Now I want to solve this regularized problem.
How do I do it using singular values? I use the singular value decomposition. I stick that in here. There's a little bit of algebra, which I'm not going to walk you through. But I'll tell you the result because the result is extremely intuitive.
The result is that the components of f now are a weighted average of the data and the prior. And what are the weights? The weights are the ratios of the singular values divided by-- square of the singular values divided by the singular value plus regularization. Note that some of the weights is one.
Note further that whenever I have data corresponding to all the elements where I have positive, nonzero values of d corresponding to rank of the matrix, corresponding to everything where I have rank r, my solution is dependent both on the data and on the prior. While for the places where I have no way-- so previously, remember that I said that if I have a matrix which has deficient rank, r is less than n, I could always add a linear combination of the null space vectors, and I didn't know how to do that.
Now I say every time I have to do that, in fact, I throw out the contribution from the null space and I essentially use the prior. So wherever I have data, use it. Wherever I don't have data, use the prior. So in the data space, use the weighted sum of the data and the prior. In the null space, use the default solution.
The additional thing that you see is that when the singular values are large compared to what? Compared to the regularization parameter, lambda squared over lambda squared plus sigma squared goes to 0, which means whenever the singular values are large corresponding to the long wavelength parts, which is where the signal is, you depend on the data. Whenever you have high frequency bits corresponding to small singular values, you basically are not going to be dominated by this term. So if sigma is small compared to lambda, this term is going to be much smaller than that, and so you're going to be biased towards the prior.
So the intuition that you have is whenever you have inflammation that you trust, long scale, large time, you believe the data, the high frequency parts. The small scale parts, if you don't believe the data, you basically just throw out because it's likely to be noisy.
AUDIENCE: Where is this notion of high and low frequency coming out in this?
LAKSHMINARAYANAN MAHADEVAN: High and low frequency corresponds to whether the singular values are large or not. So think large singular values correspond essentially to eigenvectors which are varying over the scale of the system, or singular vectors which have nonzero components over the entire system, while the small singular values correspond to things which are essentially varying over individual bits in the matrix itself, so on a very small scale. So large singular values, it's large scale. Small singular values, it's small scale. And so wherever you have small singular values, you switch to the default. Wherever you have large singular values, you use the data.
So noise is no longer amplified. The last bit is now to ask yourself, how do you choose lambda? I told you that if lambda is large, you're biased towards the prior. You're biased towards the default solution. If lambda is small, you're biased towards the data driven solution. And there is a rule of thumb which we sort of know which is indicated using this plot. I'll show it to you in the context of an example in just a minute.
So what's plotted over here, the vertical axis is the prior. How far am I from the default solution? So as I go further and further up, that's bad for the prior, which means the prior doesn't matter. I'm dominated by the data driven solution.
The horizontal axis is the misfit. So small values of d minus Af correspond to moving towards the origin. But as I move towards the origin, I'm very susceptible to noise. And large values of this are associated with, essentially dominated by, the prior.
And so typically, what you'll find is that if you plot the prior norm relative to the misfit norm, for small values of lambda, you're here. For large values of lambda, you're over there. And that typically is a very rapid change in many, many problems.
And you want to pick the lambda associated with this rapid transition because too small a value of lambda, you're too susceptible to noise. Too large a value of lambda, you're ignoring the data completely. You're using only the prior. And so in this sweet spot, you're essentially not susceptible to noise on the one hand, but you're still being driven by the data.
I want to show you now an example from blurring and deblurring. And I'll just walk through this relatively quickly. So imagine that I have a set of pixels which essentially categorizes on a grayscale an image. And I take this pixellated image. For example, if I basically have a particular location, I have a particular value of intensity which is nonzero only at one location and zero everywhere else.
Now I blur it. When I blur it, I basically say that instead of having only one pixel, I also have it at some neighboring pixel. So I take the image and I convolve it with this blurring operator. So that's my actual image, which is a delta function associated with nonzero at one location and being zero everywhere else, with coefficients which tell me essentially at which locations the pixel is nonzero.
And I have this blurring kernel. And if I essentially operate with the blurring kernel on the actual image, I get my data. So this is my matrix. This is my image. And if I basically operate the matrix on the image, I get out my data.
So written out explicitly in the formalism we just had, y is x convolved with h, or in matrix notation, ax plus the noise. And here is the image after being given. Start with text. You blur it, and you can barely read it.
And now, you basically first try to invert that image just bare hand. When you invert that image, this is what you get. It doesn't really matter how you did it. You just invert the matrix using, for example, using MATLAB and singular value decomposition, and you get a mess.
You now do what I just told you. You stick in a regularization with lambda equals-- I can even barely read it-- 0.1. No real improvement. Lambda equals 1, not really much better. Lambda equals 10, maybe you can start to read it now. I hope you can read it. Lambda equals 100.
And I'll start reading. "The ill conditioning of problems does not mean that a meaningful approximate solution cannot be captured. Rather, it simply means that linear algebra cannot be used in a straightforward way to compute such a solution. Instead, we need to have more sophisticated methods. And the sophisticated method is the Tikhonov regularization."
You increase lambda even further, and now what happens is that you're eventually biasing towards the prior. So you're starting to smear out things more and more. And if I actually plot on the vertical axis the prior norm and on the horizontal axis the data misfit norm, you see exactly what I showed you before, this L-shaped curve.
So here, in this zone, the data is not telling you anything. It's all driven by the prior. While here, the data is telling you everything but it's too susceptible to noise. And somewhere over here, you have the best of both worlds. So lambda equals about between 10 and 100 is where you essentially are doing much better. And I'm sorry this doesn't show up as well, again, on the screen. [INAUDIBLE] resolution. Question.
AUDIENCE: What would be a prior to something?
LAKSHMINARAYANAN MAHADEVAN: In this particular case-- I am glad that you asked. Ask me later. I've forgotten what the prior was over here. It's something very innocuous. The prior was a uniform grayscale because what you saw previously was incredibly ragged, jagged, and you're just smoothing it out. That's it. It was something very, very innocuous. So you're just switching from the super jagged to tabula rasa, and in between is where all the interesting information is.
So I want to finish in the next 10 minutes or so by basically telling you, how do I re-couch everything that I've told you, which is, at the end of the day, just a division problem. At least in linear algebra, it's just a question of inverting a matrix, easier said than done. How do I rethink all of this now in a probabilistic context?
Why do I want to think about it in a probabilistic context? Because if I can start to think about it that way, then I may have access and recourse to a whole bunch of other tools to solve this problem. Linear algebra, I want to invert matrices. In probabilistic contexts, I may be able to essentially use stochastic methods. I may be able to just sample. And by sampling sufficiently and sampling severally, I may be able to do a better job. So it's useful to ask whether there's an analog.
So what is the probabilistic analog to inverse problems? Again, I rephrased the problem. I just changed my notation. Rather than a times f is equal to d, now it will be y is equal to cx. Or in the non-linear regime, y, the image, is equal to some operator, f, which could be associated with the optics of the eye and all the processing that happens afterwards operating on the scene, which is what I'm given, plus noise.
So the forward problem in the probabilistic setting is to find the probability density of y, given the probability density of x. I'm not now couching this in deterministic language. In the probabilistic language, I'm given the scene. Find the image. But rather than find the image, I'm saying, find the probabilistic distribution of the image.
So that would be written down as if f is the probability density, then f, given x, given the scene, find the image. Would be the probability density f sub n for-- I will tell you in a minute-- a multivariate Gaussian of n being equal to y minus of x.
So n is the noise. I need some distribution for the noise. And that distribution for the noise will essentially tell me what the probabilistic distribution is for the forward problem. You give me the scene, and I will tell you what the image is. I'll now give you the image not as a deterministic quantity. I'll give it to you as a probabilistic distribution function.
The inverse problem would be to find the inverse probability density. Given the image, find the scene. Or equivalently, in the language of probability, conditional probability, given the image, tell me the scene. Given y, please tell me what x is, which in the deterministic case would be inverting that. Everybody with me?
How do we do this? How do I go from a conditional probability of y, given x, to find out a conditional probability, f of x given y. I think everybody over here knows. The way you do it is to use Bayes' theorem. You basically use the idea of going back to Bayes, which says that f of x given y is f of y given x times f of x divided by normalization, which is f of y.
There are names for these things. f of x is the prior. What does that mean in the absence of any information? What is it that I know about the scene? f of y given x, which is the forward probability density, is called the likelihood, which is the ability of you to have some sort of model, which is given the scene, what would be the image? And the left side is called the posterior, and f of y is just the normalization factor.
So the Bayesian approach to inverse problems is to essentially try and ask, what should I do in order to maximize the posterior? If the prior was a constant, I want to just maximize the likelihood. And typically, because maximizing the likelihood and maximizing the log-- because the logarithmic function is a monotonic function, you typically use the log. It's much easier.
Why the log? Two reasons. One, it's monotonic. Two, because if I have Gaussians then I have exponentials. And logs of exponentials are nice things [INAUDIBLE].
So the Bayesian approach to these inverse problems depends on whether I have-- if I have a uniform prior, maximizing the likelihood is good enough because that's the same as maximizing the posterior. And if I have a nontrivial prior, then I really want to [INAUDIBLE].
How does this relate to everything that we've seen? So maximizing log of f of x given y, there's going to be a constant which I will get when I take the log of that, which is what I have over here. And then there is a notation which will become obvious as soon as I show you the next slide. You have two contributions, one which is the log of this bit, the likelihood, and the other one which is the log of that bit, which is the prior.
The log of the likelihood is called the misfit because that's telling you about the model. The log of the prior is called the precedence because that's telling you how much I trust what I actually know. So in general, you want to essentially maximize this quantity. I've already told you that maximizing this quantity is equivalent to minimizing the misfit or minimizing some combination of the misfit and the preference, depending on whether [INAUDIBLE].
I want to finish now by essentially working through this for the same problem that we did deterministically, which was d is equal to af plus n. But over here, slightly differently, is y is equal to cx plus n. So if my noise distribution is a multidimensional Gaussian with zero mean and the covariance matrix gamma, then f of y given x. Given the scene, what is the image, the forward problem, is f of the noise, which is y minus cx, because that's my model because I have a linear model.
And if I have a multivariate Gaussian, then this is my distribution, so 1 over 2 pi, the n over 2. And it's a dimension, the deterministic covariance matrix. And you recognize that this is just your usual Gaussian, except now written down in vector and matrix notation.
That's my likelihood. And you'll see immediately that if I take the log of this and I take two times that and I multiply by a minus sign, then I get my misfit, which is this quantity over here, and you should recognize this. What do you recognize this as? This is just least squares. In particular, it's least squares up to this factor, gamma.
So if I had independent, identically distributed random variables, then gamma will be a constant in the diagonal, which I can pull out. And that constant will be what? It will be just the variance. So if my noise is identical independently distributed variables, gamma is the variance times the identity. Then this quantity will look like that. And up to the constant r squared, y minus cx times y minus cx transpose is just the 2 norm.
So what have we seen? We've seen that minimizing the least square error for inverse problems which are linear is exactly the same as maximizing the likelihood if I have Gaussian noise. So now we've essentially been able to convert the deterministic description to the probabilistic description. And if I have a probabilistic description, now I can throw out all the probabilistic tools to solve these problems, rather than only linear algebra. There are assumptions, as I said, the key one being that I'm assuming that I have Gaussian noise.
What happens if there are singular values? Same question that I asked before. In particular, what happens if I have singular values of c which are small? Again, I have an ill-posed problem. Again, I have to basically think about regularization. And now what I do is instead of essentially maximizing likelihood, I say, really what I want is f of x given y. And f of x given y is f of y given x times f of x.
And what is f of x? It's my prior. It's what I actually think the scene is. Somebody told me or I had experienced. So I need now to build in the fact that my prior is not uniform, that my prior is essentially localized in some region associated with a particular scene. Does that make sense?
So how do we do that? We basically say again, for simplicity, and for a reason which will become even more clear mathematically in a minute, I basically say my prior has also got a Gaussian distribution with a different covariance, p. And now I calculate the log of this quantity multiplied by 2. And I have just exponents showing up.
And maximum posterior estimate would correspond to maximizing e minus s where e was the misfit part and s was the preference part. And this is what it looks like when l is identity.
If I now compare that with this quantity, what do you see? This is now the Tikhonov regularization problem. Tikhonov regulariztion problem basically said, I want to minimize a weighted sum of the Euclidean norm which is driven by data, and a smoothed out prior norm dependent on how far I am from a feasible solution, and regulariztion parameter is lambda. And you look at this and you look at that, and you see they're essentially identical.
What does that say? That in fact, Tikhonov regulariztion for linear inverse problems are identical to maximum aposteriorally calculations using Bayes' theorem if my prior is Gaussian and if the noise in my linear system is also Gaussian. So again, what I'm showing you-- this has been known for awhile-- is that the deterministic problem has a perfect analogy to the probabilistic problem.
And again, I can start to deploy the tools that we now have at our disposal to solve the probabilistic version rather than the deterministic version. I'm just about done, more or less, I think, as we planned.
So really, what I've done is to tell you one very, very simple thing, that dividing by a small number is a bad idea. And if I want to essentially get around that, I need to make sure that I control what I divide by. And the idea of controlling what I divide by is the idea of smoothing. And so there are these two things which are constantly battling every time I'm solving an inverse problem is how well I want to essentially invert, or the accuracy metric, and how smooth I want the solution to be, which is saying something about priors. Somebody has to tell that the solution should not have all these jagged edges. If there was signal in that, you smooth it out, you lost it. You have to have some notion of how much belief there is in that part.
AUDIENCE: Even the equivalent, is there any intrinsic advantage or disadvantage of assuming a deterministic or stochastic?
LAKSHMINARAYANAN MAHADEVAN: I think with the advent of a lot of compute power, stochastic approaches are going to become a lot easier to explore, especially if you don't know how to do it. You don't have a good way to essentially invert. A priori, no, but there is a bit of caution that needs to be drawn.
In general, all these problems that we're talking about have noise. So either the deterministic case, where you sort of think of noise as an afterthought, or the probabilistic view where the noise comes front and center, there is the key assumption that was made. And the assumption that's made is that the noise is well behaved, in other words, Gaussian, and [INAUDIBLE]. And essentially, it has all kinds of nice properties associated with having an expectation value and a variance and [INAUDIBLE] and so on and so forth.
So if that's true-- in many signals, it's not true. And if it's not true, you've got to revisit everything. And unfortunately, solving inverse problems is still a bit of an art as soon as you go beyond this very, very simple minded framework. This is a very powerful framework because it allows you to move around and make connections.
But as I've written down, and that's where all the dragons are, how do I essentially do this recursively? How do I use this learning? What do I do with non-linear problems? What do I do when I have non-Gaussian noise? And what do I do if I start giving up the norms, the L2 norms, if I give up the Euclidean norm?
It's very much part of navigating through this territory that we [INAUDIBLE] now. So I don't have a precise answer except that it depends on the kind of noise.
AUDIENCE: And regarding the problem of dividing by small numbers, unless you actually divide by zero, the solution is still continuous.
LAKSHMINARAYANAN MAHADEVAN: Indeed. Finish your question.
AUDIENCE: So I'm just thinking maybe you should think about how well is the problem solved, and not just-- I mean, according to the definition--
LAKSHMINARAYANAN MAHADEVAN: The definition of ill-posedness basically says that if I have small changes in the data, will I get small changes in the solution? And if I get large changes in the solution, that's very bad news.
AUDIENCE: It's not exactly continuity. It's more like stability.
LAKSHMINARAYANAN MAHADEVAN: It is stability, but stability is defined as continuous dependence on initial conditions. And by continuous dependence, what I'm saying is what is the slope associated with the changes in the solution for changes in the data. That's intimately related in this problem, even though I can indicate the condition number of the matrix. So yes, they're all connected, and we should be careful in the length. Andrew?
AUDIENCE: I guess just to play devil's advocate, so what's small? So say I have something that looks diagonal, but every entry is 10 to the minus 10, right? In that case, small is relative. Everything on there is small. But maybe if I'm dealing with [INAUDIBLE] that 10 to the minus 10 is really small [INAUDIBLE], I don't care.
LAKSHMINARAYANAN MAHADEVAN: Yeah. So what is small is defined in terms of a parameter which I should have introduced, which I didn't, which would be the ratio of the largest singular value to the smallest singular value, because that's what really matters. If all the singular values are in the order of 10 to the minus 10, I just divide by 10 to the minus 10 and I'm good.
What really matters is how much amplification happens at different scales. That's precisely what somebody asked me. These length scales take large singular values that are associated with large scale features of the scene. Small singular values are associated with small scale features of the scene, the hair on my head, or whatever little is left, so on and so forth. So what really matters is the condition number or equivalently, sigma 1 over sigma r, where you basically are looking at [INAUDIBLE].
AUDIENCE: I have a question about the notion of stability.
LAKSHMINARAYANAN MAHADEVAN: Can you speak up, please?
AUDIENCE: Yeah, sorry. I have a question about the notion of stability. I don't think this is [INAUDIBLE] for linear systems, but if you have a system where you have qualitatively different behavior depending on different conditions. You get discontinuity in solutions, but not everywhere in the space, just in a part of the space. Is that an ill-posed problem? Like a bifurcation, for example.
LAKSHMINARAYANAN MAHADEVAN: Very good. So a bifurcation, in the neighborhood of a bifurcation, a dynamical system, you don't have a strict notion of stability according to what I just said. However, what you can do is something called unfolding the bifurcation. So you ask which modes are the ones that are growing as you move through the instability, pull those out, and look at the remaining part.
And so you essentially use the fact that hopefully everything doesn't go berserk on you simultaneously. By everything, I mean all scales. There is a particular scale on which things happen. If that's the case, you pull that out, you look at the rest of the system, and then you deal with it separately. You just basically break up the system dynamics into a part which is varying rapidly associated with the growth of instability and the rest.
As long as you have the separation, as long as you have a small number of modes which are going unstable-- small number of modes meaning, for example, if I have everybody milling around in this room simultaneously compared to just a few people moving out, the first is much harder problem than the second one because I don't have necessarily scale separation. If I have scale separation, you can use the same ideas if I essentially break up the problem into little bits and then I bite into each one separately. And if I'm being vague, I'll sit down with you afterwards and essentially [INAUDIBLE] more precise.
There was another question somewhere. No? That bell probably is a good time for us to take a five minute break, and then we will continue with Tommy, continue talking about generalizing from this in the context of--
Associated Research Thrust: