Question Answering for Language and Vision (40:04)
Date Posted:
July 5, 2017
Date Recorded:
August 22, 2016
Speaker(s):
Richard Socher
All Captioned Videos Brains, Minds and Machines Summer Course 2016
Description:
Richard Socher, MetaMind (A Salesforce Company)
A model based on dynamic memory networks casts many aspects of natural language processing as question-answering tasks. The model uses gated recurrent units from recurrent neural networks and can also be applied to the task of answering questions about natural images.
RICHARD SOCHER: All right. Thanks for coming, everybody. It is probably one of the latest talks I've given in a long time, so I'll try to keep it entertaining. We'll have cool live demos towards the end, so that'll be fun.
You can say like, I want to put this and this in. And then I can play with it until it breaks. I'm sure it will eventually.
I do think language is kind of the most interesting manifestation of intelligence in some ways, I feel like is what discriminates us the most from other animals and the kind of power that we have with it. So today I want to talk to you about question answering, for both natural language understanding and computer vision.
And maybe I will start with asking this question to all of you, which is, maybe we can cast all the different natural language processing problems as question answering problems. What do I mean by this? So let's go through a couple of examples.
The first one is what we traditionally see as question answering which is you get a couple of logical kinds of facts. In this case, Mary walked to the bathroom, Sandra went to the garden, Daniel went back to the garden, Sandra took the milk there. And you ask, where's the milk?
And you kind of have to logically reason. You may go through this and find facts about the milk. All right, Sandra took the milk there. And now you have to do some co-reference resolution or an effort resolution, find one there. Refers to maybe the last time Sandra was mentioned, she went into the garden. The answer is garden. So it's some relatively simple transitive reasoning.
That's kind of what we would usually associate with question answering. But you can go further and say, well, the statement, the input is everybody's happy. And the question is, what's the sentiment? And the answer is positive, something we would usually consider as a text classification problem instead of a question answering problem.
You can then go further and say, well Jane has a baby in Dresden. And you can ask, what are the named entities? This is usually seen as a sequence tagging problem. But again, we can cast it as a question answering problem, same with part of speech tagging.
And even other more complex things like, what's the translation into French of this sentence? And getting again the answer being just a translation. So at that point, you might say, well, fine, Richard. That's maybe mildly interesting, but it's not very useful of an abstraction.
Until you say, well, that idea that indeed I may have had after a couple of drinks at nips with some friends of mine, you say, well, let's make this useful and say, all right, let's try to build a single joined model for any arbitrary question answering problem. And as soon as we can actually throw any questions at it and it will give us back the right kind of answer, we think we made a significant step towards natural language understanding.
Now there are two major obstacles right now to make that happen. The first one is that we don't even have a single architecture, let alone a single model where all the weights are shared, we don't even have a single architecture that gets consistently state of the art performance across a bunch of different tasks.
So for question answering, this kind of logical stuff I just showed you has a strongly supervised memory neural network from Jason Weston and some coworkers at Facebook. Sentiment analysis had Tree-LSTM, so tree structured long short term memory architectures from Kai Sheng Tai. And part of speech tagging had a bidirectional LSTM-CRF model. So we may see that all of them are now somewhat neutral or have at least that sort of in their title, but they're all different kinds of architectures.
The second obstacle then is that fully joined multitask learning is really, really hard. When you try to do that, that means you can't do any hyperparameter of parameter optimization anymore, because it's all the same model. And you're not allowed to change your hyperparameters like size of hidden layers and things like that anymore.
Whenever we do this, then we usually don't actually mean that we share everything. We usually mean at least you change your final classifier and maybe you share some of the features. We also often, when we look at transfer learning and multitask learning, we only look at the final task. There's even sort of the source and the target tasks that a lot of papers define.
And the accuracy that is in the end mentioned is only on the target task and everybody ignores the source. They give you in computer vision, you would say, oh, look this pre-trained AlexNet network so well in my task. But after you train it on your task, does anybody ever go back and look how much worse they now get on ImageNet? No. Nobody mentions that.
And so none of these papers, I feel like, are actually getting us in that right direction of having a single model that can do all of these different tasks. When we do share some parameters, it's usually restricted to much lower layers. So in natural language processing, we have, for instance, word vectors that we share now. And in computer vision, we have the lower layers of a CNN that are essentially just like etch filters and colored blobs and things like that.
If we then do it and we show, in this sort of restricted setting, some improvements, it's because all the tests are related. But of course, humans are able to learn, like I can learn how to ride a bike and a learned language. And just because I get really good at Chinese, doesn't mean I'm going to forget how to ride a bike, right? So even tests aren't related, ideally, a good model would have some ability to continuously get better if the tests are related and not get worse accuracy when the tests aren't related.
So basically, this is a very long sort of distant goal that we have to create a single model. Now this talk, I'll basically tackle just the first obstacle which is try to find a single architecture where we still are allowed to change the hyperparameters in a variety of ways, but at least try to have a single model that gets consistently state of the art accuracies across a wide range of different tasks. And that's exactly what led us to this dynamic memory network.
So on a very high level, the idea is that we might want to allow the model to take multiple passes over an input. So if I give you, for instance, this story here, it's relatively simple, a bunch of people just going to different places, and then you ask where they are. But if I now ask you to read the story, read a little bit.
And now I ask, like, where's John? It's hard for you to actually be able to do this unless, of course, they're very important stories of your life. Like you might remember the first time you ever met your wife or something or somebody becomes your wife, right? Or first time you had some important life event.
So generally, it's hard for us to store everything in our working memory. And we want to hence allow the model to kind of take passes over past experiences. Now, who here is familiar of recurrent neural nets? Ah, it's roughly half.
All right, so sorry for that half. I'll go over that really quickly. Main idea is you have a vector representation for each word. You can for now assume that vector representation in the simplest form is randomly initialized. In slightly more sophisticated forms, it captures core current statistics of large corpora.
And then we basically have a single neural network function whose parameters are shared at each time step. And we basically put in just a simple linear neural networks where all the weights at each time step are shared. So that's the simplest form.
Fortunately, when you make the same kind of transformation at every time step, it's a very strong operator to assume that every word that you add to your current hidden state modifies the state in its entirety. And so instead-- and I'm trying to keep the equations to a minimum, this is probably the only slide that has a ton of equations-- they're all very simple.
Who is familiar with a single layer neural network? Cool. All right. Basically, we're just putting them together. It's kind of all the deep learning is in many cases. They're just putting layers together. And in our case, is a particularly simple way.
So you can just imagine, for instance, that you have your xt that this is the word vector at time step t-- again, initially can just be randomly initialized. And ht minus 1 is the hidden state from the previous time step. And you can kind of assume that you kind of just concatenate those two and you have a single layer neural network. But we split it up so that you first multiply xt with wz in the first element of this equation, and then you do the same thing ht.
You had a bias term and then you have an element wise nonlinearity which in this case is sigmoid. So this would be if zt would just be our final hidden state, this would be a standard recurrent neural network. Now what we instead do is we do this with two different sets of parameters twice, W with superscript z here and r.
And we'll call this our update and our reset dates. And now where it changes slightly from a standard recurrent neural network to this gated recurrent unit is that we have this reset gate here and we have an element wise multiplication. Now, what does this do?
Basically, if the reset gate here is-- and remember, this is sigmoid, so all the numbers here are between 0 and 1-- if this reset gate here is all 0, what that means is we basically will ignore all the stuff from the past. And we say what you just read, that word is more important than anything else in the past.
So for instance, if you're in sentiment and you just ask like, OK, what's the sentiment of this sentence? And it first talks about the plot and how Mary met Jane. And they went over to Australia and blah, blah, blah, none of that matters.
And then you say the movie was awesome. Now, awesome is like the one thing you actually have to care about. So you can essentially ignore all the stuff from the past and make awesome the word vector xt here that you have at that time step, which is says awesome, basically determine most of the current-- sort of this intermediate ht hidden state.
Now of course, that's very crude. These are all different numbers. r is a vector as well. So you can kind of assume what I just said, but for a subset of the neurons at each time step.
All right. Bear with me, one last line. And the second one here is this update gate and basically, it kind of allows you to do the opposite which is ignore currently what's going on. You can say, again, simple example of sentiment, this movie was awesome. Now let me tell you about the plot, blah, blah, blah, blah, blah.
And basically, if zt here was all 1's, then you basically multiply all these 1's here with the previous time step. And you do 1 minus 1, it's 0. And you ignore what's currently going on at this time step. So it allows you now to copy bits over over very long time steps.
And if you're familiar with back propagation, which we used even for single neural networks, what this allows us is to not have the vanishing creating problem. Because you can basically, if you have your update gate here that is very large, you can now propagate many time steps into the past without getting a smaller and smaller gradient. And that allows us to have much longer dependencies on time steps from further in the past.
All right, so this is our basic unit. It's these couple of almost all straightforward single layers plus a couple of element wise products. Straightforward chain rule for all the stuff that we're going to train this with. Now let's put this basic LEGO block together in this kind of monster of a model, and I'll walk through it at a very high level and then we'll zoom into each of the different modules.
One thing we tried to do here is actually modularize our models. Each module here will communicate with all the other ones simply based on vectors. So during forward propagation, you will just send whenever there's a line going from one module to another, you will send vectors. And during back propagation, you send error signals back.
So this whole model can be trained in an end to end fashion-- which end to end is kind of a magical word for us deep learning people. We want everything to be trained. Here's my error, here's my raw input, the model figures out everything in between. And if not, then it's like, eh, it's kind of hacky and pipeliney.
Now, what's going on here? So let's walk through a simple example. Let's say this was my story. We have a bunch of different sentences here. Mary got the milk there, John moved to the bedroom, and so on. And we have a question, where is the football?
Now what this model will do is it will take the question, it will pipe it through a simple GRU and get the vector representation for that question. And then it will take that and condition an attention mechanism and basically conditionally pay attention to only those facts that actually matter to this current question at hand. So let's say the question is, where is the football?
Well, that's just basically what the model learns in its first pass is to look for anything that mentions a football. Fairly straightforward. So we might go through here and say, all right, John put the football there. But then maybe this is less important, because there's another fact here that John put down the football. And maybe that's like the most important one, because it's the last term in this sequence that football was mentioned.
So what the model now does is put a large attention mechanism on this which essentially will just be another neural network. And whenever it pays a lot of attention to a certain state, it will give that state as an input to yet another GRU. And that GRU, in our what we call episodic memory module-- don't over-interpret that too much, we're not neuroscientists-- but the main idea here is you can now selectively pay attention based on a trigger onto the past hidden states of the time sequences that you've seen.
So that's kind of the neurosciencey part of the neuroscience sort of motivation. Now this GRU will basically agglomerate all the facts that have been useful so far and that have been paid attention to. But at the end the model might decide, well, I still don't know enough. Just because I know that John put down the football is not enough to actually answer where the football is.
But now that I know that John and football are somehow related, maybe I can go over the inputs again. And now look for facts that mention John and the football. This is exactly what happens here.
So now we say all right, John moved to the bedroom and John went to the hallway. Model learns that the later time step matters more. It agglomerates all these facts in our second memory state and the superscript 2. That vector's given as to yet another neural network sequence model which then outputs the answer with a straightforward softmax like logistic regression type layer.
So that's sort of the high level model we have different modules inside, question module that triggers an attention mechanism that then selectively pays attention to different inputs. Whenever there's a high attention to a certain input, it gives that as input to the episodic memory module which goes over our facts that seem relevant, agglomerates them, gives the final hidden state to an answer module, outputs a question. So if you understood that based on all these sequences, the next couple of slides will be quite straightforward.
The input module in the simplest form is a standard GRU, standard recurrent rail network. You can use LSTMs there too and in fact, it works even better if you use bidirectional going from left to right and right to left to create your hidden state. But the simplest form is just a standard recurrent neural network with these gated recurrent units.
Same with the question module. The final question vector, we often abbreviate it here with q, just the last hidden state of the question GRU. So two recurrent neural networks, straightforward. Now, how does the episodic memory module work where we actually compute the attention?
So in the next slide, I'll describe how we compute this gi. The gi will be a simple function too, and it actually also is a single number, so it's not a vector. And it basically just puts almost another GRU over to GRU, but not quite, as we adjust a high level single number that says, should you just do what you standardly do with the GRU or do you just want to copy your entire hidden state with all its elements to the next hidden timestamp?
So instead of the normal GRU, which has an element wise product here-- this is a single number, this gi, the gate or attention also. And if gi is 0, we can basically copy the entire hidden state of all the facts we've agglomerated so far. And intuitively what that means is if the sentence mentions that Sandra went back to the kitchen, it is so irrelevant to the question where the football is at the time that we just want to skip that entirely and not mess with our hidden state right now.
It's like when your professor tells you like, remember this, and then he rambles on about some side story. And you're you just try to always remember that one thing. So now, how do we compute this gi? It's a fairly straightforward neural network.
We basically compute here a couple of element twice similarities, so this is just our current sentence, si, the question vector, and our current memory state of the facts we've agglomerated so far. At n0, we just initialize m to be the question. And these are just element wise multiplications and element wise subtractions.
So basically, just two ways to identify similarity between two different vectors-- namely, the question vector and the sentence vector and the memory. And then we take this vector and pipe it through a straightforward two layer neural net. And then we have the softmax so that all our attention gates sum to 1.
So now in the simplest form, we actually give the supervision of which tasks, which information, which sentence matters, in what sequence. And so we can train the model to know when to be done or not. And if it's not done, then goes over the inputs again.
Now, this is my only slide mentioning neuroscience, the reason we kind of call this episodic memory is that episodic memory in humans is also the memory of autobiographical events, so times and places. And you can trigger it in some way. And in that sense, it is also kind of the question mechanism, this attention mechanism allows us to trigger to get back into the state that we were at some time in the past.
And what's interesting also is that it's known that the hippocampus, the seat of the episodic memory in humans, is also needed for transitive inference. And when we only allow our DMN to go over the input once, it can also not do transitive inference well at all. And so there's some interesting correlation there.
Now the answer module, like I said, straightforward GRU also. And then in many cases, even just a single time step. Or if we know for a fact that for that particular data set, it's always only a single word, you can skip that directly and go directly to a straightforward softmax layer. All right.
Couple of related works, but maybe not super interesting right now to go into the details. I think memory networks from Jason Weston are the most similar, and they also were the ones who introduced that data set and the ones we compared to here. The main difference is that we kind of use recurrent neural network sequence models for all the different set modules, and they kind of have this pretty specific functions that really only work for that exact data set and couldn't be used for other kinds of tasks like the ones we're going to show really good results for as well.
So kind of broad range of applications. So this is the babI 1k data set from Facebook and all these different ones. So pathfinding I think is like you basically say-- actually, I'll just show you some examples.
All right, here we go. This one. It's kind of interesting. And you'll see also, once you look at this real data set, that it is still quite constrained in the kinds of things that it can do. And it also doesn't really get much better-- or it doesn't generalize very well at all.
So if you change and you try to combine two types of reasoning-- right now we trained a different model on each of the different tasks. So if you then tried to combine the indefinite knowledge with the pathfinding and the transitive reasoning, it's not going to work. It really can do with this kind of reasoning and nothing else. And I think the same is true for a lot of these models.
So here you basically say-- you give us a description of where all these things are in relationship to one another. And you have to kind of build a mental model now of the world in order to answer the questions such as, how do you go from the garden to the bathroom? You have to go North twice. That's pretty hard. Nobody seems to be able to build that mental model yet.
Unless you train it on 10,000 examples, and then it works. So this is just a 1,000 example. So it doesn't seem to be a fundamental thing that it cannot ever do it. It's just on 1,000 training samples. Once you have more training examples, a lot gets easier.
So let's look into the demo. Let's start with these kinds of things. So this one here is a temporal reasoning task. So I'll give you a second to see if you can answer the question.
Anybody? Shout it out.
AUDIENCE: Cinema.
RICHARD SOCHER: That's right. Cinema. All right, let's do another one.
What color is Bernhard? Wrong. It's not green.
You all have to resign over the facts that are there.
AUDIENCE: Yellow.
RICHARD SOCHER: That's right. And what's cool is like we can look at sort of where does the model pay attention to, and indeed, it kind of goes over these facts. And indeed, some frogs can actually be yellow. And basically, first pays attention to the fact that mentions Bernard, knows to pay attention to frogs, then pays attention to the frog fact knowing, OK, Lily is a frog.
Then pays attention to the Lily fact and then pays attention to the final pseudo sentence that we have in there which says like, I'm done reading. Once you pay all your attention to that last sentence, model knows it's done. And the cool thing is it does all of that quite discrete reasoning still like completely through vectors.
We didn't give it like the lists of colors and a list of people and then if x is a y of that list type and so on. It's just like, yes, it's a lot of those kinds of stories, but it does pick up these patterns eventually entirely. So what gets cooler is I think all the next couple of results which is it's the same kind of architecture.
Now given their different hyperparameters, we sometimes go over different numbers of episodes. The hidden dimensions are slightly different and so on, but it's the same general type of architecture at least. And here we did sentiment analysis. And this is actually right now the most accurate model for sentiment analysis based on a pretty commonly used benchmark data set actually one that I developed in my PhD called the Stanford Sedimentary Bank.
And it outperforms a variety of recursive syntactically inspired models that I met during my PhD as well as convolutional neural networks, which I think make no sense for language, but that's just me. Paragraph vectors from Google and even this cool constituency tree, a long short term memory structure.
And here the model is actually quite good. And so we can play around with a couple of different inputs, because it has actually seen a lot of different movie reviews. So just to give you a couple of examples here.
The best way to hope for any chance of enjoying this film is by lowering expectations. And if you now want to push the state of the art in sentiment, those are the kinds of sentences you have to get right. Because all the simple stuff, you can get with Naive Bayes programs like and then SVM.
Like whenever a sentence mentions awesome, in like 95% of the cases, it's probably a positive sentence. But here you really have to be a little more clever. And I'll get that right.
Here's another example. The film starts out as competent but unremarkable and gradually grows into something of considerable power. Again, most sort of simpler models would get this wrong, especially the previous one here. If you look at words in isolation of topic models and stuff like that, those kinds of models would say, well, it's got best, hope, chance, enjoying, hardly any really obviously negative single words or bigram, like, bam, it's positive.
But this actually can reason over the sequence. Now we did some analysis. For sentiment, we just need two passes. And for some of these three tasks, so very transitive reasoning, we really do need to have five passes to agglomerate all the facts. And this is if you don't give any supervision of which facts to pay attention to in what sequence, you just give it this long story with 20 to 50 different sentences and the question and the final answer.
And it needs to figure out which kind of sequence of sentences may have been relevant and so on. So for sentiment, we can actually look at-- I mean, we did also already for this logical reasoning-- but for sentiment, it's kind of interesting to look at the attention of the model as it goes over the inputs. And here you see the darker the spot is, the more attention it pays to that word.
And here we pick just a couple of examples that are wrong with just a single pass over the input of the DMN. And only gotten right when you had two passes over the input. And again, this is like at this point, you're really pushing the state of the art significantly here. Or not very significantly, but it's like, it's a really, really hard example.
So in its ragged, cheap, and unassuming way, the movie works. Doesn't get it right if it just go over the input once. But if it can go over twice, you basically see the kind of general pattern which is in the beginning, it's like a little cautious, pays more attention to things that are obviously have some sentiment attached to them.
So cheap and unassuming are just adjectives. Adjectives should pay more attention, especially if they convey some sentiment. But then it basically pays a lot more attention to the movie actually working in the second pass than it did in the first and gets it right that this is actually positive. And also pays even more attention to unassuming and much less to cheap.
Another one here, which I kind of like, the best way to hope for any chance of enjoying this film is by lowering your expectations. Again, first pass doesn't pay much attention to stuff. And second pass really adds a little bit to lowering and especially the problem with the expectations.
And maybe two more. These examples with but as a contrastive conjunction are kind of interesting. Because in many cases, when you have an x but y structure, you really have to sort of pragmatically pay more attention to what comes after the but. You usually hedge and then you say oh, but let me relativize that and change my opinion.
And so this film starts out as competent but unremarkable and gradually grows into something of considerable power really pays a lot more attention to the things that come after the but, which is kind of what you would want. And lastly, my response to the film is best described as lukewarm. Best in a single unigram model is just such a strong indicator when you sort of sort, you train it with just the aggression and look at your features that matter most. Best will usually have this largest weight for the positive class. So it can't help but really pay attention to best. But then it kind of agglomerated the sentence and the cove of context and it pays much more attention to lukewarm.
I like this one too. Despite the glowing reviews, this movie wasn't an especially surprising or interesting experience. Again, learned sort of the despite structure and gets this right as well. And then lastly, and this is kind of more for bragging rights, I'm not a big fan of part of speech tagging as a task, but part of speech tagging is kind of a task that's been done over two decades. And literally, that same data set has thousands of citations, so you kind of know it's overfit at that point. And people have done sort of graduate student dissent on their objective functions on it.
But it's the most accurate model. The same kind of architecture that did logical reasoning, that did fuzzy reasoning over sentiment, when you trigger the answer module at every time step instead of just at the end, you get an extremely accurate sequence model. So we can train that as well or look at that demo.
We can actually do named entity recognition too. Now for named entity recognition to really get the state of the art, there are these features called gazetteers. And this is kind of a BS feature, because it's a list of all the names of people and of all the cities in the world and of all the countries in the world and all the corporations you can find. And you just kind of added that as a feature and it was like, OK, that's not that interesting. So this is not the state of the art for an entity recognition, but it doesn't use gazetteer features or the kind of like answers.
So now here's a cool story which is we have had a postdoc-- or I guess a former postdoc-- join us. And he was a computer vision guy, [INAUDIBLE]. He's awesome. And [INAUDIBLE] said, well, Richard, if your question answering system is so awesome, why don't we do visual question answering with it?
So he took the code and literally replaced only the input module to not read in sentences sequentially but read in images sequentially by just taking sort of the 1 minus 1 lower than FC17 where you have a 14 by 14 grid, one vector for each one of the 14 by 14 regions. And had a GRU go over that. And he replaced that module with this input module for images and then ran it.
So this is the only change to the model. Basically, you have the scene end, get the 14 by 14 patches, a vector for each patch, and then have a neural sequence model go over all the feature vectors from all those patches. In this case, it's a bidirectional group instead of this unidirectional group. And what was amazing is we actually, with that architecture, we uploaded it and had the best model on visual question answering in the world.
And that was our ISML paper this year. And yeah, outperformed a bunch of other very interesting models, some that included sort of interesting modules and reasoning steps and things like that. And I said, well, this was kind of magical. Usually, you fiddle around a lot with everything until you get really great performance.
So I was almost a little skeptical. And so we started looking at the attention. So this is kind of the visual equivalent of these plots where the brighter the area is, the more the model paid attention to that area of the image.
And this is completely unsupervised in terms of the attention, like you don't have segmentations or something like that. It's just a training time. You have the image, the question, and answer. And you have to figure out which parts of the image you have to pay attention to to answer the question correctly.
So we started looking at a couple. And I was kind of impressed. Let's just go through some of them. What is the main color on the bus? The answer is blue.
And it indeed actually finds the bus and pays attention to that area of the bus. And we're like, OK, it's kind of a centered object. It's pretty big. It's not that crazy.
What types of trees are in the background? OK, it pays attention to the background. Kind of cool, but still like gigantic area of the image.
Now this, how many pink flags are there? Answer, two, and it actually pays most of the attention to those two pink flags. Now, that's kind of cool, but I have to acknowledge that if you had like 52 different flags, it wouldn't be able to do it. It has basically a training time seeing somewhere between zero to five sort of numbers of objects. And it hasn't really learned to completely count tiny little things all over an image.
Then you ask, like, is this in the wild? And now it's kind of cool. So it's paying attention to sort of the most man-made structure of the image, just somewhat hidden by trees and occluded by these elephants. And it still sort of pays attention to that right area roughly.
And I think the left one may have gotten sort of lucky, I think. I don't think it really has a good world knowledge of flamboyantly and its meaning for badminton players. I think on the top right here, we're probably would over-interpret its capabilities if we assumed that it really understood that these are two different pictures and it's exactly the same girl that is in both picture.
This one probably saw a girl and it's a who question, so it just gives as answer girl. But where if it is a relatively straightforward domain and it doesn't require too much world knowledge, you can actually see and pick up certain patterns from visual information. It does get an amazing couple of things right.
So, what is the boy holding? And indeed pays attention to the arm and to this arm and then answers surfboard, which is pretty cool. At some point, you worry that maybe all it learned was language patterns and it just randomly put attention to some things that sort of semi made sense.
And so I like the top right one here to disprove that word assumption which is if you just looked at language and you ask what color are the bananas, it would just say yellow every time. But it actually pays attention to them and then says that they're green. And where I really thought it kind of captured some actual patterns and it isn't just like super straightforward, that sort of core occurrence counting, is, what is the pattern on the cat's fur on its tail? Pays most of the attention to the tail and says stripes and so on.
So we have a live demo too. But we have this picture right here. What color is the building that has a clock? Which is kind of a little more interesting of example because there are at least two buildings, and only one has a clock and got that right.
And now this one is the last slide with a bunch of questions and we had just put together this demo, and I was lucky enough to get to chat to John Marker from the New York Times. And he said, I want to ask a question myself. And I'm like, oh, boy. First, I hope it didn't crash, because we just hacked it up the night before.
And then asked the question, is the girl wearing a hat? And I'm like starting to impute, we didn't implement it, we didn't do this 90% engineering to make it super fast that I talked about before. We just kind of hacked something up. So it's like cranking, and I was like, oh, boy.
I started coming up with an excuse of like, well, the hat's kind of black, and it's black background. So I don't know if I-- like it's kind of hard and so on. And it finally popped up, and it said yes. And so that was like the one sample he took, and then he thought it was a pretty cool model.
And then after the interview I tried a bunch of other stuff. And then I was like actually kind of getting pretty excited. It got all of these right. And these are just like the first 10 questions or so that came to my mind.
So, what is the girl holding? A tennis racket. Doing? Playing tennis. Wearing a hat? Yes.
What is the girl wearing? It just kind of, again, this model is only trained on a single word answer, so it just says the most prevalent and most obvious kind of object which is shorts. What is the color of the ground? Brown.
And then I started to say like, all right, let's try to mess with it. And see if it's just like, oh, well, all tennis courts are brown that it's seen. So I said, what color is the ball? And it actually pays attention to that ball, and says it's yellow.
And then, what color is her skirt? It's white. And then it actually gets, yes, the model might think it's shorts. But when you actually ask for the skirt, it still sort of is roughly invariant to those things. And then, what did the girl just hit? A tennis ball.
And then I got it to fail after that, but didn't fit in this slide which is basically, I asked it, is the girl about to hit the ball? Or did the girl just hit the ball and said yes both times? And so I knew, OK, it can't. It hasn't captured all the core current statistics of what the various arm forms have to be in order to make it obvious that the ball was already hit or was about to be hit.
But nonetheless, it's a pretty awesome module. I was very excited about it. So hopefully I could convince you that indeed a lot of NLP tasks can be cast as question answering and that the DMN can accurately solve a bunch of different question answering problems.
And the goal that we're now working on is trying to actually have this single model with all the shared weights and the same decoder and everything be the same. Because now that we have this sort of different formalism of not just an xi and a yi for a supervised task, but an xi, a question, qi, and an output, yi, we can actually literally have all the same weights in them all, at least in theory. We don't have to replace the softmax outputs depending on the task or something.
But it's ongoing. And shameless plug again, if you want to work on that kind of stuff, let me know. We're looking to hire more people. And yeah, this is also kind of our example of the kind of research that we've been doing first at MetaMind and now at Salesforce. And hopefully shows you that you can do some kind of cool research in industry.
It's not super theoretical. Like the kind of cool thing about a lot of these deep learning models is that there are actually a lot of people in industry who really care about having the most advanced sentiment analysis model out there. But it's still an interesting new model and using new framework. Cool. Thanks for staying around so late.
[APPLAUSE]