An overview of Generative AI: music, video and image creation
Date Posted:
February 6, 2024
Date Recorded:
August 12, 2023
Speaker(s):
Douglas Eck, Google DeepMind
All Captioned Videos Brains, Minds and Machines Summer Course 2023
GABRIEL KREIMAN: And it's a great privilege today to introduce not one, but two amazing figures from Google here. First here is Philip Nelson, who has given talks here in the past. And I would like to encourage everybody to go to our website and see lectures that he gave in the past, which are quite spectacular. We're in for a treat today.
And we have another person from Google, in this case, Douglas Eck, from-- a senior lead at Google DeepMind, who is going to be sharing with us work on generative AI. And I'd like to just say a couple of quick words first in terms of the bio. I'm not sure I'll get it right. Correct me if I'm wrong. So he was--
DOUGLAS ECK: A generative model. Generative.
GABRIEL KREIMAN: Generative. Yes. So these are all hallucinations, perhaps. But anyway, so he was trained in Indiana University. And then he did a postdoc with Schmidt Hoover in Switzerland of LSTM fame. And then from there on, he went to work at the Montreal Institute for machine learning. He was one of the pioneers, seeing the birth and growth of the Mila Institute, that many of you are probably extremely familiar with, and has made seminal contributions to machine learning.
And then from there, he joined Google about a little bit over a decade ago, more or less, I think. And he's been contributing to many, many different areas in reinforcement learning, large language models, the interactions between machines and humans, and in particular, I think, to AI and aspects of music, which I'm particularly-- which I'm very excited and looking forward to hearing about today. So without further ado, thank you very much again for joining us, and I look forward to your talk.
DOUGLAS ECK: Cool. All right. I'm really honored to be here. And I want to warn you all that the rest of your day was filled with brilliant mathematicians talking to you about computer science theory. And there's none of that now. We're going to talk about-- it's the theory week, but it's the applied evening.
So but there are some interesting ideas here. So we're going to talk about generative AI. And I'm going to try to give you a somewhat biased, opinionated view of where we are now and, with a great experience, where we've been in the past.
To define terms, we're really talking about generative AI not in the sense of mathematical generative models. We're not talking about Michael Jordan. We're talking fundamentally about models that can generate novel and useful content.
So we'll talk a little bit about diffusion models, a little bit about language models, of course. And of course, it's-- I might not have-- I grabbed this slide. I probably would have put generative AI completely inside of machine learning, because it's not clear to me that there's a lot of useful generative AI happening outside of machine learning systems' need to learn.
And in terms of what we'll talk about today, there's a lot of text-to's on this slide. Text-to-text. So that's-- ChatGPT is a text-to-text model, right? We think we're chatting with something. But what we're really doing is throwing a string at a model and it's giving us back a string. And we're reading it and we're interpreting it as a chat.
Text-to-image. So presumably-- just, always curious, how many here have used Stable Diffusion or Midjourney to make an image? OK. How many have used ChatGPT or ChatBot or Bard? OK. And then of course-- not of course-- but we have the ability to generate video and audio.
And we're going to talk mostly about these last three. There's been so much focus on text-to-text that I didn't feel like it needed a lot of treatment. Or put differently, there's been so much focus on text-to-text that it's its own talk, right? So that's not to diminish that work in any way. Just to say that this is what we're going to focus on.
OK. So, you know, I like these obligatory charts at the beginning of talks this decade. These straight lines are, of course, straight lines only if we remember that this is a logarithmic scale. So this one is the price performance of computation from 1939 to 2021.
So it looks like, in terms of price, it's a little bit of a plateau here. But I think since this is a log scale, it's pretty safe to say that some version of Moore's law is holding. Things are getting faster and cheaper. They're combined.
And then this is the training computer FLOPs for milestone models. And the one that's labeled here is AlphaGo, but this would include a number of other breakthroughs. Presumably we have some of the GPT models in there somewhere. It would've been nice to have more of them labeled. But you get the idea.
This is also a logarithmic scale. So it's a straight line here, which means it's exponential growth in terms of compute. So we've got a lot of machinery, right? And that gives us the chance to try some things out that we couldn't try before.
This slide is one of my favorite views of the field. I think it's really, really hard to understate how much has happened in the last decade in the space of, in this case, image generation. And this blew my mind. How many people were here for GANs? You know? OK. A lot of people.
This blew my mind and it blew a lot of people's minds that the method worked and that you could generate these faces and these faces look real. Wow. You kind of have to stand back. Don't blow them up like that. If you leave them postage stamp, they look great. Keep them icon size.
And this is the area of this is not a face. 2019, you start to see other GANs generating more and more realistic images. And now this I just plucked from the web today. 2023. This happens to be from someone's Medium post about Midjourney. And this is what we can make with a clever prompt.
Now a couple of things happened here. One is the image quality has gone up. But arguably, the image quality was already pretty good in 2019. It said there was no real easy way to control the model. So in both of these models, we're sampling from the distribution moving around.
And here, we're telling the model what to make with a text prompt. I should have put the text prompt up. But some of the famous ones are a cute dog in a sushi house or something like that. Right? You can get your cute dog and sushi house from Midjourney or from any number of models.
You would have had to look around the latent space of an image generation model for hundreds of years to find that. You have no real way to get there. So user control is a big deal to be a theme.
And I put this slide here because I wanted to talk for a second about what effect this is going to have on the economy and what effect this is going to have on society. I think it's fair to-- it's fair to say that the impact of generative AI, if you look at it now-- I mean, we can't read the future, but it has every chance to be as fundamentally changing as the mobile phone and the internet, right? So I think we'll give rise to new industries. We'll see other industries fold.
What we've seen in the past is that technological advances tend to grow economies, not because of altruism, but because of this idea of a zero sum game. So you provide people with more tooling. And people just still keep staying up all night and working really hard and trying to get things done. And so new things happen.
I put up the image of an avalanche because I think it's a really fitting analogy for where we are right now. So I'm using the term avalanche in the sense that if you're hiking in the mountains, we'll say there's an avalanche on Wednesday. If you're hiking in the mountains on Tuesday, you're fine. Right? And if you're hiking in the mountains on Thursday or Friday, you're fine, too, probably. I mean, you might wait for the dust to settle a little bit.
But look out if you're hiking in the mountains on Wednesday. And history has shown that when big technological changes come, they come like avalanches. So they have every chance of really disadvantaging, disadvantaging people who are there on the mountain on that day, and so this often means people that are mid-career and late-career in their field. And the survival techniques for being caught in an avalanche are actually-- it's not a terrible analogy.
It's swim, try to keep your head above water, keep moving, keep moving, keep moving, don't get stuck. If you stop and get stuck, then the avalanche settles like concrete around you. And it's really not a terrible analogy for what one would hope people in different industries affected by generative AI would do right now.
One example of this avalanche-like behavior is talkies. For those of you that are not native English speakers or under the age of 50, a talkie is-- a talkie was the term used for motion pictures made with sound. And when we look back, given cinema right now-- we have Barbie this year, we have Oppenheimer-- I don't think very many people are really mourning the loss of silent films.
But when it happened, silent films caught some people by surprise. I found this today on Wikipedia. I think it was Tommy's talk while I was building my slides. I found this. But this happened.
This is an unkind cover featuring Norma Talmadge. As a movie historian put it, sound proved the incongruity of her salon prettiness and tenement voice. So some people just couldn't make the shift. They couldn't make the shift to talking films. The microphone didn't work for them.
And other folks did, right? And so there's this idea of being flexible and leaning in. And I didn't want to downplay this. I know that this is a scientific audience, but there will be economic changes that follow the AI systems that we're building.
And I think it's important to keep that-- to keep that in mind. And so one of the things that we've been thinking about in my group at Google is how can we roll out these technologies and build these technologies in a way that they're value add, that they're building new opportunities, not closing down more and more other opportunities for people.
So let's dive in. We're going to go through two or three different media types. And please, please raise your hand if you have questions.
There's a world where this whole slide deck gets derailed and you just ask questions. That's fine, too. OK. It's an evening talk.
I want to talk about a couple of models that came from Google. Frankly, it's not important that they came from Google. We have really great work happening, obviously with Stability, Stable Diffusion from Stability AI. And Midjourney is really important. But I wanted to dive in a little bit on the technical details so people had an idea of what's happening under the covers with these models.
On the left is an image from Imagen. Sometimes people say Imagen. And this is one of the early diffusion-based models that came out of Google. I would say one kind of interesting aside is Google's been really conservative in terms of launching this technology, and I think largely for good reasons. So we haven't rushed anything to market that would enable the creation of really horrible materials, et cetera, et cetera.
And I'm not saying that anybody else in the field has done that. I think everybody is moving at the rate that they can move. But just so you understand, as researchers, the team that-- the team that did Imagen, four of them quit Google just to do a startup. They're very good researchers. And they left not because they wanted to.
It really wasn't about fortune. I know because I talk to them all. It's like they wanted to see the work that they're doing actually get out in the world. And so I'm very interested, especially amongst the people that are just doing your PhDs, just doing your postdocs, you're living in a world that we weren't living in. I was living in a world where the models that we were building barely worked. We were thrilled to see them do anything interesting at all.
People thought we were crazy, right? I was doing early deep learning work. I was trying to train LSTMs with 5,000 or 50,000 lines of C++ code that didn't quite work. So sitting on top of these frameworks-- and you're laughing, but there's a serious point here, right? We're at a moment of opportunity. And it shows.
It shows in academia in the industry that people are saying, I want to see this stuff happen and really matter in the world. And I think it's an interesting challenge to make sure that work does happen and fuel innovation and we still have people getting their PhDs and doing their postdocs and working at university.
So far, it's fine. The supply and demand curve has been just fine. Anybody looking for faculty jobs, it's hard to get faculty jobs still, right? It's not super easy. So it's not like industry has pulled all the people away.
On the right is a model by Parti. Both of these generate pixels based upon a prompt. They do it very differently. We can dive in a little bit on the details. It's also kind of interesting to see that they have-- well, I'll come back to that. Let me keep moving.
So Imagen-- this is the-- I just screenshotted some of the papers so you'd see the citations-- is a diffusion-based model. You can find more about Imagen if you'd like at imagen.research.Google. And what it is at the highest level is a diffusion model, which is an invertible process that moves from noise to structure.
So you learn how-- it's kind of clever, right? You learn how to take something structured and figure out how to turn it into noise. But you do it in a way that's invertible. And so then you can start with some noise of the right type and turn around and generate structure. You can go back and forth.
And diffusion is the term that's used because it does relate to diffusion processes in nature. And I think Jascha Sohl-Dickstein did the first paper. Yeah? Because it's super-resolution. It's a parameter in the model, right?
I mean, you can just keep super-resolving up to higher and higher resolutions as you like. Right? I mean, and even in non-- even in diffusion processes where there's no latent diffusion, where you're just doing pixel-based diffusion, you can still just keep increasing the pixel, pixel size. Yeah.
And so some technical details. Basically, it's a cascade. So here's the answer for this model. Thanks. So you looked ahead at the slides, and I forgot to do that. So at 1024 x 1024.
But I know for a fact that we had bigger ones. This is three cascades, so you start with a small and keep getting bigger and bigger and super-resolving. And this, by the way-- I'm going to make it a point to talk about a lot of the non-technical parts-- the super-resolution caught artists by surprise in the following way.
Musicians have known for a long time that if they want to sell their music, they had probably better not put the full length MP3s on their website for free. You put the MP3s on your website for free, someone downloads them and uses them and there's nothing left to sell. But most artists, most visual artists have websites with relatively low resolution images of their art, which seems harmless enough.
Like, what can I do with this 1024 x 768 or whatever, you know, postcard-sized image? That doesn't replace the art on the wall. But in fact, with super-resolution being able to just continue to increase the resolution of the images, it becomes possible to make very, very high quality images. And there's an article I highly recommend by a journalist named Rachel Metz, from the Bay Area.
She was at CNN at the time. Now she's at Bloomberg. And she was interviewing some artists who were looking at Stable Diffusion. And one of the reasons that they were focusing on Stable Diffusion, this particular model by Stability, is that Stability had published what training sets they trained on. And so it was possible to look in the training sets and find certain art.
And this woman, she was an artist from Oregon, her art was verified to be in the training set. And so they were playing around with the model, the text-to-image model, and they were making art in her style. It's a pretty straightforward prompt, prompting, right? And it made this art. And I was really struck by her take on this.
The headline was like, artists are enraged. Artists are furious. Whatever. But she wasn't. She was in shock. She said something like-- I can almost remember exactly what she said. She says, wow, these are really good.
And then she's looking and she's going to Rachel, and she's saying, this looks like my art, right? Like, she's actually asking for verification. And so you just have this-- it's just, her words are just, it's shock. Right? And so I think it's really important to keep in mind that people whose art is up there and we're taking it and we're doing cool things with it, this is a failure mode, right?
We shouldn't have text-to-image models that are in an unattributed form, taking people's work that they're trying to sell, and super-resolving it into new versions. Right? So one research direction that might sound boring but actually is super fascinating, but if I could plant a seed, is generative models where attribution can be controlled fairly explicitly. This might look like LoRA models, low rank adapters, where what you do is you move the effects of a person's art into some low-rank vectors.
And then when you want to bring them to play and have that person's style affect the neural network, you bring them in and then you have an attribution chain that is explicit. I explicitly brought this person's work into play by adding these particular vectors to the mix. And it changed the output. But one way or the other, we've got to figure out how to build a marketplace around generative AI.
Or as we iterate forward, the outcome is just, it's just going to be bad, right? It's going be negative. So I think we can do this. And I think in the meantime, we should just not train on commercial art. But anyway, getting back on-- Yes.
See, this-- forget about the math. These are the questions. This is what people raise their hands, right? Yeah. So this is why we've been cautious releasing these technologies. Look, there's two answers, right?
One of them is we're a big company. There are lawsuits. Obviously, there's obviously just the kind of business risk here. But beneath it, our AI principles are really well laid out. And I can-- there's a couple of slides in here.
I can tell you that the researchers that I work with, we do care about this. Right? We don't want to screw things up. And so we've been very slow and very reluctant to throw stuff out there. And we've taken, even internally, from researchers who just want to get their stuff out, where people quit. Right? So it's been hard.
It's been hard for me to see us move slow sometimes. And it's also-- I'm also, in the end, glad we've slowed things down. But we've absolutely moved slowly because of the ethical impact and not feeling like we had the right story around the data sets. The answer is there's absolutely room for abuse in terms of-- in the radio industries, we call playola. You would pay money to have the radio stations play your music.
I think we've done a pretty good job with YouTube in avoiding that. I mean, YouTube is the largest creator marketplace. And of course, there's a recommender system running, right? So things are getting recommended over other things. That's inherent.
But I think we've been trying really hard to avoid that. I would be perfectly happy to see these marketplaces evolve with either government regulatory control or consortiums of companies or companies that are working together with academia. I would be completely open to that, right?
I don't think that any company employee should stand up with a straight face and be like, no, no, no. We've got this. Don't worry. We then go back behind closed doors and game things.
So I think it's perfectly legitimate to ask, but I can tell you that we're trying not to do that. And in fact, we don't have this marketplace right now, so it's all hypothetical. And let's keep moving. Yeah?
AUDIENCE: So how much do you think-- how much task do we have to figure out the top-down control of the role of generative AI and how it will fit into the marketplace and so on versus just maybe allowing people to have a shift of mindset in terms of what it means for the distribution of art? So other artists opinions that I heard, that I found at first even more surprising than the person you mentioned who is in awe, when they see their art being replicated, other artists were actually financially stoked as well, because they realized this is a way to just infinitely replicate their art and make it really famous on a different order of magnitude basically.
And so they offer deals to people who want to play around and replicate their art using AI and share the profits and so on. So I don't know. How do you think about certain maybe like, unforeseen market mechanisms?
DOUGLAS ECK: It's a great questions. All the questions have been great so far. So whatever we do, there's going to be someone unhappy, because there are people who want us to just pull the plug and never do this. And there are other people who are incredibly excited. Right? So there's a middle ground here.
For me, the biggest risk is to repeat what happened with music streaming. With music streaming, we-- the companies were so reluctant to offer high quality-- by the companies, I mean labels. The labels did not want streaming to exist. Let's be real.
They fought it. They fought it all, as long as they could at that time. Folks in the labels now, there's a new generation of people and music labels who are really leaning into technology. But--
AUDIENCE: What year is it from?
DOUGLAS ECK: Napster? So what? 2009? Am I close?
AUDIENCE: Earlier.
DOUGLAS ECK: Earlier?
AUDIENCE: Late 90s.
DOUGLAS ECK: Oh, no. That's right. I'm off by a decade. I'm off. As you rack up decade, you'll see.
AUDIENCE: Before iTunes.
DOUGLAS ECK: That's right. Free iTunes. Yeah. You know, so we had LimeWire. We had BitTorrent. And then Spotify came along. And they just basically did it against the industry's wishes, but they built really high quality, low latency streamers at the time.
And so I think at the very least, we need to think about roughing out and building out a kind of marketplace that does that in a really transparent way and ideally without a bunch of fancy pixel-level attribution math. But just like, hey, this token was present, this person's music is in there, and this art is in there, like, build that up. On top of that, there's all these other issues around how people monetize.
You'll have copycats, right? You can fan out 1,000 ways in which this can all be gamed, right? But then, that's true of so many marketplaces. I think you just have to think about how to balance. I want to move on. OK.
One more question. Yeah. Sorry? I have some cool demos, so I don't want to completely run out the clock.
AUDIENCE: Sorry. It might be a difficult question, but I believe that Google was very careful to release stuff and all that. But now that there's a lot of value in those models, and there's been some other companies that clearly didn't care that much of releasing stuff. They created competition and now kind of enforces companies Google to also release stuff.
This also creates kind of a huge problem as well. And do-- it might have implications, huge societal implication. And I don't know. What's your take on it?
DOUGLAS ECK: So first, it's messy. It is true that there has been some urgency at Google to make sure that we take advantage of generative models, especially for search. So we've done Bard and are trying to move in that direction. I genuinely-- I'm always uncomfortable when I say something positive about the company I work for, because I have that same kind of inherent cynicism that many people have.
Don't stand up and be a shill for your company. But you know, I'm really pretty proud of what we've done. And I don't think we're the only company to act this way. We really have AI principles. We have safety and data access principles.
If you want to get fired at Google, start trying to scrape around in private data. You won't last, right? It's taken very, very seriously. And I think, really, my hope is that some of the larger players in the industry work towards making functioning marketplaces and work with governments to get the right regulation in place. That doesn't mean that they're inherently altruistic.
But that's the role of large industry at a time like this, to create these ecosystems and make them work. Beyond that, we'll see how it plays out. I mean, I think I'm happy with the people I'm working with. I have not seen behavior from other companies that I've thought was truly heinous in any way.
I think the big companies, I think, are-- everybody is trying to do the right thing. And you'll just have to see how it rolls out. OK.
I'm going to go a little faster through some of this, because I see where the questions are going. And they're fun. So this is just another way of looking at how this upsampling works. You basically use a text model, even though it's one small square up here. It's actually frozen, so no training is done on this language model.
It's embedding in some way this string. And then we learn the mapping, the alignment between this embedding and the initial image. And then the embeddings are still there to keep the-- well, sorry. You start with the embeddings plus noise. You get your initial image based upon the semantic content in the embedding that's been learned.
And then you keep these embeddings there, the linguistic intent of these embeddings. And they're all away so that the image doesn't drift away from what it means. And you continue to upsample and make the images higher and higher quality. Yes?
No. I think it's just that they worked. This is expediency. You can imagine training, you know, propagating a gradient such that what you get is the right language model for text, for text-to-image. Just people have just found that frozen existing text models work fine. So you have this model that's trained just on embedding text in some way.
OK. Oops. Yeah. So the embeddings are frozen. And then these are, yeah-- these cascades, I believe-- so I'm-- this work reported up to me, but I didn't do this research. But my belief-- if there's someone that understands cascaded diffusion better, tell me. My understanding is that there's no gradient propagating here.
So this model is trained separately and is doing this particular super-resolution and each of the super-resolution stacks is trained separately. And people are nodding. And no one is standing up and shaking their fists, so I think I got it right.
All right. I want to get through the image stuff. So just to give you a flavor, I'm showing you a second model, because it's a completely different technology yet it still does a pretty good job of generating images. This is Parti, and this is another Google project. But again, there are other language model-based image generators out there. I'm using Google stuff because I have it around.
And here, the basic idea is to basically use the transformer decoder and train a transformer encoder-decoder pair. So we have our string here. We're going to encode and decode it into tokens. And then we're going to start with a-- we're going to use this to predict some coarse image tokens. And then we're going to super-resolve here.
I could spend-- this could be its own talk, and the architecture is more complicated. For the sake of argument and to get to some other demos, because it's not what the conversation is really about, what I wanted to show you is that you can do a lot just with language models, right? You don't necessarily need to use diffusion. And the other point that I want to plant is that I think a lot of what we're getting from large language models, not everything, but a lot of what we're getting with large language models can be thought of as translation.
But you want to think about translation more broadly than translation between one language and another. You want to talk about translating across domains. So in a very real way, this transformer model is taking a sequence of tokens in the English language and turning it into a sequence of tokens in the language of images. And that is then being super-resolved to be a little bit crisper using diffusion here with Imagen. Though frankly, you can get OK images by just continuing along this language modeling game.
And what we'll see later, in case I don't get to it, is another example of this where, in robotics, there have been really interesting breakthroughs in robotics, where, in the first paper called "SayCan," the language model takes a high-level intent, like I spilled my Coke, please get me another one, to the house robot. And that's a really hard high-level intent to translate into XYZ movements on a robot arm and motor movements for the wheels. It's just a lot of work.
In the first paper, the language model is used to break that down into a bunch of subcommands using prompt tuning. And so it takes that and it learns to say, oh, navigate to the refrigerator. Open the refrigerator door. Take your arm and find a Coke. Grab the Coke from the refrigerator.
Close the refrigerator door. Turn around and go back to the user, et cetera. Right? These are all much more actionable. And then the robot has its sensors. And it's using another model to turn around and generate a bunch of candidate actions.
And then the union of those is taken, rather, the intersection, and the robot acts on those. And it actually works, right? So the language model is translating into English language strings that can then be interpreted by the robot. And a second follow on paper, this was in the New York Times two weeks ago. Cool article. Just Google "New York Times robots" and you'll find it.
The next step is taken. And the English-- the transformer model just generates a bunch of tokens that are interpreted directly as robot control movements. So it's quite directly and literally tokenizing the actions, the XYZ movements, just tokenizing it into a softmax. And then the robot is being made to move by the language model. So it's directly translating into movement the language. Yeah?
AUDIENCE: Is this the RT-2--
DOUGLAS ECK: Yes. It's RT-2. It's exactly RT-2. So I bring this up here early, in case we don't get on, is this idea of translation really goes a long way. I think that the ChatBot setting is a little bit deceiving, because we anthropomorphize the agent. Just like if anybody knows what Eliza is, the Lisp program that acted like a psychotherapist, it just kept saying, tell me more.
You're like, whoa. It's alive. So you know, I think we assign a lot of agency to these models that probably aren't really there. But this translation, this domain adaptation, this domain translation is real. And it's a very interesting. We've already seen an example here for images. And we'll see more.
There's also something nice in the Parti paper where they just did something nice. They just very carefully did a qualitative comparison of different model sizes, if it's hidden by the podium, 350 million, 750 million, 3 billion, and 20 billion. And here, they were asking it to spell something, "welcome friends."
And in the example, you just see as the models get bigger, you just see this scaling effect. Just scale to scale to scale to scale, the model just does it better and better, without any architectural changes at all. It's just a question of scale. So I just thought it was nice, just nice to see that clearly mapped out.
So let's talk a little bit about video. It's going to be a very similar story. I'm going to talk about a couple of projects that are ongoing. One is Imagen video, which is really the same Imagen diffusion work done in the space of video by just adding a third temporal dimension. And Phenaki, which is a language modeling approach.
So again, you can learn about Imagen online here. You can just Google Imagen if you want. This was a nice early-- now early, it's a year and a half old-- bit of work on diffusion. And to give you a better idea of what's happening, it's basically a cascade of super-resolution modules that take the text prompt, generate those embeddings we saw, and then gradually super-resolve until we're up to about 1024 x 768, 24 frames per second video.
And this is a little bit about UNet architectures. Do me a favor. I'm going to skip this, because it'll take time to unpack. And you can, at your heart's content, go read about diffusion in UNets. And it'll be just fine. Because the questions will be about other stuff that's coming.
Whoops. And then just to compare against that, we have Phenaki, which is a language modeling approach also from Google. And here, this is a mess. But what it amounts to is a big language model encoder that can encode, basically encodes patches of images into tokens. So basically, you're tokenizing the image of each frame. And then we're going to use a transformer to pull these back together and, at decoding time, have some autoregressive components that will cause the video to have coherence in time, which is the big challenge with video.
The big challenge with video is making sure that the subsequent frames all tell the same story, that you don't have the green dress turning into a purple dress or you don't have the person's face turning into a blob. There's a need for temporal coherence over time, yet the images change over time. And that's a big challenge.
And so Phenaki is quite good at short, short videos. And I think the real win here-- be ready to read this-- the win here for Phenaki, the individual frames are not quite as high quality. That's partially just for compute reasons. It's expensive. We spend more compute, we can super-resolve these better.
Yeah, go ahead. Yes. Exactly. Yeah. Yeah. SSR is spatial and TTR is temporal so that you pull things together over time. Yeah. There we go. A blue balloon stuck in the branches of a Redwood tree. Camera pans from the tree with single blue balloon to the zoo entrance.
Camera pans to the zoo entrance. Camera quickly moves into the zoo. First person view of flying inside a beautiful garden. The head of a giraffe emerges from the side. Giraffe walks toward the tree.
Zoom into the giraffe's mouth. Giraffe gets close to a branch and picks a blue balloon. A single helium balloon with a white string is flying towards a giraffe's head. Giraffe chewing with the blue balloon nearby. Camera tilting up, following the single balloon flying away.
So the reason I bothered to read it is there's some small amount of drama here. But also, those really were the prompts used to make the video. And one of the cool things about this autoregressive model is that it can carry coherence across these prompts. So the video doesn't change immediately, right?
And we've got things happening internally where we're redoing small scenes from Metropolis, which is a movie that's now in public domain, and playing around with it. And there's a whole world here where there's a really interesting creative hook here of being able to actually tell the story you want to tell. And I think these videos are really about storytelling, more than anything else. It's really, really interesting to be able to tell a story.
AUDIENCE: So we've seen the power of Stable Diffusion. And I was wondering if it's only for image and videos and if it has been tried for other modalities like audio? And if no, why is the case?
DOUGLAS ECK: The answer is yes. So--
AUDIENCE: OK.
DOUGLAS ECK: So there's an open-source project called Riffusion, like guitar riffs, which takes audio, basically generates spectrograms, which are just pictures of audio, right? The horizontal axis is time, and the vertical axis is frequency. And then you use an algorithm called Griffin-Lim, which is signal processing to invert that back into audio. And so there's already-- there's code in open-source projects out there where you just generate high-resolution spectrograms.
And as far as I can tell, you just generate high-resolution spectrograms all you want. And you can do something with it. Terrible, not that easy to control, but an interesting direction. And that's for audio.
Diffusion is a nice general purpose process. It's easy to condition. It's an interesting process. It's maybe parallelizable in ways we can't see right now. And there's a lot of work out there. So diffusion for language is a good question, right? It's turned out to be frustratingly slow, the place to use diffusion would be in the decoding phase, and the interesting part about decoding with diffusion is that it's non-causal. So it's happening in parallel. And as you diffuse, you would start-- you would-- basically, let's say you wanted to diffuse a paragraph.
You would start with some fairly low dimensional representation of the paragraph and just keep diffusing. But as you're doing so, you're getting all of the conditional dependencies from all the words mashed up, right? It's not just left to right, right, right. And so there's this world where maybe very long paragraphs, like if you wanted to generate a thousand words of text, you might be able to do it more efficiently or better with diffusion.
There are lots of ways, though. You could talk about optimal transport. There are all sorts of ways where you can treat the sequence of tokens that you're generating as a probability distribution that you want to understand better and realize that left to right autoregression is just kind of one greedy way to get there. So people have been playing-- the problem is really that it's slow. And you can do a lot of tricks with autoregression and beam search and things like that that just prove to be hard to beat for sentence-level work.
Yes. Yes. You can-- yes. You can tell the model how much time to spend on a given phrase. So and that's an easy-- that's just like a Colab hack. They did that in a day. So some of the people internally were playing with that. So yeah. And this is just the beginning, right?
There's all sorts of ways to condition this. There are ways to ensure coherence over time. Certainly, we can make the image quality better. I would just take with a grain of salt the image quality. It really was just a compute constraint, how much compute do we want to use for this particular training run.
And I would focus on the storytelling aspect. I think that's the most interesting part of this Phenaki experiment, this is that, honestly, once you can go this long, you can keep going. Right? There's no real constraint. The autoregressive model is just going to keep carrying coherence across time for you.
This is the part I wanted to talk about. But OK. One more.
AUDIENCE: I'm just wondering, all these demos that you showed, it's pretty cool. But are they cherry-picked? Or it's-- can you show us some of the failure examples, where the model just miserably failed?
DOUGLAS ECK: Online, there are-- online, Phenaki papers published. And there are lots of non-cherry-picked examples. And of course, your mileage may vary. This is worse than cherry-picked. This was-- OK.
So there's someone on our team. Her name is Irina Blok. And she's a full-time designer. That's what she does. She's a designer. If any of you know what Android is, she did the little Android logo, the little robot.
And her full time job is to play around with models like this and try to do cool and push, push the envelope of what can be done. She made this. And she showed us that you can chain together these prompts. We knew mathematically-- you have engineers trying this-- it kind of works. But she's just-- she's-- we call her the prompt whisperer, because she knows how to-- this is her nickname.
And you know, we have other examples internally that are quite good. And you see once someone with an artistic, with a design perspective comes in and starts working with these tools, it flips cherry-picking on its head. Because it's not cherry-picking. It's-- she worked. She stayed up hours doing this, right?
And so it's fine that it's in that sense cherry-picked. It's a piece of-- it's craft, right? I won't call it art, but it's craft. But yes. I'm not trying to convince you that this model is better than another or that random samples from the model are great or anything like that.
You have to evaluate that on your own from-- there's plenty of-- tons of samples online for all these models. But here, yeah. It's exactly cherry-picked. It's constructed. Yeah?
So she did-- I don't know how long it took her to do this one. But she did a more recent one, which was like a 3 minute Black Mirror kind of episode thing that I'd love to be able to show off someday. And she like-- she's like, yeah, I did it on Sunday. And we're all like, how did you do this in one day?
And the compute necessarily varies. But we'll stand this model up. And it's not that big of a model. I think the parameters were-- I mean, it's like-- I think it's fitting on like a 4x4 DP. It's not that big.
And then so it's not like-- we do have-- we are Google. We have a lot of resources. But I promise you-- hold on-- I promise you. I promise you. No, no, no. You can laugh at me if you want.
But when you're training large language models like Bard for production reasons and your crazy artist comes along and says, can I have more resources? The answer is not always yes, right? So we struggle to get resources to do this kind of work, right? I mean, you know--
AUDIENCE: [INAUDIBLE]
DOUGLAS ECK: I don't remember. I'll tell you offline. Again, I just-- I just don't-- I don't remember how big the model is. But compared to the largest language models, these models and inference, and inference aren't that bad. Because the language models are already trained. It's frozen.
You have to load those into-- and then you have all these cascading language models at work. It takes more to train, right? But these are already trained. So and these are faster at inference time than the diffusion models, a couple orders of magnitude faster than diffusion models. Because they're working with a frozen language model, for one reason.
OK. On to music. I have to finish the music part. I don't care how late we stay. So I've done this for a long time. And you're in for a treat.
You're going to hear some of the music that my LSTM neural network generated in, wait for it, 2002. 2002. Is there anyone in the room that was-- OK. I won't say that. I don't want to know the answer. 2002 was a long time ago.
The code was written in C++. And so what we were doing was basically looking at the capacity. There were some serious ideas behind doing music. Music has interesting repeated structure and hierarchical structure. Unlike language, it's pretty easy to measure when you've got it wrong. Right?
You can just use really interesting-- well, you're not in the right-- you're playing the wrong chord. Right? And so we wanted to see if we could build a kind of stable pattern generators. Same papers coming out that year.
We're looking at context-free and context-sensitive grammars, like a to the n, b to the n, c to the n, and just being able to generate sinusoids, just understanding whether LSTM had really done anything interesting with respect to the vanishing gradient problem and the ability to stably create pattern generators. And wow, it's really good. Let me play it for you.
[AUDIO PLAYBACK]
Yeah. It's-- yeah. This is only five minutes and 47 seconds long. Sit back. Now, I mean, it didn't stand the test of time. But that's OK.
It's also only been cited. If you work in music, if you work in recurrent neural networks in music in the early 2000, these are the citation counts you can only dream of having. Someone cited this paper 121 times.
So let's move forward. I'm afraid. Anybody heard of Magenta? Oh, good. A few hands. So this was a project I created. I'm very proud to have created this in 2015.
Just saying, generative models are coming. We can clearly generate text. And there was handwriting generation happening from Alex Graves. And there was a bunch of cool stuff happening with some image generation. And I actually believe this was going to change the world.
We're going to see these models get better and better. And let's explore what's possible. And so what we did was we focused on music, partially because it's what I do. But also music and art is relatively user safe compared to dialogue and because language is inherently messy and compared to things like, I don't know, like running heart-lung machines or whatever else you might want to do with your generative model.
And so to show you, not much had happened in the intervening time. This was almost like a test file just to see what we could do to get some code working. But this is a little RNN. This is also an LSTM from 2016.
[AUDIO PLAYBACK]
So OK. So we put this online. And we announced this project. And we put up a GitHub, a little bit of code. And then something kind of cool happened. A few randos found out, found this online.
And they started playing along with it. And this was really cool. This is people. They didn't even-- they didn't even tell us. They just did it.
[AUDIO PLAYBACK]
So the pianos us. So it's the same piano.
[END PLAYBACK]
He got the chords wrong, but that's OK. Added a bridge that wasn't really there. But we were so thrilled, right? So this is what this project was about, was trying to interact with musicians and figure some of this space out.
And this happened two weeks after we launched. And they didn't even tell us it happened. And we were thrilled to see it happen. So Magenta is still going on.
[AUDIO PLAYBACK]
Oh, sorry. I had to go through that. So let's go a little bit further on. This is 2018. Now Transformer comes along. OK? Two things happen.
Think of what happened in just in a year, right? Transformer comes along, came from our group. Some of the folks from Transformer are on the paper. I should have put the paper, Music Transformer. I just put-- this is the blog posting that I copied.
But those are not all the people on the paper. There are some folks-- I think Noam Shazeer and Ashish Vaswani from the Transformer paper are this paper, too. Music Transformer. And we did two things. One thing we did was, in a separate paper, we figured out how to take-- so we started with piano music.
Now it's multi music-- but to take a piece of audio and basically derive the MIDI score, derive the MIDI file from it. So MIDI is just the note onsets and how loud they are and what the pitch is, commonly used in music. But crucially, we got the timing right and the loudness right. So a piano performance has-- slows down and speeds up and gets louder and softer. And that's what makes a performance a performance and sound good, right?
And we encoded-- this is Ian Simon who primarily thought of this-- we encoded time. The file is not sampled in time. There's no sampling rate. Rather, a note is played and then another-- a note is played sampling from one distribution. Then we sample from a distribution how far to move the clock.
And then we sample another note and then sample to make it louder or softer, et cetera. So its loudness is one set of logits. Time is another set of logits. And then pitch. And it's just putting the MIDI file back together in that format means that a long note doesn't require multiple inferences to exist.
It starts and you move the clock. And that's what makes a long note. And you generate the note offset. So it's this interesting control language for music. And I'm going to play you a piece from Music Transformer where there's no control here.
We're just sampling from the model randomly. But I think you'll hear it has much more musical structure than we were getting with RNNs, partially from the training data and partially from Transformer. And also really cool, you'll see these arcs. And I think this is one of the best visualizations of Transformer. I actually would suggest using this to teach classes, even if you don't care about music.
Because what you'll see is in colored arcs, what the attention heads are paying attention to. And the color of the arc is the head. There are eight heads in this model. And the arc is going back to what it's actually attending to most. Right?
And what you'll see, and I don't want to stop it, is there are these repeated segments. You can see them right here. This is a repetition of this is a repetition of this. When these repeated segments, think of them like musical refrains, when they happen, immediately the model is just locked in on the previous version of that segment.
So it's doing repetition with modification over and over again. And I think it's kind of a cool sounding piece of music. So let's listen to this.
[AUDIO PLAYBACK]
Then it stops. So I was pleased with that. And kind of-- does anybody-- do you agree it's kind of cool to see the structure? Yeah? OK, good. Lots of examples of that. OK. Oops. Moving way forward in time, present time, I'm going to talk about a modern language model, language modeling approach to music, called MusicLM. This was released this year. It's also available for you to play with aitestkitchen.withgoogle.com. And then the paper's up here.
And there's a-- now we've flipped from MIDI to audio. So we were looking at MIDI. We were just generating MIDI onsets and then a synthesizer was generating that piano sound. Now we're talking about generating audio.
And then in a way very similar to video, you really care about high quality in time, right? But you also want long-term consistency. And these are hard-- it's hard to get both. So Wavenet gave you really amazing quality but low consistency in time. So it sounds like-- it'll sound like babbling.
[AUDIO PLAYBACK]
OK. And these are both old models. I'm not trying to-- Aaron's Wavenet is an amazing piece of work, don't get me wrong. It's just not conditioned on phonemes. It sounds like that. And here's something. Jukebox is a little bit lower quality, but at least it's consistent.
[AUDIO PLAYBACK]
OK. So what this model is up to is-- it's actually really cool. I encourage checking these papers out. First thing we have is an audio codec. Someone actually used a neural network to build what amounts to an MP3 encoder-decoder pair. So it compresses the audio, but it does so in a pure language model fashion.
So it compresses the audio into a sequence of language tokens and then decodes from those tokens. And it's actually quite competitive. So you can do compression with language models. And one of the reasons that it's quite good is there's a bit of a GAN-like discriminator in there. Play.
And then we condition with text. And so there's another paper called "MuLan," which basically jointly embeds audio and text. So it's kind of like CLIP. It's like CLIP for music. And so this allows us to take a string, like rock song with distorted guitar, and correlate that with the audio.
So we have a frozen neural network again, a frozen language model that gives us an embedding. And then we embed the audio. And then we learn the standard contrastive loss. So this gives us the ability to do text-to-music.
So maybe a little bit of extra detail here I'll go a little fast through. Fundamentally, we're building encoders and decoders using SoundStream as our language and using MuLan to handle the text-to-music. And then at inference is quite simple. We are going to decode using SoundStream. And crucially here, from MuLan, as long as we have a MuLan text embedding to condition on, we're in good shape.
And that MuLan text embedding could come from a string like, play me some hillbilly banjo. Or it could be an audio file. And you can calculate your MuLan embedding from that. So you have two different ways to control things.
So here's an example of what it sounds like. And remember, this is audio now. So you'll see at the line there is where-- the prompt is prefixed audio, so that was used to calculate the MuLan embedding. And then it continues from there.
[AUDIO PLAYBACK]
There's still some artifacts. It's like 16 kilohertz audio. And it sounds better than that, though. I need to spend the time to gain-- have the same volume for all these clips. Sorry. Yeah?
AUDIENCE: The prompt to video, where you can inject a little bit of-- you can start out with classical music then move on to rock then move--
DOUGLAS ECK: Oh, yes. Definitely. Definitely. And we're playing with these. The nice thing-- so this question hits at-- I think a lot of the research that we need to do going forward in this space to build useful tools is more about clever ways to condition the model user interface design, allowing users to be able to state their intent in different, fun, clever ways. And it really does matter. I know it's getting late, so I want to get through these.
I'm-- these are all available online. So I have other demos I want to show you. I'll do one of them for you. This jazz song has a memorable saxophone solo.
[AUDIO PLAYBACK]
Wasn't really a memorable saxophone solo. There's cherry-picking, right? All right. But come back to those later, because I have a couple that I want to show you, too.
Just in general, you can play with these online. Actually, is this out? Some version of this demo is out. That's a-- I screenshotted our internal one. That's an internal go link.
But believe the SingSong demo is out somewhere. Do one more here. Fingerstyle guitar. Oh. One more thing. Oh, I'm sorry. We jumped.
We made a jump. That was MusicLM. SingSong is using MusicLM. The paper is out. I think I'm one slide off here. SingSong is MusicLM, but it is conditioned on your voice.
So you sing, and it tries to build a performance around your voice. So it's kind of like karaoke in reverse. Instead of-- you sing and it plays the music. Right? And so when you listen to these melodies, it's the same. It's the same source melody.
[AUDIO PLAYBACK]
So that's the same. Here's the same melody but with a different prompt.
[AUDIO PLAYBACK]
But honestly, stylistically kind of kept that descending, that descending pentatonic scale and added something to it. And then--
AUDIENCE: experiments.withgoogle.com.
DOUGLAS ECK: Thank you. Thank you. So here's-- actually, there are more examples here. I want to-- I'm just going to-- let's see. What is this one? I forget what this one is.
[AUDIO PLAYBACK]
- But it's all over when we are amazed. Oh, how do I know that we're all still the same. Well, I chewed it over.
[END PLAYBACK]
DOUGLAS ECK: And now we'll put that vocal back in. It sounds nice to have it in there, right? But the vocal is just provided by the singer. And we'll get some accompaniment to it.
[AUDIO PLAYBACK]
- But it's all over when we are amazed. Oh, how do I know that--
[END PLAYBACK]
DOUGLAS ECK: OK. You guys can laugh. You laugh now. You laugh now.
So that's it for the fun part. No. AI principles are fun. Come on. Let's get around the campfire. We're going to sing this.
No, I don't want to make fun of the principles. I think it's important that everybody-- I think it's important that academia and companies and the government work together and find the right way to make this technology roll out in ways that are beneficial. And I think it's easy to screw up. But I think it's also reasonably possible to get it right. So if you look at the kind of things that we're saying we want to do, socially beneficial.
Let's pay close attention to bias, trust, and safety, accountability, privacy, scientific excellence, and then really only be made available for uses that are in accord with these principles. You know we've laid this out much more, much more in depth. And I'll give credit. Microsoft has done a great job in this space.
I'm not not calling out anybody. Anthropic has been really active in this space. There are lots of great academic groups that are working in this space. And I think it's really important to really pay attention to these principles.
I'd also point out that-- it's decoupled now. There it goes. These get built into the engineering of these pipelines. It's kind of-- so you're filtering queries. Your training data is filtering out copyrighted audio. And we have that ability at YouTube.
Again, filters, after you generate, in case you've overfit and memorize something. Flagging and not flagging, all of this is happening. And I think, again, I think building some of this technology directly into our generative models so that we have rights control and other aspects of this handled well is really important.
And it also shows up in our papers. So these are three big language modeling papers. And they all have fairly well thought out, I think, evaluations and ethics sections.
Close with my favorite quote. I'm not going to-- I have the robotics stuff. But wow, we've really gone a long time. We're way over. Whatever you now-- this is why-- OK.
So what makes art cool with respect to technology, technology is everywhere in art. Right? Film cameras, pianos, oil paints, synthesizers, whatever, that's all technology. Even arguably, the fat mixed with charcoal mixed with some blood for cave paintings is some kind of technological use, right? And we always, as artists, break it. So you put on your artist hat.
And you're like, oh, what am I going to do with this thing? Someone gives you a drum machine, like the 808, and you know, hip hop is born in the Bronx, not because they just played the presets, like made new rhythms, played around, made things a little grungy. And this is from Brian Eno.
Whatever you now find weird, ugly, uncomfortable, and nasty about a new medium will surely become its signature. CD distortion, the jitteriness of digital video, the crap sound of 8-bit, the distorted guitar sound is the sound of something too loud for the medium supposed to carry it. The blues singer with the cracked voice is the sound of an emotional cry too powerful for the throat that releases it. The excitement of grainy film, of bleached-out black and white, is the excitement of witnessing events too momentous for the medium assigned to record them.
I think that's deeply true. And if there's a quote that embodies what we want, what I want, and what we want in Magenta, is to figure out how to break and push the boundaries of AI. And to do that, we have to work with artists. Because you, as an engineer or as a machine learning researcher, you could be an artist, too.
But it's like changing hats. You have your engineer hat on. You try to make it work. You make-- If you have your engineer hat on, you're making a guitar amplifier. You're trying to make it not distort.
You take off that hat and you put on your guitarist hat. And you're going to see what you can do with it. Kick it and break it and see what you can do with it and make some crazy sounds. I think that's really beautiful.
So with that, I'll stop. Thanks for your patience. I know this was a long talk. I hope it was cool. And we still have time for questions. Or we can go drink. I don't know what we're supposed to do.