Transformative Generative Models
July 5, 2018
July 2, 2018
All Captioned Videos CBMM Special Seminars
Prof. Lior Wolf, Tel Aviv University and Facebook AI Research
Abstract: Generative models are constantly improving, thanks to recent contributions in adversarial training, unsupervised learning, and autoregressive models. In this talk, I will describe new generative models in computer vision, voice synthesis, and music.
In music – I will describe the first music translation method to produce convincing results (
In voice synthesis – I will discuss the current state of multi-speaker text to speech (
PRESENTER: It's great to welcome Lior back. He has been here about 10 years ago, for three years or so, as a postdoc of my group after being a student of [INAUDIBLE] in Israel. So it's great, always, to welcome back somebody of the academic family. Lior has been quite active and producing a lot of interesting results, first in computer vision. And in the last few years, in machine learning, his group has been one of the most active in machine learning in Israel. Among other things he had a role in developing face recognition algorithms that are now used in Facebook. And he's a professor at Tel Aviv University and also a research scientist at Facebook Israel.
I've always found his ideas very interesting, and I find that what he will speak about today, which is basically unsupervised learning, quite relevant, especially now that I think he's not committed 100% to [? GANs, ?] but to other approaches as well. I'm not particularly a fan of [? GANs, ?] as you may understand. But I'm very curious to hear what you have to say. So let's start. And I think because of the flight constraint, I suggest to have questions during the talk so you can manage them more efficiently.
By the way, I think we are streaming to an audience of MIT alumni, so welcome to them and to you. And let's start.
LIOR WOLF: Thank you very much, [INAUDIBLE], for the kind introduction. And let's have this completely informal. If you have a question, just interrupt me and ask whatever you want. And I am a big fan of [? GANs. ?] I think there's been a huge evolution in unsupervised learning, and right now it's going through a phase where you can actually see real-world applications of unsupervised learning, not necessarily with [? GANs, ?] maybe with other techniques. So it's an exciting time.
I will talk a little bit about semi-supervised learning, then I will go to supervised. Then I will go to unsupervised. But what's common in all these cases is that we are going to use generative models, models that generate, not just classification, but an image, a video, a voice clip, and so on. So the work that I'm going to talk about today was done within Facebook AI Research in Tel Aviv in collaboration with the team over there, Yaniv, Adam, Eliya and Noam Mor.
And when we started, we didn't have a clear research agenda. But as time evolves, it's clear that our agenda is about creating personalized, generative models. So if you are talking about which module models, we are talking about creating your own Bitmoji or your own avatar that looks like you, sounds like you, behaves like you, and so on. And I can also put some music in there. We want to personalize music for your needs.
So I'm going talk about-- it's going to be pretty varied. I will start with computer vision and computer graphics, our avatar work. I will then move to speech synthesis and from there to music.
So the first task that we tackled at Facebook AI Research in Tel Aviv was the following, how do you create an avatar just for you? So we want an avatar that looks like you. It's very hard to annotate, to create an avatar for each person in the training set. So supervised learning is not really an option. We've thought, well, let's use unsupervised learning without using any matches between the domains. So there is one domain, which is the domain of facial photographs. This is what you see over here. And there is another domain. In this case, those are Bitmoji images.
So there are two visual domains. We would like to map between the two visual domains, and we don't have matching samples. So we call it unsupervised or semi-supervised. Why semi-supervised? Because in order to glue between the domains, we are using face recognition network. In a face recognition network, the input is an image, the output is a description of the face.
And we pose it as a computational problem that we learn a generative function g such that f, the face descriptor of what you generate, given an input face image x, is similar to the description of x itself. So we trained under this constraint without having any matching supervised between x and g of x.
And this works. I will show you an example this works, even though we apply f to g of x, not to x, and it was never trained on images from this domain, from the domain of generated images. So looking at it as deep learning net architecture, we take the input image, we use f. We get f of x. Then we have a generative function g generates this image.
And the main constraint that we have looks like this. It says that if we apply f to this image obtaining this expression, then it's the same of f of x that you see over there. So this is the main constraint that we are using. And there are a few other constraints that we have, including the [? GAN ?] that we throw in, and then we are able to achieve what we want. We are able to create the mapping that maps between a face image and an avatar image.
So what you see on the left is the input. This is what we asked a human artist to create just for comparing yourself to it, not for training, and this is what our system creates. So what you might notice is that what our system creates is much more identifiable. In fact, if we hid our created image inside 100,000 images, then the rank of our image when we use were retrieval is 16. 16 out of 100,000 is quite good for face recognition. So we are able to create images like this that are identifiable. Here is one example.
The problem is that it doesn't look like the original domain. It looks somehow like the original domain. The [? GAN ?] thinks that it looks like the original domain, but you can see it doesn't really look like the original domain. The original domain is built by some sort of an engine that takes as input a vector of parameters.
One parameter is the type of hair. Another parameter is to select the nose out of 10 different nose types. Another one is to select the shape of the chin and so on. So we would like to recover the vector of parameters, not just the image, but also the vector of parameters. And we would like to do this, again, in a way that is unsupervised. We can't collect enough examples to train a supervised system.
So we revisit the problem that we had before. We still have the same constraint that we had before, this constraint, but we also have another constraint. The constraint is that the image that we generate, g of x, is such that there exists a vector of parameters such that applying the graphical engine to this vector of parameters generates this image. So now we have this constraint as well as this constraint.
And we create a network. It's a compound network as well. It's slightly more complex. It has this engine as part of the network. So it's a hybrid network that has learned networks as well [INAUDIBLE] inside, and we're able to achieve the task and create both Bitmoji types of emojis and social 3D avatars, such as the ones that you see here, for every input.
So we started with this [INAUDIBLE] of models in vision. And then we said, this was exactly a year ago, we said that something interesting is going on in voice. And we decided that we would go and work in voice instead of vision without much background. And the first task that we decided to work on is text-to-speech. The input is syntax. The output is a voice clip that we want to achieve.
And the reason that we realized that this was a good time to get into this domain was a very [INAUDIBLE] Google WaveNet that came at the end of 2016, which was able to generate text to speech, at least for one speaker where you have a lot of training data. But it sounded extremely natural. So this work was a breakthrough in the sense that it was the first one that you can actually hear very natural audio as output using learning techniques that employ neural networks.
So we thought, well, this kind of solves the problem, but there are many other things that we would like to do. One example is we would like to deal with multiple speakers. So we would like to have, not just one speaker, but multiple speakers. And we would like to have every single speaker speak, after we code them, for just a short while.
Maybe you just speak two sentences, and then we can hear you speak the text that we want you to. And in addition, we also wanted the property that we call in the wild. In the wild means that we don't record you in a studio. For example, what we do, we take your voice from YouTube videos with a lot of background noise with that in the background and so on. So this is what we call in the wild. We want to sample your voice outside the lab.
Another thing we would like to achieve is to control your intonation. If you are speaking, you don't speak in a single way. You speak in multiple ways depending on your current state, and we want to have control over the intonation that is being produced. And one last thing that we've worked about recently. We want to take somebody, let's say somebody who speaks Hebrew, and we want to make them speak English without the accent, what would he sound like if he was a native speaker in English, all these sorts of language transfer and voice conversions.
So in this case, what we did, we tried many different architectures. At the end, we decided we would invent our own architecture, which we called the phonological loop. So the phonological loop is something from the cognitive literature about producing speech. And the main component is some sort of memory that contains both auditory memory and the current context of the text. So there is a central memory that mixes both the information about the text and about the audio output that is being generated.
So this is the input text. We have an attention mechanism on top of the input text. This is then inserting this memory over here. And the memory-- there is another network that writes to the output. So every time point, you write the next frame in time for the audio.
And, of course, there are multiple speakers, and each one of them speaks in their own voice. The special thing about this memory, it behaves like a buffer. So it's a fixed-length buffer. Every time point you insert a new vector, and you pop out the first vector that was inserted. So it's a buffer that keeps on updating as you speak, mixing both the text and what you can hear, the output of the system.
And what you get as a result is a system that [INAUDIBLE] voices that are highly identifiable. If you train another network to judge whether the voice belongs to the original speaker, then it can, in very, very high accuracy, identify who is the original speaker. And then if you look at the academic benchmarks, which are based on the single speaker-- and what's nice about the system, it's the only multi-speaker system out there that the code is available as open source code.
SYNTHESIZED VOICE: My code is available and you can, in fact, train me from scratch.
LIOR WOLF: And people train it to multiple, different languages, Korean, and French, and so on. So it's really nice to see the nice community in GitHub around this code.
As promised, we can sample people in the wild, in these [INAUDIBLE] videos. So imagine election type of videos, where there's a lot of noise, clapping, multiple speakers. Still, we are able to get the main speaker.
SYNTHESIZED VOICE: I am a neural network designed by Facebook.
SYNTHESIZED VOICE: I am neural network designed by Facebook.
SYNTHESIZED VOICE: I am a neural network designed by Facebook.
SYNTHESIZED VOICE: I am a neural network designed by Facebook.
SYNTHESIZED VOICE: I am a neural network designed by Facebook.
LIOR WOLF: Let's move to the next slide.
And we can control the intonation.
SYNTHESIZED VOICE: I am looking for a job as a digital assistant.
SYNTHESIZED VOICE: I am looking for a job as a digital assistant.
SYNTHESIZED VOICE: I am looking for a job as a digital assistant.
LIOR WOLF: So those are extreme cases. On purpose, we are taking extreme cases. And the way that we actually create these intonations is by priming, which means that we let the person speak quietly some sentence and then paste the sentence that we want him to speak. So this is kind of a warm start. And if the sentence is such that it elicits emotions-- for example, if it contains a lot of "I," then person becomes very much excited, and you can see this type of very excited intonations.
AUDIENCE: So, Lior, years ago, [INAUDIBLE] has this [INAUDIBLE] that produces in [INAUDIBLE] video. You decided to do the same thing for speech and not [INAUDIBLE]--
LIOR WOLF: I was telling the team this story so many times. There was MIT working with Tony, and this other guy, Tony, did perfect video that he could control the lips. The video was perfect. The audio was very difficult. He invested so much time and really tried hard, but could not mimic somebody else's voice.
And this is why it was so exciting for me 10, 12 years later to come back to the problem and being able to do something that solves this. And many of the things that you see here, they are definitely due to the advent of deep learning methods. And what we can do now is things that 10 years ago were just way, way too difficult.
AUDIENCE: So, Lior, at that time, we had, say, video [INAUDIBLE] singing contemporary song [INAUDIBLE] the same with the voice?
LIOR WOLF: Yes. I will specially sensor represent music in a second. So we definitely now take one singer and move the voice to another singer. This is what I'm going to show next, not with singing yet.
And this is the last slide that going to have on voice, so if you have a question this is the time. What we recently did for ICML was [INAUDIBLE] component that really makes the sampling of the voice much shorter. So you can speak one or two sentences. We will capture your voice. This is by creating a feed-forward network that takes the voice sample, and it creates the embedding of the speaker.
So I will now move to music. So any questions [INAUDIBLE]? Yes, please.
AUDIENCE: So the key here is that you have to input. So you have some sort of-- let's say you're recording original sample of the voice donator, if you will, and then what you're doing is you're continuously updating it by having more than one user say the same phrase over and over? Like does are these voices being-- is it taking the, so to speak, the average sound or something like that?
LIOR WOLF: Not really. This is a supervised learning problem. You have many speakers with many [INAUDIBLE] voices The problem is it's not aligned. So you have an attention mechanism that aligns between the two. And once you learn enough speakers, the system is able to generalize and capture new speakers.
And the other question, yes? What's your [INAUDIBLE]
AUDIENCE: Are you generating a time frame, frequency domain, [INAUDIBLE] WaveNet, [INAUDIBLE]?
LIOR WOLF: So in this case, it's recorded. When we did this work, it was purely based on [INAUDIBLE] recorders. And we are now shifting-- based on the work that you will see next, we are shifting more into walking with WaveNets in the actual audio.
Any other questions?
AUDIENCE: Yes, please. [INAUDIBLE] So the ideas be applied for speech recognition, speech-to-text?
LIOR WOLF: So we tried, at one point, to apply this phonological loop, this buffer-type memory, to other problems, where normally you would use [? GANs or ?] [INAUDIBLE] and we got very similar results or slightly better. And we are not sure that in other problems the advantage would be as clear as in this case, but it could be. And for speech-to-text, we did not try.
Moving on to new topic.
So now I'm going to go back into unsupervised learning, and I will discuss another domain, a domain of music generation. So before I start, let me just give some credit to the you YouTube channel of [INAUDIBLE]. Because when I presented these slides, I already saw the slides were created by [INAUDIBLE], so I was highly influenced by the slides that this YouTube channel created to explain our work. So many of my slides are very much the same as the slides they use over there.
So what is the goal in this case? The goal is to take one musical instrument, for example, and convert it into another one. So we are not using any nodes in the middle. We directly convert audio to audio. And the nice thing about it is that it works for unseen instruments, so this is why we say it's universal. So once it were in the system, you can use as input any new musical instrument that you would like to use, even if it was not used during training. So it generalizes to new domains.
So during training I give examples from, let's say, five different musical instruments. And it learns how to generate the same five musical instruments. Then after the network is trained, I can give it a new musical instrument-- for example, this flute, they never saw it before, never heard it before, and it would be able to generate in any one of these instruments that we saw during training.
So this is actually a universal way to mimic music. And the reason that this is so exciting, at least for me, is that humans can, of course, do it. You can hear something and then whistle, or hum, or clap, or play it if you are trained. And, of course, many animals-- songbirds can do it and so on, but no computer can do it. This system is one of the first one, if not the first one, to present such a solution.
So let's define the task just a little bit. We're not translating between musical instruments. We're doing something a little bit more elaborate. We are translating between musical domains. Musical domain, in this case, is basically a CD. So if we have a CD of Mozart orchestra, this is one domain. And then domain b can be some opera singing of Beethoven's songs or anything like this. And as I said, there aren't any other systems that can present such an ability in a convincing way. The [INAUDIBLE] team tried to figure out how they would solve it.
So one solution would be, maybe, to take the music and transcribe the music and then apply some sort of synthesize [INAUDIBLE] in order to play it again. The problem is that if you're thinking about polyphonic music, about music that is just a little bit more complex than playing one note at a time, then there isn't any very successful music transcription system. So humans can transcribe music. This is something that a trained musician can do. It's a human ability. They can do it. Computers, so far, cannot do it. And this is, in fact, what we are working on right now.
Another approach the [INAUDIBLE] guys tried to suggest is maybe use a translation network. So essentially, this is a problem of translating between one sequence to another sequence, which is exactly what you would do in natural language translation. You would take the original input, and you fit it into some sort of a recurrent neural network. You would then open it again as a sequence in another domain.
What is the main problem in employing such a method? If we try to employ such a method,
LIOR WOLF: I would need supervised samples. I would need the same sample here as the input and here as the output. And this is almost impossible to collect. I know it because we have collecting samples for validation. It's almost impossible to collect such samples. And another thing is that music, by itself, is not just musical instrument.
The domains are much richer. It's a specific composer being played by a specific orchestra on a specific set of musical instruments, and not even a single instrument, but multiple instruments playing together. So it's very hard to model this explicitly, very hard to collect data. So what [INAUDIBLE] is that will employ unsupervised learning.
So the only way you could solve such a problem, because it's a difficult problem, is by employing unsupervised learning. As workhorse that we employ in this work, there is something that is called a WaveNet autoencoder. It's by a 2017 paper by Google. And it's basically employs a WaveNet decoder.
And what we call a WaveNet encoder, it's actually a conversional neural network that goes over the input. So there is the input in one domain, and then there is the WaveNet encoder, and then there is some sort of latent space. This is the presentation you get from this, and it it's used to condition. It's not the actual input.
The WaveNet is an autoregressive model, which means that every [INAUDIBLE], the input is the previous outputs, but it's conditioned. It's the second input that condition the generation process, and every time it generates the next point in time. So this is called a WaveNet coder decoder, a WaveNet autoencoder, and it's an auto aggressive model in the sense that every time points, it just creates the next point in time. And then uses it as the input when it continues to sample, this way, more and more audio.
So same thing, different view. We have the input domain. We encode it. We get some sort of a feature vector. There is a feature space that encodes the input music, and then we decode it and get the next value in the musical piece. So in this case, this is what was done in 2017, you start with one instrument, and you end up with the same instrument. We generalize it through multiple instruments into a universal encoder. Our architecture looks like this. We have one encoder and multiple decoders.
So we say music is universal. There should be only one encoder. Whatever you play to it, it is able to put it in some latent representation space. And then since it's universal, once you train it, then you can take a new instrument, play it to the network, and you can decode through each one of the decoders to each of these instruments.
Any questions up to this point?
AUDIENCE: Is this published?
LIOR WOLF: Our work is published. Source codes will be soon out. It's published on archive. It's submitted But you can read it right now, and the source code would very soon be out.
So let's look at the latent space. What we would like to have is a universal latent space. And if it's universal, it should be invariant, meaning that it should not just memorize the input. It should put everything in a way that regardless of the input domain, regardless of the input instrument, the presentation would only depend on the actual music that is being played. So this is actually that the main battle that we have to fight. So we apply two different techniques in order to tackle this.
One of them is from the domain of the [INAUDIBLE] literature by [INAUDIBLE] in 2016. We use what is called a domain confusion network. A domain confusion network is a network that, given the latent representation, tries to figure out what is the input domain. So it wants to classify during training; train, maybe, on six different domains. It wants to tell you which domain is the domain from which the music actually came from. This is the domain confusion network.
And the reason that we use this domain confusion network is because we want the encoder to generate vectors that the domain confusion network would fail on. So the encoder's task is to make life hard to the domain confusion network. The domain confusion network [INAUDIBLE] differentiate between the instruments twice to do the best job it can in telling this latent representation came from this domain, this latent representation came from that domain. The encoder would like to generate something that is general, that is in variant to the input, but still contains the information needed in order to reconstruct the signal.
So this is one technique we use. The other technique that we use is basically distorting the input of the network. So we deliberately add noise to the input. I will talk in a second about this noise. So instead of the original input, we have some sort of random distortion, but, still, we demand from the network to reconstruct the original signal. By doing so, we make the network work harder. This is sound sorter for denoising the autoencoder. We want the network to learn the principles behind the music.
The type of noise that we add is basically taking the music and making it slightly out off tune. So we create off tune, and we are telling the network it's your task to give us back the correct music, the music as it should sound. So we are taking random segments. And for each random segment, we apply pitch shift by a random amount in order to obtain the perturbed input.
But, still, the network has to reconstruct, not the distorted input, it has to reconstruct the input itself. And the reason that we are doing it is that we want to avoid memorization. So we would like to have the network think in abstract terms about the music, not just to memorize the input, not to act like a vocoder. A vocoder compresses the input, keeps it, and then you are able to work on it. We want the network to understand the underlying principles and not just memorize what the input is.
So the entire process, as I said before, it's an autoregressive process, so we have some vector of measurements in the input space, we represent it in the latent space, and then recreate the next point in time. During training, we compared the next point in time to what we know should be the next point in time the input, and then we append the original points in time and create a new [INAUDIBLE] and continue this way.
So note that I'm actually using here the actual input, not the one that is predicted. So doing training, I'm using the actual input. This is what is called [INAUDIBLE] it's very common when training recurrent neural networks. We did it in the voice system as well. It enables you to train the system much faster.
Looking at the loss of the system, then we have the sample sj; we have the encoder e, the decoder dj for the various domains. j equals 1 is one domain. j equal to two is another domain, where the domain confusion network see. There are basically two loss turns. The first one is saying that if you take the original sample sj, you perturb it using the process o, you encode it, and then decode it using the decoder j. So during training, we only decode using the correct decoder because we need to compute loss. Then you get the original sample back, sj. So you compute the distance between what you generated and sj. [INAUDIBLE]
The second term is the domain confusion loss. You take a sample, you perturb it, you apply the encoder, and then you try to classify it. You want to make the classification fail. The job of the encoder here is to maximize this loss.
Any question up to this point?
AUDIENCE: Will you say something about the network [INAUDIBLE]?
LIOR WOLF: Yes. So the encoder is a convolutional neural network, and the decoder is a WaveNet. So the WaveNet that we took was basically that the WaveNet from the literature. But then we wanted to make it faster, so Nvidia released their own implementation of WaveNet with a slightly different architecture. We shifted into this architecture simply because if we are using the original WaveNet implementation, generating even 10 seconds of musics can take 10 minutes. If we are using Nvidia's architecture of WaveNet, then it's highly optimized, and we can generate in much shorter times, even close to real time.
So let's talk about inference. So given some musical domain, I encode, I then decode it using the same decoder, and I get back the input audio. This a simple autoencoding. This is what they did during training. But during tests, I actually do more. I encode the one input domain. I can decode with each one of the decoders that I train.
So this is task number one, just decode into one of the training domains. And then there is task number two. I can look outside the training set into a new musical instrument, let's say a flute, and I can decode it in each one of these domains.
So I think it's time for us to hear a little bit of the results. Those are the six domains that we were using.
So this is the input.
This is the output with some singing for example.
SYNTHESIZED VOICE: (SINGING IN ITALIAN)
LIOR WOLF: Another input.
This is out of the training set. We didn't have any guitar over there.
This is way out of the training set.
Somebody is hitting the left side of the piano.
I suggest we stop here. Any questions about this? Yes, please
LIOR WOLF: So what do you mean by multiple inputs?
LIOR WOLF: So the inputs are very complex. It does not really mind if it's polyphonic, if there are multiple instruments, If they're is singing together with some instruments. It can take all of these, extract the dominate-- I don't want to say theme. I don't want to say melody. All of these have specific meanings. But you can extract the dominate music and then translate it to another domain, and you can find many samples like this. We are not limiting in any way the input.
AUDIENCE: Have you tried doing something like [INAUDIBLE] in the latent [INAUDIBLE]?
LIOR WOLF: Yes, it's going to be the next slide, yes. Yes, please.
AUDIENCE: When giving an unseen domain to the shared encoder, how many domains is it necessary to train with for a good quality? Is one sufficient? Do you get diminishing results after--
LIOR WOLF: So what you see here is the results we published so far based on the six musical domains that they listed. Granted, we did not try 3, and we did not try 20. We do have experience with the 10 that we trained later on. It works just as well. There is still much room for making further experiments. Definitely, one wouldn't be enough. Six is enough, which is a relatively low number, so it's a bit surprising. Maybe three would be enough. We still don't know. This is an excellent question.
Another question, yes?
AUDIENCE: I'd be curious to hear a summary-- maybe you have [INAUDIBLE]-- but a summarization of [? your take on ?] what is actually being learned and generalized. So it's clearly, clearly pitch. The notion of pitch is clearly being extracted and generalized.
For the decoders, the plausible sounds that the instruments in a particular domain could possibly make are not fully being learned because the physically impossible sound [INAUDIBLE]. I am curious as to what you think what is and isn't actually being learned.
LIOR WOLF: So first of all, the output domain is captured relatively well by the WaveNet. The WaveNet is very good at capturing the output domain. So what you get sounds like the actual domain. I'm not going to share numerical results, but we did create-- in the paper, you can see many tables. Humans are asked to listen whether it sounds authentic or not, and compared to human musicians that do the same task is it authentic, less authentic, and so on. So we have all these kind of experiments in the paper.
Now you mentioned pitch. So just to make this point, the pitch in the input and the pitch in the output is not the same. Different musical instruments have different ranges of a pitch. Music is played in a different way, and it's not just preserving the pitch. But we do have some experiments on converting very simple music and showing that it preserves the pitch just to show an numeric results that it does something reasonable.
But in general, it does much more than just preserve the pitch. And [INAUDIBLE] the pitch by itself, if you look at software to extract pitch, it's highly non-reliable. There is no software that you can use today that extracts pitch in a way that is reliable, and you can trust that there are all sorts of approximation in the computational [INAUDIBLE] literature. Even this by itself is a task that cannot be solved very reliably at the moment.
So let me make other comments. What we played here is basically what we get out of the network without any filtering, without trying to change. For example, the whistle result is me whistling into WhatsApp while trying not to laugh. And you can hear that there is overexposure over there. We didn't try to claim the overexposure. And you [INAUDIBLE] hear that I'm whistling a little bit off tune. It doesn't really matter. The network still is able to perform quite robustly and produce a desirable outcome.
So let's think about the additivity of the latent space. Because [INAUDIBLE] we know that if we have a latent representation that is additive, then we are doing something right. For example, if you look at the [INAUDIBLE]. And think about the work of a human DJ. So a human DJ has to connect two musical pieces together. They have to very carefully make the rhythm match, the harmony, the pitch, connect at the right place, and so on. We can actually achieve this just by the linearity of the latent space. So this is one piece.
This is another.
So now I'm going to mix between the two linearly. I'm going to start with A, shift linearly to B, and continue with B. And it's going to be successful if there's not-- it's not going to be noticeable where the shift is.
So, at least to me, it sounds natural first shifting between two very different musical pieces. But, it does so in a way that is very smooth, which means that probably the latent space has this property of linearity, which allows us to explore it in a natural way.
So let me summarize what we've discussed today.
So we discussed three different generative models that are good enough to be employed in applications. One of them is how to generate the vector of parameters you need in order to capture a face as an avatar. The second one is how to create a personalized voice-to-speech network that is able to voice even after listening to a very short sample that has some noise inside of it. And if that sample is a music translation network, that can take an audio file and produce an audio output in a completely different musical domain. That's it. Thank you very much.