Getting started with Tensorflow 2.0 tutorial
August 21, 2019
August 15, 2019
All Captioned Videos Brains, Minds and Machines Summer Course 2019
Josh Gordon, Google
goo.gle/mbl-slides or CBMM server
JOSH GORDON: So I'm Josh from Google. And today we're going to do a getting started with TensorFlow 2.0.
Just while I'm setting up, could you do me a favor and raise your hand if you're taking a machine learning class? Can be academic or online. Awesome. How about a deep learning class? Half. OK, so I know a couple people haven't taken a machine learning class. I will assume but most have. So I'll assume that you're familiar with machine learning but you're relatively new to deep learning. We'll come back to that.
All right. So let's see. Today we're going t talk about TensorFlow 2. Here's what we're going to try and get through as much as we can. So I will talk for like five minutes. And then I'll stop talking. And you can do a quick exercise. And as always, we're going to start with MNIST because, why not? After that, we'll look at convolution. I have a lot more to say about convolution because I find it much more interesting.
When you're learning deep learning, oftentimes you'll start by looking at these ridiculous pictures of these fully connected deep neural networks with a stack of dense layers. And personally, when I see something like that, I have approximately zero intuition for how and why they work. So that sucks. And it's horrible.
But if you look, surprisingly, at a convolutional neural network, which sounds much fancier, I find it much more intuitive. Deep learning is, unfortunately, concept soup. And this state of deep learning right now is that you can write a deep neural network in five minutes. I'll show you how to do that. Actually, less than five minutes.
But it can take six plus months to really get familiar with all the concepts. But this is a good thing. And it means you can spend more of your time thinking and less of your time messing around with the frameworks. So that's great.
And then, assuming we get through that, I will talk through some more advanced stuff. Some of my favorite examples-- Deep Dream, Style Transfer, time series, stuff like that.
All right. So I will come back to-- actually, let me just talk about what deep learning is. So here's a picture that I pulled from Wikipedia, of where we are. And I ran it through our latest tutorial for a Deep Dream. And this is that photo Deep Dream-ified. So if you take a look at this for a second, what do you see occurring in the Deep Dream-ified? And then I'll explain how this relates to the fundamental ideas behind deep learning.
I know this is a bit of a random aside. But I wanted to start by talking about something a little bit more interesting than MNIST. So what do you see in this photograph that wasn't here before? Yeah. There's a beaver. Yeah, there's lots of cute little beavers. And there's quadra eyed beavers. There's sheep dogs, eyes everywhere, peacocks, snakes, stuff like that.
So deep learning is representation learning. And let me explain what that means. When I started studying machine learning, the models that I learned about were decision trees. And I absolutely love decision trees because if you train a tree and you ask the question, "How is it that this tree is classifying a piece of data?" you can print out the tree and read the rules. It's awesome. Really, really important.
So for those of you who've taken a machine learning class, think about what would happen if you tried to classify a photograph like this using a decision tree. The features that the tree can look at are going to be the pixels.
And so that means if you're the root node in the tree, you'll find whatever pixel in your training set happens to be the most informative to split the data. And you'll ask a really silly question. You'll say like, "If pixel intensity is greater than 128, then ask about the next pixel intensity."
And on 1,000 by 1,000 by three image-- three because there's three color channels, red, green, and blue-- you have three million features, none of which are really informative at all. And so if you think about how wide and how deep the decision tree that you might train to classify an image is, it's useless. It doesn't mean anything.
What deep learning does is you basically-- I'm going to fast forward like 80 slides just to show the idea that I want to talk about here. What deep learning in a nutshell is and why I like talking about convolution. This is what deep learning does.
So basically, the way I like to think of it is, what we're looking at here is this is a deep convolutional neural network. And we'll come back to this.
But the way I like to think about deep learning is there's two parts. The first part is what you see on the bottom here. And this is what you would cover in a machine learning class. And here, this is a schematic for multi-class logistic regression. What each of these little cubes represents is a feature. And if we were working with raw image data, these features would be pixels. And they're fully connected to an output layer. And you can imagine maybe this output node or neuron is collecting evidence that it's a cat, it's a dog, it's a sheep, whatever.
So this is just a multi-class logistic regression unit. What the deep in deep learning does is the convolution of base that we're looking at above this. What the base does, it's a series of convolutional layers, in this case, or a series of dense layers, which I hate, in other cases. And what they're doing is they're looking at the raw pixels from the input image. And they're extracting features as they go.
So the purpose of the first layer is to transform pixels to the edges. The second layer, from edges to textures, textures to more complex textures, and so on, and so forth. What this means is that by the time you're training a logistic regression model, it's no longer taking the pixels as input. Instead, these are high level features. And they're high level features that were automatically learned from the data.
So deep learning learns a representation of the data that you can classify with a linear layer, in a nutshell. So other words for deep learning are automatic feature engineering or representation learning. And unlike in traditional machine learning, where 10 years ago you might have come up with features like shapes and textures using a library like OpenCV, or you would've written a whole bunch of giant Python pre-processing scripts, you can learn all these features automatically.
And the reason I-- I'm going to flip back like 50 slides, which you should never do in a presentation. So flipping back 50 slides, the reason Deep Dream is interesting, the reason we can modify this image to make all these psychedelic shapes appear, is because we begin with an image classifier. And this is an experiment where we're asking the classifier to show us the type of features that it's learned from data. But we'll come back to that.
Anyway, TensorFlow is an open source machine learning library. For the purposes of your research, there are many awesome open source machine learning libraries. The truth is learning one is hard. After you've learned one, learning multiple gets easier.
What's nice about TensorFlow 2, which is what I work on and I'll talk about today, is you can very, very roughly think of TensorFlow 2 as Keras, plus PyTorch, plus a lot of other awesome stuff. And the reason I teach TensorFlow in my classes at night is not because I work for Google. It's because when students learn how to use TensorFlow, it's easier for them to branch to wherever. So it's a good place to start.
Here are some resources for you. We're a few weeks away from releasing TensorFlow 2. It's in beta right now. Our website is tensorflow.org/beta. And you should skip everything else on the website that's not there. For news and updates, we have a blog and a Twitter. I'll share these slides afterwards, by the way. So you don't have to write everything down.
And then what I wanted to mention, too, we're going to use Python today. But machine learning is very rapidly branching out beyond Python. And I was totally wrong about this when I was first introduced to this idea.
I'm going to point it at you. And I'm probably going to accidentally unplug things. But if I manage not to screw this up, this should be really cool.
And there's other things you can do, too. This, I believe, only works for one person. But here we're getting a part map. So there's a lot of value in running things in the browser. I just wanted to mention that as an FYI. It's not just Python anymore.
I'm going to blaze through this. Also, FYI, another thing you can do is you can train models in Python and deploy them on iOS, Android, Raspberry Pi, embedded devices, whatever.
I don't have a slide for this. But let me just tell you. Briefly, TensorFlow 2 is a C++ engine. Any time you write your code in Python, what happens is behind the scenes that code is accelerated by C++. The reason this is important is it's easier to write your code in Python.
But also, a lot of the time, when you run your model, you don't want to run it using a Python interpreter. You might want to run it on a phone or in a browser. And so TensorFlow gives you a way to save your models in a machine independent format that lets you deploy it where you want. So that's valuable.
Anyway, this is the picture that I hate. But what we're looking at here is this is a fully connected deep neural network. And we're looking at a series of three dense layers. I'll break this down in a bit. Each dense layer is taking a linear combination of the input features and some weights, and applying a non-linearity, and forwards that result to the next layer.
Instead of looking at this, because no one has intuition for that, I want to give you intuition for what a single dense layer does. So the data set that I want to talk about is MNIST. And if we Google MNIST really quickly. MNIST, it's an old computer vision data set. It's the hello world of deep learning. There's 60,000 images of digits. The important thing about these is they're 28 by 28 by one. So they're black and white digits.
What I want to show you is what a dense layer does, one dense layer, if you train it to classify MNIST digits. And this is fast. But what we're looking at here, that's our cartoon dense layer. At the top, those would be the pixels from a single image that we're feeding through the network. Each of the gray lines represents a weight. And each of the green nodes represents an output.
Let's imagine that we've trained this dense layer for a long time on all 60,000 digits. And now we ask the question, what is it that the weights are doing that lets us classify the images?
What we're looking at here is a visualization of the learned weights. So this is a fully connected layer, which means there's one weight for every input pixel. So every pixel in the image would be connected to one weight. This guy here would be the upper left input pixel.
And the reason that's an array is, you can imagine, that dense layers can only take arrays as input. So we've unstacked the rows of the image and lined it up into an array. So this might be Pixel 1, Pixel 2, Pixel 3.
And what we're doing here is if-- I'm sorry. The weight for Pixel 1, Pixel 2, Pixel 3. What we're doing here is we've colored the weights. So if a weight is very high, we've colored it in red. And if a weight is very low, we've colored it in blue. And if you visualize the weights, you see this red band around the output for the 0. And that's because there's many different ways to draw zeros. But most people don't cross the center of the image, which is blue.
And so what a single dense layer is doing for every input feature, it's basically assigning one weight that says if this feature is present, how much evidence does that give me that it corresponds to the output class? So a single dense layer is fast. But a single dense layer is something we can interpret.
In terms of writing the code, let me stop talking. And I'll give you an exercise. And let me show you one more thing before we do that. I just want to show you how to write a deep neural network in TensorFlow 2 in like two seconds. Just so you know.
This code right here. We're defining a model. We're adding a single dense layer. That would correspond to that diagram right there. If we wanted to go from a dense layer to a neural network, we would add one line of code like this. And now we have a neural network. We've added a second dense layer. And that's a hidden layer.
If we wanted a deep neural network, we would copy and paste this. And now we have a deep neural network. And now you will have a deeper neural network, and so on, and so forth.
So what I'm trying to communicate here is this part. There's like six months of learning here, like I said earlier. It may be faster for you. But it took me a while to go through all of these concepts.
When you look at this, I see not a lot of code. And I see a lot of things. First of all, let me give you some terminology. There's different ways to define deep neural networks in TensorFlow 2. This is the simplest one. And here, we're saying our network is a stack of layers. That's what it means by sequential. This is great 95 percent of the time.
As it happens, dense layers can only take arrays as input. Flatten is a special layer. It's a pre-processing layer. It basically says, "Give me an image. And I will unroll it into an array." So the output of this layer is an array. Great.
Terminology. This is the depth of the network, the number of the layers that we've added.
The rough intuition. You'll see this with convolution. But, roughly, the more layers you have, the more combinations of features you can detect. So maybe pixels come in, edges, textures.
And then you have logistic regression and classify it based on the textures it's discovered. You also have the width of the network. And that's the number of units or neurons per layer. Here, we've pulled 128 out of a hat. The more units per layer, the more patterns you can detect at that layer. Great.
Lots of hyper parameters here. Designing neural networks for a problem is a bit of an art. And over time, you learn basically good starting points. So I've looked at MNIST for a long time. So I know from experience what designs might work. Unfortunately, we often have to search around a little bit. And to be honest, it's pretty hacky how people do it. If you're working on a more substantial problem, usually the way you get a starting design is you find a paper that's close, use that as a starting point, modify from there.
There's more hyper parameters, too. There's things like the type of activation function, which we'll talk about later. Good news. ReLU is almost always the one you want.
Here, what we're saying is, well, whatever. Softmax is a fancy way of saying give me a probability distribution. So, the output of some network. The way to read this, you start by looking at only two things.
The first is the input. So this network is going to take some data that's 28 by 28. So a square is going in, in this case, an image. And the thing that's coming out-- you can ignore all this junk in the middle-- the thing that's coming out is going to be 10 numbers, all of which range between 0 and 1. And they sum to 1. So basically, this means give me a probability distribution based on some input data.
All right. Let me give you an exercise and explain how to run it. Just so you know, TensorFlow is open source. You can definitely install it on your laptop. It's great. Today, we're going to run it in the cloud just to save time. And we're going to use one of my favorite tools, which is called Colab.
Has anyone seen a Colab? Half? OK. We're just going use Colab. If you've any Colab questions, I'm happy to help.
Let's start this right now. So on your laptop, what you should do is-- I have two exercises for you. And the first one is what I recommend you start with. If you've been working with TensorFlow for a long time, I have an advanced exercise which I'll show you right after this. But if you're new to it, you should do this. Please go to this link.
And what this will do is this will connect you to our hello world tutorial for MNIST. It's close to the minimum amount of code you need to write an image classifier. And I'll put this slide back. Actually, no. You're going to need this because I have to have a second slide up in a sec. So definitely write this one down. Or go to bit.ly/mnist-seq. S-E-Q for sequential.
Let me show you. This has a long link. You're going to need this as a reference. I'm going to bring it up on my screen and show you how to get to it. So you don't have to write down this long link. If you go to tensorflow.org/beta, then you go to Machine Learning Basics, Classify Images. So this is tensorflow.org/beta, Machine Learning Basics, Classify Images.
This is a good reference if you want to learn more about MNIST. And I haven't described it to save some time. But if you need reading, this gives much more detail on what you're about to do. So, Classify Images.
And let me show you what I want you to get to.
What our beginner tutorial is missing is this diagram. And you're going to add this. So a blessing of deep neural networks is that if you create a deep enough network, and the layers are wide enough, and you create it for long enough, it will memorize pretty much any data set. And this is great. They're very powerful.
We don't want to memorize the training data, though. What you usually need to do is get high accuracy on the validation set. So the way-- there's a key parameter that you need to set. When you are training your networks, of all the parameters-- meaning how many layers, what's the width of a layer-- the one that you really need to get right is this one at the very end. It's epochs. And, roughly, this corresponds to how long you're training the model for.
An epoch means you've used every example from the training set once to update your weights. So here we are using them all five times. If this number is too large, you will overfit the training set. If it's too small, you'll underfit. So to set it properly, it's not rocket science.
Usually, what we do is we make plots like this. And here, we're plotting our accuracy on the training set and the validation set over time. So epochs is on the x-axis. Accuracy is on the Y. If we set epochs to like 20, probably the accuracy on the training set is going to hit one.
But what will happen, you'll notice that the accuracy on the validation set, it's going to start to drop. And the goal is, basically, the correct value for the number of epochs. You'd want to stop training this thing when the accuracy on the validation set begins to decrease, because that means you're beginning to overfit.
So what you're going to do, go to bit.ly/mnist-seq. Add plots for the training and validation accuracy and loss. And then find the right number of epochs to train that model for.
And here is the code. I'm giving you the code that you can use, so you can start modifying that example with code like this. And try and get those plots. And then find the right number of epochs. And why don't we work on that for-- I'll start talking again at 4:05. So, 15 minutes.
If you've been working with TensorFlow 2 for a long time, or TensorFlow 1 for a long time-- I'll put that back in a sec. Here's a more advanced exercise. Whoops. Nice. Well, you can see the answer. But try not to see the answer. And look at this later.
We read Ichkai a couple days ago in Macau. And so it's bit.ly, slash, and my friend had a typo. Here's a bizarre go link or a bit.ly link-- bit.ly/ijcav_adv. And that's a more advanced exercise where you write some pieces of a neural network from scratch.
So if you want the events one, H cav, underscore, A-D-V. And let me put back the beginner one, which you should probably start with bit.ly/mnist-seq. And here's some code that you can use.
And then if anyone has any questions, please raise your hand. And I'll come around. There are one or two people new to Colab. It might be distracting for a lot of you, but I'm happy to-- I can give a quick intro to Colab while you're working on this. Quick intro to Colab, anybody? Awesome, OK. And if you have any questions, please raise your hand.
Another thing you can try. If you add the plots quickly, the next task that a lot of you are going to care about is how accurate of a model can you train on MNIST without overfitting on the validation set. And the way to train a more accurate model is to add more dense layers or to increase the width of the dense layers you have. That will give you more capacity.
But the larger your model, the more likely you are to overfit. And you'll see that there's layers you can play with, like Dropout and things like that, that you can read about or I'll talk about in a little bit.
OK I'll keep talking. And we can keep working on this in a little bit. So, TensorFlow 2. First of all, here's how you install the thing, if you're not working in Colab. Basically, what I want to mention right now is, while it's in beta, it's important to get a named release. So if you want to install the latest beta, here it is.
Just FYI, although in Colab, if you used it last time, you can enable a GPU with the Edit Notebook settings. If you want to enable a GPU, you also need to install the GPU version of TensorFlow 2.
Good practice. While we're working on upgrading, at the top of your scripts, just print out what version of TensorFlow you have just to make sure things are working.
So here's the first difference of TensorFlow 2 and TensorFlow 1. And let me explain what the name means. So a tensor is a fancy word for an array. So a scalar is a tensor. An array is a tensor. A matrix is a tensor. A cube is a tensor. So a tensor is an array.
Flow refers to a data flow graph. Under the hood, in C++, a data flow graph is built for your program, compiled, and executed. In TensorFlow 2, you don't need to be aware of that or see it unless, very rarely, you care about it.
So with TensorFlow 2 installed, what we're doing is we're creating two constants, 1 and 2. And we're going to add them together. And if we print this out, you'll see 1 plus 2 is 3, as you would expect. The shape is saying it's a tuple. And it has a data type.
Roughly, TensorFlow 2 works like NumPy. Instead of NumPy and D arrays, we have TensorFlow tensors. The main difference is a TensorFlow tensor can be accelerated on a GPU. And we can back prop through it.
Let me show you how this is different from TensorFlow 1. Actually, before I get to TensorFlow 1, here's how this starts being useful in TensorFlow 2. In the last exercise, you poked around briefly with dense layers. Here, what I've done is I've just written some code. And I've imported a dense layer. And I've got the setting in such a way that the behavior is very simple. And I know what it's going to do.
And then, I'm creating some data. And I'm forwarding the data through the dense layer. And you can see, just by running this in Python, exactly what the result is. So this is a great way to poke around and exactly understand the behavior of your layers very easily. So this is really useful.
Also TensorFlow tensors. If you get tired of TensorFlow, they have a dot NumPy method. So you can switch back from tensors to NumPy. And TensorFlow operations will work with NumPy and D arrays. And NumPy operations will work with TensorFlow tensors. So they're close friends. And that should work most of the time.
TensorFlow 1.0 was, sadly, different. So here, we're going to try and add some numbers again. And this won't work as expected. So in TensorFlow 1-- this was a long time ago, in 2015.
Basically, TensorFlow 1 is what you would have wanted if you were an engineer at a very large software company and your problem was, how can I do massively distributed deep learning? In TensorFlow 1, you build a data flow graph. And you need to be aware of what that graph is. And then, you run the graph.
And here, if we make those constants again, and we print Z, we don't get three. Instead, what we get is Z prints out to be this add operation. And that's an operation on some data flow graph. To actually do the addition, you had to make a session. And then, in the session, you would execute-- this should say Z, not X. You would run Z. And this added a lot of mental overhead.
So this is gone. It works comparatively by default, which is great. If you want to make-- so I'm going to skip forward like 50 million slides again. And I'm just going to cut right to it and show you the one line of Python you need in TensorFlow 2 to make your code run fast and in graph mode.
The only piece of code you need to know in TensorFlow 2 that doesn't look like regular Python is going to be a single Python annotation. So here's some code. I've created some random LSTM cell. This is just Python and TensorFlow 2. I'm making some data. I'm calling the cell. And I have some crappy benchmark to see how long that takes to run.
To accelerate that, I can add a single line, which is at TF dot function. And let me explain what this does. In this example, which is old, by the way, it made it like nine times faster. That would be probably slower now. But makes it much faster. Here's how this works.
So the reason we're using a C++ back end is Python is slow at multiplying matrices. This is why NumPy is so popular. You write your code in Python. In NumPy, the matrices are multiplied on C. The results go back to Python. You get 100x or 10x speed up, depending on what you're doing. Awesome.
One problem with the TensorFlow program or NumPy program is you're going from Python to NumPy-- I'm sorry. Python to C, Python to C, Python to C. So you're ping ponging back and forth between these environments. So you get latency.
If you're a compilers engineer, which I'm not, there's other things you can do to accelerate programs if you can look at the whole program at once. You can compile it. You can prune pieces that aren't used. Anyway, what TF function basically says-- and it's applied recursively-- is take any code that appears in this block. Send it to the back end all at once. The back end compiles it, does its magic, does the math, delivers the result once.
So you run the whole code in C. And then you get the result once. So it saves you from the ping pong. And it can do some tricks. So that's it. So TensorFlow 2 is Python plus at TF function. And anything you can stick in TF function, you can stick in a saved model that will run on devices without a Python interpreter. So that's good news.
While we're here, in case you do distributed training down the road, distributing trading in TensorFlow 2.0 is also much, much easier. So basically, ignoring the indentation mistakes, here's something that looks very similar to that little MNIST model we looked at a second ago.
To run this on one machine with multiple GPUs, it's just this. So we have different distribution strategies. The way it works is you create. Every strategy has its scope, annotation problems. Oh no, it's not. I'm just tired from jet lag. Create a model inside the scope. Compile it. And when you do fit, this will do data parallelism, which is the easiest way to do distributed training.
What I should tell you is that this is the easy part. So, wrapping your model in a scope. And there are scopes for different context. This is not hard. You just have to read about the scopes and figure out what they are.
What's hard is your input pipeline. So the bottleneck is basically going to be reading data off disk and getting it onto the GPUs fast enough that they're not starving, which means sitting around waiting for data. And that's getting easier.
Let me just show you what that involves right now. When you imported the data set in this little hello world example, we used these Keras data sets. And Keras is a wonderful library I was talking about in a sec. It's built in, in TensorFlow 2. It has a lot of small data sets that you can import in memory. Almost always, your data sets are not going to be in memory. They're going to be sitting on disk.
In TensorFlow 2, the way you get a data set off disk and onto the GPUs quickly is you use something called TF dot data. And briefly, the best tutorial that we have that you can check out right now-- and don't do this now, but for the future just so you have a reference. We're working on cleaning these up. Load and pre-process data and, strangely, images is the one. In my experience, that's the one you wnt even if you're not working with images.
But let me explain how this works. That is not images. TF data is a tool to build input pipelines. The way it basically works is this.
So the first thing I want to show you is that TensorFlow 2-- the way to think about it is NumPy. So if you have a NumPy operation like NP.sum, you can usually find an equivalent TensorFlow 2. There's some other stuff, too. We have image modules with things for like loading images off disk, and resizing them, and decoding them, and doing stuff like that. So there's different utility modules.
But here we have some code that takes some image, and decodes it into a JPEG, and does some math on it, and whatever. So there's some TensorFlow 2 code. TensorFlow data is a tool that you can use to build data pipelines out of these. So basically, you can start constructing a data set.
And here, you can say things like a data set is a stream of data. TF data has different operations that you can apply to that stream that are useful. So here, we're saying shuffle the data. And maybe later we'll say batch the data and repeat the data set for a long time.
And then, there's these interesting tools like Pre-fetch. And this is something you can do with TF data that you can't do easily with NumPy. So what Pre-fetch is trying to say is get the next batch of data onto the GPU. So it's there when it finishes processing the current batch. So you don't have latency.
And there's all sorts of fancy tricks like this. TLDR. TF data is useful. It's a bit of a hassle. And it's complex. But if you're doing larger experiments, it's worth using and worth worth learning.
All right. Let's see. All right. So Keras is built into TensorFlow 2. And it's a huge part of TensorFlow 2. Keras is a separate library.
Let me explain what this is. If you go to Keras.io, this is one of my favorite all time libraries next to Scikit-learn. It's a deep learning library. It's wonderful. And what Keras does, it's basically an API without an implementation.
So Keras defines different ways of defining deep neural networks. And everything at Keras.io works in TensorFlow 2. Keras defines two, a sequential and a functional. Sequential is for building a stack of layers. Functional is for building a graph.
What Keras doesn't say. And you've seen these things, dense and sequential. Keras doesn't say anything about how you actually run this code on a GPU. If you do pip install Keras, you get what you get at Keras.io. And automatically, behind the scenes, Keras will install what it calls as a tensor processing library. So it will install TensorFlow, or MXNet, or CNTK. Call that library to do the math. You never see it.
In TensorFlow 2, Keras is built in. And TensorFlow 2 is a superset of what you get at Keras.io. So if TensorFlow 2 is installed, you can say from TensorFlow.keras, import whatever you want. And any code you find at Keras.io will work identically in TensorFlow 2 just by changing an import. So instead of import Keras from TensorFlow, import Keras. And that's it.
So if you're new to this stuff, Keras is famous for being one of the easiest to use libraries and the best documented. It's a perfectly good place to start learning. Nothing you learn in Keras.io is a waste of your time because it all works identically in TensorFlow 2.
Just so you know, Colab has Keras installed also by default. So you have to be careful with your imports. If you're importing things from Keras and you see using TensorFlow back end, that's a mistake. You don't want that. You just want to get your imports from TensorFlow.keras.
I put some notes on the slides for you when I upload them. All right. So you've seen sequential models. So that's a stack of layers. That, by the way, existed in TensorFlow 1. It's the same in TensorFlow 2. It didn't change it all.
Functional models are what you would use to build a model that's a dag. And so if you start learning about things like residual networks and things when you've skipped connections between layers, you can define them using the functional API.
There's a third method that I'll talk more about, which is the subclasses API. And this feels a little bit like object-oriented NumPy. This is very, very similar to a library called Chainer and similar to a library called PyTorch.
And the way this works is, here, we're defining our model by extending a class. And this class happens to be model. It's provided by the library. You can write your own if you don't like this one. And what we're doing is, in the constructor, we're defining a couple of layers.
And in the call method, we're defining the forward pass of our model or our layer. So if you call this model on some data, you can see that the data will pass through the dense layer. If you're curious exactly what the output is, you can just print that out because it's just Python.
This is really, really great from the research side if you're defining new layers and stuff like that. You can interactively see how they work.
All three of these model styles can be trained in two ways. One way is model.fit, which you've seen, which you should always use unless you need to write custom code. The way to write a custom training loop in TensorFlow 2-- and we'll do this in a second with linear regression-- is called the gradient tape.
So here, what we're doing is we're creating our model. And then, all these models are trained by gradient descent. The way we get the gradients is back propagation. The way TensorFlow will give you the gradients for the weights in your model is using something called a tape. What we start doing is we record all the operations under this with block on a tape. It builds a computational graph, plays the graph backwards to the gradients.
But basically, what's happening is we're calling the model on some images. We're getting the output of the model. We're computing our loss. And then, we're getting the gradients with respect to the loss of all the variables in the model. And if you print these out, these are your gradients.
If you were doing research in optimization, and you're working on the Rachel optimizer or the Josh optimizer, you can rewrite this however you want.
If you were doing just SGD, you would multiply the gradients by a learning rate and update your model variables. Or if you're doing gradient clipping, it's really easy to write that. So this is a really, really nice way to do auto-- basically, to do back prop in TensorFlow.
The best way to get started with poking around with the gradients, in my experience, is linear regression. So we're going to skip this stuff.
Let's take a look at-- the next exercise is linear regression. But it's written the slow way. And so we're going to pretend like we don't have dense layers. We don't have model.fit. Let's do linear regression with a gradient take. And this is good. So you can actually see what the gradients are that you get.
And let me see what this notebook gives you. You might have to clear the output. So it's bit.ly/tf-ws1.
And I think the real power of these deep learning libraries, it's-- regardless of which library you're doing, it's that they can do auto diff. Once you have an easy way to get gradients-- here we're going to get them for linear regression. Great. But almost with exactly the same code, we get the gradients for Deep Dream. So this scales up in a really surprising way.
So, tf-ws1. And probably, you're going to need to clear the output. I think I forgot to clear it. But what we're doing here is we're going to fit a model, y equals mx plus b, to some data. So we created some random data. We're going to create a model, y equals mx plus b. And we're going to use two TensorFlow variables. Normally, you don't have to write code at this low level unless you're doing some sort of research. But here, we're creating variables.
This is the forward pass of our model, y equals mx plus b. M is the slope. B is the intercept. Our loss is going to be squared error. When you see names in TensorFlow that aren't quite identical to NumPy, that usually means there's a subtle difference in how these work.
And I think the reason that it says reduced mean, instead of just mean, is you can imagine if you have a GPU, and you have a long list of floating point numbers, and you're taking the average, if the GPU doesn't guarantee the order in which it takes the average, it's possible you'll have very slightly different results every run based on floating point arithmetic errors. So that's just trivia, why that's called reduced mean.
Anyway, what I wanted to show you is, at the end of this notebook, if you're new to gradient descent, it makes this nice little plot that you can look at . And what we're seeing here, we're visualizing the loss of our model as a function of the slope and the intercept. And that's our starting loss. We get the gradient. The notebook will give you the code to take a step in the negative direction of the gradient. And down we go.
And here's the gradient tape loop that I showed you from the slide in action. And what's cool is if you run this, you can literally print out the gradients for m and b and see exactly what they are, which is cool.
So why don't we take-- let's take like eight minutes and poke around with this. So if you want to run it from scratch, edit, clear all output. I'll start talking again shortly. shortly, at like 4:25. So it's bit.ly/tf-ws1.
Also, in case you're new to back prop, let me just point you to a really nice article. We don't have time to cover it right now. But if you Google for this, if you want to learn how auto diff works, wonderful article by Chris Olah, Calculus on Computational Graphs.
And the reason I like this article as a teaching tool for back prop is it actually does it. So it has an example. It's not just like, here's some equations. So it's really nice in Chris's article. What he does is he builds up a very simple computational graph.
And this is what TensorFlow does, too, behind the scenes. This is a computational graph for-- we're doing like a plus b times c, or something like that. And he'll build the graph. Show you the forward pass and the backward pass to get the gradients. It's really, really nice. So, Calculus on Computational Graphs.
By the way, if people are getting an error message with length, Colab has TensorFlow 1 installed on it by default. And we'll get rid of that as soon as TensorFlow 2 is out. So if you're getting-- tensor has no property length. The very first cell will install TensorFlow 2. And you'll have to run that one. And then, that error message should go away.
So let me briefly explain gradient descent and gradients. And so I'll do this in two ways. So one is the numeric gradient. And the other is the analytic gradient. Basically, deep neural networks work the same way as linear regression in this sense. You always start-- if you're doing a deep neural network or linear regression, there are two things you need.
The first thing you need is a model. Here, our model is y equals mx plus b. With a neural network, our model is going to be Keras, sequential, dense, dense, dense. It's a much bigger model. Same thing.
When you call the model, that's called the forward pass. You take some data, pass it through the model, get a result. The next thing you need, which is very important, is called a loss function, which is synonymous with error. And all that is, is a way to quantify how bad of a prediction you've made.
In linear regression, the loss function is our squared error. So for the entire training set or whatever data we forwarded through the model, we take the point we predicted, which is the blue line, subtract the point we wanted, which is the blue dot, square it, and we sum that up over the whole training set. The point is loss is just a number. In classification, we'll use something called cross entropy. But it still gives us just a number.
And this is gradient descent. As soon as you can plot your loss as a function of your variables-- linear regression, there's two-- the slope and intercept. Deep neural networks, there might be a million. But the concept is identical. We're almost done.
Because our loss quantifies how bad of a job we're doing, if we minimize the loss, that means we have a good model. So we want to go down the hill. Deep neural networks don't have a global minimum like this. Or they're not convex like this. This is the special case. But we'll get to some minimum.
There's a concept in calculus called the gradient. And the gradient is a vector of partial derivatives that points uphill, which is why the negative gradient is the direction that points downhill. The good news is if you haven't taken a calculus class in 20 years, and you don't remember what that means, you can sort of understand it intuitively. So loss is a function of our variables.
The gradient looks at each variable independently. So let's just look at b. Our variables are just numbers. There's only two things we can do to a number. We can make it bigger. Or we can make it smaller by some amount.
If you forget calculus, you can calculate the numeric gradient like this. For each variable in your model, make it slightly bigger. Recompute your loss. Then make it slightly smaller. Recompute your loss. Figure out which way makes your loss go down.
Well, actually in this case, the way that makes your loss go up is the gradient. And the negative gradient is the direction that makes it go down. So you wiggle each one a little bit and recompute it. That gives you the direction. Wiggle each one a little bit. That gives you the direction.
The problem with just doing this numerically is if you have a million variables, you have to do a million forward passes of your data. So this is really slow. If you remember tricks from calculus, you can get it in time that's linear in the size of the number of nodes on the computational graph.
So basically, calculus is a much faster way to get the gradient. But the point is, regardless of how you compute it, the gradient descent step is easy. That just means apply the gradient. Literally, take a step. So, wiggle your parameters a little bit. Get the gradient again, and again, and again, and again, and again. And so neural networks are trained identically.
All right. Really, really quickly, I just want to look at some of the building blocks of these DNNs. So basically, you'll see this like a billion times. There's cartoon diagrams of a neuron, which I like to think of is a little logistic regression unit.
So here, what we have is some input data. These could be pixels on an image. Each pixel is being multiplied by a weight. We sum it up. We apply non-linearity. And that gives us the output of one neuron. I don't like this diagram. I don't like the math, either. But you can look at the math. It's a sum of the inputs multiplied by the weights, and then a non-linearity.
But let me show you a diagram that makes a little bit more sense. So here's the way I like to start thinking about it. So here's a diagram that corresponds to that. Here's the diagram of our little neuron. And here's what's happening when it actually computes on some data.
So we have an image. Let's pretend this just has four pixels. And we'll pretend it's black and white. Ignore the colors. The flattened layer that we've been working with unrolls that image into an array. So after we flatten it, here's the pixel values from that image.
Here, we have four pixels. So we have four weights. I ran out of room. So there should be four inputs up there. But I just drew three. What we do is we do a dot product of the weights and the inputs. We add a bias. And we get a result.
So what a single neuron-- we haven't done non-linearities yet. What a single neuron is doing is giving you a score for something. And you can think of this neuron as telling us how plane-like is that image maybe. What's nice is we can start adding-- see how this is already starting to look like a little neural network? Now instead of one neuron, we have a dense layer. All we had to do to get a dense layer is we added one more output.
Actually, we could have had a dense layer. In Keras, you could have written exactly what you see here. Is model dot add dense one for one neuron. This would be dense two. And what we have here is now we have two outputs. Adding a second output because it's fully connected, it means we've added a second layer of weights. And what's really nice about this is instead of a dot product, we're doing a matrix multiply. So the forward pass of one dense layer is one matrix multiply, which is really, really nice.
And other cool things you can do, too. Here, we're multiplying. We're classifying one image at a time. We can also, still with one matrix multiplied, classify multiple images at a time. And what we've done here is we've added a batch of images. And here we have two.
And what I'm trying to show you here is still a matrix multiply. But now we get scores of multiple images at the same time. And so we're classifying two images at once. And this is what a dense layer is doing.
If you look at model.fit-- let's see if this works. I'm not connected. That's why.
You can look at the documentation for all these little different methods we're calling. And you'll see that one of the parameters you can set inside model.fit is the batch size. And the batch size in TensorFlow, if you're using these Keras APIs, defaults to 32, which is fine.
When you're doing gradient descent, the larger your batch size, the more accurate of an update you're going to make. But the slower it is to compute. A batch size of one would be one example at a time. That's stochastic gradient descent. A batch size equal to the length of your training set would be batch gradient descent. And what everyone does in practice is a mini batch, which is a number greater than 1 and less than the size of your data set. 32 is usually what you want. But the point is matrix multiply.
To get a neural network from that, you just need one more dense layer. So you need a non-linearity. And you need a dense layer. The intuition for the non-linearity, I guarantee you some of you have a much better sense of this than I do. I don't like this. But I'll show you a demo of how it works.
So to get to a neural network, we just need two more things. We have our matrix multiply. We have a non-linearity. And we have another dense layer.
There are a bunch of non-linearities. A lot of you have an awesome math background. If you're multiplying a series of matrices without the non-linearities, that reduces the multiplying just one matrix.
So there are a stack of different activation layers you can add. Some of the ones originally used were things like sigmoids. Now a good default would be ReLU. I know these are tiny diagrams. Sigmoid looks really nice. It takes a number and squashes it to be between 0 and 1, which makes a lot of intuitive sense but has really bad properties for gradient descent.
And the bad properties when you're using sigmoid activations-- and these weren't understood for a while. If you have a very large value or very small value going into a sigmoid, and you think about the derivative of a sigmoid, it flattens out towards the extremes. So this using sigmoids can cause your gradient descent to run very, very slow.
Later, it was found that ReLU, which-- it looks a little silly. It's basically an on/off switch. Will make your models train much faster. So the good news is applying the non-linearities is simple. There applied piecewise.
So here, maybe we've done our dense layer. We've done the matrix multiply. And if these are the scores we got, we can apply ReLU to them like this. It's just going to be, if it's less than 0, it's 0. If it's greater than 0, it just passes through unchanged. So that's how you would apply ReLU to the output of your matrix multiply. And this in Keras would be Keras dot add layers, dense 3, activation equals ReLU.
And then to get a neural network, you just need a single, one more dense layer on top of that. Basically, on these slides, if you want to poke around with why you need the non-linearities, I linked some code that trains the deep neural network without non-linearities and tries to classify this data set. If you delete the non-linearities, it gives you a linear decision boundary. If you add the non-linearities, it gives you a nonlinear decision boundary.
And I'll show you a demo of this in a second. But basically, the idea is if you forget your ReLUs, you replace-- here's some DNN. If you stick these ReLUs, if you write none instead of ReLU, this has the same power as this network right here. The intermediate pairs to nothing.
And let me show you a quick demo of this. There's a cool website. It's playground.tensorflow.org. And this is a little neural network running in the browser. And this was before TensorFlow.js, just FYI. And it's sort of a funny thing. It's awesome, and really powerful, and horribly documented, and can be a little bit hard to understand.
But basically, if I delete the hidden layers, we're looking at a single dense layer or one neuron. And if we pick a linear data set, and I hit Play, we can classify the data set with our neuron.
If I have a nonlinear data set-- here we have these two circles, blue dots in the center, orange dots outside-- we can't split the thing. We can't draw a line to split them. If you add a hidden layer, now we have a neural network. And the hidden layer will do feature engineering. And it will-- I don't have a slide for this. But we'll just skip it.
There's a trick you can use to classify a nonlinear data set with a linear layer. And that's if you do feature engineering. But it doesn't matter. The neural network is doing feature engineering to let us classify the data. If you delete the activations, though, so if we switch the activation to linear, which is none, our neural network can't do it. And so you have to have the activation functions to have the hidden layers do something.
All right. Really quickly, and then we'll do some more code. There's just two concepts that I wanted to briefly mentioned, because we have alphabet soup. So the output of a dense layer is just some scores. And after we apply the activation function, we still just have scores. Usually, when you're doing classification, what you want are probabilities.
So there's a function that you'll see at the end of your networks called softmax. And softmax take scores. And it returns a probability distribution. So that's what softmax is doing.
The other thing you'll see is, in linear regression, the loss function is squared error. When you're doing classification, the loss function is usually cross entropy. And all I wanted to say right now, when you see the term cross entropy, what you're saying is compared to probability distributions. So softmax gives us scores. And we need to compare those scores to the thing that we wanted.
So this is called a one hot encoding. And let's say we were classifying this image of a bird. And maybe there's 10 possible outputs for the image. Our label, or the value that we want for the bird, is-- let's say 2 corresponds to bird. So we have a 1 here and zeros everywhere else. That's the probability distribution we wanted. This is the probability distribution we got from making a prediction with our model. Cross entropy we'll compare these and return a number. So it's another loss function, just FYI.
All right. So here's another notebook. And then after this, we'll do convolution, which is much more interesting than these dense layers. So this is another notebook where you're going to write a neural network for a fashion MNIST. And this is dense layers still. The link is ijcai, this time, underscore one dash a. Let's take 10 minutes. And you can hack on that.
By the way, the goal that was not to give you all the details of softmax or cross entropy. It's just so you know, OK, that's ballpark what those terms are trying to do. And you can go from there.
Oh, I just wanted to mention so you don't get stuck on this. The goal of this notebook was just to briefly introduce two things. The goal this notebook is briefly introducing TF data.
So you're seeing instead of-- when you're pre-processing images, Keras has awesome, really thoughtful, easy to use pre-processing utilities. Things like flow from directory, data augmentation. They're wonderful and awesome. They work in TensorFlow 2 also. The goal of this notebook is to show you a lower level way to do it, which is why we're using things like data set map and writing your own pre-processing functions from scratch.
Just in case you're stuck, the first step is to batch the data. And the way you batch the data is just nice. If you're seeing you can't-- ah. So my friend wrote this in Google Drive. If you can't edit the notebook, it's because it's not on GitHub. You have to click on Open in Playground. And that will give you a copy of it.
But let me just show you how to do the batching step. Just for step one, you can just do dot batch and then the batch size. So that's all you need.
All right. So continuing our warp speed intro to deep learning. So, convolution. Basically, you'll hear a lot about CNNs. And convolutional neural networks are way more-- they're much better suited to image classification than dense networks. And I'll briefly explain why.
So first of all, convolution. Not a deep learning concept. And you'll see this a lot in deep learning. I know some of you have an electrical engineering background. You'll know way more about convolution than I ever will. In deep learning, we take concepts from other fields. And we use [INAUDIBLE] kind of remedial way.
So first of all, convolution. Not a deep learning concept. And I have some code that I wrote in PSI Py. And we're going to convolve over a picture of an astronaut to detect the edges on the photo.
And quickly, does anyone know who the picture of the astronaut is in PSI Py? Who got built into PSI Py? What do you have to do to become part of PSI Py? Anyway, that's Eileen Collins. And she was the first woman to command the space shuttle Columbia, which is where I stole the slide from.
So anyway, the way we're going to detect edges on Eileen is we're going to use a filter or a kernel. Things in deep learning often have like five names for no reason. So we're going to use a filter or a kernel. The brief idea is there are nine numbers, eight of which are negative one, one of which is eight. Same number of negative ones as the eight.
If we put the kernel on top of the image, and we do the dot product of the values in the kernel with the pixels. And there's just code in PSI Py you can look at later.
Let me show you what I mean by that. So here's our image. And here's our kernel or our filter. And the way we can convolve or we slide over this image is we stick the filter on top of the image. We take the dot product of the filter and the image values. And we write it in the output image.
And then, convolve literally means slide. Slide dot product output, slide dot product output, slide dot products output. And so we get an output image by convolving.
And in CNNs, is the filter values are learned exactly like parameters inside dense layers are learned. So they're learned by gradient descent. They start life as small random numbers. And what's interesting about convolution is if you have the right numbers for the kernel, you get really powerful things.
So this is an edge detector. And this is the way Photoshop works as well. It's convolution to detect edges, to blur images, to sharpen images. The difference is in Photoshop, they have these really nice kernels that are very carefully hand designed. This is like the crappiest one you can write. But it works.
And the difference with convolution in dense layers. This is already much more powerful than a dense layer. So with just nine numbers, we can find edges anywhere on the image. To do that with a dense layer, a dense layer would have to separately learn to detect edges at every location in the image. So this little thing has the same power as like dense 1,000. So, much more efficient.
It's slower because we have to convolve it and slide it around the image to do the math. But it's much more efficient in terms of the number of parameters. So this is a big deal.
What's great is in deep learning-- well, first of all, here's how you use convolution inside TensorFlow. You can write a little convolutional layer. And I'll explain what this means. Here, we have some layer that's going to take an input image as input. The input image is going to be 10 by 10 by 3, meaning it has three color channels-- red, green, and blue.
Here, pulled it out of a hat. We're going to learn a filter that's 4 by 4. The larger your filter is, the more sophisticated [INAUDIBLE] detect but the slower they are. Common filter sizes are not 4 by 4. They're usually 3 by 3 or 5 by 5. I stole these slides from a friend. I had to change the kernel size to match the slides. And we're going to learn four filters. And I'll show you what that means in a second.
So here's convolving in 3D. And this becomes very powerful very quickly. So, convolution in 3D. Instead of having a 2D filter, we now have a 3D filter. And already, this gets a little bit harder to wrap our heads around exactly what this filter is doing. But it's basically looking at every color channel separately.
The good news is we can convolve in exactly the same way that we can convolve in 2D. So we stick the filter over the image. We take a dot product here. And we write that down as the output value. And I'm skipping things like padding, and stride, and stuff like that. But basically, if you do a lot of sliding, take a lot of dot products, you end up with an output image.
What this is called, this is an activation map. So this is showing you the regions where the filter was most strongly activated. And that just means the dot product was high. So if it was an edge detection filter, this would be the locations of the edges.
What's nice, again, these filters are learned. This is also an image. There's no reason that we can't just stick this in map plot live and display it as an image like we did with the results of convolving over [? alib. ?] And so you can visualize very easily exactly what the output is of all these filters. So that's a nice property.
And then it gets powerful. If we add another filter, we get another output image. And all the filters are learned, random weights initialization. So hopefully, they'll be detecting different things.
And here's what's cool. I guess I deleted the slide accidentally. But you can imagine if we had 4 output filters, or 10. Let's say we had 4 output filters. That would mean we've gone from a 10 by 10 by 3 image to a 10 by 10 by 4 output image. So we've left color space. And we've entered activation space.
And the hope is that this will learn edges in some orientation, edges in other colors, different kind of colors. So we're getting maps describing where features are in the image. And it starts getting powerful very quickly. It's when you add a second convolutional layer.
And the important thing is convolutional layer one. If you're a filter in this layer, you have to look at pixels and compute features. But if you're a filter in this layer, you get to look at the features this guy already computed and compute features of them. So if these are edges, maybe you'll learn to detect shapes, which is really cool.
And here's how this works. So let's say in our first convolutional layer, we learned four filters pulled out of a hat. The next convolutional layer looks through all four activation maps of the previous one. If we had learned 32 activation maps in the last layer, these filters would look through all 32 of them. So these filters are really, really powerful. Basically, they're taking dot products of features, different types of features.
The things they can compute are very powerful. And they get powerful very fast. But again, convolution works in exactly the same way. Every filter produces a single activation map. And if we had eight of them, we'd get eight activation maps. So what happens is, basically, the image gets deeper.
And then, as I've drawn it here-- I didn't have time to talk about things like max pooling. But there's ways you can make the image-- basically, this is a big chunk of image. And it's slow to convolve over. As a way to speed this up, you might see things like max pooling layers. And what max pooling layers are, they reduce the width and the height of the image. But they leave the depth unchanged. So basically, what max pooling will do.
One funny thing you'll learn, by the way, if you start teaching this stuff. There's like two or three-- more than two or three- but there's a small number of people that make really excellent diagrams. And every other class in the world steals them.
The best example of this. I'd say like 95 percent of the classes I've seen have borrowed this diagram, which is written by-- you've all seen it, yeah? Which is written by Chris Olah. And it's the same thing with max pooling from Stanford.
But anyway, what max pooling does is just taking-- this is max pooling of two. So what we're saying is this is too hard to process computationally. We want to hack to make it smaller. What we're going to do is for every 2 by 2 region, we're just going to copy out the strongest activation to the output. So this reduces the image size by 75 percent. It's lossy, but whatever. That's max pooling.
And then, yeah, we talked about that earlier. What I want to do is talk about Deep Dream really quick. Well, there's a couple of things you can do with this.
Anyway, one question you might have is, by the time you get to layer 17, what are these filters actually responding to? Before we write a CNN, let me just talk about a couple of things you can do with this. So here are three things you can do with CNNs. The first is you can write one from scratch. And that's the next exercise. And that's writing a model Keras sequential, convolution, max pooling, convolution, max pooling, [? depth. ?] And that's great.
The other thing you can do is transfer learning. And this is a really, really powerful concept. So, the idea of transfer learning. Usually, machine learning, you have a small amount of data. Let's say your friend in the past had a lot of data. And maybe she trained a convolutional network on ImageNet from Stanford. So ImageNet, the moving pictures in 1,000 different classes takes a day to train a model on it.
So let's say she trained it. And then you wanted to reuse her model to train your own. Instead of starting from scratch, what you could do is, let's say this is her model. And this dense layer at the end is classifying things from ImageNet. So, cats, dogs, snakes, peacocks, whatever.
Let's say you have Hondas and Toyotas. To do transfer learning, you delete this dense layer. You keep the rest of the CNN that she previously trained, unchanged. You add your own dense layer and outputs just for the classes that you care about. And then you relearn just this dense layer. But you leave the convolutional base unchanged. And the idea here is you use her CNN as a preprocessor.
So it takes an image. It gives you good features for the image. And then you learn a dense layer using those features. That's called transfer learning. It's a really, really interesting idea. The idea is using knowledge that you've learned on a previous task on another task. You might know of other examples of this. The only one I know that works-- well, that's no longer true.
This works really well for images. And it's starting to work really well for NLP. But that's very, very recent with models like BERT. But I bet there's more more potential here, too.
The third thing you can do with convolution is trying to understand what these filters do. So basically, let me see what we have here because of the time.
I'm just going to talk about Deep Dream for a minute. And then I'll give you some exercises you can do to write a CNN and then to do transfer learning.
So here's the idea with Deep Dream. All right. Has anyone seen Deep Dream Before Does anyone know why Deep Dream exists? Was the goal of Deep Dream like, let's smoke too much and generate psychedelic images from neural networks? So Deep Dream is-- so in a really hand wavy way been like, trust me. We get this magical feature hierarchy. It's going to be great. And Deep Dream is a way to actually show that this exists. So it's a way to investigate the representations learned by a neural network.
So basically, let me show you the results and then explain what they are. Just in terms of terminology. By the way, when you add these layers, you can name them. So if we wanted to, I could put a parameter, comma, name equals layer 1, or whatever.
This layer has four filters, each of which are 4 by 4. And let me show you what the authors of the Deep Dream paper-- before I go into Deep Dream, let me show you what the filters are learning to detect. So each of these is one filter in the first convolutional layer in a neural network. And these images were produced by starting with random noise and modifying the noise until the filter is maximally excited.
So this is an image. If you were the first filter in the first congressional layer, and you saw this, you would produce your highest possible activation. And in layer 1, we're seeing that filters are responding to different colors. Right? This layer probably, by the way, looks like com 2D 32 3 by 3 or something like that. So there might be 32 of these individual filters. They respond to different colors, and edges, and different orientations. So that's the first layer of a CNN. The names here are a little funny in this network.
But as we go deeper, these are the images that the filters in the next layer get really excited by. And already, they're getting a little complex. These are like texture-y, right? As you go deeper into the network-- I'm not going to go through all of them. Take forever. They get more and more complex as you go.
And what's interesting is if you start poking around really deep, they start to look like things we recognize. So we see like peacocks, and feathers, and cool textures. And I don't know what all this stuff is.
The reason that we're seeing these particular images is this model was trained on ImageNet. And these are the features that it found to be useful to classify ImageNet images. Presumably, if the image had features that looked like this, it might be a bee. And the reason that you see these features tessallating along the image is probably because convolution does this slide-y operation where you apply the filter in different regions.
And then if you go really deep, you see things that start making sense to us, like saxophones, broccoli, and who knows what.
But anyway, let me explain how we get these. And this is what Deep Dream is doing. First of all, there's two things you can ask. One is you can say let's find the image that excites one filter from some layer. And that's what we're doing here. Two, you can say let's find an image that excites the entire layer. So if this is layer 5, let's make this layer as excited as it can be.
And here's how that works. It's a really, really powerful idea. And the code is surprisingly short. So in Deep Dream, you start with the picture. Great. The next thing you need is a model that was trained on a large data set of images. And it doesn't matter what model you use.
Here, we're importing. This is transfer learning, almost. We're importing a model called Inception. There is a-- if you learn more about CNNs, there's a box of famous models. There's things like BGG, Inception, Resonant, whatever. Inception is one of them. And these are all different architectures.
And what I mean by an architecture is, basically, when you do model equals sequential, add dense, add dense, add dense, that's an architecture. This would just be a fancier architecture built with a functional API, or the [? sub classic ?] API with whatever. But that's Inception. It's some CNN with some fancy bits added.
If you were doing transfer learning, this line here says give me the CNN but not the dense layer. And later, you could add your own dense layer to this. Here, we're saying give me the weights that we previously learned on ImageNet. So this is a trained model.
The next thing we're going to do is we're going to take an image. And we're going to pass it through the model. And what we want are the activations at a certain layer. So our goal is going to be to modify the image to get those activations as high as possible. And as we modify the image, presumably we'll add more things that-- whatever that layer is detecting we want to appear in the image.
So because the layers in this model are named, we can look at the summary and find the names we want. And then here, we're using the functional API just to write a new model where we pass in an image. And we get the activation maps out of the layers. And if you pass an image through this and you run it, you'll see a bunch of matrices, which are literally the output of the convolutional filters at those layers. And there's going to be a lot of numbers.
But here's how Deep Dream works. And this is kind of magical. Some of this code is boilerplate. But you always need a loss function. And the loss function here is just this . We're summing up all the activations. So if we want to find an image that excites this layer, we want to maximize this list of activations. So we literally sum them. Or here, we're taking the mean. But same thing. And that's almost it.
Just so you know, we updated this like yesterday. So I haven't actually seen this brand new version yet. But I can give you the idea.
So what we do. When we call this, we're going to pass some image through a model. Inside here, we get the list of activations. And this is the sum or the average of the activations. We need to maximize this.
And here's the insight behind Deep Dream. So normally, when you have a deep learning model, you adjust the weights on your models to fit the data. Here, we're going to leave the model alone. And we're going to adjust the image to fit the model. So we get the gradient or the loss with respect to the pixels on the image.
And if you print this out, these gradients will have exactly the same shape as the image, which means you can directly add them to the image. And we want to do gradient ascent because we want to make the loss go up. And so there's some normalizing code here. But the important part, they changed this slightly. But it's right here. We're doing gradient ascent. So we're adding the gradients to the image multiplied by a learning rate.
And at every step of this, it's amazing. That's all the code you need to Deep Dream-ify an image. So if you picked a layer that responds to sheep, those gradients will make your image slightly more sheep-like, which is nuts.
And there's two versions of the Deep Dream tutorial. So the first half of it, we tried to write the minimum amount of code to make it work. And that produces these slightly staticky images. The second half of the tutorial-- that's the research inside. The second half, which is a little bit more complicated, has different tricks to make them really high resolution and stuff like that.
But the point is I just wanted to talk about Deep Dream after convolution because it really proves the point. I really like it. It means this isn't BS. And these layers are actually learning this feature hierarchy. And we can see it. And we can reuse it. And it's cool.
Anyway, let's do this for 10 minutes. I know it's fast. But the goal here is to start writing a CNN for a data set called CIFAR 10. And CIFAR 10 is in the MNIST family. It's a small data set. But it's color. So it's a little bit more interesting. And there's a reference tutorial you can look at which has background on how convolutional layers work in TensorFlow 2.
OK, so let me point you to one or two more things. I'm going to take like two minutes and just point you to-- we've spent a lot of the summer working on the tutorials. So let me just point you to some of the latest ones, just to save some time.
So basically, for transfer learning, we have two different tutorials you can check out. And let me explain why we have two different ones from an industry perspective. So basically, there are two repositories of pre-trained models in TensorFlow 2.
One list of pre-trained models, which are awesome, are these Keras application. And if you Google around for Keras applications, there are these one liners where you can import a lot of famous CNNs with usually weights trained on ImageNet. So a lot of our tutorials are using MobileNet B2.
By the way, here's a really simple line of research that's interesting. Previously, with CNNs, the goal was how accurate of a model can we train. The new goals today are often how small of the model can we train that's accurate enough. And the goal is to get it fast enough to run on a phone or in a web browser. It's not rocket science. Basically, what people do is they do experiments with different numbers of layers. They look at like the accuracy-speed trade-off.
Anyway, so one tutorial has these applications from Keras. They're great. The other has a larger repository of pre-trained models from TensorFlow Hub. And TensorFlow Hub is a more recent collection of models. And we're working on expanding this for TensorFlow 2. The truth is it doesn't really matter which one you use. Sometimes, big companies build two of everything and see which one works better. Keras applications are older. And they existed before TensorFlow 2. They're great.
Anyway, you can try either. And whichever one you feel is easier is the one that you should use. So that's transfer learning.
A really cool thing today is GANs. And TensorFlow 2 has really, really awesome tutorials for GANs. We've got three of them, plus a VAE, plus an adversarial. Well, actually, let's just look at the GANs. That is not a GAN. Has anyone worked with GANs? A couple. All right.
So basically, for people that are new to GANs, they ask a really hard question. So everything we've looked at so far is, here's a picture. Classify the picture. The question GANs ask are, generate me a picture. And the goal is to generate a picture that looks real.
And the challenge with generating things with deep learning is that we need a loss function. So everything we do in deep learning is we're doing gradient-based optimization against some loss function like squared error, or cross entropy, or maximize the activation of some layer. It's hard to get a loss function for generating cats.
And the way GANs work is research from Ian Goodfellow in 2014. And what Ian realized is we can get a loss function for generating images for free. And we already have it. It's an image classifier. And so if you train an image classifier to say, is this image of a cat real or fake? That's just a standard convolutional network. You can train a second CNN to generate images of cats. And you can train them against each other.
And so you have this game where you have a generator and a discriminator. And they're trained in parallel. And basically, over time, the generator learns to generate more realistic pictures of cats. And the discriminator becomes better and better at telling real cats apart from fake cats. Over time, they hopefully reach equilibrium, at which point you can generate pictures of cats.
And this tutorial here is the minimum amount of code you need to train a GAN or generative adversarial network to generate images of MNIST. And that is a visualization of the MNIST images being generated over time.
And if you look at this, it will look very, very similar. There's two chunks. Chunk 1 is the discriminator. And if we had more time, this would have looked almost identical to the image classifier you would have written on the last exercise. The discriminator is just a run of the mill CNN.
The trick is the generator. And when you're new to deep learning, you'll start looking at code like this. And you'll recognize some layers. And you won't recognize others.
So let me just walk through what some of these layers are. And by the way, the best way to go through these. The papers are linked at the top of the tutorials. If you read the paper, you track the code at the same time. It's much, much easier than just the paper.
So basically, you'll see layers like dense. You've seen that. Leaky ReLU is a friend of ReLU with just slightly different properties. This would work fine with ReLU also. You haven't seen batch normalization. And you haven't seen these com 2D transpose. Let me see if I have slides on com 2D transpose really quick.
So one challenge with GANs is we don't want to generate the same image every time. Otherwise, the discriminator would just learn that that's the fake image. So we need to randomly seed the generator. So the way the generator and GANs work is, here's some random numbers. Use these to parameterize the image that you generate.
Maybe the first random number tells you how one-like it is. And the second one tells you how two-like it is. And what the GAN has to do in the generator, it has to go from a list of numbers to an image. And we usually do up sampling to do that. Com 2D transpose is an up sampling layer. There's two ways to do up sampling. One is you can just double the size of the image and average the pixels. Two is we can do a learned up sampling.
And what I'd recommend doing as you're going through layers, you see layer like this. Take a few moments or a day or two. And dig around. And try and understand what it's trying to do. And so when I was going through convolution 2D transpose and like, what the hell is that. And so what I usually do is-- these are slides from the summer workshop. What I usually do is try a simplest mini example of the layer and just work out an example. [INAUDIBLE] they look.
So this is convolution transpose. And here's a quick example of how we can go from a small image to a larger image with a learned up sampling. So this is a lot like convolution. Here's our filter. And the basic idea. The details aren't important. I just want to show you that it's a thing. You would take the small image. This is how the Keras com 2D transpose layer works. You take a the image. And you use it to parameritize the filter.
So instead of the dot product, it's the image value. Multiply it against all the filter values. And you write that down on the output image. And then, just like convolution, you slide. So we slide again. We multiply that by the filter. We write down the output values and the summit. Slide again. We slide again.
But the point is that's all the layer is doing. It's complicated but not so bad if you have the time to go through a small example. Com 2D is a learned up sampling.
The other layer in there that we haven't talked about is batch normalization. And let me just briefly explain the idea there. So in a lot of the code you'll look at for MNIST, if you're training an image classifier, one of the first things we do is we normalize the data. So we import some images. And usually, the pixels range between 0 and 255. And the first thing you'll see in a lot of tutorials is we divide by 255, which makes them range between 0 and 1.
The reason we do that, briefly, is that basically neural networks don't like large numbers as input. And there's different reasons why they don't like large numbers, one of which is if you remember the slides from a dense layer, the input value multiple by weight. If we have a very large input value or a very large weight, we could get overflow, numeric overflow with floating point problems. And it can have bad properties for gradient descent. So we normally normalize the numbers to be between 0 and 1.
Here's the insight between this amazing layer called batch normalization. And let me explain what's happening and why this is important. And you'll also see that a lot of the research today in deep learning is not rocket science. It's just very early in the field.
So here's our beginners tutorial. And we import MNIST. We normalize its [? reproduction. ?] We import MNIST. We normalize it to be between 0 and 1. This means if you are the first dense layer, if you are this guy, all the input values are between 0 and 1. You learn to wait.
However, if you're the second dense layer, your input values are not necessarily between 0 and 1. They're the outputs of whatever this previous dense layer has produced, which means your job is harder. So this dense layer just has to learn weights for that fixed distribution. But the distribution coming into this guy is changing, which makes the job harder than it needs to be.
So what batch normalization does, it's a layer that you would add right here. And if you wanted to, right here. And the basic idea is it's a normalizing layer. And there's a lot more details you can read about. But the basic idea is-- let me see if I'm awake enough to go through this.
Here's some features coming into a layer. And these are examples. And what we're doing is batch normalization computes the mean and standard deviation of each feature. And it just normalizes it before it goes into the next layer. So batch norm, basically, in a nutshell is, let's re-normalize the data to make the distribution going into the layer change slower. So it can speed up learning a lot.
Another layer you'll see is Drop Out. in DC GAN. And here's Drop Out. The good news is-- let's see. You see these Drop Out layers. Drop Out is a really, really nice layer. And it's easy to use. Drop Out is a great way to prevent overfitting.
And here's the basic idea. So we have some cartoon network full of dense layers. And lets say we're overfitting. We're memorizing the training data. What Drop Out does is it randomly deactivates, on every batch, a subset of the neurons. And it does that by setting their activations to zero.
So Drop Out basically says this network is too powerful. Let's randomly turn off a bunch of neurons at every step. And the reason that this helps prevent overfitting, it makes it harder for the network to learn. And the idea is that because it can't rely on any individual neuron being on at any individual step, so it has to learn redundant representations. So it has to learn different ways of detecting the same feature. So Drop Out is a layer to prevent overfitting.
What's cool is if you learn about Drop Out, it was invented by Jeff Hinton. And Jeff Hinton had this really cool thing on Reddit where somebody asked him what's the intuition behind Drop Out.
And this is what Jeff said. So basically, Jeff was saying-- I think he must be a really nice guy because he was trying to make friends with his bank teller. And he was having trouble making friends with his bank teller because the teller kept being changed. And basically, he asked the bank, why are you changing the bank teller? And it's so he can't defraud the bank. And so because he can't rely on any individual bank teller being there at any individual day, you can't form a friendship. You can't come up with a conspiracy to defraud the bank.
And Drop Out, in the same way, prevents neurons from always being present. So that's the intuition. I don't think like this when I go to the bank.
All right. Let's just-- I just want to point you to one more thing. Then, we're going to play a game. And then, we're going to stop. If you have time, there's two awesome new GAN tutorials you can go through and read the papers. The reason I like these is it's complete code that works. And it works with a click, which is nice.
The first is Pix2Pix, out of Berkeley, which is beautiful. This is a conditional GAN. And the goal here is not generate me a random image that looks real. It's generate me an image that looks real, that also resembles an input image I give you. So this is an input image for the building facade or facade, not sure. This is the building that image actually corresponds to. And this is what's generated by Pix2Pix.
And so this is an image with similar pixel values to this image that the discriminator isn't able to distinguish from real or fake. And if you start looking through these GANs, the main thing to look at is the loss function.
And so if you want to understand the evolution from DC GAN, which is MNIST, to Pix2Pix, look at the loss function. And the main difference in the loss function. The loss function for DC again is trick the discriminator. The loss function in Pix2Pix is trick the discriminator and minimize the L1 distance between the input image and the output image, which forces the output image to look similar to [? right. ?] That's the main difference.
And then the same reasoning applies to cycle GAN, which is the latest when we have. And cycle GAN does unpaired image translation. And so with Pix2Pix, you have to have paired training data. So, a facade building, a map image, satellite image.
There's lots of things you might want to do GANs for that you can't get paired training data for, like day to night. Even day to night is hard to get paired training data. If we took a picture of the NBL at night and in the day, things change. Cars move around. People move around. So it's hard to get. And of course, there isn't a paired training data for horses to zebras because it doesn't exist.
But cycle GAN can do this. And the insight with cycle GAN was you don't have to have a one to one mapping. What you need is a directory. So if you have a directory of horse pictures and a directory of zebras pictures, you can exploit supervision at the level of sets. So the cycle GAN was a really cool thing. All the code is there.
All right. Let's stop with that. Let's play one game really quick. And I'll point you to two games that, if you're teaching, they can be fun to help keep students engaged. And there's a point to them.
So could I have a quick volunteer? And this this person should be proud of their artistic ability. Very good artist, which I am not. Thank you. Come on up. Has anyone seen Quick Draw before? So this is great with kids and adults. Anyway, so Quick Draw, by the way, just got way harder. And I'll explain why in a sec. So if you could do a-- you've seen this. Go for it. So let's try and do like two or three quick draws. OK, OK. So let's try and do two or three. Oh well. I hope this isn't blazingly loud. Draw shorts.
QUICK DRAW: I see music note. Oh, I know it's shorts.
JOSH GORDON: We don't-- that was amazing. So we don't have audio. But usually it speaks to you as you're playing it. So I see baseball. I see shorts. So try it again. That was great.
QUICK DRAW: I see shoe, or suitcase, or square, or camera, or stereo. I see stove.
JOSH GORDON: Yeah, it's a stove.
QUICK DRAW: Oh, I know. It's--
JOSH GORDON: Yeah, there you go. So let's do one more. This is actually really good. Cannonball.
QUICK DRAW: I see line, or rainbow, or potato, or peanut, or pond. I see watermelon or steak. Oh, I know. It's cannon. This is actually surprisingly-- OK.
STUDENT: Do I keep going?
JOSH GORDON: Keep going.
QUICK DRAW: I see nose, or line, or pond, or pool. I see skateboard, or sandwich, or hockey puck. Oh, I know. It's hamburger.
JOSH GORDON: Let's do one more. You might've set the new record for Quick Draw.
QUICK DRAW: I see line, or diving board, or circle, or peanut. I see potato. Oh, I know. It's steak.
JOSH GORDON: All right. We're going to stop there. So, nice job. Thank you very much. Thanks.
All right. So the first thing I have to tell you about Quick Draw. If you're teaching a class, MNIST is boring as hell. But it's a good place to start. A good homework for the students is Quick Draw. So let me point you to some code.
If anyone wants the URL, you can grab this screenshot. And there's probably official versions of this, too. But what this is. It's a little Python file you can use to make a Quick Draw data set. So Quick Draw, which your images are now a part of. Quick Draw has an academic data set of probably like 20 million plus Quick Draw diagrams at this point.
And this code, you can pick which class names. You can say, yo, give me all the planes, cars, trucks. And you can say how many images you want, anywhere from like five all the way up to millions. And what's nice about this is students, when they go home, they can get some experience like training a model on a large amount of data. And it doesn't have to be a blocker because they can select how much they want. So it's a nice thing, too. Yeah, and this will walk you through how to get the images out of Quick Draw and stuff like that.
The other thing I have to say about Quick Draw, which is really interesting. If you look at these drawings, they're not just pictures. But they're sequences of brushstrokes. And so what's cool is these are different elephants from the Quick Draw elephant set. And what's cool is the different colors and different brushstrokes. I don't know the order. But that's too.
And what's cool is there's some really good research. And given that we have this database, what else can we do with these brushstrokes? And you can use RNNs, which are usually used to generate text and stuff like that. But the insight with David [? Hal's ?] group, you can generate Quick Draw images using the [? R Net. ?]
And there's a really cool game for this. It's called Sketch RNN. Sketch RNN is an RNN that's been trained on the Quick Draw data set. And what's cool is you can pick an image from Quick Draw. So if we pick-- I don't want. To. We'll pick penguins. It's the marine biology, right? So it goes like that.
And then you start drawing a penguin. And then you stop. Sketch RNN attempts to auto complete your penguin. And it looks silly. But it's super impressive. You think about how hard this is to write.
And so basically, what we're seeing is, of the people in the Quick Draw data set that started drawing a penguin in the way that I did, these are the brushstrokes that might follow. And it's kind of cool. I'm not sure this will work with penguins. But maybe people start drawing the beak. And so it's really cool. And there's a surprisingly large number of images that you can draw.
So this one is not as immediately actionable. You can show students the Quick Draw data set. Then they can go train in classifier. This one is more of like, hey, FYI, this is super cool. All the code for this is online. It's just someday I would love for us to have a short tutorial.
And the reason I wanted to mention this is if you think about how this generates images, here's some research. This is very different from GANs. So GANs synthesize these beautiful photorealistic images pixel by pixel. But that's not how people draw, right? If you start drawing a scene, you're not going to draw it pixel by pixel. You draw in brushstrokes. So this is an RNN-based solution that learns to draw images brushstroke by brushstroke, which is very, very different.
Anyway, that's all I got. So basically, here's some tutorials you can look at. Also, for people that are teaching, here's three book recommendations. The first two books are not academic. They're like 40 bucks. These are how you do the thing books. How do you train an image classifier? How do you train a text classifier? How do you make an RNN work? They're both great.
If you get the first book, only get the TensorFlow 2 version, which is in prerelease right now. So, the second edition. The deep learning with Python book. This is a Manning book. It's by Francois Chalet, who wrote Keras. Everything from this book works in TensorFlow 2 just by changing an import.
And then if you want a textbook, it's free. It's Ian Goodfellow's Deep Learning book. This is a little bit-- it might be instructive for some of you. I struggled with this a bit. It was really hard. To me, this is more of a really great reference.
Yeah, that's all I got. Thanks a lot. Can answer any questions? Or yeah, thanks.