Decoding Animal Behavior Through Pose Tracking
July 10, 2020
July 9, 2020
Talmo Pereira, Princeton University
All Captioned Videos Computational Tutorials
Talmo Pereira, Princeton University
Behavioral quantification, the problem of measuring and describing how an animal interacts with the world, has been gaining increasing attention across disciplines as new computational methods emerge to automate this task and increase the expressiveness of these descriptions. In neuroscience, understanding behavior is crucial to interpretation of the structure and function of biological neural circuits, but tools to measure what an animal is doing has lagged behind the ability to record and manipulate neural activity.
In order to get a handle on how neural computations enable animals to produce complex behaviors, we turn to pose tracking in high speed videography as a means of measuring how the brain controls the body. By quantifying movement patterns of the humble fruit fly, we demonstrate how advances in computer vision and deep learning can be leveraged to describe the "body language" of freely moving animals. We further demonstrate that these techniques can be applied to a diverse range of animals, ranging from bees and flies to mice and giraffes.
This talk will describe our work in generalizing deep learning-based methods developed for human pose estimation to the domain of animals. We tackle the problems of learning with few labeled examples, dataset-tailored neural network architecture design, and multi-instance pose tracking to build a general-purpose framework for studying animal behavior. Finally, we'll explore how postural dynamics can be used in unsupervised action recognition to create interpretable descriptions of unconstrained behavior.
Slides, code and data for tutorial:
In the tutorial part of the session, we will work through the usage of our framework SLEAP (
https://sleap.ai) to see how we can train and evaluate deep learning models for animal pose tracking right in the browser. No data is required but we will provide a short tutorial on using SLEAP with your own data for which a laptop with Miniconda ( https://docs.conda.io/en/latest/miniconda.html) installed is recommended.
Speaker Bio: Talmo Pereira is a PhD candidate in Neuroscience at Princeton University where he uses deep learning and computer vision to develop new tools for studying animal behavior. His recent work has demonstrated how advances in deep learning-based human pose estimation and tracking can be adapted to the domain of animals to solve problems in fields ranging from neuroscience to ecology. This work has been published in Nature Methods and featured in The Scientist, Nature Lab Animal, Nature Toolbox, and Quanta Magazine. Talmo was a research intern in Perception at Google AI working on pose-based action recognition, a NSF Graduate Research Fellow, and was recently a recipient of the Porter Ogden Jacobus Fellowship, Princeton University's top graduate student honor.
JANELLE: We have Talmo Pereira speaking to us today. So I'm excited. I actually worked with Talmo last summer at Google, and he told me a lot about his work on using pose tracking software. And so it seemed like that would be a cool tutorial to present to the department. So Talmo is a PhD candidate at neuroscience at Princeton University. And so today, he'll be talking about his work using deep learning to perform pose tracking.
TALMO PEREIRA: Thanks, Janelle. All right, guys. Excited to be virtually here. So today we're going to be doing this in two parts. First, I'll be giving a little-- somewhat long talk on how we've been using animal pose tracking to better understand behavior. And afterwards, we'll do about 30 minutes of a tutorial on using the software that we've developed to do a multi-animal pose track.
So as we go along, there will be several hopefully natural stopping points between sections if you guys want to stop and ask questions. And of course, the tutorial afterward will be a lot more interactive. So let's get into it. So behavior. Behavior, particularly through the lens of neuroscience, is typically schematized in something that looks like this, where we have a body that controls effectors in response to sensors to produce behavior.
Put more biologically speaking, we can consider this as being the brain, which through the spinal cord or equivalents controls muscles to create behavior to interact with the physical world in response from signals in the environment. In other terms, behavior then could be considered akin to body movement. And this is going to be the [INAUDIBLE] definition that we're operating on.
So our goal with the methods I'll be describing today is to go from videos of behaving animals, that is, moving animals-- in this case, we have here a fruit fly-- to some quantitative measure of movement. So how are we going to go about doing this? So first, we're going to start with a short background on animal tracking using computer vision.
That's going to motivate us for how we started developing our software called LEAP, in which we first adapted deep learning for animal pose estimation. Then we'll get into our-- the new version and the successor of Leap called SLEAP, for Social LEAP, which enables very general-purpose animal pose tracking. And we'll get into the challenges that emerged as we evolved from one to the other.
And finally, I want to-- I was going to end with a couple of stories of how we can use and supervise machine learning to recognize behavior directly from pose tracking of our animal body movement signals. So background. How do we go about tracking animals? In the course of description, and this is something that you might not have even formalized, but certainly you've seen in the past, is starting from these porous descriptions, either of a centroid or the position of the animal to an ellipse.
And that might look a little bit like this if we overlay it on a mouse. An ellipse. We'll fit a centroid, and we'll be able to derive its direction. The next level of description allows us to do this with multiple animals. This introduces a variety of challenges, particularly pertaining to how we keep track of individual animals across time, to then the finer description of behavior, which is now all the rage, particularly in neuroscience and behavioral neuroscience, that is, to do pose estimation, in which we go from having a single point or pair of points to many landmarks around the body of the animal.
And the final frontier, as we'll hope to get to soon, collectively as a field, is to do 3D pose estimation. So I'm not going to be talking about this in this current talk, but there's lots of exciting directions at this finest level of description. So why is it not enough to do the classical tracking, to do center and orientation? Well, consider these three different fruit fly behaviors.
For something like walking or general locomotion, this is where tradition would use the center of mass. And as you can see from this tracking here, you can indeed extract a signal that corresponds to these dynamics. You see the movement of the center of mass. Perhaps orientation gave you a little more information.
But for other movements that do not involve locomotion, like, say, a fruit fly grooming uses two front limbs rubbing its head, there is no changing of the parameters that we can capture with classical tracking, nor for this behavior called the wing extension. So we're missing information even in this case about the specifics of the gait, and these cases for the limbs that are simply not captured in this course in description of behavior.
So this where pose estimation comes in. The idea is to extract the position of every body part such that we can now capture dynamics or motion of every actuatable limb of the animal and therefore have a complete description of all the types of behaviors that its motor system can elicit. So how are we going to go about doing this? So for this, we developed a system LEAP.
LEAP builds on top of previous work in human pose estimation using deep learning, which has had tremendous success in taking images that are completely unconstrained, running them through a convolutional neural network that then is trained to predict these heat maps. So these are images in which the brightest pixels, like these red ones here, correspond to the likelihood of the body part being at that location.
So this totally changed the game in the human pose estimation field, in particular when it came up with this representation, which is an image rather than the raw coordinates. So doing this makes it-- it's very amenable to neural network training, particularly in these fully convolutional networks. And so this is the approach that we decided to adopt.
But we had a problem in attempting to do this with animals. That is, how do we deal with having no training data? So although these approaches in humans were wildly successful, they relied on having these training data sets on the order of tens of thousands to hundreds of thousands of images. And that's simply not tractable to annotate and gather for lab animals, particularly when you have a new animal, a new morphology or new imaging [INAUDIBLE].
So we developed-- to deal with this, we developed a software framework called LEAP. LEAP stands for LEAP Estimates Animal Pose, in which we have this general pipeline, where we go from raw images of our animals into a convolutional neural network that then is going to output these part confidence maps. So these are these probabilistic representations. Then we'll generate the pose.
So getting into the details here, the idea was to create a simple and lightweight encoder-decoder style convolutional neural network. And what that means, basically, is that we start with the raw image of our animal. That's going to be fed into a series of convolutions with pooling. This will enable us to capture features across different scales in the image and then up-sample it until we can now output a set-- a multi-channel image where each channel corresponds to the part confidence maps for each individual body part.
And then we can just train the network based on the ground truth body parts that we generate from label data such that it can now do this on unlabeled images. Great. So we implemented it, and it's a tiny network, so the first thing that we notice that we're able to achieve was very fast training on a single GPU. So during the training phase, in as few as 15 to 20 minutes on a single desktop GPU, you can already get to virtually human-level or convergent accuracy.
And so here on the right what you have is a visualization of the accuracy distribution, that's these circles here, over multiple training outputs. And as you can see, by 15 outputs or so, and that's about 15 minutes on just a desktop GPU-- a single one rather than on a huge GPU cluster-- you already get to virtually perfect accuracy.
So because we could do this fast training, this enabled a paradigm called human-in-the-loop training. And this is the key to being able to adapt this to new data sets of new animals with new sets of body parts. And so the way that works is that we have a GUI in which you load up your images for labeling. You click and drag the markers for the body parts that you've defined such that their positioned on top of the correct locations.
You can label as few as something like 10 frames, train just for a few minutes, then the software would predict on the unlabeled frames, which then get imported back in. And all you'd have to do is correct those labels. And as you saw from the visualization on the previous slide, very quickly we can get to a pretty reasonable approximate accuracy, such that as we continue going through this LEAP of labeling and training and estimating and then correcting, the amount of time that it takes to generate a new training sample to label a new image decreases drastically.
Because now, as we increase the number of labeled frames, all we have to do is fix fewer and fewer mistakes. So this means I basically can get to a full data set of maybe 1,000 images or more in effectively an afternoon of sitting. So something that's a very tractable timescale for experimentalists, for lab scientists.
Further, we found that, even with very few labels, on the order of hundreds, you get to the plateau of how accurate you'll be. So put together, this means that this whole system could start completely from scratch and get you to a fully labeled data set that gets you accuracy that you can actually do work with within a very short amount of time, with very little labor required.
In fact, when we finish grading our data set and we train it with our full set of trading images, we get incredibly accurate prediction localization errors. So where we can see here, we have errors on the order of microns, relative to the size of a fruit fly, which in image space is only a couple of pixels even in 90th or the highest percentiles.
So what does that allow us to do? Well, for behaviors like locomotion which previously you could have used classical methods to perhaps track the centroid, now we can use this to actually quantify the gait video, so gait stride and how each leg is employed separately. And later, I'll show you how we use this to dig into how locomotion works at a high-- and a finer resolution than we could before.
But this also allows us to capture behaviors like grooming, which previously we could measure very easily, because now we can track the positions of these limbs even when they're closely interacting with other body parts. Great. So here you can see the resulting signal. And this is the kind of data that we want to be able to quantify, the more interesting behaviors. You can see that we can capture the cyclical structure, for instance, of this grooming bounce.
So we designed this for fruit flies because I work in a fruit fly lab. But what we found quickly was that this actually generalizes very easily to a number of different animals. And if this video playback is choppy, you can check out the presentation link from the GitHub or [INAUDIBLE] shortener to check out the full resolution videos. But as you can see, we and other people have used their software, have applied it to a variety of animals, including giraffes, ants, moths, and, of course, mice.
As well as more eclectic set of animal body type morphologies like hydra, or here, this was part of our original LEAP tutorial, depth camera imaged cats chasing a laser pointer, or, just downloaded from Twitter-- from social media, a cat chasing a ball. So this is all to illustrate that it can handle really a wide diversity of image conditions. It's really not limited to particularly the lab setting or some more controlled image set.
So great. We did that, and we could track single animals. But there are several other challenges that LEAP didn't quite handle, and in particular, dealing with multiple animals. So that led us to design the successor of LEAP called SLEAP. Oh, sorry. Maybe I should stop here for questions before I go on.
JANELLE: Yeah, it looks like there's a couple of questions in the chat. So we can either-- you can read them off, or--
TALMO PEREIRA: Great. I can do that. So how many FPS on what GPU? So when we did our original benchmarking, we were able to get 185 Frames Per Second on a 1080 GTX TI. However, and you'll this especially in SLEAP, we're able to achieve incredibly higher frame rates by reducing the resolution out of the input or the output. There are some tricks that have to do with [INAUDIBLE] with accuracy.
But effectively, if you can deal with coarse enough accuracy, you can go up to like thousands of frames per second, completely enabling real-time prediction. So the requirements in terms of image, video, and spatial temporal resolution, that is going to be a complete continuous parameter. So the rule of thumb is, if you can see it and reliably click on the body parts, then you should have sufficient spatial resolution to be able to reliably locate those body parts.
If the body parts are very blurry, this is probably a result of having a low temporal resolution. So that's when you might want to increase the frame rate of your camera. And you're going to notice this, again, while you're doing your labeling. But there's no hard limits. So do you have to fix the labels on frame from [INAUDIBLE] or once when it's fixed the rest follow? So you'll fix the labels on a subset of the frames, and you can at any point stop fixing the labels, retrain, and re-predict.
Have you tried this with insects like caterpillars, which appear to lengthen or shorten as they crawl and are able to raise their bodies or climb vertically? Yeah. So I'll bring back this slide here. So not quite caterpillars, but we have done this with worms that have similar image feature appearance. And things lengthening and shortening, as you can see with this hydra for example, aren't a terrible issue.
And part of dealing with that is in having architectures that can integrate information across scales. In terms of going in and out of the perspective of the frame, that is, climb vertically, for example-- so you can see in this example here where the cat is coming from further from the camera towards the camera. And it's one situation in which you can handle that.
But here in the mouse example, perhaps you can see that a little bit more [INAUDIBLE] in which you can see what happens when the mouse rears even away from the camera. So what happens is you can mark key points as being missing altogether, in which case the network will learn to predict that points are included. Or you can force it to predict where you think the body parts are. It's all about how you choose to label.
So all right, LEAP was pretty cool, estimates animal poses-- or rather animal pose-- but we're in a fruit fly courtship lab, and part of our goal was to actually develop these algorithms so that we could track the movements of multiple animals while they're socially interacting. And that introduces a set of different issues that pertain to the problem of having multiple animals.
So Rebecca just asked in chat about dealing with blobby key points. There are certainly a lot of discussion having to do with that. It's a big issue. You can see that with mice, for example, we typically have that problem because it's hard to see key points, particularly from above. But let's discuss that afterwards. There's lots that you can do in that domain.
So then we developed SLEAP with several goals, first to have a more sophisticated GUI, have it be all in Python, and have it be easy to install. If you've worked with any of the deep learning based frameworks, you might be familiar with the headache that is installing CUDA and having GPU support. So it's been a while-- a long time in development, and we now have it publicly available at sleap.ai. And this is going to be the software we're going to be working, and we're going to go through the tutorial in just a little bit.
So in particular, SLEAP was developed to deal with two main problems. One is, how do we deal with the problem having multiple instances? And by instance, I mean multiple copies, if you will, of each animal. And the second is, how do I adapt to different data sets? So although I showed you a variety of different data sets in which we're able to apply LEAP, it still doesn't work as well as we'd like it to across the board for different data sets with different kinds of image properties. And we'll look at those in a moment.
So first, the problem of dealing with multiple instances. So this is actually a very principled-- a set of very principled approaches that come from classical computer vision, in which they break down into either a top-down approach in which, basically, you begin by predicting something about a high-level description of your system, in this case our animals, to in some sense simulating what would happen if we isolated the instances, that is, if they were centered and cropped, to then analyzing those simulated images.
And what that's going to look like is that we're going to crop the images of our animals. In our alternative approach, we have this bottom-up technique. And just [INAUDIBLE] have a little bottom-up problem. Have a little animal problem right here that's seeking to deal with the bottom-up approach. [INAUDIBLE] chewing through my computer cables.
But in classical computer vision, this technique involves first analyzing features in the image. So in our case, that's going to entail [INAUDIBLE] the body parts of the animal and then parsing them using [INAUDIBLE] information into a full prediction of the pose of the animal. So going from low to high level. And so how do we implement these two approaches in SLEAP?
So first, the top-down. We start by isolating the instances, and we do this by finding an anchor point on the animals, so usually some side of the center of mass, a centroid, or a very easily visible body part. We are going to then crop centered bounding boxes around our animals such that we're simulating what would have happened if the camera were following the animal. And we're always centered around that particular body part.
And then we're going to basically perform single instance part detection. So just like we did with the original LEAP, we're going to do it such that we're going to predict the location of the body parts using these confidence maps, but only for this center data. And so the key thing here is that the network learns to deal with multiple animals by leveraging the fact that the focal animal is the one that's in the center of the crop.
And in that way, it can learn to ignore detections of the same body parts of animals that are in the surrounding. And great. This works remarkably well. This is something that could be easily adapted for a variety of approaches, including the other existing single animal pose trackers. So what about the bottom-up approach? So this actually deals with this completely differently.
In the bottom-up approach, we first start by finding-- by a [INAUDIBLE] they can predict these confidence maps for all of the instances. So we're going to take in-- rather than crop, we're going to take in the entire image. And within this image, we're going to have a network that can predict the location of every body part of every animal. And we can convert these confidence maps back into coordinates by doing local rather than global peak detection.
And great. So you say, well, maybe now I have all the body parts for all the animals, and maybe I'm done. But you have to deal subsequently with the grouping problem. The grouping problem entails resolving situations like this, where say that you have two copies, two instances of the same body part, let's say the thorax of a fruit fly, and the tip of the right mid leg of a fruit fly. And you have these two.
You have two different hypotheses for how they might be grouped together, one in which this body-- this thorax is grouped with this leg, and this thorax is grouped with that leg, versus the actual correct hypothesis of grouping this body part with this-- this thorax with this leg and this thorax with that leg. And there is no information in the images alone that allows us to disambiguate between these two hypotheses.
So what we're going to do, then, is we're going to borrow another technique from computer vision called part affinity fields. In this technique, the idea is to represent the connectivity between body parts as a set of unit vectors that points along the direction from the source body part to the destination body part. So this is something that, again, we're going to define from the ground truth data.
And all you need to do in addition to having the locations of the body parts is you need to define a directed graph that spans all of the body parts. So this is defined once in the humans, but obviously you'll have to define it differently for each type of animal. But once you do, you can leverage these part affinity fields, which incidentally are predicted by a neural network that then parse out the body parts into which animals they belong to.
And so if we take a line integral, that is, if we take the average of all of those directed errors from each pair of candidates, then we can weigh these errors here or the strength of the connection by how much the arrow in space aligns with the arrows that were predicted in the part affinity fields. And we'll do this for every connection that forms the body skeleton graphs.
If we do this for every connection, I'm showing you here all the possible [INAUDIBLE] candidates. It might look a little bit like this. And what you might be able to see if you start squinting is that there are clearly some very weak connections and some very strong ones. So with a greedy matching algorithm under the assumption that this is a directed tree, we can actually find a global optimal solution using greedy matching locally, so by resolving all of those errors. And we'll get something that looks like this.
So the output of our neural networks, just so you can visualize that over time, looks like this for our part affinity fields. So this is akin to our confidence maps that we saw previously. And this is what it looks like after parsing. So you can see that, even when they're closely interacting, we're still able to resolve those body parts. In fact, even when they're completely overlapping. So the network is able to handle this problem of inferring these part affinity fields quite well.
And so summarizing this and putting this in the context of how we do it with the single instance [INAUDIBLE] estimation, in the single case, we have an image that goes through a neural network to generate our confidence maps that have a single peak, in which we can do global peak detection for each of these different colors which belong to different image channels.
Then, in the top-down approach, as we mentioned, we start with a full frame. We find-- we do [INAUDIBLE] as we find some sort of anchor from which we can crop the instances. And then we feed those through a second neural network that predicts the locations for the individual instances. In the bottom-up approach, we have a single neural network as opposed to two that takes a full frame and simultaneously predicts both the conference maps, the part affinity fields, which allows us to then do our grouping through our graph parsing algorithm.
Great. So that's wonderful, and we'll talk a little bit about the pros and cons of each of these methods. But clearly, you can already see that in the top-down approach, we have to deal with two separate branches of the network or two separate networks entirely, whereas in the bottom-up approach, we only have to do this once.
And the second biggest difference being that this, the top-down method, will scale with the number of instances that you have. That is, you're going to have to run each crop through the second network such that for each animal that you have, you're going to run-- you're going to incur a linearly increasing runtime cost. So before I go on to problem two, I'll address the one question that we have in the chat.
How do you deal with occlusions. especially when one animal is entirely bisected by another animal as [INAUDIBLE] occurs with mice? And two, when a viewpoint point is totally included and you get missing data, how do you typically deal with this in [INAUDIBLE]. So for the first part, we deal with occlusions in a couple of different ways.
And in the top-down method, is actually dealt with intrinsically because we can choose an anchor part. And as long as that part-- the crop is centered on that anchor, the network can implicitly learn to bridge those gaps. But this will be a problem for techniques like the part affinity fields in which it depends on how you have defined the skeleton tree. But we can talk about that a little bit more later.
And when a key point is totally occluded, you get missing data, and this is something that you can actually just [INAUDIBLE] as part of the labeling procedure such that the network can actually tell you that it does not think the body part is present. We deal with it with a variety of different ways, and it depends what kind of analysis that you're doing.
But a simple way to deal with it is to impute the missing locations through a simple interpolation or through, for example, a states based filtering models, like a [INAUDIBLE] filter, things that can predict, given context, where those body parts are. For the most part, we try-- we attempt to be quickly agnostic to this as far as SLEAP goes and leave that decision down to the user for whatever is more appropriate for their downstream task. And we can totally talk about that also later in the tutorial.
So getting back into the problems we [INAUDIBLE] in SLEAP, so the second problem was adapting to different data sets. And so first things first. We found that even with multi-instance data sets, we are still able to retain our sample efficiency. So what that means is that here what we have plotted is, on the y-axis, accuracy.
This is an accuracy metric. We talked about if somebody-- if you guys are curious, but it's just an overall accuracy metric, more suited for multi-instance prediction. And what we find is that, as we increase the number of labeled frames, we're able to still get to a plateau of accuracy fairly quickly.
So we went ahead and started applying this to different data sets. And what we found was that, depending on what animal that you're trying to track in what particular setting, you're going to deal with very different image type, image features, and properties [INAUDIBLE] distributions of the animals themselves that are going to lead to a use of the two different kinds of approaches, as well as adapting your neural networks to better deal with different cases.
So we can have situations where we have very large instances like these mice that occupy a large part of the image but have a small field of view and yet have coarse features. So this goes back to the blobby type problem that Rebecca mentioned earlier, so where we might have less certainty about where the body parts might actually be located.
Versus a data set way in which we have very small instances but a large field of view, so a lot of background pixels deal with, and very fine features. So in this case, we want to have as much resolution as possible, but we want to be able to ignore the background as much as possible as well. In contrast-- and then, the hardest case being where we have very large instances and very large fields of view, and yet still very fine features, like these antenna, these beaks.
So how are we going to go about dealing with all these different scenarios? Well, we started to look at, specifically, the neural networks that were dealing with the different problems, with the task of predicting the different representation that we talked about, like confidence maps and part affinity fields. And we abstracted that away into this general template of encoder-decoder convolutional networks.
And so in this general architecture design, we start with the input image followed by a series of encoding steps. So these extract features across larger and larger scales to then up-sample it back to producer output through your decoder. And specifically, the blocks that are contained in here are convolutional blocks. So these, as we increase the number of filters or parameters, we increase the representational capacity of our network. That's one easy knob for us to turn.
But the number of down-sampling blocks is one of the key things that we're going to be planning with. And so these are able to increase their respective field size or how large an area of the image the network is able to reason about. The up-sampling blocks are going to be able to recover spatial resolution so you can get finer features back at the end. And skip connections allow you to fuse features across scales.
And so as a point of reference, here's three common architectures that you may not be familiar with. Like the LEAP architecture effectively looked a little bit like this, where we have no skip connections but a symmetric encoder and decoder. TheDeepLab, if you're familiar with that-- don't know who isn't to this point-- uses a ResNet or MobileNet, another pre-trained network, as the encoder to extract all these features and generate a small resolution but a deep feature bank to then up-sample and back to the output.
And U-Net, a very popular architecture in image segmentation and biomedical applications, uses a set of repeated pooling and up-sampling steps with skip connections. So this is effectively just like the LEAP architecture but with skip connections, which enables the recovery of multi-scale features. And as it turns out, this is something that's going to be important for generalized difference image properties.
So looking back to our diagram here of the different blocks. So the reason to abstract away the notion of neural network architectures into these different high-level blocks is that then we have these high-level knobs [INAUDIBLE] And specifically, we're going to be experimenting with this knob, the down-stamping block, because we know that's going to increase their receptive field size and therefore enable us to better deal with the large [INAUDIBLE] for example.
So we know that there is a very-- there's just a simple formula to describe how sets of repeated pooling and convolution layers can control the receptive field size. And as a visualization, you can see here that for convolutional kernels in an image, once you pool, now a single pixel after pooling gets the-- has been influenced, has been computed from a wider or a larger area and the original layers before the pooling.
So then, if we increase the number of outstanding blocks, we increase the number of pixels that we can perform computation or extract features of as we go down [INAUDIBLE] So what do we do? We increase the receptive field size, and we vary the receptive field size and train networks at each of these different levels.
And what we find is that for the fruit fly [INAUDIBLE] which we have tiny instances although a large field of view and fine features, we do way better using the top-down approach than the bottom-up across the board, but that we also have a plateau in accuracy as we reach a particular level of receptive field size. So what does this mean?
It means that choosing even the largest possible network that can integrate over a very large region of the image, especially when we're doing the top-down approach in which we first do the cropping, we can create-- we can parameterize a neural network that is very specifically tuned to deal with the size of features in the data set that you-- that is specific to your particular data.
And so as a point of reference, a U-Net that has a receptive field size of 360 pixels and achieves a effectively plateau level of accuracy, effectively perfect accuracy, has orders of magnitude fewer floating point operations than a general purpose ResNet 50 like that used in one of the [INAUDIBLE] And the reason for this is that rather than having a network that is a generalist-- that is, it can handle features across many scales across many different situations-- you instead create a network that is specifically designed to deal with the features of our specific data. It's about creating specialist networks as opposed to generalist ones.
If we look at our accuracy in terms of space, you can see that we have virtually perfect accuracy, comparable to what we had in the first LEAP, in fact, even higher now that the outer circles here correspond to fifth percentile, so we can reduce the number of outliers considerably. And effectively, even the vast majority of body part predictions are within just two pixels of their ground truth location.
So what about the other animals, like, say, our blobby mice? Here, we find a little bit of a different story, where as we increase the receptive field size, we still see this improvement in accuracy, but we actually do because it's going to be better using the bottom-up approach rather than top-down. And one of the reasons for this, we think, is particularly having to deal with the notion of having coarser features and therefore reasoning about spatial geometry at a global scale.
So if you recall, in the bottom-up approach, the network sees the entire image at once, rather than the top-down just see a centered crop, which when the animals are very large and encompass a large area of the crop, you're probably going to have most of the animals in that crop regardless. And so the network now has to learn to reason about subtracting a lot of the background, ignoring a lot of the background, as well as other animals that may occupy the same region of the crop that the focal animal would as well.
And we see this, again, generalized to another-- to our bee data set, in which we have these very large instances that occupy a very large fraction of the frame. And putting these guys side by side, you can see this trend once again of both the differences between the different kinds of approaches to doing multi-instance pose estimation as well as the plateau in accuracy as a function of this architecture parameter of receptive field size.
So this is all to say we have specific network architectures that do better for specific data sets. And the way that we can select those is guided entirely by the geometry and features of your particular data. So we can look at a couple examples, and then we can go to question if anyone has any. So here you can see-- these might be a little choppy over Zoom, but you can check out the video link below.
You can see on the left there a pair of interacting mice. You can see they're very close interacting, in fact, in contact almost all the time, as well as the fruit flies, where some of them are even completely overlapping. We're able to keep track of all the body parts, equally the ones that are slightly occluded, and their relative identities without mixing up body parts very frequently.
As well as in the bee data, which we have many body parts that we're tracking in a very small region of the field of view. And cool. So I'll mention again that one particular observation about these different neural network architectures is that it generalizes both to the single animal case as well as the multi-animal case, and so the takeaway being that more carefully designed neural network architectures can perform better and with fewer operations and higher performance when they're tuned to the data set that they're being applied to.
Sage. Is there any upper limit to how many animals can be tracked? There is not. The only difference is that in the top-down approach, the processing time is going to scale linearly with the number of animals. That is to say, if you have two animals versus three-- or rather this is four, the four animal case will take approximately twice as long as the two animal case. Not quite the same, but at least for the second stage. But there is no hard limit other than memory constraints on your GPU. And honestly, it will just run slower, if anything.
Jake asked about data augmentation used both trained and networks [INAUDIBLE] and does augmentation achieve results. It absolutely does. And so all these networks are trained from scratch without any pre-training, except for the ResNets which we trained both with and without pre-trained weights. And one thing was crucial, especially when we're at the [INAUDIBLE] but we have very few examples, is to perform data augmentation So this includes things like rotations, scale, random uniform and Gaussian noise, as well as contrast adjustments.
All of these have a huge impact in forcing the network to be able to generalize even within the same condition, even if it's kept relatively constant. And so, for example, even when we're looking at just the scale to be fixed, in other words, the animals are always the same distance relative to the camera, do we scale augmentation improves how much networks can generalize because they learn to have more wiggle room in the image features that we capture.
Remy asks, in the top-down approach, how do you deal with translation in variance of CNNs? Do you encode position with respect to center in some way? So in the top-down approach, we do implicitly encode this position by doing this anchored crop. We can also do a bounding box crop. But in either way, the relative position of the body parts of the focal animal, the centered animal, are going to be with respect to the center coordinate of our crop.
There's no explicit encoding like, say, you would have in a 2D transformer or anything like that. The translation in variance is true up until a particular level of pooling. Once we do enough pooling, then we have our separate field size that is integrating over a large enough portion of the image, the kernel of a convolution deep in the network is actually not as translation, technically equivariance, as the initial layers. That is to say, its kernel is looking at all corners of the image simultaneously.
And finally, would it also be able to differentiate between overlap and [INAUDIBLE] Yes. And again, this is all going to be a continuously varying performance kind of situation, where you're going to run into different structural sources of noise pertaining to which body parts are present or not. We found that overlapping animals are a little bit more easily dealt in the bottom-up approach than in the top-down.
And you can think of this as relating to the fact that, in top-down, you need to have a particular anchor point or centroid that's reliably visible in order to have this sort of prior on the position of the animal within the crop. But your mileage may vary, and the best thing to do is just to try it out. Cool. I'm just going to move on in the interest of time and quickly tell you a couple of stories about how we've been applying these sort of tracking to do unsupervised behavior recognition, and then we can move on to the tutorial.
So as I mentioned, we'll consider our main way to quantify behavior in the form of movement is to these postulates. In particular, if we look at a set of stereotyped behaviors, we can see that the dynamics, that is, these particular trajectories of the movement of body parts, has a particular pattern. You see these oscillations occur at regular frequencies, or in the case of [INAUDIBLE] a very characteristic shape that the movement of the body parts elicit.
And so the goal, then, is to be able to recognize behaviors without actually giving them labels ahead of time by keying in on the statistics of the motion of the body parts. And the way we do this is by using approach developed in one of the groups that I'm in here at Princeton called Motion Mapper, in which it works roughly like this, in which say that we have several time series. And these could be the motion of different-- of multiple body parts, in which we have characteristically different statistics of dynamics along the different segments of this clip.
What we're going to do is, first, extract spectral features. And these are multi-time scale features. In other words, we're converting each of these time series into spectrograms. So this captures the power of these oscillations across different frequencies or time scales. And we're going to then concatenate on these spectrograms and slice through each time step, where now we have a feature vector that contains the power at different times scales for the different time series.
And we're going to embed each of these time steps, each of those is multi-time scale representation of each time step of our behavior dynamics habits, into a manifold, a low-dimensional manifold. And what this all is to do is to capture the relationship or the structure across the different body parts in a low-dimensional space.
And in that space, because we've pushed these points together, we're able to do something like density base segmentation or clustering. And if we cluster in this space, we're going to have clusters of points in which we have stereotyped or very self-similar dynamics. And the idea is that self-similar dynamics, of course, should respond to self-similar movements and therefore to self-similar behaviors.
So what does that look like? So in our single fly data set, an embedding will look a little bit like this. And great. We have a bunch of peaks and clusters that have outlined, but no labels or no other way to infer what a behavior is from this directly. So then we know we're going to be able to tell what's going on here is by assigning labels to each of these clusters that were previously identified.
And we could use a variety of different ways, for instance, just looking at examples from each cluster a posteriori and giving these names to the behaviors. But I want to show you how using pose tracking allows us to do this in a way that's a little bit more principled and give us a little bit more information. So let's consider our map and look at specifically just one of these clusters right here.
And so we're looking at this part of our behavioral manifold. And what we can do is we can take all the points that fell into this region of the space and look at the average feature vector. That is, for every body part, what is the power at all the frequencies that we've measured? And what we can see is that for this particular cluster here, we have a non-zero power at the wings and particularly of the hind legs at a wide range of different frequencies.
So this could already lead us to believe, if we have some prior knowledge about how animals-- how fruit flies move, that this is indeed going to be posterior grooming. And we can look at examples. So if we look at the adjacent cluster here, we find that this actually corresponds to hind grooming. You can see the tracking overlaid on an example [INAUDIBLE] sampled from this state. And this is-- these are the relevant rows of that feature vector.
So what you can see is that for the right wings-- sorry, for the right hind legs, we have a high power, and for the left ones, we do not. Similarly, on the other side of this super cluster of behaviors, we have a left hind grooming where the fly is moving its left leg primarily. And the average feature vector reflects that exactly. And our middle cluster here corresponds to our bilateral hind grooming. Again, you can see involves coordination of both of the legs.
And they all share approximately the same set of frequencies at which those legs are moving with lower power at the individual hind leg segments. So cool. This allows us to, without actually telling it anything about our algorithm, like about what a fly is or what grooming behaviors are, identify purely from the statistics of how pose varies over time these different kinds of behaviors, including ones that are mirror symmetric.
But cool. So that allows the capture one behavior that we couldn't before with possible tracking. But one that we could capture before is that of locomotion. So that was grooming. Here's what the feature vector would look like for locomotion. And at first glance, not terribly informative because what we see is just power across all of the legs for some high frequency [INAUDIBLE] And great. We could've probably gotten something akin to this to say that a fly is locomoting just like computing the centroid velocity.
But if we actually break down this super cluster along this little peninsula, what we find is that as we go from the top region of this manifold here down to the bottom end of this peninsula, what's happening is that we go from coarser, more diffuse frequency-- rather leg oscillations to more sharply defined and higher peaked frequencies to even higher peaked and more coordinated locomotion.
And so what might not be obvious from the get-go is that if we look at the peak frequencies, this is the main characteristic that changes across these clusters. And if we characterize each of these clusters by their [INAUDIBLE] distribution, we find that they are indeed characterized by these different leg oscillation beats.
So if we plot this against velocity, we find that there is a total overlap in the distribution of their forward velocities that is not present in the peakiness of their frequency [INAUDIBLE]. And so they're something that we could not distinguish between before when we just looked at the forward centroid velocity. Now we have a very specific description of how that forward velocity is induced through how those legs are moving.
So great. We discover some behaviors without any supervision directly just from the posture dynamics. And so I'll move on to future directions and tutorial. But before, I guess I'll just take a second for questions, and then we can quickly move on to doing this ourselves. So [INAUDIBLE] asks, how do [INAUDIBLE] compare to other supervised clustering methods like k-means or random forest? Can you keep [INAUDIBLE] transition probability into account in order to [INAUDIBLE] behaviors that never go together?
Absolutely. So this is a-- density-based clustering is like the first pass, easiest thing that you could do. You don't need to use [INAUDIBLE] or watershed. There's a good paper by [INAUDIBLE] on comparing across these different methods, including using, say, things like gas mixture models. But there's a whole litany of different approaches to doing supervised behavioral clustering, and including ones that particularly look at transition structure rather than being more agnostic to how the behaviors are grouped together.
And I'm happy to talk more about that. We're writing a review paper currently that goes through the general classes of approaches. But feel free to chat with me offline about these techniques. And cool. I'm just going to quickly move on to the future directions so that we can get started with the tutorial. So moving forward, the things that we might want to do is to further help ameliorate the problems having to do manual labeling of pose, particular as it becomes more laborious with multiple animals.
So there are advances of computer vision that enable unpaired pose estimation, where you just have a set of poses and can you perhaps derive from a different data set with images from your data set and train a network that learns to disentangle the two such that you can produce pose estimation without doing any labeling at all, provided that you have some distribution of realistic poses for your particular kind of animal.
Or using things like having a prior on the skeleton of your animal and allowing-- and using fully unsupervised learning to be able to reconstruct-- to force a network to generate confidence maps from which the original image can be reconstructed. This forces a network to learn to compress information about anatomy without actually having to give it any labels.
And finally, we're working on applying this stuff back to our original model here of our simplified brain behavior generator through techniques for doing generative modeling, where we used neural networks that actually output pose directly from sensory inputs. And this is a framework that can subsume everything from supervised to unsupervised techniques.
And some work has already demonstrated how we can use tracking to infer the sensory inputs of the animal, like projecting its scene onto its view, as well as using neural networks to output a pose directly and then look at the neural network latence, that is, the computations the neuron uses as a way to characterize those behaviors. So I just want to do quick acknowledgments. Thanks, everyone. Thanks a lot of the people at Princeton in particular who have helped and worked with me on this project, funding, collaborators, everybody at MIT. Thanks to [INAUDIBLE] in particular for inviting me. And thanks, you guys, for listening.
And if we're ready, can I think we move on to tutorial. Yeah. So there it is. And we're going to go through a couple of different steps here. And so as I mentioned, actually, if you go to the github and just scroll down, you'll find the tutorial section. So you won't need anything for this part of the tutorial. All we're going to do is we're going to run some stuff in CoLab. And I already provided some of the data that we're going to do as well.
And so for the training part of the tutorial, we're actually going to go through and do what I demonstrated in the middle part of our talk, which is to train a SLEAP model for a top-down, multi-instance prediction. So if we open up our first CoLab here. [INAUDIBLE] connect. Great. If you click on that link, you'll see our notebook thing looks a little bit like this.
You've got to make sure that you're connected to CoLab. If you hover your mouse over here, you want to make sure that it says GPU as the CoLab back end. And if not, you can go on runtime and change runtime type and make sure that the hardware accelerator is set to GPU. Great. We'll do that. And all that you need to do in order to run SLEAP on CoLab is to run pip install SLEAP.
So we'll go ahead and say that we're fearless adventurers, and we're going to go ahead and click and install SLEAP. Should be relatively quickly. Should be done relatively quickly. And we'll move ahead as soon as that's done. And we'll be sure to be able to import SLEAP. So SLEAP will include everything they need, including TensorFlow and, if you're doing this locally and you follow our Conda instructions rather than Pip, it'll also download the GPU drivers for you. But I highly recommend you try this out of CoLab first.
So while it's doing that, we'll start walking through the subsequent cells. So here we just have a few checks of the system. We'll come to that in a moment. And particularly, we're going to download our training data. And in here, what I've done is I've exported a couple of our training data sets. This is going to be-- this includes the images as well as the labels for our fruit fly data sets.
And this is something that you can do yourself from the GUI so that later on, if you're trying to send your own data, you can just export a training package, as we call them-- so these are single files that contain both the images and the labels-- and get them into CoLab by just downloading them using Curl. Or CoLab also let's just click here on the side is upload files directly or mount them through your Google Drive. But for now, we're just going to download them from my Dropbox because it's easier to do it in a single line.
So let's check here how our installation's going. Almost there. Uninstall some things in addition to installing. Cool. Maybe if anybody has any questions, now's a good time to try that out.
AUDIENCE: So I have a question, Talmo. If we want to run it not from CoLab but from our own computer, can you show us how to do it?
TALMO PEREIRA: Yeah, absolutely. And so at the bottom of our tutorial here, we have a section that says, how do I get some more SLEAP? Something in short supply for us grad students. So you can go on our website at sleap.ai and follow the tutorial for both installation, creating a project, labeling, and have everything step-by-step for all your OS's.
Include little animations of how to go about adding videos, defining which points you want tracked, to generating your initial labels, training, and predicting them again. You both do the whole human [INAUDIBLE] procedure, as well as exporting your data for analysis afterwards. So our website is pretty complete. There's a tutorial then I'll walk you through. It's just a little bit more, [INAUDIBLE]
So we've done our Pip install here. Everything is working. We've done-- we've run this cell here. It shows that here we have our GPU, and SLEAP is able to detect that the GPU is available and accessible. So we'll download the training data. Should be relatively fast. Might take a couple seconds. We'll unzip it moment, and then we're going to configure our training job.
And so this section down here-- there you go. We've downloaded our training files. In this section here, we're going to define a SLEAP training job. So as we covered in the talk, there are several different approaches and many different neural network architectures that you're able to use through SLEAP. So we've created this configuration system that allows you to just-- to basically specify every aspect of how data is handled, to the neural network architectures, to the outputs that you want the network to generate, to how you want the training to be done, as well as what metadata you want to generate.
So here is just the configuration. So we'll just run this cell. We're starting off from our default job configuration. That's all of these different sections. We're going to just edit a couple of the parameters so that we can do our centroid model prediction to then set up our training [INAUDIBLE] And at this point, the SLEAP trainer will load up the videos, detect all aspects of what size are your images, what format, how many parameter-- create your network, [INAUDIBLE] parameters and layers, hook it up to the correct output, load up your different training sets and optimizers, as well as creating visualization hooks.
And so because we're in CoLab, we can also run TensorBoard directly in our browser. Cool. And you'll note, by the way, that SLEAP already created a models folder here in which our specific run, the baseline model for centroids, is going to be stored right here. And you'll note that amongst these files we also have these JSON files that contain a serialized representation of those configurations so that you can always reproduce it, as well as a copy of the training data.
So this allows you to fully reproduce a model from every run. Cool. We have our TensorBoard running now, and all we need to do to actually get SLEAP to run is call up our trainer dot train. So this will actually start generating the training data and feeding it into our optimization. As soon as it finishes preloading the data, you'll start to see how it will do everything from loading, formatting, augmenting the data, and then feeding it into the network for training.
So great. Generate the data sets. We're starting the train loop. And we're going to do it for 200 batches for 50 outputs at the most. So already, we're streaming some stuff into TensorBoard here. And the first thing that we see is that in his Graphs tab, you can visualize the specific architecture that we configure. And so this is U-Net architecture. We start from here. It lists series of convolutions and pooling.
So just like that diagram I showed you before, we have pooling steps and skip connections that go up and connect the encoder to the decoder. Cool. That's our architecture. But if we click on Scalers, we'll begin to see the outputs of every output with respect to the training validation [INAUDIBLE].
And a little bit more informative is if we click on their images, you'll be able to see a visualization directly of the actual predictions of a network so far. So if we click on this, you see a prediction on the training set and on the validation set. And as you can see, even as of the first epoch, it's already doing a reasonable job of predicting these confidence maps, these big blobs here, where the red points are the predicted location of the anchor points and the centroid, and the green ones are the ground truth locations.
So even with a single training epoch, we're already pretty close. That took about 30 to 20 seconds. You can see that it goes by pretty quickly, every iteration. And so we'll leave that guy training for now, and later we'll be able to scroll through these guys and see how training progresses in real time. But after we finish a training, we'll have this best model, the best model so far, in our in our Models folder. And we'll be able to download it locally with this last cell here, which we'll run after that's finished training.
In the meantime, we can already start, in parallel, our second model training. So this is going to be our top-down confidence maps. This guy is going to be a model that takes as input those cropped images and predict the confidence maps for the centered animal. So I'll go back to the tutorial page here for a second and show you that this is what the final predictions will look like for that interactive training of the centroids.
And after we do our tracker training through a top-down model, we'll get predictions that look a little bit like this for our centered animals. And we'll look at that in a second while it's starting. After you finish running both of those, you're going to have downloaded two different zip files that will contain the actual train models. You can run those locally, or you can upload them to CoLab to do inference or predictions on new data.
If you want to run either of these notebooks locally, you can just clone the repository and download the data using these commands. So you'll have a folder here for your models as well as for the training data. They're too big for github, but the links are there if you want to be able to reproduce this locally.
So let's check in on our training. Here is our centroids. We're at the 10th epoch already. And we can look at how the training is going on TensorBoard. Refresh here. And it seems like for now it's already converging to be able to produce these confidence maps very reliably. Great. Both on the training and on the foundation [INAUDIBLE]. Cool.
So in parallel, we are also training in a new tab here. Install SLEAP. We are installing-- see, we're still installing SLEAP here. We'll soon then be able to start the training on the top-down. But maybe in the interest of time, and we can hop back here in a moment once this guy's finished running-- and by the way, these notebooks will take about 30 minutes from end to end to install, get the data downloaded, train, and download the results. Cool.
And this one's formatted much the same way. And the only difference in the second notebook is that we're going to be training our top-down model. So our configuration is going to be slightly different. Right now, we have an output head of our model that outputs the sensor instance confidence maps rather than the centroids. So we'll see what that looks like in just a couple moments. Looks like we're just downloading the data here. Cool.
JANELLE: While it's downloading, can I ask a quick question?
TALMO PEREIRA: Yeah, absolutely.
JANELLE: So what are the major failure cases that you might run into if you're training this on your own set of data? So for instance, are there any tricks with initialization that you need to be aware of at the networks?
TALMO PEREIRA: Yeah. Absolutely not. So in fact, all these networks are trained from scratch, and so we put a lot of work into making it seem as magic as possible. But definitely, as you saw from the presentation, different approaches will perform better on different data sets. And it's not exactly knowable a priori which thing is going to be the most optimal for your data. But we do provide these sorts of points of reference.
And so when you're training this through the GUI directly, we have some baseline job configuration profiles that tend to work in the general case but maybe won't work the best in your specific case. We'll see how the GUI looks like in just a little bit. But for the most part, all you really need to do is do the labeling, do it for 10, 20 frames, and just tryout training.
I usually like to start off with the top-down approach because it's a little bit more flexible because you're breaking up the two different components to two different networks, and see how well you're doing at that point. It'll import the predictions back in, and you'll get a baseline sense of how it's going. You'll also be able to visualize the training in real time even when you're doing it with the GUI. That will immediately give you a sense for whether there's a major failure mode in how you've configured SLEAP. We have a lot more information in the website as well.
Clay asked, could SLEAP also [INAUDIBLE] for a free moving mice installed with a mini scope and a wire hooked on their heads? Yeah, absolutely. And we've done that before. I don't have that immediately available here. But Yeah we get it if you want to track the implants maybe so for example, you can define that as an additional body part to track or you can just choose to ignore it altogether. And either way, it'll be able to handle it.
So what people have done in the past using either LEAP or DeepLabCut to deal with multiple animals is to treat the separate animals as actually the same instance. That is, you'll create a bunch of body parts that are called, like, forehand or foreleg of animal with implant and animal without implant, or that white mouse and black mouse, something that distinguishes their appearance.
And then the network will hopefully learn to learn from context that even though individual body parts may appear very similar, that looking-- taking to [INAUDIBLE] a particular feature that differentiates the animals will allow it to predict the different-- to assign the body parts the correct pseudo name or label. This is not ideal, because if body parts appear similar, what you're going to see is a flickering of the prediction of that body part between the two different animals because you're not really explicitly representing the notion that they're different instances.
SLEAP handles all of this transparently. We actually treat each animal instance as if you have multiple copies of the same body parts. So if there is a body part or landmark that is unique to a particular animal, that's fine. You can add it to the skeleton and basically mark it as not visible in the other animal. And it should just handle it transparently. Cool.
So because we're almost out of time, let's just have a look at how we're doing here. So here is our centered instance confidence map network. Let's switch over to the images. Now, as you can see, in the time it took me to say that, whereas we start with relatively rough predictions for what individual body parts are, very quickly, over the course of just a few iterations, now we can very quickly converge on to being able to detect all those different features.
And it's still going to get some stuff mixed up. As you can see here, it'll maybe still assign some body parts incorrectly. But it will very quickly learn to resolve this, you see there. And again, the training will keep going just like it did over here, and eventually you'll be able to [INAUDIBLE] the training models. So in lieu of actually waiting and downloading those, I've already trained those and made them available so that you just download them.
Now we can move on to how we're going to predict on new data. So this is going to be a third CoLab in which we're going to take an unlabeled clip and apply these models to generate new predictions. So let's connect to CoLab again in this third notebook. And I'm going to have to wait a second so that it can install SLEAP again. Every time you-- this is one of downsides of using CoLab, is that it does have to install everything from scratch every time you start a new instance.
But we'll live with it. That will only take a couple of seconds. Couple minutes, rather. Maybe I'll take this time to also demonstrate what's going to happen next, which is after we run the predictions, we're going to get the outputs in a single file that we can download. But we can then visualize the predictions locally. I'll show you how to visualize it on CoLab, but most of the time you're going to be running the predictions either on the cluster or on CoLab and then downloading it so that you can actually look at the results on your own computer.
And for that, you can open up the-- you can look at them through the SLEAP GUI. I have some instructions here on how to install SLEAP in a new [INAUDIBLE] environment is the easiest way to go about doing it. And then, here I have a terminal already open which I'm going to activate my SLEAP environment. And then I'm going to call our SLEAP enable shortcut on that prediction. And I've already saved the predictions file, which we're going to see how to generate in just a moment once CoLab finishes installing over here.
But this is what the final result will look like in our GUI. So here, we've opened up predictions at SLEAP, and we can scroll through all the frames of the video as well as, say, zoom in and be able to inspect the results a little bit more closely. So you can do that just by scrolling your mouse wheel. And we have all sorts of interesting information here, like the tracks that the animals were assigned to. This is the skeleton. Here's our-- here's a video that we used for prediction.
And you can see that it not only is able to keep track of the animals contiguously, that the key points are stable, they're accurate, and it's able to handle even when they're very closely interacting. And again, this is on unlabeled held out data that was not used to train those models. So here, frame by frame, you can see how SLEAP handles situations where we have things like motion blur, for example, so less heavy occlusion.
So with these results-- so in this GUI, if we had had any tracking years, you would be able to correct them here by transposing the tracks, deleting them or otherwise correcting the predictions. If you want to use this for training, you can just double-click one of the instances, and that will create a training instance from our data. It'll do this for, say, both of these guys, mark all the points as there. And we can say click and drag to just correct those if we needed to.
Typically, I'll do this in a single labeling data set that contains more than a single video. That's what our GUI looks like. And once you're happy with it, you can export an analysis file that you can load up in Matlab or Python. Cool. Oh, it looks like we're done. So here, what happened in our inference notebook is that we downloaded the models that we trained using the other two notebooks, and we download a little test clip, an mp4 file.
The way that we do inference in SLEAP, if we're doing it interactively, we have a command line interface as well as the ability to do it through the GUI directly. The GUI, you can just go to Predict and Run Inference and select your models. But on CoLab, we can create it interactively by importing our predictor classes. So we can create a predictor, a top-down model predictor, by giving it the path to our two different models that we trained, then create a tracker that associates those detections across frames.
[INAUDIBLE] support for a lot of different types of video data. And we can, for the most part, just load them in directly just by being SLEAP.video and load them from the file name. It'll figure out everything from its format as well as treating it effectively like a [INAUDIBLE] such that video IO is relatively transparent. And this allows you to create custom pipelines if you want to integrate it into some other analysis across the pipeline that you already have set up for your data.
Cool. We have data loaded. We have a predictor set up. And from that already have loaded our trained neural network models. Then we can just predict on the video. And we're going to set the make labels equal to true so that we can create one of these labels files. That's the thing that we loaded up into the GUI. That is create give us our predictive labels object.
And great. Once we have that guy, we can plot that. We have some plotting utilities. And here I just set up some interactive plotting in matplotlib. And you can see in CoLab. So you can see how we can scroll through different frames and, in real time, you can in CoLab load up the frame, load up the predictions, and overlay them.
So as you can see, this is obviously not as interactive or as high res as you can get with the GUI, but it's a useful way to be able to inspect the results of SLEAP directly in your CoLab notebook. And so, obviously, if you want to do it in the GUI, want you'll want to do is save those labels, add it to our predictions.sleap file, and then download them directly from CoLab, which you could also have done over through the Files tab here in CoLab. Ta-da. Great.
So recapping, what we did was trained some centroid models that could predict the location of anchor parts of multiple animals. We've trained a top-down part detection model that can give in a crop of a centered animal, predict the location of body parts. And we loaded up these trained models such that we can now track a new contiguous clip of new data and downloaded the results so that we can inspect them locally on our computer. And you can do this on your laptop or something without actually having access to a GPU.
And that about covers it. So thank you, guys. If you're interested in using SLEAP for your work, check out the link at the bottom here. Give us a Twitter follow, and we'll be announcing new version of SLEAP with new features from our fancy tracking and getting our quantitative behavior on. So thanks, guys.