Exiting flatland: measuring, modeling, and synthesizing animal behavior in 3D
Date Posted:
April 9, 2021
Date Recorded:
April 8, 2021
Speaker(s):
Jesse Marshall, Harvard University
All Captioned Videos Computational Tutorials
Description:
Mechanistic studies of complex, ethological animal behaviors are poised to define the next decade of neuroscience. Fully understanding the ontogeny, evolution, and neural basis of these behaviors requires precise 3D measurements of their underlying kinematics. While 2D convolutional networks have allowed for kinematic correlates to be monitored in repetitive behavioral tasks, they are ill-suited to track keypoints in 3D and across multiple behaviors. To address this, we developed a pair of tools: CAPTURE and DANNCE, that enable continuous whole-body 3D kinematic tracking across species, behaviors, and environments (Marshall et al. Neuron 2021; Dunn* Marshall* et al. Nature Methods 2021). CAPTURE uses chronically attached retroreflective markers and motion capture to continuously track the head, trunk, and limbs of rats in 3D. DANNCE generalizes this tracking capacity to animals not bearing markers by leveraging projective geometry to construct inputs to a 3D CNN that learns to perform 3D geometric reasoning and identify body keypoints in 3D. Together these approaches enable new lines of inquiry into computational ethology, the neural basis of behavior, and artificial models of behavioral production using deep imitation learning. I will discuss the technical details of these approaches and demonstrate how to use them to analyze the structure of animal behavior across multiple timescales.
Speaker Bio:
Jesse Marshall is currently a K99/R00 postdoctoral fellow with Bence Ölveczky in the Harvard Department of Organismal and Evolutionary Behavior, where he invents new techniques for behavioral tracking and uses them to investigate the neural basis of movement. He completed his PhD in Physics in 2016 with Mark Schnitzer at Stanford University, where he developed new optical approaches for recording neural activity and applied them to elucidate the neural basis of movement disorders. He received his undergraduate degrees in physics and mathematics from the University of Chicago in 2009.
MODERATOR: Today we're very happy to welcome Jesse Marshall to present with us. Jesse is a postdoctoral fellow with Bence Olveczky in the Harvard department of Organismal and Evolutionary Behavior. He received his undergraduate degrees in physics and mathematics from the University of Chicago. He then completed his PhD in physics with Mark Schnitzer at Stanford University, where he developed new optical approaches for recording neural activity and apply them to elucidate the neural basis of movement disorders.
And today at the Olveczky lab he continues to invent new techniques for behavior tracking and uses them to investigate the neural basis of movement. Jesse is also the recipient of a K99/R00 fellowship. So, Jesse, thank you so much for joining us today and please take it away.
JESSE MARSHALL: Yeah. Thanks so much for the introduction and it's a real pleasure to be speaking with you all today. Hopefully in the coming months, if you have any follow up questions, you can come by the lab. But until then we'll settle for this. So, yeah, I'm going to tell you about a pair of new techniques that we developed for measuring animal behavior, and then their applications in modeling and synthesizing this behavior in silico.
And animal behavior has been a hot topic in neuroscience in recent years. But it's of basic interest to a much broader range of disciplines. So in biology, investigators from genomics to psychology are interested in understanding the diverse biological basis of animal behavior, from its genetic components, all the way up to neural circuits. People in medicine and engineering, as well as biology, are also interested in using animal behavior, as a tool for preclinical testing of therapies and animal models of disease, for biologically inspired robotics, and to build more humanitarian agricultural systems.
And much of the impetus and technologies for measuring behavior have come traditionally from work in humans, where performance capture has found a niche in sports and Hollywood, and, more recently, augmented reality and virtual reality applications are driving a big boom in markerless pose detection approaches. But the reason that animal behavior has become, I think, so important and so vital to neuroscience in recent years is that we've gotten very good at measuring from the brain. So in this classic plot from Konrad Kording and Ian Stevenson, each point here is a study, using electrophysiology, and they're plotting the number of simultaneously recorded neurons as a function of the study's publication date.
And the y-axis here is on a log scale, and so you can see that there has been an exponential increase in our ability to simultaneously record from neurons. And in recent years, that means hundreds or thousands of neurons, even in freely moving animals. And so we've gotten very good at recording from the brain.
In contrast, if we just put up a sort of schematic plot of our ability to record from animal behavior, say, the number of simultaneously recorded keypoints on an animal's body, that we can measure, this has really not kept pace with our ability to record from the brain. Even modern approaches can record maybe five keypoints from an animal, but still often across a very limited range of different behaviors. So there's a really tremendous gap between our ability to record from the brain and our ability to record behavior.
And this really limits our ability to understand, say, the function of motor systems which we think control diverse parts of the body. It limits our ability to understand sensory systems, which we now know are strongly affected by the ongoing behavior and movements of animals, and a diverse range of other questions in neuroscience, such as social behaviors, where without a detailed understanding of animals' body language, it'll be very difficult to decipher the types of social interactions that are going on.
And this imbalance between recording from the brain or recording animal behavior is sort of endemic to neuroscience. So, you know, it's not just recording neural activity. We've developed diverse approaches for manipulating brain function through engineered fluorescent probes and ion channels. We've developed increasingly baroque microscopes to record from these probes and to perturb the activity of identified cells and neural circuits.
And in connectomics, we have 61 beam scanning electron microscopes, that allow us to record the anatomy of neural circuits with really nanometer scale precision. In contrast, we will stop at almost everything to record the behavior of animals. By and large, the gold standard or typical means of measuring behavior in studies is just to simply describe it, or to hand annotate it.
More recently, work using videography, and digitized using tools from machine learning, has allowed for some quantification of the ongoing behavior of animals. But it still pales in comparison to our ability to record from the brain. And I think that this gap is now widely recognized. And I think in the next five or 10 years, we're going to see substantial improvements in our ability to record from behavior. And I think where these innovations are going is to allow for measurement of behavior in increasingly naturalistic and ethological paradigms.
And so we can have animals in complex visual environments and complex tasks, or in even ecological scenarios. We can use non-invasive measurements with cameras or, say, a probe on the animal's head cap, to record the full 3D pose and body surface of animals, to track their eye and whisker and other effector positions, to track the underlying muscular activity driving a lot of these kinematic changes, and measuring the animal's endocrine neuromodulatory state, as well as a broad range of other behavioral parameters. And, in many cases, we have tools for measuring these variables, often in isolation and in very reduced settings.
And I think a lot of the work that needs to happen over the next decade is making these measurements more integrated, easier to use, and available over a much broader range of different behaviors. Now today I'm not going to be talking about all of these. I'm just going to be talking about measurements of the animal's 3D pose. And this is a measurement that has seen some progress in recent years through the development of convolutional neural networks for pose tracking. However, as I'll show, most of these networks are sort of explicitly designed for 2D pose tracking, and they scale poorly to measurements in 3D, and measurements across multiple behaviors.
And so while I think they've been very impactful for neuroscientists looking at behavioral tasks, when you have a fairly limited range of different behaviors animals perform, they scale poorly to these more complex, naturalistic environments. However, there is a tool that is very effective for 3D pose tracking in complex optical environments. But it's only really been used in humans. And this is motion capture, which is a technique that might be familiar to many of you from Hollywood, where you can take an actor such as Andy Serkis, you can put them in a suit with a number of markers attached.
And these markers have a special property known as retro reflectivity, where light that shines on the marker is reflected straight back. And so if you used a specialized type of camera, known as a motion capture camera, that has a ring light around a camera lens, then you can use this to obtain a very high signal to noise recording of the markers' position. If you do this across multiple different cameras, then you can triangulate the position of these markers in 3D, to then visualize the actor as an animated character like Gollum.
But while motion capture has been very important for human studies, it's been difficult to apply to model systems, because it's hard to put animals in a suit, and it's hard to keep these little tiny foam markers attached for long periods of time. We got around this by deploying another technique that's been well-established in humans, namely body piercings. So we developed an approach, or we developed a set of markers made of high index of refraction glass, that could act as almost perfect retro reflectors. And then we developed a custom set of body piercings, where we could attach these markers to the animal chronically for long periods of time.
And these, much like human body piercings, have very minimal effects on the animal's ongoing behavior. And, as you can see from this video, on the right where I have an animal with a number of markers attached to it behaving in an arena, there is a very high signal to noise that you can obtain using these markers. So they're the sort of glowing white spheres that you can see in the movie.
Using these attached markers, we developed an approach known as Capture. And so we have an animal that's actively behaving in an open field arena, and then we attach a set of 20 markers to the animal's head, trunk, and limbs. And then we use a calibrated array of 12 cameras to record the position of these markers and triangulate them into 3D with sub-millimeter precision and millisecond timescale resolution.
This allows us to visualize the full pose of the animal, which we visualize as a wireframe, as you can see here, with the head in blue, the trunk in red, and the limbs in different colors. Now one of the advantages of motion capture, compared to, say, video data, is that the data is very lightweight. We're just recording the marker positions rather than the full video files. And this allows us to make these recordings continuously for 24/7 across days and weeks, across the full range of different behaviors that animals perform.
And this allows us to really get a very ground truth assessment of the diverse types of behaviors that animals make. But it's not just a useful tool for behavioral identification and approaches in computational ethology. We can also use Capture to report the exact behavioral kinematics of animals. So this is a side by side movie where I'm showing you, on the left, the animal's movements, visualized as this wireframe, and then on the right, the velocity of three markers on the animal's head, trunk, and hind limbs.
And this is slowed down by about four-fold. And so, as this animal makes a behavior known as the wet dog shake, you can see, we can report very highly precise oscillations of activity, in these different markers. And so Capture is, I think, a very powerful tool for high precision, long timescale recording in rats. But, of course, there's a number of different applications in behavior where it would be nice to not have to attach markers to animals. And it would also be nice to not have to use a large motion capture array, but to use a smaller set of, say, off-the-shelf machine vision cameras.
And it would, of course, be nice to use this in a greater diversity of model systems, such as mice or marmosets. And the question of how to read out an animal's pose from just normal video cameras is a classic question in computer vision known as markerless pose detection, or here, specifically, markerless 3D pose detection. And the conventional way this is done, in neuroscience today, is to use a 2D ConvNets.
So, in this approach you would take multiple video inputs of an animal from different perspectives. You would then take a convolutional network, such as a leaf or an animal part tracker, a DeepLabCut, and then you would label a set of, say, a few hundred frames of this animal behaving in these video frames. And then you would use these examples to extract a set of keypoint predictions of the animal.
And then, using the known position of these cameras, you can triangulate these predictions into 3D, to read out the animal's full 3D pose. The challenge, however, is that these approaches have really only been applied across single behaviors and fairly restrictive environments. And so it's not clear how many cameras, or how many training frames you would need to extend them to measurements across multiple behaviors in 3D. So to test this, we used Capture to collect a large ground truth training and benchmark data set, called Rat 7M.
And so you can see we're recording capture, which I'm showing as the wireframe here on the left. And then we had six synchronized video cameras that you can see on the right, that were calibrated in the same reference frame as the motion capture array. And we could then project these capture recordings into the six video frames to get perfectly labeled examples of the animal behavior, resulting in this large seven million frame data set that we could use to really benchmark, how much data do you need to train these 2D ConvNets to do accurate 3D pose estimation.
This data set is divided into different action categories as well, both to train the networks over a balanced set of data, and then also for benchmarking future pose detection algorithms, and evaluating their performance within specific action types. So if we take Rat 7M and we train here DeepLabCut, making predictions using six cameras and give it 10,000 of these perfectly labeled training frames, we can compare this to Capture. And so I'm showing you here side by side movies with the capture recordings in the left and the DeepLabCut recordings here on the right, and the re-projections on the top, and just the wireframe reconstructions on the bottom.
And you can see that the wireframe reconstructions from DeepLabCut show much larger jitter than the capture measurements. If we quantify this by looking across different training frames, the numbers of training frames for DeepLabCut, from 100 to 100,000, and looking at the mean error to capture measurements, and across different numbers of cameras from three to 12, what we find is that, in general, DeepLabCut can still only get a precision of about 18 millimeters for these recordings, which is comparable to the distance between two markers on the rats' forelimb.
And it really means that these tools, which I should emphasize are really fantastic for measurements in behavioral tasks, these are things we use in the lab every day, but when you try to extend them to these much more diverse sets of behaviors, in free-moving animals, they really struggle to perform well. And there's really five reasons, at least, for why these 2D ConvNets sort of struggle in 3D keypoint detection. So the first is occlusions. When animals are freely moving, parts of the body are occluded. And these 2D networks really can't make predictions of an occluded marker.
The second is perspective changes. So if an animal is close to the camera or further away, its relative size will change. And this will mean that the filters that the network is trying to use to extract keypoints are changing, and so you need to add labeled examples. And it also means that it's hard to use spatial statistics, such as, say, the length between the arm markers or the distance between, say, the head and the spine. It's harder to use these spatial priors to constrain or refine predictions.
A third challenge is just behavioral diversity. If animals are making multiple behaviors, then you need to have more labels for each of these behaviors, which can end up requiring many, many labels. Kind of relatedly, in comparison to 2D, where some of the success of these 2D ConvNets for animal keypoint detection come from having large pre-training data sets in humans, with labeled keypoint examples, these data sets are much fewer and have much more limited diversity in the case of 3D pose detection.
And lastly, these networks simply lack many of the needed inductive biases in architectural features to perform 3D keypoint detection. So camera predictions are made independently. So a camera on the left hand side of the animal doesn't constrain what a camera on the right hand side of the animal is thinking. And then all of the reasoning that's going on in these networks is very much in 2D and doesn't have any sense of 3D reasoning.
So to address all of these challenges, we developed a new convolutional network approach for 3D pose detection, that we call Dance, that is coming out in Nature Methods in I think the May issue. And so Dance overcomes many of these weaknesses inherent to 2D pose detection by using a trick known as unprojection. So in addition to having these video frames, we know where many of these cameras are in space. We know their position, their orientation and focal length.
Using this information, we can reconstruct the set of light rays consistent with a given image of an animal in 3D, and this is this differentiable unprojection operation. If we do this from multiple different views, what we get is a fully 3D feature space, where the position of different keypoints in 3D space corresponds to the intersections of light rays in these unprojected views. So if we disparatize this space, we can then train a 3D ConvNet to identify the position of keypoints from these light ray intersections.
Now, there's a couple aspects of this that I want to mention. The first is that these volumes are centered on the position of the animal. And this network is trained using Rat 7M, which contains 30 different camera views. And so, as a result, the network is very robust to changes in the position of cameras that you're using to make your recordings. A second feature is that this feature space, because it's in units of millimeters, is metric. And this allows it to both be robust to different changes in perspectives, whether the animal is close or far away from a camera, and it also allows the network to learn spatial priors about how far apart different keypoints are, and uses these to constrain its predictions in an end-to-end fashion.
And, lastly, because this feature space is explicitly 3D, the network naturally learns to, say, constrain predictions from one camera by the predictions from another. So if we compare Dance to DeepLabCut, here for tracking rats not bearing markers, I'm again showing you the video re-projections on the top, and in comparison to the wireframe reconstructions of animals here on the bottom. And once again, the 3D tracking with DeepLabCut is showing far more jitter in these markerless animals.
If we quantify this, visualizing the mean error versus Capture, with Dance here in shades of blue and DeepLabCut here in shades of orange and peach, Dance achieves errors as low as 3 millimeters for recordings in these animals, whereas the DeepLabCut errors, even for six cameras and having several thousand training frames, really only get down to errors of about, say, 15 or 20 millimeters. So Dance is a fantastic tool for pose tracking in rats, but is also readily extendable to recordings in other species.
And so we collaborated with multiple other labs, including Kyle Severson and Fan Wang, who are now your colleagues over at MIT. And this, you can see Kyle's mouse, he's doing some really beautiful pose detection over here on the left and other labs, the Aronov lab at Columbia and David Hildebrand and work at Rockefeller in Winrich Friedwald's lab, are getting some great fantastic recordings on chickadees and marmosets. And so Dance is, I think, somewhat unexpectedly able to generalize across these different species and environments.
And the code for this is available up on GitHub and we'll be running through example applications of this in the tutorial today. So I think that Capture and Dance together illustrate how I think a lot of the progress in behavioral tracking will be made, in the coming years, where we're going to pair high precision behavioral observatories like Capture, or say those made using synthetic data sets, and then we're going to use these to train tailored deep neural networks, so those with inductive biases sort of specifically suited to the pose detection problem, to teach these high precision measurements to generalize across different hardware and species. And I think what's coming is higher resolution measurements of animal kinematics and eventually skeletal kinematics of animals' body surfaces and of increasingly complex and naturalist behaviors such as social behaviors. And I think already we're able to use Capture to record across multiple different animals, and I think that these data sets will again be very useful, both for benchmarking the many, many existing approaches that are out there for social behavioral tracking in animals, as well as to develop new algorithms like Dance for 3D pose tracking in multiple behaving animals at once.
So, with these behavioral techniques, this now raises a number of questions in behavioral analysis and opens a lot of new avenues in behavioral analysis. But before I turn to that, I'm going to say that it's time to start the Colab demonstration, simply because it takes a little while for the code to run. And I was going to give a quick tour of this, but I think I'm only sharing this window. So if you have Colab open, I'm just going to start running the notebook, and then we're going to turn back to it, once we sort of formally start the hands-on portion of the talk.
I just wanted to get this starting now because we have to download some video files into Colab, which inevitably takes a little bit of time. So with that running, and I'm going to take silence as a sign that people are able to get this running to their satisfaction, I'm going to turn to talk a little bit about behavioral analysis. So Capture and Dance are able to record the 3D kinematics of animal behaviors over extremely long time scales. So Capture can record kinematics over multiple days, as you can see here, where we're able to visualize, say, the transitions of animals between periods of wakefulness and sleep.
Of course, if we look on very fine timescales, these approaches can also be used to identify the presence of different behaviors, say, walking and rearing, and also the really underlying kinematics driving these behaviors. So if you look at these bottom traces on the left, you can see some various oscillatory activity in the green trace, the right hind limbs, so this is the animal scratching, at the very beginning of the sequence. You can then see a very high velocity wet dog shake type behavior that I introduced before. And so on and so forth.
So these approaches are able to record the sort of millisecond time scale kinematic motifs of different limbs, their organization into behaviors, the organization of these behaviors in a more intermediate time scale into repeated patterns, like behavioral sequences, and then their long term organization into sequences and states. And one of the values of 3D kinematics is that these measurements can be made reproducibly across different labs. So if we measure some 3D kinematics here at Harvard, and a group at MIT, say, Kyle is recording some kinematics over in Fan Wang's lab, then we can compare and we can have common definitions for what behaviors we're observing.
And so I think my hope in the long term is that, in the coming years, behavioral analysis turns into something like analysis of nucleotide sequences in DNA, where you can take a DNA sequence of ACTs and Gs like you see here, you can copy it and enter it into a BLAST search, a Basic Local Alignment Search, to search a very large database of different sequences that have been measured and to identify what is the exact protein that you're looking at. So when you run BLAST, it takes a second because it's a government agency, but you end up with a bunch of hits. You can see the top one is this potassium inward rectifying channel, KCNJ2, and then we can go do more than just identifying it.
We can go through a gene ontology database, search for KCNJ2. Again we find a number of hits. We can look at this protein in mice. And we see that this inward rectifying potassium channel participates in establishing the action potential waveform, and excitability of neural muscle tissues. So we can see that it's located in neurons in the heart, in the muscles. We can look at its subcellular organization, and even its exact protein sequence, or a crystallographic sequence. And this knowledge in genomics is really at your fingertips.
And I think our hope is that in the long term, we can take these kinematic recordings and have similar databases where we can start to just read out the behavior that an animal is engaged in and other aspects, descriptive aspects, of these behaviors in a sort of behavioral ontology search. But, of course, we're not there yet. And so we've developed a set of pipelines for trying to do some of this work in identifying structure in these behavioral data sets.
And our entry point to analyzing these kinematics is first identifying stereotyped behaviors that the animals perform. And the pipeline for doing this is to take a captured Dance recording of 3D kinematics, defining a set of features, describing the animal's pose and kinematics on individual frames. And so these are often things like the Eigen poses of the animal, and a time frequency transform, like a wavelet transform of these Eigen poses, so that you're getting information about the kinematics in a local window.
You can then take this high dimensional feature set and embed it in 2D using t-SNE. So we now have a density map here where different density peaks correspond to performance of similar behaviors. And then from this embedding space, we can then cluster out different peaks that we observe and then annotate them with different behavioral names. Now this procedure of taking a high dimensional data set defining a correspondingly high dimensional set of features and then clustering it is common to many workflows in biological and other sciences. And oftentimes there's a conversation about what algorithm you're using, say, for dimensionality reduction, whether it's t-SNE or IsoMap or UMAP, and about, say, the types of clustering you're doing, whether it is a K-means or an information based clustering or hierarchical clustering.
And I'd just like to emphasize here that these distinctions between types of algorithms are important. But, ultimately, probably the biggest factor that determines what types of behavioral clusters you get out from these pipelines is the types of features that you put into them. And so if you look at papers that are really doing unsupervised behavioral analysis, oftentimes there's an extensive method section on the features that are used, because it's diverse types of things that can impact your ability to track different behaviors. So it's not just the Eigen poses and their velocity, it's also the animal's, say, center of mass, and the center of mass on different time scales, and maybe the distance of the animal, say, from a wall.
And so, oftentimes, I think it's important to think critically about the types of features that one uses. And I think we're still a bit in the infancy of determining what the exact set of features to use for behavioral analysis are. So with these pipelines, I'll show you an example of what one of these behavioral maps looks like. So here I'm going to show a movie where, on the left hand side, is a single cluster visualized from the t-SNE space. And I'm going to show you examples from different clusters in the space, and, on the right hand side, these wireframe representations of the animal made using from just sort of six randomly drawn clusters, instances from this cluster rather.
So these different colored regions in the t-SNE space, that I'm sort of switching between, correspond to different categories of behavior, so grooming, scratching, walking, locomotion, et cetera. And if I was to show you different sub-clusters within these colored regions, they would typically correspond to different kinematic or postural variants of these behaviors, so rearing to different heights, walking at different speeds, grooming different parts of the body, and so on and so forth. With these approaches, we can both identify behaviors, but then we can also extract out their underlying kinematics. And so we can determine exact kinematic fingerprints of these behaviors.
So for rhythmic behaviors, we can look at, say, the power spectral density of markers. So take a marker on the trunk, look at its frequency transform to see how much power is in different frequency bands. And we find, for instance, that there's many different types of wet dog shakes that have very variable amplitudes, but they all occur at precisely 15 hertz frequency. Similarly, there are a variety of different types of left grooming behaviors, again, very different types of amplitude, grooming different parts of the body.
But they all really have a center of frequency of just about 4 hertz. In contrast, scratching behaviors, say, are far more variable in their underlying frequency. And I think that these kinematic fingerprints are sort of an example of what would go into a behavioral ontology, where we could have, for different species, what are the exact kinematic fingerprints underlying the different behaviors that occur in this frequency. And it's not just the short timescale kinematics that we can start to analyze. We can also use these kinematic recordings to understand the organization of behavior on longer time scales.
So if we start from an ethogram, a description of the animal's behavioral usage over time, we can smooth this ethogram on different time scales, and then look at the pairwise relationship or pairwise similarity between different time points, which now represent a density vector of behaviors that occur in a local window. And so the off-diagonal peaks in these similarity maps correspond to reuse of different patterns of behavior, so similar sets of grooming or rearing or locomotor sequences. And then we can use a clustering approach to just identify repeated patterns of behavior, on these various timescales.
And what this spits out occurs on short time scales of 15 seconds, different behavioral sequences like walking or scratching or grooming, and we find that if you look at the transition matrix of these patterns and extracts, they have sequential ordering consistent with having a stereotyped sequence. And if we look on longer time scales, so if we take the ethogram, we smooth it on a longer timescale, look for repeated patterns in this smoothed ethogram, we find more behavioral states corresponding to changes in arousal, as the animal is exploring in the arena, or in a maintenance state consisting of various types of grooming behavior, or performance of a behavioral task that we often have our animals use. So with this suite of approaches, I think we have a basic scaffold for taking these kinematic data streams and really bringing some order to them, by identifying these different behaviors, the exact kinematic features of these behaviors, and their longer timescale organization into sequences in state.
And as the last component of this work that I want to mention is that we can also use these kinematic recordings for generative modeling of behavior itself. And so what we can do is we can take a physical model of an animal. And so this is a model in a physics simulator that has a skeleton and a set of masses associated with different points in the skeleton. And we can use deep reinforcement learning to both train this skeleton from scratch to solve multiple tasks, and then we can also use these kinematic data and use imitation learning to train this network to imitate the exact behaviors that these animals are performing.
And I think what these approaches will allow us to do is taking inspiration, mostly from the work going on at MIT, is to take these networks which really can act as artificial motor systems, and we can now use them to compare with the motor system of real animals to identify commonalities and also differences in the structure of motor representations and their hierarchical organization and differences over nuclei. And so I think these approaches are a way that we can start to extend many of these types of network analyses, which I think have been, again, so beautifully illustrated in sensory systems, to the motor domain.
You know with this, I'd just like to say that we are recruiting. If you know any fantastic, say, undergraduates or technicians interested in work over the summer or, say, next year, we're looking for people across diverse areas in the lab. So feel free to reach out if you or someone you know is looking for a spot at the intersection of animal behavior and neuroscience and deep learning.
And I'd just like to close by thanking everyone that contributed to this work, in particular Tim Dunn, who is a former graduate student at Harvard but is now an assistant professor at Duke University, as well as Diego Arango, who's a fantastic graduate student in the lab, at the Olveczky lab, that's contributed to most of the projects you see here. So with that, may be I'll take, if there's any questions on the sort of talk portion of the tutorial, before transitioning to the more hands-on tutorial section.
AUDIENCE: Yeah, so all of this tracking stuff is really cool. I was wondering if you'd be able to share what the limitations are. So like are there places where the best system fails and/or is it perfect and everyone should be using it.
JESSE MARSHALL: Yeah, no, that's a great question. And I think that the two things I'll say are, number one there's a lot of room left at the bottom in that there's a lot of the parts of the body that we still are not able to measure. And so some of my other work involves recordings from the motor system and trying to correlate motor responses with behavior. And there's still questions about what parts of the body, if any, are driving these changes, in addition to, say, sort of sensory systems.
And so I think that there's still a lot of work that we need to do to really get exact recordings of the animal's behavior. With respect to Dance, which I think is probably the most accessible of these two techniques of Capture and Dance, I think the limitation with these approaches is still training data. And so I think that, compared to 2D ConvNets, Dance is far more sample-efficient, right, where Kyle, I think, is labeling several hundred frames of mice that are freely behaving. I think to get that level of precision with DeepLabCut you would need tens of thousands of frames.
So it's far more sample-efficient, but hundreds of frames is still not nothing. And, I think, you also find with these that if you substantially change the environment, you need to label additional training frames. And so I think that this is obviously a very active area in machine learning. And so there's approaches using everything from types of augmentation and neural rendering to synthetic data, that I think are addressing these. But that's still sort of, I would say, an ongoing challenge for biological investigation.
AUDIENCE: Jesse, can you talk a little bit more about how one would apply this to a species that looks different, with a different body plan, for example.
JESSE MARSHALL: Yeah, that's a fantastic question and we'll be touching on some of this in the tutorial section. But, in general, Dance is very flexible where you'll just kind of define a set of keypoints that you want to measure and like a set of skeletal linkages between them. And then you'll supply that set of labeled data to the algorithm. And it should extend that to that species.
AUDIENCE: Yeah, I guess the part that I didn't quite understand is would you have to start with the markers on animals, or is the idea that it would transfer somehow?
JESSE MARSHALL: Yeah, great question. This would transfer. And so you wouldn't need to have a full marker data set. You could use just, say, 100 or a couple hundred time points of the end of the markerless animal's behavior. So for all the work that our collaborators and us did in mice and marmosets and chickadees, none of those animals were wearing markers.
AUDIENCE: I guess just a quick follow up to that question, and maybe this was partially answered by that. I'm interested in different species, larger species. Obviously like humans are very important to study. But has any work tried to apply this yet to something like rhesus macaques or species that are a little more different kinematically than the movement of a mouse?
JESSE MARSHALL: Yeah, I think that's a fantastic question also. And I think related to all of this is the question, this network has been trained on rats. And so it's very hard to empirically guess or gauge how effective the transfer capabilities are to different species. I can say that many aspects of the network make transfer to new environments fairly robust. And so the fact that volumes are centered on animals, the fact that it's in this sort of metric space, make it much easier to scale to new environments, because what the network has learned is not just about where rat keypoints are in space, but it's learned a more general ability to reason in 3D about how different light rays intersect to define keypoints.
It's learned to start to learn relationships between how keypoints on one side of the animal constrain another. And like all things in deep learning, I mean, it's a little squishy. It's a little empirical. But there's a lot of reasons to think that these are driving in this ability to reason geometrically, you're driving sort of a increased ability to transfer to new settings. And so the larger species, well, we have tried it on humans, and both adults and infants, and it works well.
And so, yeah, the size is not really a large driver. You just need to change a couple of parameters that you use in the network.
AUDIENCE: Awesome. That's great to hear. Is there like a-- is that work published, the stuff on humans yet, or--
JESSE MARSHALL: The humans will not be part of the paper, I think. It's just we applied it to one of the human benchmark data sets. But the paper will be out soon.
AUDIENCE: Gotcha. Thanks.
AUDIENCE: Yes, on the topic of transferring, you could transfer to other animals. And it sounds like you have for humans. But if you wanted to grade how accurately it's performing beyond just like an eyeball test, you would still need either some kind of data set that has accurately measured markers or do that yourself with the original three capture set up with the body piercings. Is that right?
JESSE MARSHALL: Yeah. Yeah, so for empirically validating transfer success, we have typically used just human observers. And so we'll take two people labeling, if we have a mouse, we have two people labeled that mouse, in a set of say 20 keypoints in the mouse, and then gauge how accurate, or how do the Dance predictions compare to these human labellers. And in general, we find that the, I think it's the Dance to human error, is comparable to the human to human error.
And it is, though, I'll say, you know, it's not going to be quite as precise as a rat where you have several million frames of training data. But it still is, I think, better, and it can also label occluded keypoints not visible to humans.
AUDIENCE: Other question is, do you have any plans to extend this, like imminent plans, to extend this to tracking multiple objects, because there's a whole lot of data association problems that get much more complicated when you have multiple targets.
JESSE MARSHALL: Yeah, a great question. We are very interested in extending this to multiple objects. I think the 3D capabilities are very beneficial for multi-animal tracking, because, if you think about two rodents interacting, there's obviously a lot of occlusions. And having a method that's far more robust to occlusions can really help. But we don't yet have anything to share on that front.
AUDIENCE: Yeah, great talk. I'm wondering if you have tried any like a different animal mass model, or you have to do something like the drug induction that can let us know to alter the behaviors, and then you test your model to see if you can test that amount. Have you ever tried that type of test?
JESSE MARSHALL: Yeah, that's a great question. We haven't done anything with Dance in mice, I think. In rats we have done work in the Capture paper with drugs as well as some models of autism. But and so I think that the kinds of analysis approaches for identifying, say, behavioral structure on multiple timescales and then comparing these across species, you could sort of extrapolate out from what we've done in the Capture paper.
AUDIENCE: Great. Thank you.
AUDIENCE: Does this technique presume that you have three or more cameras? And do those cameras have to be at a certain angle relative to one another?
JESSE MARSHALL: Yeah. So that's a great question. And so Dance requires at least one camera and knowledge of the animal's center of mass, which generally requires having two cameras. But so, because it's learned about 3D structure, it is possible in principle to make predictions from a single camera. But in general we use somewhere between, I would say, three and six cameras, which is I think generally possible in most lab recording environments.
With respect to camera arrangement, we've had good results with many different types of camera arrangements, from having one on either side and one on the top, to having them sort of elevated and inclined at different configurations. The one that I would avoid is having two cameras facing each other, because this can cause really profound degeneracies in triangulations. But we haven't found, in general, that one configuration works substantially better than others. I would say that most of the variation that we've had is simply, if you have enough labelled training frames up to, say, a couple hundred, then performance is generally good for everything.
I'll give a couple of slides overview. And then we can go to the notebook. And maybe I'll switch my screen share there. And so the tutorial will go over application of Dance to a recording of a mouse from multiple video cameras. And so to use Dance, there's a few different steps.
And so sort of the entry point, which we've just been discussing, is that you need to have a set of cameras, and so probably at least three, and these cameras need to be calibrated in some global reference frame, which is sort of a standardized procedure in computer vision, to determine their 3D position and aspects of the lens, like its distortions and focal length. In addition, the video from these cameras needs to be, or it typically helps, if it's compressed in some manner, and formatted according to, you know, so that Dance can read it in. So it needs to have the right file structure.
And then you need some training examples. And so for a new animal, a new data set, you need some examples of where the center of mass of the animal is, and the different keypoints that you want to measure. And so that's about, and so typically about a hundred of those. With this in hand you can then use the Dance algorithm.
And so the first thing Dance is going to do is it's going to find the center of mass of the animal. Then, using the multiple video cameras, we can triangulate the position of the center of mass using this measured camera calibration. Then the network will anchor a grid on that 3D position. And you can take these images, you can project the grid onto the images and populate all of the voxels with the pixel values. And then we're going to transfer these pixels into voxels, and then use a 3D ConvNet to process these volumes and output a confidence map, which will then be processed to yield the final keypoint predictions.
So supporting all of this, we have a GitHub repo, which we'll be drawing from today. And this is just a screen capture of the base of that GitHub repo. And I've highlighted a couple of features of this, so there's this beginning step, right, where you have to get your cameras. You've got to synchronize them. You have to calibrate them. You probably want to compress them.
And so, in red, I've highlighted here that we have a couple of different approaches people have used for camera calibration, one just using a checkerboard and an L frame, another using a laser pointer that you sort of just shine in the arena, and then you can use that to get the calibration for the cameras. There's also two scripts for camera compression. I'm going to highly recommend this one written by Kyle Severson, which does onboard GPU compression of videos. And so he's recording from, I think, six high definition cameras with a single computer and compressing all of them in real time on the GPU.
And this just makes these recordings far, far, far easier. And so this is an external repo that is linked to in the Dance repo. And then there's another repo that I've highlighted in green, written by Diego Arango, who's a grad student in the Olveczky lab, that is a really fantastic tool for this process of labeling training data. And I'm going to give some examples of that in a second. And so these sort of steps are all to make this beginning process of just setting up the experiment a lot easier.
And so I'll give just some examples of this labeling process. And so the nice thing about this Label 3D tool is you can visualize simultaneously like all the different cameras you're recording. So this is a rat in an arena on a green screen recorded with six cameras and so, with Label 3D it'll bring up images of all the cameras. And then one of the nice things about it is you can just click on the center of mass here in a couple of different views. And then you can just hit a button to triangulate it across the rest of the views, which will accelerate the labeling process, because you can use the triangulation from two views and then triangulate it into 3D and then project it into all the different image frames.
This becomes very handy when it comes to measuring the animal's full kinematics. And so here we're labeling 23 keypoints on this rat. And so first we're going to zoom in and just visualize the rat, and then start to label the keypoints on the head and spine and limbs. And I think where this Label 3D approach really comes in handy is the limbs are often occluded in multiple views. But you can find the set of views, say these bottom two, where here the left forelimb is visible, and you can label the keypoints in just those views where it's visible. And then as I'll show in a sec, you can triangulate these predictions to all of the other views.
And then this is again very useful for really augmenting all of the training data you get, because you're only labeling a subset of views. Then you're getting the predictions from all of the cameras. And here you can see that we've triangulated these predictions and now have the labelled examples from all of the different cameras.
So this is the Label 3D software. And I think it's very helpful for working with these 3D tracking approaches. With that I'm going to transition to the Colab, which is the perfect spot. So the demo will throw a memory error, as a side note. But so this tutorial goes through a few aspects of Dance and the behavioral analysis. So we start by just setting up Dance and installing dependencies, downloading the GitHub repo, and creating a virtual environment.
And, following that, we're going to download a set of video data from the Dance GitHub repository, and then run Dance predictions on that. So these first cells we're installing Anaconda into the Colab notebook. We are then cloning the Dance GitHub repo. And so that is just put into the Colab directory. So if you go and do this yourself, and if you look in the content folder of your Colab notebook, then there will be a subdirectory for Dance, and in that contains the entire GitHub repo.
With that, following that, the script creates the Dance Anaconda environment, which I've just sort of already installed. But it's using FFM package to deal with most of the video processing, as well as dependencies from PyTorch and CD-DNN. And so that is sort of the set of dependencies that Dance uses. And so once that installed, we have all these directories and we can start to analyze some video cameras.
So the GitHub repo contains a number of demos in markerless mice. And because of the size of the video files, they're small but they're not that small, so we find it's best to not keep them in the GitHub repo, but we have a URL that contains a link to the video files. And so in the repo if you go into demo, Markerless Mouse One videos, there is a text file that contains a link to the videos, just a Dropbox link.
And so this first part, the reason that I wanted to start this Colab a little earlier is it downloads these video files, which takes about five minutes. And once they've downloaded there in this video directory of the Markerless Mouse One folder, and so this folder is a Dance project folder, and so the organization of it is going to be standardized across all of the data sets you analyze using Dance, where you have a video folder that contains the videos you've recorded, organized by camera one, camera two, camera three, camera four, et cetera. There's also folders for the center of mass labeling, that has the prediction results as well as training waits for the center of mass detection network.
And then, similarly, there is a subdirectory for the Dance code that contains the prediction results, as well as training and the network weights. I am not going to go over training and labeling in this demo. I did see that [? Tomoe, ?] I think, had covered some of this in his tutorial. And I think we'll sort of stand on his shoulders. But the process is well described in the GitHub repo. And I've sort of included this excerpt here where, once you have this project folder set up and you also have this, I forgot to mention this, IO.YAML file.
So this contains a description of where everything is located in this project folder. So it tells the network where, say, the center of mass directory is, and where the Dance directories are, and so on and so forth. Then using this, the center of mass network is trained by just running ComTrain, and then ComPredict, and then giving a path to the center of mass configuration file, which I think we have some examples of here in the Config folder.
So this is the actual script that's being run by, say, ComPredict or ComTrain, and it just has several of the different hyperparameters that you would use. And you don't, I think, typically need to modify any of them to get it to work. But if you really wanted to change, say, the batch size, or the learning rate, you could change them in here. I'm just going to close that. So that's training the Comfinder, and in this demo file the Comfinder has already been trained. And we have the prediction results as a map file and a pickle.
And the network has also been trained. And so we're just going to download in this next cell the weights for the training. And so, similarly, if you wanted to run the Dance training, you would just use Dance train and then you would give a path to here, the Dance configuration file, that's in that same Configs folder. And there's a different set of hyperparameters that I'll mention in a bit. So with these network weights, we can then run the prediction. And so this typically takes a second to run. So I won't quite go through it here. But you just run Dance predict and then you give it a link to this Config file. And so this Config file, again, there's a number of different parameters.
And you don't really need to change any of them to get it to work out of the box, aside from maybe one, which is this volume size parameter, which is just the size of the 3D volume in millimeters, anchored on the animal. And so this can vary if you are using, say, a marmoset, and you want it to be more on the order of, say 500 millimeters, versus a mouse where it's going to be on the order of, say 100 millimeters. And this can be important because, in contrast to 2D networks, we're only using, say, 64 voxels on a side. And so there is some sort of resolution trade-off that you're going to see with the volume size there.
There's a number of other parameters here. And I just want to mention that, if you look at the Dance GitHub page, there is a wiki, and the wiki has sort of an in-depth discussion of all of the different required parameters. So, say, the batch size or the number of epochs, or the volume size that you're using, which, for instance, this should be big enough to fit the entire animal with a little wiggle room to accommodate noise in the center of mass, as well as other optional parameters. If you are a real power user and wanted to change, say, the GPU ID or the amount of median filtering that goes on, you could change those in the Config file as well.
But I think for most users, really, the biggest parameter you need to change is just the size of the 3D volume that goes in. And so the prediction results, because we started this before, have already been run. So if we look in the Dance, this demo folder, then we can see that there is a save data AVG file that has been run. So let me give you an example of what some of these predictions look like. So this is just the view from one of the cameras which, because of the way Colab works, we just have to compress the video and have a few extra lines of code to visualize it.
So that's going to take just a second to load in and run.
AUDIENCE: Or get Label to get the code to run, like run a prediction.
JESSE MARSHALL: Yes, so that I've actually pre-run it. And so we could start that in a second. But the prediction on Colab, so I should say the prediction, I think can run at about 10 hertz and is very paralyzable, so we can record for a few hours each day and just kind of run it, run it by the refiner in real time, but in Colab it can take a little while to run. So I wanted to spare everyone the waiting time for that.
Yeah, I don't know what's going on with this video file, but this is, I'll describe in words, is just one of the videos that we record, as well as the re-projected points from this prediction file, on top of the videos. It's really taking its time. Oh, I see.
AUDIENCE: Jesse, so there is a comment in the chat.
JESSE MARSHALL: Yeah.
AUDIENCE: About some people having issues with unzipping the videos. Have you encountered this problem?
JESSE MARSHALL: No, I actually haven't seen that. I have to say I'm not a Colab power user, so my first bet might be a stack overflow. But if other people can locate this, I think one of the challenges with Colab is like, I don't know if I have any sort of privileges that other people don't have, but if anyone can recapitulate that I'd be very interested. So, OK, this is just an example of one of here Kyle's videos, being this is the talk at MIT.
And you could see very high resolution recording of a mouse. And we've just taken several of these videos and run the keypoint prediction on top of them, which is, of course, taking its sweet time. Oh, I see, because it's waiting for me to give input. So hopefully this will run in a second.
But, so, yeah, this is just an example of what these points re-projected on top of the video look like, which is very small. I think that's all we're going to get. But this is just an example of what it outputs, which is the sort of 22 keypoints here projected right on top of the animal.
So that is the Dance pipeline and project folder organization. And I should also mention that a lot of these sort of steps, as far as calibrating cameras, synchronizing cameras, compressing camera data, putting it in the project folder, are going to be common to any deep learning approach to pose detection and 3D pose detection. And so, even if you opt not to, say, use Dance for your approach, I think a lot of these steps are going to be common to any approach that you're going to use.
But so once you have these prediction results, we can start to use some of these approaches for analyzing 3D kinematics. And so we can start by just loading in the prediction data, this Dance predict results, save data average. And we can start by just plotting them. And so you can see the Dance outputs, these very smooth time traces, of the 3D kinematic predictions.
I'm going to skip ahead a little bit and pretend that we ran that over a full recording of a mouse, so not just over, say, several minutes, but over an hour long recording session. And so there's a separate drop zip file here, which hopefully everybody can open that has these larger recordings. And I should say that we kind of typically switch a little bit between Python and MATLAB, just because a lot of the 3D visualization code in MATLAB is a little easier to use, and so here we're just going to convert this data file, which is a dot metafile into a Python readable format.
And then so we have this sort of data structure that has all of these kinematics. And from that, we want to, say, compute a behavioral embedding of one of these behavioral maps that I showed you previously. So to do that the first step is to compute the Eigen postures of the animal. And so we just take these keypoint data set measurements, and so the 3D marker position over time, although I'll just note that I also did a little bit of sleight of hand and put this in an aligned coordinate frame, so we have the animal fixed, center of mass fixed, aligned in some axis and then look at deviations around that aligned animal.
And then we can just run PCA on this set of marker positions. And so we can take the here 81 dimensional matrix of aligned keypoints, by here I guess it's a million different time points, it's like an hour of data at 100 hertz. And then we can extract out just the top 10 Eigen postures of the animal. And if we just visualize this as a heat map, you can see that there's some transition over time. There's these little green blobs which probably correspond to when the animal is either rearing or grooming or something like that.
And so that's one set of features that we can use for computing a behavioral embedding. So we could compute just a pose embedding. But then it often helps to have some kinematics associated with those. And so in addition to just the instantaneous posture, we're also going to use a wavelet transform to look at the local variations of this posture in time, but compress it with the time frequency transform. And so here I'm just taking a wavelet transform of these Eigen postures, and so just using the CWT function of these different score components, so just the Eigen postural scores of the animal.
One small detail is that here I'm using a Mexican hat wavelet just because the Morlet wavelet wasn't built in, but that's a minor detail. I think if you're doing this on your own, you'll probably want to vary a little bit to see what type of features make the most sense for your application. But so once again we can run, if we look at the wavelet transform, and we can concatenate them over the 10 Eigen postures, we end up with something that is about, I want to say 150 dimensional. So we have a big 150 by million matrix that's the wavelet components over time.
And then we can once again run PCA on those. And so we have the Eigen postures, as well as the principal components of the wavelet transform of these Eigen postures. And so this is going to be another sort of 10 by a million matrix. So now we have a bunch of different features describing the animal's behavior on individual frames. And we can use t-SNE or your favorite sort of embedding approach to visualize variations in this pose. And so here I'm just going to subsample and look at frames every second.
And here I'm going to start by just looking at an embedding of the animal's pose. And so I'm going to take these Eigen postural scores, and, running t-SNE on them, here with a perplexity of 30, using two components. And it'll just take a second to run. And I should say that generally for t-SNE I use a perplexity of 30, if there's sort of fewer than 100,000 samples, and more if you have larger samples, just to get the space looking nice.
I guess the other detail typically with t-SNE is you want to subsample the data appropriately before putting in, so for most applications you might want to say pre-run K-means on the data set and sort of smartly sample the time points you're putting in, because with t-SNE, the size of a region in space corresponds more to the number of time points you have than to the sort of Euclidean diversity, because it doesn't preserve global distances. And so your mileage might be very little there. You might want to try UMAP or something like that. But those are a couple tricks of the trade.
Right, so if you look at the embedding, you get out something that looks a bunch of stuff. But you can see that it's highly structured, and so there's a lot of sort of small structure in this space. And if we were to visualize the time points around these, which we haven't yet worked into this Colab notebook, you would see that they would correspond to similar types of behaviors. And so these outer lying regions would probably be some types of grooming over here, and maybe you'd have a large rearing cluster over there, and then typically there's a big interior portion that's various types of walking or postural adjustment behaviors.
You can do the same thing with the wavelets. And so you can take the t-SNE embedding and run them on the wavelets transform data to get another map of all of the kinematic variation. And so that's going to run. And it'll look a little squishier than the pose. And finally maybe I'll skip running this in the interest of time, but we can run this and get the embedding of both of these, which will visualize not only changes imposed, but also kinematic variation across these different poses.
And so this can be useful for, yeah, getting sort of a longer time scale behavior organized from the data set. So this will run. And then lastly, I'll just give another visualization of what the sort of annotation looks like. And so following the sort of t-SNE embedding, the next step in all of this would be to just say cluster the t-SNE space, using either K-means on the high dimensional space, or a watershed transform in the t-SNE space. Either, I think they typically produce fairly comparable results, and then annotate the different clusters you see.
AUDIENCE: I have a question about t-SNE actually. So can you go over the parameters that the t-SNE accepts, and how can we fine tune those parameters?
JESSE MARSHALL: Yeah, that's a great question. So t-SNE, there's basic parameters, I would say, are the number of components, like the number of dimensions you're embedding into, which, typically, everyone uses 2. And I should say a lot of this is based on work by Gordon Berman who's now at Emory, and he's done some comparisons between say two and three dimensions, and found two and three produce similar embedding results. The perplexity determines, it's like basically how many nearest neighbors it's looking for and gauging distances by. And so t-SNE is really a local approach, where it's basing similarities to these 30 or 200 nearest neighbors, at least that's my sort of intuition.
And then it's not ignoring distance on a much larger scale. And so you can sort of interpret small scale features, but maybe not larger scale distances. And so that, yeah, again I use 30 for sort of smaller data sets like this, and that's kind of the default. And then I do use more for when you have a very large data set of several hundred thousand frames. Other things that can go into t-SNE are, and I don't know if they're built into Python, but the big ones would be, I mean you can change aspects of the number of iterations or how the gradient is managed.
But I would say the bigger ones are the initialization. And so Dimitry Kobak has done some interesting work on this, where there is this whole, I don't know if it was a Twitter thing or, I mean, this is in sort of the single cell community, that's like UMAP versus t-SNE debate. But, basically, their contention is that if you initialize t-SNE with the principal components of the data, that you get out reproducible results. And it's sort of comparable to UMAP in that regard.
And so that is something I actually would recommend, and I haven't done it here, but you can initialize t-SNE with PC1 versus PC2, and you know the advantage of that is you'll get sort of very similar t-SNE's from time to time. There's an approximation known as the Barnes-Hut approximation that is used in, for instance, this version of t-SNE. So like the original version of t-SNE was super, super slow, and couldn't really be extended to say like a 10,000 or 100,000. There's Barnes-Hut, there's also a Fourier transform one.
I think the Fourier transform one is maybe not an approximation you necessarily need. I don't know that I or Gordon have had great results with that one. But I like Barnes-Hut. And there's like a parameter you can set that interpolates between those two. But I would say these are sort of a little weak too, like I think the biggest thing, so if you look at the dynamics of t-SNE versus the pose t-SNE, they look very different, right? And so like the biggest thing that is going to affect a lot of the results you see is the features that you're putting in.
And this I guess also, the other thing I should mention, the last one, is the distance metric that you use. And so here we're using Euclidean distance. But you can also use, say, the Kahlil divergence between different points as your distance metric, or pick your poison, right? But, yeah, I would say, I would emphasize feature space. And what we're going to get out now is sort of a blend between those, and you can also kind of scale how much you want to weight, say, pose versus dynamics.
But, yeah, that is the basic truth, I should say the last sort of thing I've learned along the way with t-SNE. In addition, I mentioned balancing before, where the size of these different regions in t-SNE space is more determined by the amount of data you have, rather than necessarily their kinematic diversity. That I think would be a little different with UMAP. But t-SNE isn't going to put everything into one point if they're all very similar. It's going to occupy a broader region. So their balancing is important.
And then t-SNE, I have not gotten good results with more than, say, a couple hundred, or like a 100,000 frame. Like there's a phenomenon, that I think is poorly understood, known as like compression of the t-SNE space, that if you put in a couple hundred thousand or more points, then the space starts to look kind of funny. So I've seen this noted in the literature. But I do not actually know why. I mean, I'm sure you could think of reasons why it would occur. But it's just a weird thing at this point.
But, yeah, so the joint embedding now looks sort of-- has the squishy feel but like a little more structure than just the dynamics. And you can see it has a different orientation than the dynamics one because of the random initialization. And I'm just going to decrease the amount of dynamics information that we find and try that.
Yeah, there's a question in the chat about feature selection. And so I would say there's a set of default features that I use, which are just the Eigen postures of the animal and their wavelet transform. And as well, sometimes, there's a repo associated with the Capture project that has more code I would say on this, that I'll just try and enter in. But, yeah, so basically there's a default set of features associated with those. And that, I think, is going to be useful for most applications.
But I think if you want to get creative, you can start adding other kinematic features of, say, the animal's center of mass or things like that. See, let's see, so this Capture demo I think has more of the features here. But I would say that, aside from Eigen postures and their wavelet transform, which again is sort of based on this older work by Gordon Berman starting from images, there is a bit of flexibility in what you get out. And it sort of depends on whether it's separating out the things you want to separate out.
But I think, generally, we find that for say the mice, we get, if we just use Eigen postures and their wavelet transform, we separate out grooming and rearing and various subtypes of grooming, you know, body grooming and walking and so on and so forth. So I think that's good for most applications. But if you're finding that the space, say, doesn't have enough structure, there's like different things that have different center of mass velocity but otherwise look similar, you might want to supplement this with other sets of features.
MODERATOR: Yeah, so thanks so much Jessie. I think this is a really great resource and toolbox for everyone doing animal behavior analysis. So thank you.
JESSE MARSHALL: Yeah. Thanks, everyone, for your time. So happy to answer any more questions as well.