Identifying Subgroups in Biomedical Datasets using Data Attribution
- All Captioned Videos
- Computational Tutorials
Understanding how training data influences model predictions ("data attribution") is an active area of machine learning research. In this tutorial, we will introduce a data attribution method (datamodels: https://gradientscience.org/datamodels-1/) and explore how it can be applied in the life sciences to identify meaningful subgroups in biomedical datasets, such as disease subtypes. We will begin with a simple example from image classification (CIFAR10), offering a step-by-step guide to demonstrate how the data attribution method works in practice. Since the approach involves training thousands of lightweight classifiers, we will focus on strategies for fast and efficient model training. Next, we will explore its applications in biomedical science, with a focus on single-cell and genetic datasets, highlighting the biological insights gained from applying this computational approach. The tutorial will conclude with an interactive, hands-on session using Google Colab, where participants can apply the techniques themselves and explore the approach further. This session is designed to be accessible to participants of all coding and machine learning experience levels—whether you're new to machine learning or curious about its intersection with biomedical applications.
- Slides: https://drive.google.com/file/d/1qGahNYBUnThba07D2D9gZTviiU_kOedF/view?u...
- Github repository of tutorial code: https://github.com/djunamay/datamodels_tutorial
- Code with outputs: https://colab.research.google.com/drive/1u2jZzWs7SVT6kj-O8rMsUphHfvyeqnH...
- Code no outputs: https://colab.research.google.com/drive/1lwl7-Xsc7lg9bTg97hEEqPt54x-J1qe...
have an interactive transcript feature enabled, which appears below the video when playing. Viewers can search for keywords in the video or click on any word in the transcript to jump to that point in the video. When searching, a dark bar with white vertical lines appears below the video frame. Each white line is an occurrence of the searched term and can be clicked on to jump to that spot in the video.