Existing approaches to labeling images and videos with natural-language sentences generate either one sentence or a collection of unrelated sentences. Humans, however, produce a coherent set of sentences, which reference each other and describe the salient activities and relationships being depicted. In keeping with the CBMM goal of understanding human intelligence, we are devising new approaches that can reproduce this behavior. The goal of this project is to produce rich descriptions of sequences of events, static and changing spatial relationships, properties of objects, and other features.