University of Texas – Austin
Learning to compose photos and videos from passive cameras.
Degree: PhD, Computer Science, 2019, University of Texas – Austin
Photo and video overload is well-known to most computer users. With cameras on mobile devices, it is all too easy to snap images and videos spontaneously, yet it remains much less easy to organize or search through that content later. With increasingly portable wearable and 360° computing platforms, the overload problem is only intensifying. Wearable and 360° cameras passively record everything they observe, unlike traditional cameras that require active human attention to capture images or videos.
In my thesis, I explore the idea of automatically composing photos and videos from unedited videos captured by "passive" cameras. Passive cameras (e.g., wearable cameras, 360° cameras) offer a more relaxing experience to record our visual world but they do not always capture frames that look like intentional human-taken photos. In wearable cameras, many frames will be blurry, contain poorly composed shots, and/or simply have uninteresting content. In 360° cameras, a single omni-directional image captures the entire visual world, and the photographer's intention and attention in that moment are unknown. To this end, I consider the following problems in the context of passive cameras: 1) what visual data to capture and store, 2) how to identify foreground objects, and 3) how to enhance the viewing experience.
First, I explore the problem of finding the best moments in unedited videos. Not everything observed in a wearable camera's video stream is worthy of being captured and stored. People can easily distinguish well-composed moments from accidental shots from a wearable camera. This prompts the question: can a vision system predict the best moments in unedited video? I first study how to find the best moments in terms of short video clips. My key insight is that video segments from shorter user-generated videos are more likely to be highlights than those from longer videos, since users tend to be more selective about the content when capturing shorter videos. Leveraging this insight, I introduce a novel ranking framework to learn video highlight detection from unlabeled videos. Next, I show how to predict snap points in unedited video – that is, those frames that look like intentionally taken photos. I propose a framework to detect snap points that requires no human annotations. The main idea is to construct a generative model of what human-taken photos look like by sampling images posted on the Web. Snapshots that people upload to share publicly online may vary vastly in their content, yet all share the key facet that they were intentional snap point moments. This makes them an ideal source of positive exemplars for our target learning problem. In both settings, despite learning without any explicit labels, my proposed models outperform discriminative baselines trained with labeled data.
Next, I introduce a novel approach to automatically segment foreground objects in images and videos. Identifying key objects is an important intermediate step for automatic photo composition. It is also a prerequisite in graphics…
Advisors/Committee Members: Grauman, Kristen Lorraine, 1979- (advisor), Hays, James (committee member), Huang, Qixing (committee member), Niekum, Scott (committee member).
Subjects/Keywords: Passive cameras; Video highlight detection; Snap point detection; Image segmentation; Video segmentation; Viewing panoramas
to Zotero / EndNote / Reference
APA (6th Edition):
Xiong, B. (2019). Learning to compose photos and videos from passive cameras. (Doctoral Dissertation). University of Texas – Austin. Retrieved from http://dx.doi.org/10.26153/tsw/5847
Chicago Manual of Style (16th Edition):
Xiong, Bo. “Learning to compose photos and videos from passive cameras.” 2019. Doctoral Dissertation, University of Texas – Austin. Accessed August 03, 2020.
MLA Handbook (7th Edition):
Xiong, Bo. “Learning to compose photos and videos from passive cameras.” 2019. Web. 03 Aug 2020.
Xiong B. Learning to compose photos and videos from passive cameras. [Internet] [Doctoral dissertation]. University of Texas – Austin; 2019. [cited 2020 Aug 03].
Available from: http://dx.doi.org/10.26153/tsw/5847.
Council of Science Editors:
Xiong B. Learning to compose photos and videos from passive cameras. [Doctoral Dissertation]. University of Texas – Austin; 2019. Available from: http://dx.doi.org/10.26153/tsw/5847