You searched for +publisher:"Georgia Tech" +contributor:("Batra, Dhruv")
.
Showing records 1 – 18 of
18 total matches.
No search limiters apply to these results.

Georgia Tech
1.
Chattopadhyay, Prithvijit.
Evaluating visual conversational agents via cooperative human-AI games.
Degree: MS, Computer Science, 2019, Georgia Tech
URL: http://hdl.handle.net/1853/61308
► As AI continues to advance, human-AI teams are inevitable. However, progress in AI is routinely measured in isolation, without a human in the loop. It…
(more)
▼ As AI continues to advance, human-AI teams are inevitable. However, progress in AI is routinely measured in isolation, without a human in the loop. It is crucial to benchmark progress in AI, not just in isolation, but also in terms of how it translates to helping humans perform certain tasks, i.e., the performance of human-AI teams. This thesis introduces a cooperative game – GuessWhich – to measure human-AI team performance in the specific context of the AI being a visual conversational agent. GuessWhich involves live interaction between the human and the AI. The AI, which we call Alice, is provided an image which is unseen by the human. Following a brief description of the image, the human questions Alice about this secret image to identify it from a fixed pool of images. We measure performance of the human-Alice team by the number of guesses it takes the human to correctly identify the secret image after a fixed number of dialog rounds with Alice. We compare performance of the human-Alice teams for two versions of Alice. Our human studies suggest a counter-intuitive trend – that while AI literature shows that one version outperforms the other when paired with an AI questioner bot, we find that this improvement in AI-AI performance does not translate to improved human-AI performance. As this implies a mismatch between benchmarking of AI in isolation and in the context of human-AI teams, this thesis further motivates the need to evaluate AI additionally in the latter setting to effectively leverage the progress in AI for efficient human-AI teams.
Advisors/Committee Members: Parikh, Devi (advisor), Batra, Dhruv (advisor), Lee, Stefan (advisor).
Subjects/Keywords: Visual conversational agents; Visual dialog; Human-AI teams; Reinforcement learning; Machine learning; Computer vision; Artificial intelligence
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Chattopadhyay, P. (2019). Evaluating visual conversational agents via cooperative human-AI games. (Masters Thesis). Georgia Tech. Retrieved from http://hdl.handle.net/1853/61308
Chicago Manual of Style (16th Edition):
Chattopadhyay, Prithvijit. “Evaluating visual conversational agents via cooperative human-AI games.” 2019. Masters Thesis, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/61308.
MLA Handbook (7th Edition):
Chattopadhyay, Prithvijit. “Evaluating visual conversational agents via cooperative human-AI games.” 2019. Web. 16 Apr 2021.
Vancouver:
Chattopadhyay P. Evaluating visual conversational agents via cooperative human-AI games. [Internet] [Masters thesis]. Georgia Tech; 2019. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/61308.
Council of Science Editors:
Chattopadhyay P. Evaluating visual conversational agents via cooperative human-AI games. [Masters Thesis]. Georgia Tech; 2019. Available from: http://hdl.handle.net/1853/61308
2.
Deshraj.
EvalAI: Evaluating AI systems at scale.
Degree: MS, Computer Science, 2018, Georgia Tech
URL: http://hdl.handle.net/1853/60738
► Artificial Intelligence research has progressed tremendously in the last few years. There has been the introduction of several new multi-modal datasets and tasks due to…
(more)
▼ Artificial Intelligence research has progressed tremendously in the last few years. There has been the introduction of several new multi-modal datasets and tasks due to which it is becoming much harder to compare new algorithms with existing ones. To solve this problem, this thesis introduces EvalAI, an open source platform for evaluating and comparing machine learning and artificial intelligence algorithms at scale. This platform is built to provide an open source, standardized, scalable solution for evaluating learned models using automatic metrics as well as with human-in-the-loop evaluation. By simplifying and standardizing the process of benchmarking, EvalAI seeks to lower the barrier to entry for participating in the global scientific effort to push the frontiers of machine learning and artificial intelligence, increasing the rate of measurable progress in these communities.
Advisors/Committee Members: Batra, Dhruv (advisor), Parikh, Devi (advisor), Lee, Stefan (advisor).
Subjects/Keywords: Machine learning; Artificial intelligence; Evalai; Deep learning; Computer vision; Reinforcement learning; Systems; Scale; Data science; Kaggle
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Deshraj. (2018). EvalAI: Evaluating AI systems at scale. (Masters Thesis). Georgia Tech. Retrieved from http://hdl.handle.net/1853/60738
Note: this citation may be lacking information needed for this citation format:
Author name may be incomplete
Chicago Manual of Style (16th Edition):
Deshraj. “EvalAI: Evaluating AI systems at scale.” 2018. Masters Thesis, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/60738.
Note: this citation may be lacking information needed for this citation format:
Author name may be incomplete
MLA Handbook (7th Edition):
Deshraj. “EvalAI: Evaluating AI systems at scale.” 2018. Web. 16 Apr 2021.
Note: this citation may be lacking information needed for this citation format:
Author name may be incomplete
Vancouver:
Deshraj. EvalAI: Evaluating AI systems at scale. [Internet] [Masters thesis]. Georgia Tech; 2018. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/60738.
Note: this citation may be lacking information needed for this citation format:
Author name may be incomplete
Council of Science Editors:
Deshraj. EvalAI: Evaluating AI systems at scale. [Masters Thesis]. Georgia Tech; 2018. Available from: http://hdl.handle.net/1853/60738
Note: this citation may be lacking information needed for this citation format:
Author name may be incomplete

Georgia Tech
3.
Prabhu, Viraj Uday.
Few-shot learning for dermatological disease diagnosis.
Degree: MS, Computer Science, 2019, Georgia Tech
URL: http://hdl.handle.net/1853/61296
► In this thesis, we consider the problem of clinical image classification for the purpose of aiding doctors in dermatological disease diagnosis. Diagnosis of dermatological disease…
(more)
▼ In this thesis, we consider the problem of clinical image classification for the purpose of aiding doctors in dermatological disease diagnosis. Diagnosis of dermatological disease conditions from images poses two major challenges for standard off-the-shelf techniques: First, the distribution of real-world dermatological datasets is typically long-tailed. Second, intra-class variability is large. To address the first issue, we formulate the problem as low-shot learning, where once deployed, a base classifier can rapidly generalize to diagnose novel conditions given very few labeled examples. To model intra-class variability effectively, we propose Prototypical Clustering Networks (PCN), an extension to Prototypical Networks that learns a mixture of "prototypes" for each class. Prototypes are initialized for each class via clustering and refined via an online update scheme. Classification is performed by measuring similarity
to a weighted combination of prototypes within a class, where the weights are the inferred cluster responsibilities. We demonstrate the strengths of our approach in effective diagnosis on a realistic dataset of dermatological conditions.
Advisors/Committee Members: Parikh, Devi (advisor), Batra, Dhruv (committee member), Lee, Stefan (committee member).
Subjects/Keywords: Image classification; Low shot learning; Automated diagnosis
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Prabhu, V. U. (2019). Few-shot learning for dermatological disease diagnosis. (Masters Thesis). Georgia Tech. Retrieved from http://hdl.handle.net/1853/61296
Chicago Manual of Style (16th Edition):
Prabhu, Viraj Uday. “Few-shot learning for dermatological disease diagnosis.” 2019. Masters Thesis, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/61296.
MLA Handbook (7th Edition):
Prabhu, Viraj Uday. “Few-shot learning for dermatological disease diagnosis.” 2019. Web. 16 Apr 2021.
Vancouver:
Prabhu VU. Few-shot learning for dermatological disease diagnosis. [Internet] [Masters thesis]. Georgia Tech; 2019. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/61296.
Council of Science Editors:
Prabhu VU. Few-shot learning for dermatological disease diagnosis. [Masters Thesis]. Georgia Tech; 2019. Available from: http://hdl.handle.net/1853/61296

Georgia Tech
4.
Tendulkar, Purva Milind.
Computational methods for creative inspiration in thematic typography and dance.
Degree: MS, Computer Science, 2020, Georgia Tech
URL: http://hdl.handle.net/1853/63699
► As progress in technology continues, there is a need to adapt and upscale tools used in artistic and creative processes. This can either take the…
(more)
▼ As progress in technology continues, there is a need to adapt and upscale tools used in artistic and creative processes. This can either take the form of generative tools which can provide inspiration to artists, human-AI co-creative tools or tools that can understand and automate time-consuming labor so that artists can focus on the creative side of their art. This thesis aims to address two of these challenges: generating tools for inspiration and automating labor-intensive, tedious work. We approach this by attempting to create interesting art by combining the best of what humans are naturally good at – heuristics of `good` art that an audience might find appealing – and what machines are good at – optimizing well-defined objective functions. Specifically, we introduce two tasks – 1) artistic typography given an input word and theme, and 2) dance generation given any input music. We evaluate our approaches on both these tasks and show that humans find the results generated by our approaches more creative compared to meaningful baselines. The comments received from participants in our studies reveal that they found our tasks fun and intriguing. This further motivates us to push research towards using technology for creative applications.
Advisors/Committee Members: Parikh, Devi (advisor), Batra, Dhruv (committee member), Riedl, Mark (committee member).
Subjects/Keywords: Creativity; Human studies; Typography; Dance; Music; AI; Computer vision
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Tendulkar, P. M. (2020). Computational methods for creative inspiration in thematic typography and dance. (Masters Thesis). Georgia Tech. Retrieved from http://hdl.handle.net/1853/63699
Chicago Manual of Style (16th Edition):
Tendulkar, Purva Milind. “Computational methods for creative inspiration in thematic typography and dance.” 2020. Masters Thesis, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/63699.
MLA Handbook (7th Edition):
Tendulkar, Purva Milind. “Computational methods for creative inspiration in thematic typography and dance.” 2020. Web. 16 Apr 2021.
Vancouver:
Tendulkar PM. Computational methods for creative inspiration in thematic typography and dance. [Internet] [Masters thesis]. Georgia Tech; 2020. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/63699.
Council of Science Editors:
Tendulkar PM. Computational methods for creative inspiration in thematic typography and dance. [Masters Thesis]. Georgia Tech; 2020. Available from: http://hdl.handle.net/1853/63699

Georgia Tech
5.
Raval, Ananya.
Generation of Linux commands using natural language descriptions.
Degree: MS, Computer Science, 2018, Georgia Tech
URL: http://hdl.handle.net/1853/59849
► Translating natural language into source code or programs is an important problem in natural language understanding – both in terms of practical applications and in…
(more)
▼ Translating natural language into source code or programs is an important problem in natural language understanding – both in terms of practical applications and in terms of understanding usage of language to affect action. In this domain, we consider the problem of translating natural language descriptions of LINUX commands into the corresponding commands. This is useful from the point of view of users who want to get commands executed but lack expertise to come up with them on the bash terminal. The major contribution of this thesis is a parallel corpus for translating natural language into LINUX commands. The corpus contains 4561 unique commands and 3-4 descriptions for each command, making a total of 11177 pairs. Along with the corpus, simple classification settings using Support Vector Machines and translation settings using Sequence to Sequence Recurrent Neural Network based models are studied to provide benchmarks for machine learning model performance on the collected dataset. This document provides analysis of the collected dataset, and describes the results and findings from models trained on the dataset.
Advisors/Committee Members: Parikh, Devi (advisor), Batra, Dhruv (committee member), Chau, Duen Horng (Polo) (committee member).
Subjects/Keywords: Natural language processing; Program synthesis; Neural machine translation
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Raval, A. (2018). Generation of Linux commands using natural language descriptions. (Masters Thesis). Georgia Tech. Retrieved from http://hdl.handle.net/1853/59849
Chicago Manual of Style (16th Edition):
Raval, Ananya. “Generation of Linux commands using natural language descriptions.” 2018. Masters Thesis, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/59849.
MLA Handbook (7th Edition):
Raval, Ananya. “Generation of Linux commands using natural language descriptions.” 2018. Web. 16 Apr 2021.
Vancouver:
Raval A. Generation of Linux commands using natural language descriptions. [Internet] [Masters thesis]. Georgia Tech; 2018. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/59849.
Council of Science Editors:
Raval A. Generation of Linux commands using natural language descriptions. [Masters Thesis]. Georgia Tech; 2018. Available from: http://hdl.handle.net/1853/59849

Georgia Tech
6.
Agrawal, Aishwarya.
Visual question answering and beyond.
Degree: PhD, Interactive Computing, 2019, Georgia Tech
URL: http://hdl.handle.net/1853/62277
► In this dissertation, I propose and study a multi-modal Artificial Intelligence (AI) task called Visual Question Answering (VQA) – given an image and a natural…
(more)
▼ In this dissertation, I propose and study a multi-modal Artificial Intelligence (AI) task called Visual Question Answering (VQA) – given an image and a natural language question about the image (e.g., "What kind of store is this?", "Is it safe to cross the street?"), the machine's task is to automatically produce an accurate natural language answer ("bakery", "yes"). Applications of VQA include – aiding visually impaired users in understanding their surroundings, aiding analysts in examining large quantities of surveillance data, teaching children through interactive demos, interacting with personal AI assistants, and making visual social media content more accessible. Specifically, I study the following – 1) how to create a large-scale dataset and define evaluation metrics for free-form and open-ended VQA, 2) how to develop techniques for characterizing the behavior of VQA models, and 3) how to build VQA models that are less driven by language biases in training data and are more visually grounded, by proposing –
a) a new evaluation protocol,
b) a new model architecture, and
c) a novel objective function.
Most of my past work has been towards building agents that can "see" and "talk". However, for a lot of practical applications (e.g., physical agents navigating inside our houses executing natural language commands) we need agents that can not only "see" and "talk" but can also take actions. In chapter 6, I present future directions towards generalizing vision and language agents to be able to take actions.
Advisors/Committee Members: Batra, Dhruv (advisor), Parikh, Devi (committee member), Hays, James (committee member), Zitnick, C. Lawrence (committee member), Vinyals, Oriol (committee member).
Subjects/Keywords: Visual question answering; Deep learning; Computer vision; Natural language processing; Machine learning
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Agrawal, A. (2019). Visual question answering and beyond. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/62277
Chicago Manual of Style (16th Edition):
Agrawal, Aishwarya. “Visual question answering and beyond.” 2019. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/62277.
MLA Handbook (7th Edition):
Agrawal, Aishwarya. “Visual question answering and beyond.” 2019. Web. 16 Apr 2021.
Vancouver:
Agrawal A. Visual question answering and beyond. [Internet] [Doctoral dissertation]. Georgia Tech; 2019. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/62277.
Council of Science Editors:
Agrawal A. Visual question answering and beyond. [Doctoral Dissertation]. Georgia Tech; 2019. Available from: http://hdl.handle.net/1853/62277

Georgia Tech
7.
Yang, Jianwei.
Structured visual understanding, generation and reasoning.
Degree: PhD, Interactive Computing, 2020, Georgia Tech
URL: http://hdl.handle.net/1853/62744
► The world around us is highly structured. In the real world, a single object usually consists of multiple components organized in some structures (e.g., a…
(more)
▼ The world around us is highly structured. In the real world, a single object usually consists of multiple components organized in some structures (e.g., a person has different body parts), and multiple objects usually exist in a scene and interact with each other in predictable ways (e.g., man playing basketball). This structure manifests itself in the visual data that captures the world around us and in the text describing it and thus can potentially provide a strong inductive bias to various vision tasks. In this thesis, we focus on exploiting the structures existing in visual data to improve visual understanding, generation and reasoning. Specifically, for visual understanding, we model structure at different levels to improve image classification, scene graph generation and representation learning. In visual generation, we exploit the foreground-background structure in images to generate images in a layer-wise manner to reduce blending artifacts between foreground and background. Finally, we use the structured visual representations as the intermediate interface to bridge visual perception and reasoning to address different vision and language tasks, including image captioning and visual question generation. Through extensive experiments, we demonstrate that leveraging structure in visual data can not only improve the model performance, but also make vision and language models more grounded and interpretable.
Advisors/Committee Members: Parikh, Devi (advisor), Batra, Dhruv (committee member), Crandall, David (committee member), Lee, Stefan (committee member), Hoffman, Judy (committee member).
Subjects/Keywords: Scene graph; Structured visual understanding; Visual generation; Reasoning; Vision and language
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Yang, J. (2020). Structured visual understanding, generation and reasoning. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/62744
Chicago Manual of Style (16th Edition):
Yang, Jianwei. “Structured visual understanding, generation and reasoning.” 2020. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/62744.
MLA Handbook (7th Edition):
Yang, Jianwei. “Structured visual understanding, generation and reasoning.” 2020. Web. 16 Apr 2021.
Vancouver:
Yang J. Structured visual understanding, generation and reasoning. [Internet] [Doctoral dissertation]. Georgia Tech; 2020. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/62744.
Council of Science Editors:
Yang J. Structured visual understanding, generation and reasoning. [Doctoral Dissertation]. Georgia Tech; 2020. Available from: http://hdl.handle.net/1853/62744

Georgia Tech
8.
Lu, Jiasen.
Visually grounded language understanding and generation.
Degree: PhD, Computer Science, 2020, Georgia Tech
URL: http://hdl.handle.net/1853/62745
► The world around us involves multiple modalities – we see objects, feel texture, hear sounds, smell odors and so on. In order for Artificial Intelligence…
(more)
▼ The world around us involves multiple modalities – we see objects, feel texture, hear sounds, smell odors and so on. In order for Artificial Intelligence (AI) to make progress in understanding the world around us, it needs to be able to interpret and reason about multiple modalities. In this thesis, I take steps towards studying how inducing appropriate grounding in deep models improves multi-modal AI capabilities, in the context of vision and language. Specifically, I cover these four tasks: visual question answering, neural image captioning, visual dialog and vision and language pretraining. In visual question answering, we collected a large scale visual question answering dataset and I study various baselines to benchmark these tasks. To jointly reason about image and question, I propose a novel co-attention mechanism that can learn fine-grained grounding to answer the question. In image captioning, I address the model designs for grounded caption generation of a image. A key focus is to extend the model with the ability to know when to look at the image when generating each word. For the words which have explicit visual correspondence, we further proposed a novel approach that reconciles classical slot filling approaches with modern neural captioning approaches. As a result, our model can produce natural language explicitly grounded in entities that object detectors find in the image. In visual dialog, I study both sides of the visual dialog agents – questioner and answerer. For modeling answerer which answers visual questions in dialog, I introduce a novel discriminant perceptual loss that transfers knowledge from a discriminative model a generative model. For modeling questioner, I consider an image guessing game as a test-bed for balancing task performance and language drift. I propose a Dialog without Dialog task, which requires agents to generalize from single round visual question generation with full supervision to a multi-round dialog-based image guessing game without direct language supervision. The proposed visually-grounded dialog models that can adapt to new tasks while exhibiting less linguistic drift. In vision and language pretraining, I study more general models that can learn visual groundings from massive meta-data on the internet. I also explore the multi-task vision and language representation learning. Our results not only show that a single model can perform all 12 vision and language tasks, but also that joint training can lead to improvements in task metric compared to single-task training with the same architecture. Through this work, I demonstrate that inducing appropriate grounding in deep models improves multi-modal AI capabilities. Finally, I briefly discuss the challenges in this domain and the extensions of recent works.
Advisors/Committee Members: Parikh, Devi (advisor), Batra, Dhruv (advisor), Corso, Jason J. (advisor), Riedl, Mark Owen (advisor), Hoffman, Judy (advisor).
Subjects/Keywords: Computer vision; Natural language processing; Visual question answering; Multi-task learning; Deep learning
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Lu, J. (2020). Visually grounded language understanding and generation. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/62745
Chicago Manual of Style (16th Edition):
Lu, Jiasen. “Visually grounded language understanding and generation.” 2020. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/62745.
MLA Handbook (7th Edition):
Lu, Jiasen. “Visually grounded language understanding and generation.” 2020. Web. 16 Apr 2021.
Vancouver:
Lu J. Visually grounded language understanding and generation. [Internet] [Doctoral dissertation]. Georgia Tech; 2020. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/62745.
Council of Science Editors:
Lu J. Visually grounded language understanding and generation. [Doctoral Dissertation]. Georgia Tech; 2020. Available from: http://hdl.handle.net/1853/62745

Georgia Tech
9.
Das, Abhishek.
Building agents that can see, talk, and act.
Degree: PhD, Interactive Computing, 2020, Georgia Tech
URL: http://hdl.handle.net/1853/62768
► A long-term goal in AI is to build general-purpose intelligent agents that simultaneously possess the ability to perceive the rich visual environment around us (through…
(more)
▼ A long-term goal in AI is to build general-purpose intelligent agents that simultaneously possess the ability to perceive the rich visual environment around us (through vision, audition, or other sensors), reason and infer from perception in an interpretable and actionable manner, communicate this understanding to humans and other agents (e.g., hold a natural language dialog grounded in the environment), and act on this understanding in physical worlds (e.g., aid humans by executing commands in an embodied environment). To be able to make progress towards this grand goal, we must explore new multimodal AI tasks, move from datasets to physical environments, and build new kinds of models. In this dissertation, we combine insights from different areas of AI – computer vision, language understanding, reinforcement learning – and present steps to connect the underlying domains of vision and language to actions towards such general-purpose agents. In Part 1, we develop agents that can see and talk – capable of holding free-form conversations about images – and reinforcement learning-based algorithms to train these visual dialog agents via self-play. In Part 2, we extend our focus to agents that can see, talk, and act – embodied agents that can actively perceive and navigate in partially-observable simulated environments, to accomplish tasks such as question-answering. In Part 3, we devise techniques for training populations of agents that can comunicate with each other, to coordinate, strategize, and utilize their combined sensory experiences and act in the physical world. These agents learn both what messages to send and who to communicate with, solely from downstream reward without any communication supervision. Finally, in Part 4, we use question-answering as a task-agnostic probe to ask a self-supervised embodied agent what it knows about its physical world, and use it to quantify differences in visual representations agents develop when trained with different auxiliary objectives.
Advisors/Committee Members: Batra, Dhruv (advisor), Parikh, Devi (committee member), Hays, James (committee member), Pineau, Joelle (committee member), Malik, Jitendra (committee member).
Subjects/Keywords: Computer vision; Natural language processing; Machine learning; Embodiment
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Das, A. (2020). Building agents that can see, talk, and act. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/62768
Chicago Manual of Style (16th Edition):
Das, Abhishek. “Building agents that can see, talk, and act.” 2020. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/62768.
MLA Handbook (7th Edition):
Das, Abhishek. “Building agents that can see, talk, and act.” 2020. Web. 16 Apr 2021.
Vancouver:
Das A. Building agents that can see, talk, and act. [Internet] [Doctoral dissertation]. Georgia Tech; 2020. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/62768.
Council of Science Editors:
Das A. Building agents that can see, talk, and act. [Doctoral Dissertation]. Georgia Tech; 2020. Available from: http://hdl.handle.net/1853/62768

Georgia Tech
10.
Cogswell, Michael Andrew.
Disentangling neural network representations for improved generalization.
Degree: PhD, Interactive Computing, 2020, Georgia Tech
URL: http://hdl.handle.net/1853/62813
► Despite the increasingly broad perceptual capabilities of neural networks, applying them to new tasks requires significant engineering effort in data collection and model design. Generally,…
(more)
▼ Despite the increasingly broad perceptual capabilities of neural networks, applying them to new tasks requires significant engineering effort in data collection and model design. Generally, inductive biases can make this process easier by leveraging knowledge about the world to guide neural network design. One such inductive bias is disentanglment, which can help preven neural networks from learning representations that capture spurious patterns that do not generalize past the training data, and instead encourage them to capture factors of variation that explain the data generally. In this thesis we identify three kinds of disentanglement, implement a strategy for enforcing disentanglement in each case, and show that more general representations result. These perspectives treat disentanglement as statistical independence of features in image classification, language compositionality in goal driven dialog, and latent intention priors in visual dialog. By increasing the generality of neural networks through disentanglement we hope to reduce the effort required to apply neural networks to new tasks and highlight the role of inductive biases like disentanglement in neural network design.
Advisors/Committee Members: Batra, Dhruv (advisor), Parikh, Devi (committee member), Hays, James (committee member), Goel, Ashok (committee member), Lee, Stefan (committee member).
Subjects/Keywords: Deep learning; Disentanglement; Compositionality; Representation learning; Visual dialog; Language emergence
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Cogswell, M. A. (2020). Disentangling neural network representations for improved generalization. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/62813
Chicago Manual of Style (16th Edition):
Cogswell, Michael Andrew. “Disentangling neural network representations for improved generalization.” 2020. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/62813.
MLA Handbook (7th Edition):
Cogswell, Michael Andrew. “Disentangling neural network representations for improved generalization.” 2020. Web. 16 Apr 2021.
Vancouver:
Cogswell MA. Disentangling neural network representations for improved generalization. [Internet] [Doctoral dissertation]. Georgia Tech; 2020. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/62813.
Council of Science Editors:
Cogswell MA. Disentangling neural network representations for improved generalization. [Doctoral Dissertation]. Georgia Tech; 2020. Available from: http://hdl.handle.net/1853/62813

Georgia Tech
11.
Hsu, Yen-Chang.
Learning from pairwise similarity for visual categorization.
Degree: PhD, Electrical and Computer Engineering, 2020, Georgia Tech
URL: http://hdl.handle.net/1853/62814
► Learning high-capacity machine learning models for perception, especially for high-dimensional inputs such as in computer vision, requires a large amount of human-annotated data. Many efforts…
(more)
▼ Learning high-capacity machine learning models for perception, especially for high-dimensional inputs such as in computer vision, requires a large amount of human-annotated data. Many efforts have been made to construct such large-scale, annotated datasets. However, there are not many options for transferring knowledge from those datasets to other tasks with different categories, limiting the value of these efforts. While one common option for transfer is reusing a learned feature representation, other options for reusing supervision across tasks are generally not considered due to the tight association between labels and tasks. This thesis proposes to use an intermediate form of supervision, pairwise similarity, for enabling the transferability of supervision across different categorization tasks that have different sets of classes. We show that pairwise similarity, defined as whether two pieces of data have the same semantic meaning or not, is sufficient as the primary supervision for learning categorization problems such as clustering and classification. We investigate this idea by answering two transfer learning questions: how and when to transfer. We develop two loss functions for answering how to transfer and show the same framework can support supervised, unsupervised, and semi-supervised learning paradigms, demonstrating better performance over previous methods. This result makes discovering unseen categories in unlabeled data possible by transferring a learned pairwise similarity prediction function. Additionally, we provide a decomposed confidence strategy for answering when to transfer, achieving state-of-the-art results on out-of-distribution data detection. Lastly, we apply our loss function to the application of instance segmentation, demonstrating the scalability of our method in utilizing pairwise similarity within a real-world problem.
Advisors/Committee Members: Kira, Zsolt (advisor), Vela, Patricio (committee member), Batra, Dhruv (committee member), Hoffman, Judy (committee member), Odom, Phillip (committee member).
Subjects/Keywords: Transfer learning; Pairwise similarity; Clustering; Deep learning; Neural networks; Classification; Out-of-distribution detection; Instance segmentation; Lane detection
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Hsu, Y. (2020). Learning from pairwise similarity for visual categorization. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/62814
Chicago Manual of Style (16th Edition):
Hsu, Yen-Chang. “Learning from pairwise similarity for visual categorization.” 2020. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/62814.
MLA Handbook (7th Edition):
Hsu, Yen-Chang. “Learning from pairwise similarity for visual categorization.” 2020. Web. 16 Apr 2021.
Vancouver:
Hsu Y. Learning from pairwise similarity for visual categorization. [Internet] [Doctoral dissertation]. Georgia Tech; 2020. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/62814.
Council of Science Editors:
Hsu Y. Learning from pairwise similarity for visual categorization. [Doctoral Dissertation]. Georgia Tech; 2020. Available from: http://hdl.handle.net/1853/62814

Georgia Tech
12.
Ramasamy Selvaraju, Ramprasaath.
Explaining model decisions and fixing them via human feedback.
Degree: PhD, Computer Science, 2020, Georgia Tech
URL: http://hdl.handle.net/1853/62867
► Deep networks have enabled unprecedented breakthroughs in a variety of computer vision tasks. While these models enable superior performance, their increasing complexity and lack of…
(more)
▼ Deep networks have enabled unprecedented breakthroughs in a variety of computer vision tasks. While these models enable superior performance, their increasing complexity and lack of decomposability into individually intuitive components makes them hard to interpret. Consequently, when today’s intelligent systems fail, they fail spectacularly disgracefully, giving no warning or explanation. Towards the goal of making deep networks interpretable, trustworthy and unbiased, in this dissertation, we will present my work on building algorithms that provide explanations for decisions emanating from deep networks in order to —
1. understand/interpret why the model did what it did,
2. enable knowledge transfer between humans and AI,
3. correct unwanted biases learned by AI models, and
4. encourage human-like reasoning in AI.
Advisors/Committee Members: Parikh, Devi (advisor), Batra, Dhruv (committee member), Hoffman, Judy (committee member), Lee, Stefan (committee member), Kim, Been (committee member).
Subjects/Keywords: Visual explanations; Interpretability; Computer vision; Vision and language; Deep learning; Grad-CAM
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Ramasamy Selvaraju, R. (2020). Explaining model decisions and fixing them via human feedback. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/62867
Chicago Manual of Style (16th Edition):
Ramasamy Selvaraju, Ramprasaath. “Explaining model decisions and fixing them via human feedback.” 2020. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/62867.
MLA Handbook (7th Edition):
Ramasamy Selvaraju, Ramprasaath. “Explaining model decisions and fixing them via human feedback.” 2020. Web. 16 Apr 2021.
Vancouver:
Ramasamy Selvaraju R. Explaining model decisions and fixing them via human feedback. [Internet] [Doctoral dissertation]. Georgia Tech; 2020. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/62867.
Council of Science Editors:
Ramasamy Selvaraju R. Explaining model decisions and fixing them via human feedback. [Doctoral Dissertation]. Georgia Tech; 2020. Available from: http://hdl.handle.net/1853/62867

Georgia Tech
13.
Chandrasekaran, Arjun.
Towards natural human-AI interactions in vision and language.
Degree: PhD, Interactive Computing, 2019, Georgia Tech
URL: http://hdl.handle.net/1853/62323
► Inter-human interaction is a rich form of communication. Human interactions typically leverage a good theory of mind, involve pragmatics, story-telling, humor, sarcasm, empathy, sympathy, etc.…
(more)
▼ Inter-human interaction is a rich form of communication. Human interactions typically leverage a good theory of mind, involve pragmatics, story-telling, humor, sarcasm, empathy, sympathy, etc. Recently, we have seen a tremendous increase in the frequency and the modalities through which humans interact with AI. Despite this, current human-AI interactions lack many of these features that characterize inter-human interactions. Towards the goal of developing AI that can interact with humans naturally (similar to other humans), I take a two-pronged approach that involves investigating the ways in which both the AI and the human can adapt to each other's characteristics and capabilities. In my research, I study aspects of human interactions, such as humor, story-telling, and the humans' abilities to understand and collaborate with an AI. Specifically, in the vision and language modalities,
1. In an effort to improve the AI's capabilities to adapt its interactions to a human, we build computational models for (i) humor manifested in static images, (ii) contextual, multi-modal humor, and (iii) temporal understanding of the elements of a story. 2. In an effort to improve the capabilities of a collaborative human-AI team, we study (i) a lay person's predictions regarding the behavior of an AI in a situation, (ii) the extent to which interpretable explanations from an AI can improve performance of a human-AI team. Through this work, I demonstrate that aspects of human interactions (such as certain forms of humor and story-telling) can be modeled with reasonable success using computational models that utilize neural networks. On the other hand, I also show that a lay person can successfully predict the outputs and failures of a deep neural network. Finally, I present evidence that suggests that a lay person who has access to interpretable explanations from the model, can collaborate more effectively with a neural network on a goal-driven task.
Advisors/Committee Members: Parikh, Devi (advisor), Batra, Dhruv (committee member), Chernova, Sonia (committee member), Riedl, Mark (committee member), Bansal, Mohit (committee member).
Subjects/Keywords: AI; Neural networks; Human-AI interaction; Human-AI collaboration; Humor; Narrative; Sorytelling; Explainable AI; Interpretability; Predictability; Guesswhich; Human-in-loop evaluation
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Chandrasekaran, A. (2019). Towards natural human-AI interactions in vision and language. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/62323
Chicago Manual of Style (16th Edition):
Chandrasekaran, Arjun. “Towards natural human-AI interactions in vision and language.” 2019. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/62323.
MLA Handbook (7th Edition):
Chandrasekaran, Arjun. “Towards natural human-AI interactions in vision and language.” 2019. Web. 16 Apr 2021.
Vancouver:
Chandrasekaran A. Towards natural human-AI interactions in vision and language. [Internet] [Doctoral dissertation]. Georgia Tech; 2019. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/62323.
Council of Science Editors:
Chandrasekaran A. Towards natural human-AI interactions in vision and language. [Doctoral Dissertation]. Georgia Tech; 2019. Available from: http://hdl.handle.net/1853/62323

Georgia Tech
14.
Shaban, Amirreza.
Low-shot learning for object recognition, detection, and segmentation.
Degree: PhD, Interactive Computing, 2020, Georgia Tech
URL: http://hdl.handle.net/1853/63599
► Deep Neural Networks are powerful at solving classification problems in computer vision. However, learning classifiers with these models requires a large amount of labeled training…
(more)
▼ Deep Neural Networks are powerful at solving classification problems in computer vision. However, learning classifiers with these models requires a large amount of labeled training data, and recent approaches have struggled to adapt to new classes in a data-efficient manner. On the other hand, the human brain is capable of utilizing already known knowledge in order to learn new concepts with fewer examples and less supervision. Many meta-learning algorithms have been proposed to fill this gap but they come with their practical and theoretical limitations. We review the well-known bi-level optimization as a general framework for few-shot learning and hyperparameter optimization and discuss the practical limitations of computing the full gradient. We provide theoretical guarantees for the convergence of the bi-level optimization using the approximated gradients computed by the truncated back-propagation. In the next step, we propose an empirical method for few-shot semantic segmentation: instead of solving the inner optimization, we propose to directly estimate its result by a general function approximator. Finally, we will discuss extensions of this work with the focus on weakly-supervised object detection when full supervision is not available for the few training examples.
Advisors/Committee Members: Boots, Byron (advisor), Hays, James (committee member), Batra, Dhruv (committee member), Kira, Zsolt (committee member), Li, Fuxin (committee member).
Subjects/Keywords: Few-shot learning; Low-shot learning; Bi-level optimization; Few-shot semantic segmentation; Video object segmentation; Weakly-supervised few-shot object detection
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Shaban, A. (2020). Low-shot learning for object recognition, detection, and segmentation. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/63599
Chicago Manual of Style (16th Edition):
Shaban, Amirreza. “Low-shot learning for object recognition, detection, and segmentation.” 2020. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/63599.
MLA Handbook (7th Edition):
Shaban, Amirreza. “Low-shot learning for object recognition, detection, and segmentation.” 2020. Web. 16 Apr 2021.
Vancouver:
Shaban A. Low-shot learning for object recognition, detection, and segmentation. [Internet] [Doctoral dissertation]. Georgia Tech; 2020. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/63599.
Council of Science Editors:
Shaban A. Low-shot learning for object recognition, detection, and segmentation. [Doctoral Dissertation]. Georgia Tech; 2020. Available from: http://hdl.handle.net/1853/63599

Georgia Tech
15.
Castro, Daniel Alejandro.
Understanding the motion of a human state in video classification.
Degree: PhD, Computer Science, 2019, Georgia Tech
URL: http://hdl.handle.net/1853/61262
► For the last 50 years we have studied the correspondence between human motion and the action or goal they are attempting to accomplish. Humans themselves…
(more)
▼ For the last 50 years we have studied the correspondence between human motion and the action or goal they are attempting to accomplish. Humans themselves subconsciously learn subtle cues about other individuals that gives them insight into their motivation and overall sincerity. In contrast, computers require significant guidance in order to correctly determine deceivingly basic activities. Due to the recent advent of deep learning, many algorithms do not make explicit use of motion parameters to categorize these activities. With the recent advent of widespread video recording and the sheer amount of video data being stored, the ability to study human motion has never been more essential. In this thesis, we propose that our understanding of human motion representations and its context can be leveraged for more effective action classification. We explore two distinct approaches for understanding human motion in video. Our first approach for classifying human activities is within an egocentric context. In this approach frames are captured every minute at a low frame rate video that represents a summary of a persons' day. The challenge in this context is that you do not have an explicitly visual representation of a human. In order to tackle this problem we therefore leverage contextual information alongside the image data to improve the understanding of our daily activities. In this approach, motion is implicitly represented in the image data given that we do not have a visual representation of a human pose. We combine existing neural network models with contextual information using a process we label a late-fusion ensemble. We rely on the convolutional network to encode high-level motion parameters which we later demonstrate performs comparably to explicitly encoding motion representations such as optical flow. We also demonstrate that our model extends to other participants with only two days of additional training data. This work enabled us to understand the importance of leveraging context through parameterization for learning human activities. In our second approach, we improve this encoding by learning from three representations that attempt to integrate motion parameters into video categorization: (1): regular video frames (2): optical flow and (3): human pose representation. Regular video frames are most commonly used in video analysis on a per-frame basis due to the nature of most video categories. We introduce a technique which enables us to combine contextual features with a traditional neural network to improve the classification of human actions in egocentric video. Then, we introduce a dataset focused on humans performing various dances, an activity which inherently requires its motion to be identified. We discuss the value and relevance of this dataset along the most commonly used video datasets and among a handful of recently released datasets which are relevant to human motion. Next, we analyze the performance of existing algorithms with each of the motion parameterizations mentioned above. This assists…
Advisors/Committee Members: Essa, Irfan (advisor), Batra, Dhruv (committee member), Hays, James (committee member), Parikh, Devi (committee member), Sukthankar, Rahul (committee member).
Subjects/Keywords: Action recognition; Dance videos; Human pose; Pose parameterization
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Castro, D. A. (2019). Understanding the motion of a human state in video classification. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/61262
Chicago Manual of Style (16th Edition):
Castro, Daniel Alejandro. “Understanding the motion of a human state in video classification.” 2019. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/61262.
MLA Handbook (7th Edition):
Castro, Daniel Alejandro. “Understanding the motion of a human state in video classification.” 2019. Web. 16 Apr 2021.
Vancouver:
Castro DA. Understanding the motion of a human state in video classification. [Internet] [Doctoral dissertation]. Georgia Tech; 2019. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/61262.
Council of Science Editors:
Castro DA. Understanding the motion of a human state in video classification. [Doctoral Dissertation]. Georgia Tech; 2019. Available from: http://hdl.handle.net/1853/61262

Georgia Tech
16.
Drews, Paul Michael.
Visual attention for high speed driving.
Degree: PhD, Electrical and Computer Engineering, 2018, Georgia Tech
URL: http://hdl.handle.net/1853/61183
► Coupling of control and perception is an especially difficult problem. This thesis investigates this problem in the context of aggressive off-road driving. By jointly developing…
(more)
▼ Coupling of control and perception is an especially difficult problem. This thesis investigates this problem in the context of aggressive off-road driving. By jointly developing a robust 1:5 scale platform and leveraging state of the art sampling based model predictive control, the problem of aggressive driving on a closed dirt track using only monocular cam- era images is addressed. It is shown that a convolutional neural network can directly learn a mapping from input images to top-down cost map. This cost map can be used by a model predictive control algorithm to drive aggressively and repeatably at the limits of grip. Further, the ability to learn an end-to-end trained attentional neural network gaze strategy is developed that allows both high performance and better generalization at our task of high speed driving. This gaze model allows us to utilize simulation data to generalize from our smaller oval track to a much more complex track setting. This gaze model is compared with that of human drivers performing the same task. Using these methods, repeatable, aggressive driving at the limits of handling using monocular camera images is shown on a physical robot.
Advisors/Committee Members: Rehg, James M. (advisor), Theodorou, Evangelos A. (committee member), Boots, Byron (committee member), Batra, Dhruv (committee member), Fox, Dieter (committee member).
Subjects/Keywords: Robotics; Computer vision; Autonomous vehicles; Neural networks; High speed
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Drews, P. M. (2018). Visual attention for high speed driving. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/61183
Chicago Manual of Style (16th Edition):
Drews, Paul Michael. “Visual attention for high speed driving.” 2018. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/61183.
MLA Handbook (7th Edition):
Drews, Paul Michael. “Visual attention for high speed driving.” 2018. Web. 16 Apr 2021.
Vancouver:
Drews PM. Visual attention for high speed driving. [Internet] [Doctoral dissertation]. Georgia Tech; 2018. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/61183.
Council of Science Editors:
Drews PM. Visual attention for high speed driving. [Doctoral Dissertation]. Georgia Tech; 2018. Available from: http://hdl.handle.net/1853/61183

Georgia Tech
17.
Vijayakumar, Ashwin Kalyan.
Improved search techniques for structured prediction.
Degree: PhD, Interactive Computing, 2020, Georgia Tech
URL: http://hdl.handle.net/1853/63701
► Many useful AI tasks like machine translation, captioning or program syn- thesis to name a few can be abstracted as structured prediction problems. For these…
(more)
▼ Many useful AI tasks like machine translation, captioning or program syn- thesis to name a few can be abstracted as structured prediction problems. For these problems, the search space is well-defined but extremely large — all English language sentences for captioning or translation and similarly, all programs that can be generated from a context-free grammar in the case of program syn- thesis. Therefore, inferring the correct output (a sentence or a program) given the input (an image or user-defined specifications) is an intractable search problem. To overcome this, heuristics — hand designed or learnt from data — are often employed. In my work, I propose modified search procedures to output multiple diverse sequences and then, for the task of outputting programs, I propose a novel search procedure that accelerates existing techniques via heuristics learnt from deep networks. Going further, I propose to study the role of memory and search i.e. process each new query with the memory of previous queries — specifically in the context of solving mathematical problems.In the context of sequence prediction tasks like image captioning or translation, I introduce Diverse Beam Search (DBS), an approximate inference technique to decode multiple relevant and diverse outputs. With the objective of producing multiple sentences that are different from each other, DBS modifies the commonly used Beam Search procedure by greedily imposing diversity constraints. In follow-up work, we directly formulate the task of modeling a set of sequences and propose a trainable search procedure dubbed diff-BS. While both algorithms are task-agnostic, image-captioning is used as the test-bed to demonstrate their effectiveness. In the context of program-synthesis, I propose Neural Guided Deductive Search (NGDS), that accelerates deductive search via learnt heuristics. We find that our approach results in a significant speedup without compromising on the quality of the solutions found. Further, I will discuss the application of this technique in the context of programming by examples and synthesis of hard problems for a given solver. Finally, I study the interplay between memory and search, specifically in the context of mathematical problem solving. Analogical reasoning is a strategy commonly adopted by humans while solving problems i.e. new and unseen problems are solved by drawing parallels to previously seen problems. Inspired by such an approach, I propose to learn suitable representations for “problems” that al- lows the reuse of solutions from previously seen problems as a building block to construct the solution for the problem at hand.
Advisors/Committee Members: Batra, Dhruv (advisor), Parikh, Devi (committee member), Boots, Byron (committee member), Jain, Prateek (committee member), Polozov, Oleksandr (committee member), Rajpurohit, Tanmay (committee member).
Subjects/Keywords: Sequence decoding; Program synthesis
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Vijayakumar, A. K. (2020). Improved search techniques for structured prediction. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/63701
Chicago Manual of Style (16th Edition):
Vijayakumar, Ashwin Kalyan. “Improved search techniques for structured prediction.” 2020. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/63701.
MLA Handbook (7th Edition):
Vijayakumar, Ashwin Kalyan. “Improved search techniques for structured prediction.” 2020. Web. 16 Apr 2021.
Vancouver:
Vijayakumar AK. Improved search techniques for structured prediction. [Internet] [Doctoral dissertation]. Georgia Tech; 2020. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/63701.
Council of Science Editors:
Vijayakumar AK. Improved search techniques for structured prediction. [Doctoral Dissertation]. Georgia Tech; 2020. Available from: http://hdl.handle.net/1853/63701
18.
Vedantam, Shanmukha Ramak.
Interpretation, grounding and imagination for machine intelligence.
Degree: PhD, Interactive Computing, 2018, Georgia Tech
URL: http://hdl.handle.net/1853/60799
► Understanding how to model computer vision and natural language jointly is a long-standing challenge in artificial intelligence. In this thesis, I study how modeling vision…
(more)
▼ Understanding how to model computer vision and natural language jointly is a long-standing challenge in artificial intelligence. In this thesis, I study how modeling vision and language using semantic and pragmatic considerations can help derive more human-like inferences from machine learning models. Specifically, I consider three related problems: interpretation, grounding and imagination. In interpretation, the goal is to get machine learning models to understand an image and describe its contents using natural language in a contextually relevant manner. In grounding, I study how to connect natural language to referents in the physical world, and understand if this can help learn common sense. Finally, in imagination, I study how to ‘imagine’ visual concepts completely and accurately across the full range and (potentially unseen) compositions of their visual attributes. This thesis analyzes these problems from computational as well as algorithmic perspectives and suggests exciting directions for future work.
Advisors/Committee Members: Parikh, Devi (advisor), Batra, Dhruv (committee member), Eisenstein, Jacob (committee member), Zitnick, C Lawrence (committee member), Murphy, Kevin (committee member).
Subjects/Keywords: Computer vision; Machine learning; Artificial intelligence
Record Details
Similar Records
Cite
Share »
Record Details
Similar Records
Cite
« Share





❌
APA ·
Chicago ·
MLA ·
Vancouver ·
CSE |
Export
to Zotero / EndNote / Reference
Manager
APA (6th Edition):
Vedantam, S. R. (2018). Interpretation, grounding and imagination for machine intelligence. (Doctoral Dissertation). Georgia Tech. Retrieved from http://hdl.handle.net/1853/60799
Chicago Manual of Style (16th Edition):
Vedantam, Shanmukha Ramak. “Interpretation, grounding and imagination for machine intelligence.” 2018. Doctoral Dissertation, Georgia Tech. Accessed April 16, 2021.
http://hdl.handle.net/1853/60799.
MLA Handbook (7th Edition):
Vedantam, Shanmukha Ramak. “Interpretation, grounding and imagination for machine intelligence.” 2018. Web. 16 Apr 2021.
Vancouver:
Vedantam SR. Interpretation, grounding and imagination for machine intelligence. [Internet] [Doctoral dissertation]. Georgia Tech; 2018. [cited 2021 Apr 16].
Available from: http://hdl.handle.net/1853/60799.
Council of Science Editors:
Vedantam SR. Interpretation, grounding and imagination for machine intelligence. [Doctoral Dissertation]. Georgia Tech; 2018. Available from: http://hdl.handle.net/1853/60799
.