A New Dataset to Develop Smart Assistants for Specialized Training with Augmented Reality | NYU Tandon School of Engineering

A New Dataset to Develop Smart Assistants for Specialized Training with Augmented Reality

Sustainability & Environment


Project Sponsor:

    Project Abstract

    Emergency response personnel (i.e. firefighters, medical personnel, and utility workers) require specialized training to act in time- and precision-sensitive tasks. Comprehensive training requires time, practice, and continuous guidance from a professional and experienced trainer able to predict and correct the trainee’s actions. The trainer-to-trainee ratio currently limits the amount of individuals who are trained at a time. Ideally, such training could be carried out by an automatic and smart agent using augmented reality devices like the Hololens. In this project, we aim to develop a system for guided monitoring of a person’s actions as they learn a specialized task.


    Project Description & Overview

    Smart assistants can guide a trainee’s actions as they learn a specialized task. Such a system can: 1) identify the task being performed, and 2) predict the trainee’s actions. The assistant must process the trainee’s field-of-view (i.e. egocentric video) and surrounding sounds, carry out object recognition (including the trainee’s body parts, like arms), attend to relevant objects, and predict future actions.

    Existing approaches rely on multimodal datasets with egocentric video and audio where an individual is seen carrying out a task. These datasets must be annotated so that actions in the egocentric video are associated with clear human-language descriptors. Annotation can be carried out completely by a person, but this is time-consuming and prone to error. Alternatives include automatic annotation via speech recognition, if the video data features an individual narrating their own activities. But narration leads to pauses between actions as the individual speaks, or errors when the individual thinks about what to say, or talks and acts at the same time. As a result, currently existing smart assistants are limited by data used for their development.

    This project aims to 1) Improve data quality with a new dataset of egocentric video and audio, where an individual receives verbal instructions from a third party. 2) Benchmarking of pre-trained machine learning models that carry out video summarization and audio-visual correspondence. 3) Evaluation of action prediction models. Hence, the project’s question is: do multimodal egocentric recordings of instructed actions result in better annotations and predictions of human performance by an artificial agent? 


    Datasets

    Collecting data from emergency response workers would be logistically challenging and not really necessary to first address our research question (whether multimodal egocentric video of instructed actions result in better annotations to predict human behavior). Instead, to ask our question we will use videos recorded by a real-life worker at the Subway restaurant chain. He uploads to Youtube everyday and his videos are openly-available. The videos feature him making specific menu items as he follows the verbal instructions of customers. He started his Youtube channel in June and has already uploaded 7 hours of egocentric multi-modal video (and his list of videos continues growing every day). Moreover, we have established direct contact with him, shared our research ideas, and if this project is approved he will support us by uploading at least 10 minutes of his real-life footage at work, per day.


    Competencies

    The students should be comfortable with Python and familiar with data analysis tools such as numpy and pandas. Having a machine learning background is also desirable (basic classification models such as random forests and test/train splits for evaluation).


    Learning Outcomes & Deliverables

    To conduct such a project we need audio-visual annotations. First, students will learn how to use existing models for automatic speech recognition and visual object detection. This will result in a real-world audio-visual dataset of egocentric perspective in an instruction-following task with annotated actions. Secondly, students will learn to evaluate performance of existing video summarization and audio-visual correspondence models against their newly-curated dataset. This will result in a study of performance of different off-the-shelf models on egocentric multimodal data in an instruction-following task. Finally, students will learn how to use and benchmark state-of-the-art multimodal action prediction models. The third deliverable would be a report with a summary of the work carried out and main conclusions along with the associated code used.


    Student Team Members

    Siyong Liu, Sonam Sonam, Sijian Wang, Miaojing Yang