See Readme file for details of updates.
July 15: Typos of scene_id.csv in training, test dataset are corrected and reloaded on the original URL.
July 28: Regular and Challenge track submission deadlines are extended.
Aug 8: Please submit the challenge track result file as the docx file format instead of the csv file format data. (cause cmt submission format limitation issue)
Aug 20: You can see the result leader board here.
Oct 16: Workshop Program time table is updated.
Please see the notice for Challenge, Regular track speakers here.
Nov 8: Workshop Photos are uploaded. Link


Call for Paper (Download)

Comprehensive video understanding has recently received increasing attention from the computer vision and multimedia communities with the goal of building machines that can understand the video like humans. Currently, most works for untrimmed video recognition mainly focus on isolated and independent problems such as action recognition or scene recognition. While they address different aspects of video understanding, there exist strong mutual relationships and correlations among action and scene. To achieve the very accurate human level understanding of untrimmed videos, the comprehensive understanding of various aspects such as what the actors are doing and where they are doing so is of great importance.

This workshop aims at providing a forum to exchange ideas in comprehensive video understanding with a particular emphasis on the untrimmed video summarization with temporal action and scene recognition in untrimmed videos. Papers presented in this workshop have to address one of independent video understanding problems including but not limited to

This workshop consists of two tracks: Regular Track and Challenge Track.

Regular Track

The regular track invites paper that addresses video summarization, video action, and scene recognition problems. We are soliciting original contributions that address a wide range of theoretical and practical issues including but not limited to:

  • Video summarization in untrimmed video
  • Action recognition in untrimmed video
  • Scene recognition in untrimmed video
  • Video summarization with action/scene information

Challenge Track

The challenge track is the challenge section that focuses on the evaluation of video summarization with temporal action and scene recognition on the new untrimmed video dataset, called the Untrimmed Video Summarization with Action and Scene Recognition Dataset. The detail of challenge can be found here.


Date: 27 October, 2019
Time: Half Day - PM (13:00-17:00)
Location: Room 307 A-C

Time Description
13:00 Opening Comments
13:10 Invited Talk 1 [Title] Towards Embodied Action Understanding
[Speaker] Ivan Laptev
13:50 Invited Talk 2 [Title] Weakly Supervised Action Localization
[Speaker] Bohyung Han
14:30 Challenge Briefing [Speaker] Kuk-Jin Yoon
14:45 Challenge Talk 1 [Title] Comprehensive Video Understanding: Video summarization with content-based video recommender design
[Authors] Yudong Jiang, Kaixu Cui, Bo Peng, Changliang Xu
15:00 Challenge Talk 2 [Title] Temporal U-Nets for Video Summarization with Scene and Action Recognition
[Authors] Heeseung Kwon, Woohyeon Joseph Shim, Minsu Cho
15:15 Challenge Talk 3 [Title] Video Summarization by Learning Relationships between Action and Scene
[Authors] Jungin Park, Jiyoung Lee, Sangryul Jeon, Kwanghoon Sohn
15:30 Coffe Break
16:20 Regular Talk 1 [Title] Enhancing Temporal Action Localization with Transfer Learning from Action Reccognition
[Authors] Alexander Richard, Ahsan Iqbal, Jürgen Gall
16:40 Regular Talk 2 [Title] Video Multitask Transformer Network
[Authors] Hongje Seong, Junhyuk Hyun, Euntai Kim
17:00 Closing Comments
** For Challenge, Regular track Speakers **
Please prepare presentation with ppt format for 10~15 minutes(Challenge Track), 15~20 minutes(Regular Track).
There is no poster session, so you don't have to prepare poster.

Invited Talk 1
Title: Towards Embodied Action Understanding
Speaker: Ivan Laptev
Abstract: Computer vision has come a long way towards automatic labeling of objects, scenes and human actions in visual data. While this recent progress already powers applications such as visual search and autonomous driving, visual scene understanding remains an open challenge beyond specific applications. In this talk I will outline limitations of human-defined labels and will argue for the task-driven approach to scene understanding. Towards this goal I will describe our recent efforts on learning visual models from narrated instructional videos. I will present methods for automatic discovery of actions and object states associated with specific tasks such as changing a car tire or making coffee. Along these efforts, I will describe a state-of-the-art method for text-based video search using our recent dataset with automatically collected 100M narrated videos. Finally I will present our on-going work on visual scene understanding for real robots where we learn agents to discover sequences of actions to achieve particular tasks.

Invited Talk 2
Title: Weakly Supervised Action Localization
Speaker: Bohyung Han
Abstract: The success of deep learning in computer vision is partly attributed to the construction of large-scale labeled datasets such as ImageNet. However, computer vision problems often require substantial human efforts and interventions to obtain accurate annotations due to dynamic aspects of class labels, needs for pixel-level labeling, and annotation ambiguities. Hence, collecting high quality large-scale labeled datasets is very time consuming and even infeasible. Such challenges are even aggravated in the problems related to videos. This talk first addresses problem formulations and existing algorithms about weakly supervised action localization and then discusses the limitations of the existing techniques and the potential solutions to the issues.

Paper Submission

You are cordially invited to submit papers to the 2nd Workshop and Challenge on ‘Comprehensive Video Understanding in the Wild – Untrimmed video summarization with temporal action and scene recognition (CoView 2019).’ This workshop invites full research papers of varying length from 4 to 8 pages, plus additional pages for the reference pages. The reference page(s) are not counted to the page limit of 4 to 8 pages.

All papers must be formatted using the ICCV template style, which can be obtained at ICCV style.

Important Dates

July 22, 2019: Abstract submission.
July 29, 2019: Workshop paper submission deadline.
Aug 5, 2019: Workshop paper submission deadline. (Extended) (11:59 PM Pacific Time)
Aug 19, 2019: Notification of acceptance.
Aug 30, 2019: Camera-ready papers submission.
Oct 27, 2019: Workshop.

Click the following link to go to the submission site: :
The link will be provided soon.

Contact Email :


Video summarization with action and scene recognition in untrimmed videos

This challenge aims at exploring new approaches and brave ideas for video summarization task with action and scene recognition in untrimmed videos and evaluating the ability of the algorithms. In this challenge, it is intended to deal with the joint and comprehensive understanding of untrimmed videos with a particular emphasis on video summarization with temporal action and scene recognition. While most recent works for untrimmed video understanding problem mainly focus on each of them, there exist strong mutual relationships and correlations among importance, action and scene.


For the challenge, we built the Video Summarization with Temporal Action and Scene recognition Dataset that consists of untrimmed videos sampled from the Youtube-8M dataset, Dense-Captioning dataset (Krishna et al.), and Video summarization dataset (Song et al.) with annotated action and scene class labels for each video. It consists of about 1500 videos and the distribution among training and testing is 1200 and 300 of the total videos, respectively. Each video is annotated as follows: we chopped up a video into a set of 5 second-long segments and asked 20 users to annotate importance score and action / scene labels of each segment. The importance score indicates how important each segment is, compared to other segments from the same video, on a scale from 0 (not important) to 2 (very important). The action and scene label are selected from 99 and 79 classes, respectively. The average of importance scores and the most voted action/scene labels become the segment-level information of the video. Here, all segments contain scores and action/scene class labels.
The training dataset can be downloaded directly from here.
The test dataset can be downloaded directly from here.

Evaluation Metric

As the evaluation protocol of the challenge, we will use the quantitative evaluation metric as well as the qualitative evaluation metric. For the quantitative evaluation, we compute the sum of importance scores of summarized segments and also compare the scene and action labels from the submitted summary segments with those of the GT summary segments. When N is the number of test videos and N_s is the number of selected segments for summarization, the importance score metric is defined as

$$ \frac{1}{N} \sum_{k=1}^{N} \frac{\sum_{i=1}^{N_s} Importance Score(k, Sub_i)}{\sum_{i=1}^{N_s} Importance Score(k, GT_i)} $$

where Importance Score(k, GT_i) is the importance score from the i-th segment of the ground truth summary of the k-th video, and Importance Score(k, Sub_i) is the importance score of the i-th segment of the submitted summary for the k-th video. Importance Score is multiple user annotated scores, which is shared in both ground truth and submitted summary. Ground truth summary is top N_s segments with highest scores, and N_s is set to 6 for all videos. For better understanding this metric, we provide a toy example code here.

In addition, we will use the Top-K hamming score of action and scene labeling results. When K is the number of predictions and L is the number of labels, and the Top-K hamming score, H(K), is defined as

$$ H(K) = \frac{1}{N}\sum\limits_{n = 1}^N {\sum\limits_{label = 1}^L {\sum\limits_{k = 1}^K {\frac{{AND\left( {k{\rm{ - th predictio}}{{\rm{n}}_{label}},G{T_{label}}} \right)}}{L}} } } $$

with AND(A,B)=1 only if A and B has exactly same label index on action or scene. We will set the Top-K hamming score as the challenge criterion and the Top-K hamming score will be provided to you as additional prediction result information. The number K will soon be set to a reasonable value. For the qualitative evaluation, we will perform the user study through the Amazon Mechanical Turk. We will create multiple tasks to qualitatively evaluate the submissions as below.

Submission Format

Please follow the following CSV format when submitting your results for the challenge:

Submitted file should contain header, [video_id, starting time, scene label 1, scene label 2, … , scene label 5, action label 1, action label 2, … , action label 5]. Submitted file must contain only the information of predicted 6 segments. You can download a evaluation kit You can download here a evaluation kit.
In addition, follow the following fact sheet format when submitting your fact sheet for the challenge. You can download here the fact sheet format.

Note that we accept the challenge submission file (.csv) as a supplement file in the submission CMT site (NOT EMAIL !)
Please submit the challenge track result file as the docx file format instead of the csv file format data. (cause cmt submission format limitation issue)
You can download here a docx file example.

Important Dates

April 8, 2019: Release of training data.
July 8, 2019: Release of test data and evaluation kit.
Aug 2, 2019: Test results (including fact sheets) submission deadline.
Aug 9, 2019: Test results(including fact sheets) submission deadline. (Extended) (11:59 PM Pacific Time)
Aug 19, 2019: Final test results release to the participants.
Aug 30, 2019: Camera-ready papers submission deadline for entries from the challenges.

Ledaer Board
Submission ID summary score rank scene & action score rank Final rank (rank sum)
ID 13 0.8512 1 0.7325 3 1 (4)
ID 3 0.8343 3 0.8036 2 2 (5)
ID 19 0.8409 2 0.7294 4 3 (6)
ID 5 0.7594 7 0.8114 1 4 (8)
ID 15 0.8151 4 0.3872 5 5 (9)
ID 16 0.7985 5 0.3831 6   6 (11)
ID 14 0.7696 6 0.0494 7   7 (13)

Invited Speakers

Ivan Laptev


Bohyung Han

Seoul National University

Program Chairs

Kuk-Jin Yoon


Kwanghoon Sohn

Yonsei University

Ming-Hsuan Yang

University of California at Merced

Karteek Alahari

INRIA Grenoble

Yale Song

Microsoft Research

Technical Program Committee

Hyeran Byun

Yonsei University

Minsu Cho


Sunyoung Cho


Bumsub Ham

Yonsei University

Suha Kwak


Jia-bin Huang

Virginia Tech

Ig-Jae Kim


Jongwoo Lim

Hanyang University

Jiangbo Lu

Shenzhen Cloudream Tech

Tao Mei

Dongbo Min

Ewha Womens University

John See

Multimedia University

Tony Tung


Stefan Winkler


Kuk-Jin Yoon


Antoine Miech


Gül Varol



Contact the workshop organizers on:


The workshop is supported by Next-Generation Information Computing Development Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT (NRF-2017M3C4A7069369).