Semantic Video Summary

Research background

Today, the consumption of multimedia contents is rapidly increasing. Video segmentation is the basic operation in authoring and retrieving video content, therefore, detecting exact shot boundary and segmenting video into semantically homogeneous unit are very important for higher level video manipulation. Semantic video summarization requires understanding of semantics in video content and it's very challenging work.

Research area

In this research, we provide fundamental functions for multimedia manipulation. The research objectives are as follows,

1. Robust shot boundary detection algorithm

2. video shot clustering by homogeneous semantics

3. Genre classification to develop genre dependent video summarization

4. Highlight event  detection for video summary

5. A Framework for Semantic video summarization.

We achieved robust shot boundary detection and genre classification work successfully. And for sport genre contents, researches about semantic event detection and retrieval were done. video shot clustering by homogeneous semantics and  generalization of semantic video summarization are still on going work.

Video segmentation detection by Hidden Markov Model

o General framework

We propose a robust video segmentation algorithm for video summary. Exact shot boundary detection and segmentation of video into meaningful scenes are important parts for the automatic video summary. In this paper, we present a shot boundary detection using audio and visual features defined in the MPEG-7 which provides software standard for multimedia description. By using Hidden Markov Model classifier based on statistics of the audio and visual features, exact shot boundary is detected and further over-segmentation could be reduced, which is a common problem in automatic video segmentation


Fig. 1 video segmentation process

o Detection of  camera shot change by HMM

Hidden Markov Model is adopted as a framework of video segmentation. MPEG-7 descriptors are used for extracting audio, visual features from video content. Using the features, HMM classifies shot boundary into non-transition region, abrupt transition region, and gradual transition region. We considered detecting shot boundary event as a detecting transition region which has time domain information. In the shot transition verification process, falsely detected boundaries are rejected and the specific camera effects such as dissolve, fade in/out and wipe are classified which are also used in the verification.


 Fig. 2 HMM classifier for shot boundary detection

   After detecting shot boundary region, verification process rejects erroneously detected shots. The detected shot transitions are verified by checking the characteristics of the region. Abrupt shot transition is rarely missed, but for gradual shot transition, wrong detections are occurred in the typical cases. The first case is when there exists global motion. This case generates the same waveform as if there is gradual shot transition. Because true gradual shot doesnt show global motion, detecting the came motion of the gradual transition region is good rejecter for false transition region. The second case is short term disturbance such as dominant object movement or bright variation. In this case, the length of transition region detected by HMM is shorter than the length of abrupt transition region which is equal to frame difference K.

o Conclusions

In the research, a video segmentation algorithm with multi-modal MPEG-7 descriptions is proposed. By using both audio and visual features, erroneous shot boundary detection and over-segmentation were avoided compared with conventional uni-modal feature based algorithm. Editing type such as cut, fade-in/out, dissolve are detected from gradual shot change using mathematical model of frame editing effect, which is used to reject falsely detected shot boundaries. The experimental results show effective shot and editing type detection. And the results show that the proposed method merges shots into more meaningful scenes. The proposed method uses just simple clues to detect video shot semantics. Therefore to increase merging efficiency, more sophisticated approaches should be considered. Video segmentation using video shot level features can be considered as the future work.

Video Genre classification Using Multimodal Features

We propose a video genre classification method using multimodal features. The proposed method is applied for the preprocessing of automatic video summarization or the retrieval and classification of broadcasting video contents. Through a statistical analysis of low-level and middle-level audio-visual features in video, the proposed method can achieve good performance in classifying several broadcasting genres such as cartoon, drama, music video, news, and  sports. In this paper, we adopt MPEG-7 audio-visual descriptors as multimodal features of video contents and evaluate the performance of the classification by feeding the features into a decision tree-based classifier which is trained by CART. The experimental results show that the proposed method can recognize several broadcasting video genres with a high accuracy and the classification performance with multimodal features is superior to the one with uni-modal features in the genre classification.


o Genre classification procedure

The flowchart of video genre classification is shown in the following figure.


Flowchart of the proposed video genre classification.

o The multimodal features used in this research are as follows

   Audio features - AudioPower, FundamentalFrequency, HarmonicSpectralCentroid                             HarmonicSpectralDeviation, and AudioSpectrumFlatness of MPEG-7.

    visual features - Motion Activity, Camera Motion, and HSV color of MPEG-7.


o The experiment environment is written as

   - The experiment has been conducted to verify the usefulness of the proposed method implemented with MPEG-7 XM (eXperimentation Model). In the experiment, we use the video database of MPEG-7 consisting of several genres and sufficient video clips are gathered from commercial broadcasting videos. Five types of video genres are used: a cartoon, a drama, a music video, a news, and a sports genre. We have 72 video clips with 60 sec length and 352 x 240 size per a genre. The total number of video clips is 360. Among which 235 (5 x 47) clips are employed in training a classifier, and the others are used to test the classification performance. Cartoon video clips are trimmed from 9 different videos of Korea, Japan, U.S, and Europe, drama clips are from 10 different broadcasting videos, music videos are from 30 video sources, news clips including 3 sports news clips are extracted from 6 different videos, and sports clips are from 12 different videos such as a soccer, a basketball, a tennis, a swimming, a judo, a golf, a tack and field (running), and a volleyball game.


o we make a conclusion of this approach about video gerne classification.

- This method proposes a method of video genre classification using multimodal features. Our goal is to recognize the genres of broadcasting video contents to provide benefits for automatic video summarization or retrieval and classification. In the proposed method, the multimodal features are adopted from MPEG-7 audio-visual descriptors. To prove the usefulness of the proposed method, we analyze the characteristics of the genres through a statistical distribution of audio-visual features, organize the features, and apply them to a decision tree-based classifier using CART. The experiment uses multimodal as well as unimodal MPEG-7 features for classification and evaluate the classification performance using a video database with several broadcasting genres. Since both audio and visual features are employed in the proposed method, it accurately and efficiently classifies the genres of broadcasting contents. The experimental results show that the proposed method classifies several broadcasting video genres with a high accuracy. As further works, we will extend the number of genres, reduce the processing time, and improve the performance at the same time.


1. "Video Segmentation by Hidden Markov Model Using Multimodal MPEG-7 Descriptors," Tea Meon Bae, Sung ho Jin, Jin Ho Choo, Mansoo Park, and Yong Man Ro, Hoi-Rin Kim, Kyeongok Kang - Internet Imaging V, 2004, IS&T and SPIEI S&T/SPIE's 16th Annual Symposium, Electronic Imaging Science and Technology, Jan. 2004.(Scheduled Presentation)

2. "Video genre classification using multimodal Features," Sung ho Jin, Tea Meon Bae, Jin Ho Choo, and Yong Man Ro, - Storage and Retrieval Methods and Applications for Multimedia 2004, IS&T and SPIEI S&T/SPIE's 16th Annual Symposium, Electronic Imaging Science and Technology, Jan. 2004.(Scheduled Presentation)