A subjective and objective study of scrambling-based privacy protection in video surveillance systems


2009 . 01 ~ current

1. Introduction

Nowadays, video surveillance systems are omnipresent in public places. These systems are often characterized by high-speed network connections, plenty of storage capacity, and a high computational power. Moreover, thanks to continuously improving computer vision algorithms, video surveillance systems are increasingly able to analyze and understand events of interest. In video surveillance systems, spatial resolution and visual quality are critical factors for the performance of computer vision algorithms. Indeed, the use of high-resolution and high-quality video content improves the overall performance of computer vision algorithms targeting object detection, identification, and tracking. A high spatial resolution and a high visual quality are also important for legal reasons. For example, in the UK, the minimum resolution of traffic Closed-Circuit Television (CCTV) cameras, used for the detection of unlawful drivers, was recently the topic of a legal debate. In addition, privacy concerns are also on the rise. People are being monitored without having given their consent or without having knowledge about these activities. The increasing use of high-resolution surveillance cameras will even pose more threats to the privacy of individuals.

In this research, we propose a video surveillance system using the state-of-the-art JPEG Extended Range (JPEG XR) format for scalable image coding. This coding format comes with a low computational complexity, while offering a high image quality and a high flexibility of use in diverse usage environments. In particular, JPEG XR can be seen as a low-complexity alternative to JPEG 2000, which is frequently used in current video surveillance systems. JPEG XR is expected to be ratified as an international standard in the course of this year (formally denoted as ISO/IEC 29199-2). Further, for the purpose of privacy protection, our system detects and protects face regions, which are considered privacy sensitive. Protection is realized using different scrambling techniques operating in the transform domain, taking into account the scalability provisions of JPEG XR.

2. Image coding using JPEG XR

2.1 Scalable Intra coding
 In our surveillance system, each video frame is intra-coded using JPEG XR. The main technical benefit of using JPEG XR as a video codec can be found in its low computational complexity, while offering image quality and scalability provisions that are, from a practical point of view, similar to that of Motion JPEG 2000 and the Scalable High Intra Profile of H.264/AVC Scalable Video Coding (SVC). This observation holds especially true for the JPEG 2000 standard. For example, in JPEG 2000, the wavelet transform is used as a global transform, operating at the level of a tile, and typically requiring more memory bandwidth than the 44 block-based transforms of SVC and JPEG XR.

The transform used by JPEG XR is denoted as a two-staged hierarchical Lapped Biorthogonal Transform (LBT). In frequency mode, a JPEG XR bit stream offers support for both spatial and quality scalability, thanks to a partitioning of the transform coefficients of a particular tile into four subbands: the DC subband (containing a single second-stage DC coefficient for each macroblock or MB in the tile), the low pass (LP) or LP subband (containing 15 second-stage transform coefficients for each MB), the high pass (HP) or HP subband (containing the significant part of the 240 first-stage transform coefficients of each MB), and the Flexbits subband (containing the refinement bits of the 240 first-stage transform coefficients of each MB). Significant bits and refinement bits are also computed for the DC and LP coefficients. However, in contrast to the 240 first-stage transform coefficients, the significant bits and refinement bits of DC and LP coefficients are not stored in a separate subband.

A trivial form of quality scalability can be realized by removing the Flexbits subbands, or part thereof, from a JPEG XR bit stream. Spatial scalability is supported by additionally removing the HP and LP subbands, each time resulting in a reduction of the spatial resolution by a factor of four along the horizontal and vertical axis. Intermediate resolutions can be achieved by relying on client-side down sampling techniques, enabling a complexity trade-off between the server, the bit stream extractor, and the client. Dependent on the application targeted, spatial scalability in JPEG XR can also be seen as coarse-grained quality scalability when still displaying the adapted image at its original resolution.

2.2 ROI Representation

In JPEG XR, an image may consist of several spatial tiles. Each spatial tile represents a group of spatially adjacent macroblocks. Since there is no coding dependency between spatially adjacent tiles (except when the overlap transform is enabled), tiles can be used to represent an ROI. As such, ROI extraction in JPEG XR can be realized by extracting spatial tiles in the compressed domain, in both spatial and frequency mode. This feature of JPEG XR is also known as fast tile extraction.

Figure 1. ROI representation in JPEG XR.

Two types of tile layouts are possible: a uniform and a non-uniform tile grid. In the uniform tile layout, each tile has the same width and height, while the non-uniform layout permits the use of tiles with different widths and heights (tiles on the same row still need to have the same height, while tiles on the same column still need to have the same width). The non-uniform tile layout is illustrated in Figure 1. Note that the use of a fine-grained tile grid may significantly decrease the coding efficiency . This will also be discussed in Section 4 for our use case.

3. Scrambling

3.1 Proposed Encoder Archtecture

Figure 2 illustrates the architecture of our modified JPEG XR encoder. In particular, before entropy coding, scrambling is applied to the transform coefficients in the DC and LP subbands. The information stored in the HP and Flexbits subbands is not altered due to its limited impact on the visual quality. This will be explained in more detail in Section 3.2.

Figure 2. Architecture of our modified JPEG XR encoder.

As for the decoder, we assume only authorized clients have full access to the original surveillance video content. A detailed description of the actual bit stream extraction, decoding, descrambling processes, and key management [6] is omitted due to space limitations.

3.2. Subband-Adaptive Scrambling

As shown in Figure 2, a subband-adaptive approach is followed in order to scramble privacy-sensitive face regions. This approach is motivated by the following observation: when scrambling a particular subband, a trade-off exists between the visual importance of the subband (the information in the DC subbands is for instance visually more important than the information in the LP subbands), the available amount of coded data in the subband (the number of coefficients increases when going from the DC subband to the HP subband, and hence the amount of compressed data), the level of security offered by the scrambling technique used, the effect on the coding efficiency, and the computational complexity of the scrambling technique used.

3.2.1. Scrambling for DC subbands

In a DC subband, a limited amount of data is available for the purpose of scrambling. Indeed, each macroblock in a tile only contributes a single DC coefficient to the DC subband of that particular tile. Therefore, we propose to apply scrambling at the level of individual bits in order to ensure a sufficient level of protection. Specifically, we propose to apply both Random Sign Inversion (RSI) and Random Bit Flipping (RBF) to DC subbands. RSI pseudo-randomly flips the sign of DC coefficients as follows:


where D denotes the data to be scrambled and where De denotes the pseudo-randomly sign-flipped data. As the sign of DC coefficients is signaled using a simple Boolean flag in JPEG XR, the use of RSI does not affect the coding efficiency. RBF flips bits by applying an XOR operation between input bits and bits belonging to a pseudo-random data stream:


In Equation 2, B denotes the data to be encrypted while Be denotes the encrypted data. Further, bi denotes the ith bit of B and R denotes the set of pseudo-random bits. In JPEG XR, each DC coefficient is partitioned into a significant part and a remainder part (i.e., DC refinement bits). The significant part is again partitioned into a level value and level refinement bits. The level value is signaled using variable length codes, while both DC refinement bits and level refinement bits are signaled using fixed length codes. RBF is only applied to the DC refinement bits and the level refinement bits. By combining the low-complexity RSI and RBF scrambling techniques, the coefficients in a DC subband can be significantly altered without affecting the coding efficiency, which is an important characteristic for mobile devices.

3.2.2. Scrambling for LP subbands

An LP subband is visually less important than a DC subband, but visually more important than an HP subband. Also, an LP subband contains more transform coefficients than a DC subband, but less transform coefficients than an HP subband. Therefore, we propose to apply Random Permutation (RP) to the transform coefficients in an LP subband. RP offers a higher level of protection than RSI or RBF as RP allows for a higher number of possible combinations. However, RP comes with a decrease in coding efficiency since this scrambling technique breaks entropy coding. In our experiments, we observed that the decrease in coding efficiency was limited (less than 6.6% for a worst case scenario). This will be discussed in more detail in Section 4.1.

3.2.3. HP and Flexbits Subbands

In our experiments, we have observed that the visual impact of a scrambled HP subband lowers when the resolution increases (as the content of the HP subband represents high frequency information). Figure 3 shows two images taken from Foreman. The HP subband is scrambled using RP, once at QCIF and once at 4CIF resolution (Flexbits not shown).


 (a)                                  (b)

Figure 3. Visual impact of a scrambled HP subband:
(a) QCIF resolution and (b) 4CIF resolution.

As shown in Figure 3(b), the visual effect of a scrambled HP subband can hardly be seen at 4CIF resolution. The QCIF image in Figure 3(a) even shows that a face region with a sufficiently high resolution cannot be concealed adequately. Further, we have also observed that the application of RP to an HP subband significantly lowers the coding efficiency (in the order of 24% to 52%). When also taking into account that scrambled DC and LP subbands already alter the visual quality significantly, and the fact that the application of RP to HP subbands also requires a significant number of additional computations (as HP subbands contain significantly more compressed data than the DC and LP subbands), we propose not to scramble HP subbands. Following a similar reasoning, we also propose not to scramble Flexbits subbands.

4. Experimental Results

4.1. Test setup

We have implemented the proposed scrambling approach in the JPEG XR encoder available in the HD Photo Device Porting Kit (DPK) 1.0 provided by Microsoft. The video sequence used in our experiment is montinas_toni. This video sequence, part of the Surveillance Performance EValuation Initiative (SPEVI) dataset, has VGA resolution and a frame rate of 25 frames per second. The first eight seconds of the video sequence were used in our experiment. And the average size of the face region in the video sequence is 6x6 macroblocks.


(a)                            (b)


(c)                              (d)

Figure 4. Privacy-protected surveillance video: (a) DC, (b) DC + LP, (c) DC + LP + HP, and (d) DC + LP + HP + Flexbits.

Figure 4 shows the visual effect of our scrambling approach for montinas_toni, varying the number of decoded subbands (cropped for visualization purposes).

4.2. Bit stream overhead analysis

Table 1 shows the bit stream overhead according to the tile size, varying the bit rate. For each bit rate, overhead is computed using as reference an image coded in spatial mode with no tiles. The second column of Table 1 represents the overhead when using the non-uniform tile layout as shown in Figure 1, while the other columns represent the overhead when using a uniform tile layout. For example, the label 1x1 MB refers to a uniform tile layout consisting of 40x30 tiles in VGA resolution. As shown in Table 1, the combined use of a small tile size and a uniform tile layout may significantly decrease the coding efficiency. This can be attributed to a broken entropy coding, an increasing number of tile headers, and an increasing number of entries in the index table. Also, the overhead becomes higher as the bandwidth decreases. Table 1. Bit stream overhead according to the tile size

     Tile grid

Bit rate (Kbps)

9 tiles (%)

11 MB (%)

55 MB (%)

1010 MB (%)









































Figure 5 shows the average bit stream overhead, caused by the proposed scrambling approach. The overhead is shown for two cases: scrambling of the whole image and scrambling of the ROI (using the non-uniform tile layout).

Figure 5. Bit stream overhead introduced by scrambling.

As shown in Figure 5, in the worst case (i.e., at a bit rate of 629 Kbit/s), the overhead is approximately 6.6% when scrambling the whole image, while the overhead is about 0.34% when only scrambling the ROI.

4.3. Security Considerations

This section analyzes the level of protection offered by the proposed scrambling technique against a brute force attack. For one macroblock, the combined application of RSI and RBF at the level of the DC coefficient results in 2N+1 possible combinations (N denotes the number of bits used to represent the fixed length part of the DC coefficient), while the application of RP to the LP coefficients results in a total of 15! possible combinations. Figure 6 shows the average number of bits assigned to the fixed length part of the DC coefficient. As such, the total number of combinations required to break the protection of a macroblock is equal to (2N+1 + 15!).

Figure 6. Average number of bits used to represent
the fixed length part of a DC coefficient.

The compressed video bit stream at 629Kbit/s has the lowest amount of protection. A brute force attack at the level of a single macroblock requires evaluating (21.72+15!) combinations. Since the size of the ROI is equal to 66 macroblocks, a brute force attack at the level of the ROI requires evaluating (21.72)36 + (15!)36 combinations: (21.72)36 evaluations are required for the DC subband and (15!)36 evaluations for the LP subband. As decoding and descrambling of the DC subband requires about 1.9 ms on a quad-core 2.0 GHz processor, the time needed to generate all possible face regions is approximately equal to 2.31012 hours. This number shows that the proposed scrambling approach provides a feasible level of protection against a brute force attack (on the condition that the size of the ROI is sufficiently large).

* Contact Person: Prof. Yong Man Ro (ymro@kaist.ac.kr)


Hosik Sohn, Dohyoung Lee, Wesley De Neve, Konstantinos N. Plataniotis, and Yong Man Ro, An objective and subjective evaluation of content-based privacy protection of face images in video surveillance systems using JPEG XR, Accepted for publication in Effective Surveillance for Homeland Security: Balancing Technology and Social Issues, Book Chapter, March 2013.

Hosik Sohn, Dohyoung Lee, Wesley De Neve, Konstantinos N. Plataniotis, Yong Man Ro, Contribution of Non-Scrambled Chroma Information in Privacy-Protected Face Images to Privacy Leakage, Proceedings of Digital Forensics and Watermarking 2011, USA, October 2011, pp. 453-467.

Hosik Sohn, Esla T. AnzaKu, Wesley De Neve, Yong Man Ro, and Konstantinos N. Plataniotis, Privacy Protection in Video Surveillance Systems using Scalable Video Coding, AVSS09

Hosik Sohn, Wesley De Neve, Yong Man Ro, Region-of-Interest Scrambling for Scalable Surveillance Video using JPEG XR, ACM Multimedia 2009