video surveillance systems are omnipresent in public places. These
systems are often characterized by high-speed network connections, plenty
of storage capacity, and a high computational power. Moreover, thanks to
continuously improving computer vision algorithms, video surveillance
systems are increasingly able to analyze and understand events of
interest. In video surveillance
systems, spatial resolution and visual quality are critical factors for
the performance of computer vision algorithms. Indeed, the use of
high-resolution and high-quality video content improves the overall
performance of computer vision algorithms targeting object detection,
identification, and tracking. A high spatial resolution and a high visual
quality are also important for legal reasons. For example, in the UK, the
minimum resolution of traffic Closed-Circuit Television (CCTV) cameras,
used for the detection of unlawful drivers, was recently the topic of a
legal debate. In addition, privacy concerns are also on the rise. People
are being monitored without having given their consent or without having
knowledge about these activities. The increasing use of high-resolution
surveillance cameras will even pose more threats to the privacy of
this research, we propose a video surveillance system using the
state-of-the-art JPEG Extended Range (JPEG XR) format for scalable image
coding. This coding format comes with a low computational complexity,
while offering a high image quality and a high flexibility of use in
diverse usage environments. In particular, JPEG XR can be seen as a
low-complexity alternative to JPEG 2000, which is frequently used in
current video surveillance systems. JPEG XR is expected to be ratified as
an international standard in the course of this year (formally denoted as
ISO/IEC 29199-2). Further, for the purpose of privacy protection, our
system detects and protects face regions, which are considered privacy
sensitive. Protection is realized using different scrambling techniques
operating in the transform domain, taking into account the scalability
provisions of JPEG XR.
2. Image coding using JPEG XR
2.1 Scalable Intra coding
In our surveillance system, each video frame is intra-coded using
JPEG XR. The main technical benefit of using JPEG XR as a video codec can
be found in its low computational complexity, while offering image
quality and scalability provisions that are, from a practical point of
view, similar to that of Motion JPEG 2000 and the Scalable High Intra
Profile of H.264/AVC Scalable Video Coding (SVC). This observation holds
especially true for the JPEG 2000 standard. For example, in JPEG 2000,
the wavelet transform is used as a global transform, operating at the
level of a tile, and typically requiring more memory bandwidth than the
4¡¿4 block-based transforms of SVC and JPEG XR.
transform used by JPEG XR is denoted as a two-staged hierarchical Lapped
Biorthogonal Transform (LBT). In frequency mode, a JPEG XR bit stream
offers support for both spatial and quality scalability, thanks to a
partitioning of the transform coefficients of a particular tile into four
subbands: the DC subband
(containing a single second-stage DC coefficient for each macroblock or
MB in the tile), the low pass (LP) or LP subband
(containing 15 second-stage transform coefficients for each MB), the high
pass (HP) or HP subband (containing the
significant part of the 240 first-stage transform coefficients of each
MB), and the Flexbits subband
(containing the refinement bits of the 240 first-stage transform
coefficients of each MB). Significant bits and refinement bits are also
computed for the DC and LP coefficients. However, in contrast to the 240
first-stage transform coefficients, the significant bits and refinement
bits of DC and LP coefficients are not stored in a separate subband.
trivial form of quality scalability can be realized by removing the Flexbits subbands, or part
thereof, from a JPEG XR bit stream. Spatial scalability is supported by
additionally removing the HP and LP subbands,
each time resulting in a reduction of the spatial resolution by a factor
of four along the horizontal and vertical axis. Intermediate resolutions
can be achieved by relying on client-side down sampling techniques,
enabling a complexity trade-off between the server, the bit stream
extractor, and the client. Dependent on the application targeted, spatial
scalability in JPEG XR can also be seen as coarse-grained quality
scalability when still displaying the adapted image at its original
2.2 ROI Representation
JPEG XR, an image may consist of several spatial tiles. Each spatial tile
represents a group of spatially adjacent macroblocks. Since there is no
coding dependency between spatially adjacent tiles (except when the
overlap transform is enabled), tiles can be used to represent an ROI. As
such, ROI extraction in JPEG XR can be realized by extracting spatial
tiles in the compressed domain, in both spatial and frequency mode. This
feature of JPEG XR is also known as fast tile extraction.
Figure 1. ROI representation in JPEG XR.
types of tile layouts are possible: a uniform and a non-uniform tile
grid. In the uniform tile layout, each tile has the same width and
height, while the non-uniform layout permits the use of tiles with
different widths and heights (tiles on the same row still need to have
the same height, while tiles on the same column still need to have the
same width). The non-uniform tile layout is illustrated in Figure 1. Note
that the use of a fine-grained tile grid may significantly decrease the
coding efficiency . This will also be discussed
in Section 4 for our use case.
3.1 Proposed Encoder Archtecture
2 illustrates the architecture of our modified JPEG XR encoder. In
particular, before entropy coding, scrambling is applied to the transform
coefficients in the DC and LP subbands. The
information stored in the HP and Flexbits subbands is not altered due to its limited impact on
the visual quality. This will be explained in more detail in Section 3.2.
Figure 2. Architecture of our modified JPEG XR encoder.
for the decoder, we assume only authorized clients have full access to
the original surveillance video content. A detailed description of the
actual bit stream extraction, decoding, descrambling processes, and key
management  is omitted due to space limitations.
3.2. Subband-Adaptive Scrambling
shown in Figure 2, a subband-adaptive approach
is followed in order to scramble privacy-sensitive face regions. This
approach is motivated by the following observation: when scrambling a
particular subband, a trade-off exists between
the visual importance of the subband (the
information in the DC subbands is for instance
visually more important than the information in the LP subbands), the available amount of coded data in the subband (the number of coefficients increases when
going from the DC subband to the HP subband, and hence the amount of compressed data),
the level of security offered by the scrambling technique used, the
effect on the coding efficiency, and the computational complexity of the
scrambling technique used.
3.2.1. Scrambling for DC subbands
a DC subband, a limited amount of data is
available for the purpose of scrambling. Indeed, each macroblock in a
tile only contributes a single DC coefficient to the DC subband of that particular tile. Therefore, we
propose to apply scrambling at the level of individual bits in order to
ensure a sufficient level of protection. Specifically, we propose to
apply both Random Sign Inversion (RSI) and Random Bit Flipping (RBF) to
DC subbands. RSI pseudo-randomly flips the sign
of DC coefficients as follows:
where D denotes the data to be scrambled and where De denotes the
pseudo-randomly sign-flipped data. As the sign of DC coefficients is
signaled using a simple Boolean flag in JPEG XR, the use of RSI does not
affect the coding efficiency. RBF flips bits by applying an XOR operation
between input bits and bits belonging to a pseudo-random data stream:
Equation 2, B denotes the data
to be encrypted while Be
denotes the encrypted data. Further, bi
denotes the ith
bit of B and R denotes the set of pseudo-random
bits. In JPEG XR, each DC coefficient is partitioned into a significant
part and a remainder part (i.e., DC refinement bits). The significant
part is again partitioned into a level value and level refinement bits.
The level value is signaled using variable length codes, while both DC
refinement bits and level refinement bits are signaled using fixed length
codes. RBF is only applied to the DC refinement bits and the level
refinement bits. By combining the low-complexity RSI and RBF scrambling
techniques, the coefficients in a DC subband
can be significantly altered without affecting the coding efficiency,
which is an important characteristic for mobile devices.
3.2.2. Scrambling for LP subbands
LP subband is visually less important than a DC
subband, but visually more important than an HP
subband. Also, an LP subband
contains more transform coefficients than a DC subband,
but less transform coefficients than an HP subband.
Therefore, we propose to apply Random Permutation (RP) to the transform
coefficients in an LP subband. RP offers a
higher level of protection than RSI or RBF as RP allows for a higher
number of possible combinations. However, RP comes with a decrease in
coding efficiency since this scrambling technique breaks entropy coding.
In our experiments, we observed that the decrease in coding efficiency
was limited (less than 6.6% for a worst case scenario). This will be
discussed in more detail in Section 4.1.
3.2.3. HP and Flexbits Subbands
In our experiments, we have observed
that the visual impact of a scrambled HP subband
lowers when the resolution increases (as the content of the HP subband represents high frequency information).
Figure 3 shows two images taken from ¡°Foreman¡±. The HP subband
is scrambled using RP, once at QCIF and once at 4CIF resolution (Flexbits not shown).
Figure 3. Visual impact of a scrambled HP subband:
(a) QCIF resolution and (b) 4CIF resolution.
As shown in Figure 3(b), the visual
effect of a scrambled HP subband can hardly be
seen at 4CIF resolution. The QCIF image in Figure 3(a) even shows that a
face region with a sufficiently high resolution cannot be concealed
adequately. Further, we have also observed that the application of RP to an
HP subband significantly lowers the coding
efficiency (in the order of 24% to 52%). When also taking into account
that scrambled DC and LP subbands already alter
the visual quality significantly, and the fact that the application of RP
to HP subbands also requires a significant
number of additional computations (as HP subbands
contain significantly more compressed data than the DC and LP subbands), we propose not to scramble HP subbands. Following a similar reasoning, we also
propose not to scramble Flexbits subbands.
4. Experimental Results
4.1. Test setup
have implemented the proposed scrambling approach in the JPEG XR encoder
available in the HD Photo Device Porting Kit (DPK) 1.0 provided by
Microsoft. The video sequence used in our experiment is ¡°montinas_toni¡±. This video sequence, part of the
Surveillance Performance EValuation Initiative
(SPEVI) dataset, has VGA resolution and a frame rate of 25 frames per
second. The first eight seconds of the video sequence were used in our
experiment. And the average size of the face region in the video sequence
is 6x6 macroblocks.
Figure 4. Privacy-protected surveillance video: (a) DC,
(b) DC + LP, (c) DC + LP + HP, and (d) DC + LP + HP + Flexbits.
4 shows the visual effect of our scrambling approach for ¡°montinas_toni¡±, varying the number of decoded subbands (cropped for visualization purposes).
4.2. Bit stream overhead analysis
1 shows the bit stream overhead according to the tile size, varying the
bit rate. For each bit rate, overhead is computed using as reference an
image coded in spatial mode with no tiles. The second column of Table 1
represents the overhead when using the non-uniform tile layout as shown
in Figure 1, while the other columns represent the overhead when using a
uniform tile layout. For example, the label 1x1 MB refers to a uniform
tile layout consisting of 40x30 tiles in VGA resolution. As shown in
Table 1, the combined use of a small tile size and a uniform tile layout
may significantly decrease the coding efficiency. This can be attributed
to a broken entropy coding, an increasing number of tile headers, and an
increasing number of entries in the index table. Also, the overhead
becomes higher as the bandwidth decreases. Table 1. Bit stream overhead according to the tile size
Bit rate (Kbps)
9 tiles (%)
1¡¿1 MB (%)
5¡¿5 MB (%)
10¡¿10 MB (%)
5 shows the average bit stream overhead, caused by the proposed
scrambling approach. The overhead is shown for two cases: scrambling of
the whole image and scrambling of the ROI (using the non-uniform tile
Figure 5. Bit stream overhead introduced by scrambling.
shown in Figure 5, in the worst case (i.e., at a bit rate of 629 Kbit/s),
the overhead is approximately 6.6% when scrambling the whole image, while
the overhead is about 0.34% when only scrambling the ROI.
4.3. Security Considerations
This section analyzes the level of
protection offered by the proposed scrambling technique against a brute
force attack. For one macroblock, the combined application of RSI and RBF
at the level of the DC coefficient results in 2N+1 possible combinations (N denotes the number of bits used
to represent the fixed length part of the DC coefficient), while the
application of RP to the LP coefficients results in a total of 15! possible combinations. Figure 6 shows the average
number of bits assigned to the fixed length part of the DC coefficient.
As such, the total number of combinations required to break the
protection of a macroblock is equal to (2N+1 + 15!).
Figure 6. Average number of bits used to represent
the fixed length part of a DC coefficient.
The compressed video bit stream at
629Kbit/s has the lowest amount of protection. A brute force attack at
the level of a single macroblock requires evaluating (21.72+15!)
combinations. Since the size of the ROI is equal to 6¡¿6 macroblocks, a
brute force attack at the level of the ROI requires evaluating (21.72)36
+ (15!)36 combinations: (21.72)36
evaluations are required for the DC subband and
(15!)36 evaluations for the LP subband.
As decoding and descrambling of the DC subband
requires about 1.9 ms on a quad-core 2.0 GHz
processor, the time needed to generate all possible face regions is
approximately equal to 2.3¡¿1012 hours. This number shows that
the proposed scrambling approach provides a feasible level of protection
against a brute force attack (on the condition that the size of the ROI
is sufficiently large).