video & audio compression

The video system is concerned with capturing , compressing and transmitting video on one thread and receiving , decompressing & displaying on the other . Due to the system target bit rate image compression is the main focus of the video system .To accomplish a good compression ratio, image redundancy must be removed .

Digital video is well established in the broadcast industry & now with the emergence of powerful compression techniques . It is appearing in consumer products . However , the digital video used television studios is quite different from one might found in a personal computer .It is obvious when asking about the essential characteristics shared by the various forms of digital video .

For analog television broadcast , analog video is a sequence of frames , encoded by a continuous electrical signal occurring at some constant rate , the frame rate .With digital video , we have a sequence of frames which re no digital images , perhaps compressed in some manner . But the key concept that must be associated with digital video is the notion of liming .

Digital video is not just any sequence of images , but a sequence with accompanying information indicating the duration frames are to be displayed , (whereas in analog video , frame duration is constant ) . Frame duration in digital video displayed is not always constant , many digital video sequences have varying frame rates .

Although there are many different formats for digital video formats , these formats can b compared along certain dimension .

Digital color images require some form of encoding color . A procedure for specifying colors is called a color model ( or color space ) .

Colors are represented by a numeric tripe specifying red (R) , green (G) , and blue (B) intensities . This numeric values can be easily mapped to voltages for the red green , and blue guns in color CRT ’s .

Colors are represented by a tripe representing hue(H) (the dominant color of a sample) . saturation (S) (the intensity of the color) & brightness (B) (the amount of gray in the color brighter colors have less gray ) .

A color model used within television industry . Y represents luminance , and can be thought of as the black-and-white portion of a video signal . U .V are color difference signals (they involve the difference between Y , B & R respectively . U & V represents the color portion of a video signal & are called chrominance or simply chroma .

YUV is suited for video broadcast since it makes efficient use of bandwidth . The human eye has greater sensitivity to the changing in luminance than to color changing , so it is possible to better utilize available bandwidth by allocating less to chroma & more to luminance .

Digital video frames are generally obtained in two ways , synthesis , usually by computer programming , sampling , at an analog video signal .

Sampling rate is crucial in specifying a digital format for sampled analog video , since its value determines storage requirements & data transfer rates . Sampling theory gives a lower limit for the frequency at which to sample in order to reconstruct a signal properly . This limit , the Nyquist rate , is twice the highest frequency within the signal . Sampling below the Nyquist rate introduces frequency interference that leads to distortion in the reconstructed image .

Sample size is the number of bits used to represent sample values , while quantization refers to the mapping from the continuous range of the analog signal to discrete sample values .

Generally , digital video representations can be divided into two categories , high data rate formats , and low data rate formats . The high data rate formats have little , or , no compression , picture quality , and ease of processing are of primary importance . The low data rate formats are intended for real time applications like video phone , video conferencing , and television broadcasting and here we indeed need the COMPRESSION of the transmitted video signal .

Compression is the key to low data rate digital video & there ha been much work recently in compression techniques & algorithms . Methods for video compression can be compared through the following :

In a lossy technique , the reconstructed image doesn’t match exactly the original image but differs a little . An example of a lossy technique is any transform technique in which an image is represented by a certain mathematical transform coefficients with some of those coefficients neglected which leads to the lost of some information from the original image , but a transform technique has a high compression ratio .

Real time compression & decompression techniques are those employed in real time applications like video conferencing . If a digital video is stored in a compressed form , then only real time decompression is needed for playback , like the case of CD drives .

Interframe compression uses the fact that successive video frames tend to be similar so , it removes this temporal redundancy by simply differencing between successive frames & more over predicts & interpolate intermediate frames from independently coded key frame (Key frame is a frame that differs significantly from other frames) . In intraframe compression , no prediction nor interpolation , and all frames are independently coded .

Image compression addresses the problem of reducing the amount of data required to represent digital image .The main basis of the reduction process is the removal of redundant data .

Interest in image compression dates back more than 25 years. The initial focus of research efforts in this field was on the development of analog method for reducing video transmission bandwidth, a process called bandwidth compression .

Over the years the need for image compression has grown steadily , currently it is recognized as “ enabling technology “ where video compression enable us to use it in many applications such as in faxes and the control of remotely piloted vehicles in military and other large applications.

As we said how to efficiently compress the video data is one of the most important functions in multimedia applications . Most existing video coding techniques such as (H.261) and (MPEG) have been designed to compress video data using discrete cosine transform (DCT) , the (DPCM) ,and the motion compression (MC) . On of the major problems in MC/DCT based techniques the corresponding high computation complexity which inhibits the real time applications of it by using software only . As a result , the cost of such video applications is high and the implementation and maintenance are difficult .

Recently , to realize the video coder by soft ware to reduce both the cost and complexity of multimedia systems has been widely discussed .The popular video coder (PVC) have been proposed to achieve this goal , the PVC was applied to implement the video phone systems , and was shown good performance . However the coding speed and the compression ratio of the (PVC) are not high enouto apply it to higher resolution and/or higher speed video applications , such as video conferencing systems . But (PVC) still has low compression ratio which is the purpose of using another technique which is popular video coder 2 (PVC 2) .

The PVC2 improves the performance of the huffman coder of the PVC and applies adaptive resolution reduction technique for intraframe coding . With the aids of this new techniques ,the PVC2 compress the video data more efficiently than PVC does. More interestingly the coding speed of PVC2 is much more faster than other well-known software video coders such as the video for windows and MPEG coder.

The Moving Picture Expert Group (MPEG) , is an ISO working group has developed a video compression standard based on interframe coding an Discrete Cosine Transform (DCT) techniques . The standard also addresses audio encoding . The initial standard for video known as MPEG-1 leads to data rates of about 1 M bit/s . A recent proposal MPEG-2 is intended for broadcast quality video at data rates between 2 & 15 M bit/s . One problem with interframe coding is that reverse playing & random access are difficult as every frame depends on another .

Joint Photographic Expert Group (JPEG) is a compression technique intended for still images . It can be used in digital video compression by treating each frame as an image to be compressed independently (This technique used in digital video is called Motion JPEG) .

For a given frame rate & resolution , JPEG compressed video will have higher data rate than interframe compression techniques such as MPEG-1 & MPEG-2 , but since frames are compressed independently , it is an easier matter to re-arrange the order of the frames & to play back in reverse or at variable rates .

Image processing is the process of removing the redundancy in an image , this redundancy comes out of the fact that video images have very low variations in its contents (i.e. the colors in an image) , so every pixel is related to the surrounding pixels which means that

a respectable sum of information is found in every pixel about the neighboring pixels.

For still image we have one type of redundancy , called spatial redundancy or the redundancy in the horizontal and the vertical axis which is presented in repetition of pixel colors in large areas of the image .

For moving pictures we have in addition to the spatial redundancy another type of redundancy that is called temporal redundancy or infra frame redundancy this type comes up from the fact that a very small change occurs from a frame to the frame next to it .

The simplest way is to apply direct Differencing between corresponding pixels in the current and previous frames .

a)the first frame and subsequent refresh frames are encoded and transmitted as is , to overcome the effect of noise .

b)the need of removing the redundancy is to minimize the B.W. the price paid of coarse is added complexity to the transmitter and receiver . also significant time may be spent on the process if complex motion estimation techniques are implemented .

First , frames are subdivided into blocks whose dimensions are matched with other block dimension choices in the system. the process is to try getting the closet match for a block in current frame and in another frame .The difference block is then calculated as the difference between the current block and the closet match ,also the motion vector (dx , dy) is produced. these information used at receiver to reproduce the image .The mismatch function M(p,q) is describes the amount of mismatch between the two blocks p & q .

Block searching involves trying a set of blocks in a certain region (called the search window) & testing the best match. an example of used algorithms :

1)for a given block in the current frame, construct a search window in the previous frame and scan all possible blocks at a 1-pixel step.

4)calculate the motion vector as (dx , dy)=(x_current-x_prev , y_current-y_prev).

Considering a block X in the current frame , the blocks in its search windows(in the previous frame) can be labeled as :

UL	U	UR
L	X	R
BL	B	BR

Information about timing are hardware and platform dependent and is affected significantly at changing any of the parameters described above . It is also affected by frame rate (inversely varying ) Finally , processing time changes from one frame to the other depending on the number of search cuts that occur and their location in the search tree .

Due to the high repetition of pixels colors, we need some mathematical technique to transform a matrix represents the values of image colors to another domain in which the transformed matrix doesn’t represent the values of the colors , instead it represent the variations in the pixels colors .One of most common transforms is the Discrete Cosine Transform (DCT), this transform is widely used by very popular compression techniques like JPEG for still images and MPEG for moving pictures, as the core of there compression algorithm .

Some other transformation techniques like Discrete Fourier Transform (DFT) & Discrete Sine Transform (DST) & other transforms are used but not so widely as the DCT according to the complexity of computations and large computation time related to those transforms .

From the above equation and from the previous discussion we find that the [C] matrix have to satisfy two essential condition :

1)it gives a transformed matrix that represents the variations in pixels colors in the original matrix , not there values .

2)it should be real and orthogonal matrix so that [C]^-1 = [C]^T , hence the [U] matrix can be obtained back from the transformed matrix in small time from : [U] = [C]^T [V] [C] as it’s faster to get the transpose of a matrix than to get its inverse . We find that the DCT satisfies these two conditions .

The transmission of a true colored image in which red, green, and blue colors of each pixel is represented by 8 bits each over a stacked bandwidth is inefficient , so here appears the need of choosing a small number of colors that describes the and reflect subjective image quality. This set of available colors, called a color palette .since the images usually contain a wide range of colors which must then be quantized by a palette with limited size. This color Quantization problem is considered into two parts:

There is an algorithm for design of tree structured color palettes incorporating performance criteria which reflect subjective evaluation of image quality. Tree structured color palette greatly reduce the computational requirements of the palette design and pixel mapping tasks while allowing colors to be properly allocated density populated areas of the color space. The algorithm produce higher quality displayed images and require less computations than previous methods .

Video monitors display color images by modulating the intensity of the three primary colors red green and blue of each pixel at the image. in a digitized color image each primary color is usually quantized by 8 bits of resolution in order to eliminate distinguishable Quantization steps in luminance color hue, and color saturation . thus, full-colored display system use 24 bits to specify the color of each pixel on the screen. An alternative approach, which used by many currently available displays, is to provide a limited number of bits , such as 8 bits for specifying the color at each pixel. Each of this 2⁸=256 values is then used as an index into a user defined table of colors each entry in the table contains a 24 bits value which specify the color red , green and blue components. In this way the user is allowed to select small subject of colors , called a palette , from the full range of 2²⁴=1677216 colors . the drawback of this scheme is that it restricts the number of colors which may be simultaneously displayed .

Since natural images typically contains a large number of distinguishable colors, displaying such images with a limited palette is difficult.

The approach we take is structured around algorithms performing two tasks. The first task , called color palette design, select the best possible set of colors for a particular image . the second task , called pixel mapping, associates each pixel of the image with a color from this palette to yield the highest quality image .

A number of approach have been suggested for the design of color palette which involve iterative refinement of some initial palette . however, these algorithms suffer from disadvantage of being computationally intensive. In many applications, the efficiency of algorithms for performing color palette design and pixel mapping is very important. This is particularly true in applications involving large image data bases . in these applications, images are in some common format which must be displayed on monitors with different limitations on color palette size. In such an environment, the color palette design and pixel mapping must be performed at the time the image is displayed , which is the case in video phone. This make computational efficiency of critical importance .

Even if the color palette is transmitted with each image, standard image coding techniques require that the pixel mapping be performed locally at the time each image is displayed. This is because standard image coding techniques make use of the high correlation between the values of the primary colors of neighboring pixels to reduce the transmission bit rate .the color palette indices of neighboring pixels aren’t highly correlated and, therefore, cannot be coded with high efficiency. Thus, even if a color palette of correct size is provided with each stored image, the pixel mapping function must be performed after the image retrieved from the data base and decoded into a full-color format .

Quantizer is the way to convert analog data which is received into digital data . The main idea of Quantizer is to convert data into a fixed number of levels .Quantizer either software or hardware . Quantizer either uniform or non-uniform . In case of non-uniform quantizers the distances between the levels are not equal , and as an example of quantizer is the digital to analog circuits in which the analog signals are converted to fixed number of levels .

One problem we face in quantizer is noise due to the difference between the true level of the signal and the quantized level of the signal . this noise is called Granular noise .

In many applications we are transmitting a large amount of data (for a good quality speech using PCM we need a data rate of 64 kb/s ), this is exaggerated if we are not transmitting speech only (as in video-phone applications) .But we found that we are limited by the 4kHZ channel bandwidth which can’t bear this high rate of data transmission we also have limitations concerning the storage medium required for storing this large amount of data for future reuse .

Prediction & transform operations serve to remove the redundancy in an input wave form , &thus permit bit rate efficient digitization .

Speech wave form redundancy is utilized in time domain operations to realize straightforward reductions in bit rate , for a specified quality of digitization .

Differential pulse code modulation (DPCM) coders , which are based on quantizing a prediction error signal , are important examples of predictive coding systems .

DPCM systems with one bit quantizers constitute an important subclass which is called delta modulation (DM) .

Transform coding (block quantization) techniques use frequency domain approaches for redundancy removal .

The above various compression techniques will be discussed in more details now .

By representing a correlated waveform in terms of appropriate difference samples , or prediction error samples , one can achieve an increased SNR at a given bit rate , or equivalently ,

a reduction of bit rate for a given requirement of SNR . This can be appreciated at least qualitatively for the simple but important case when the correlation between adjacent samples approaches unity .

In this case , assume that the encoder represents the waveform as a succession of adjacent sample differences x(n) - x(n-1) that these difference samples are quantized for transmission , and that a decoder recovers an approximation to the input essentially by integrating quantized adjacent sample differences .

In general , the quantizer input to a DPCM coder is a prediction error or difference signal

The simplest DPCM predictor is the unit delay operator defined by x`(n) = x(n-1). In this case the DPCM decoder is simply a perfect integrator .

The DPCM decoder takes a more general form as well . The more general decoder is sometimes referred as leaky integrator .

There are two types of reconstruction errors in DPCM coders , namely overload distortion & granular noise . We are , now , no longer quantizing x(n) , as in the case of PCM , but we are quantizing the difference signal d(n) . As a re, overload errors in DPCM are characterized not as amplitude overload , but as slope overload .

Slope overload occurs whenever the largest step size of the quantizer is too small for local values of d(n) , resulting in a lag of staircase output function y(n) with respect to a fast-changing input , for example the rapid increase of speech amplitude at the onset of the pitch period , or the rapid change of image intensity at a high contrast edge . The opposite situation of granular noise occurs when the input variations are too small in comparison with the smallest step size of the quantizer . Examples are in the coding of silences in speech , or of constant or nearly constant gray areas in images .

The DPCM coders just described are appropriate for certain types of applications at 32 kb/s & 24 kb/s . They operate with a fixed first order predictor for simplicity .

Resulting speech quality , is less than conservatively defined to quality ( that of 8 bit PCM ) .

Nevertheless , 32 kb/s DPCM quality can be considered acceptable for certain speech transmission applications , especially those which do not require too many multiple encoding . On a subjective quality scale of 1 to 5 , the mean option score for this coder is very close to 4 at 32 kb/s .

Voice storage systems , such as those for recovered announcements , have even less stringent demands on quality . consequently they use a smaller speech bandwidth (say 0 to 2700 HZ)

& achieve 24 kb/s coding with 4-bit quantization ( Fs = 6 kHz , R = 4 ) where R is the number of bits per sample .

The quality of a conventional bandwidth 24 kb/s system ( Fs = 8 kHz , R = 3) is less ,

i.e. the greater quantization noise with R = 3 (instead of 4) is more objectionable than the loss of bandwidth in going from 3.2 to 2.7 kHz .

In ADM , instantaneous variation of step size d is employed to enhance the quality of the encoded waveform . There are several algorithms are in terms of quantizer output memory used , and recommended step size multipliers or increments .

Practical implementations of ADM algorithms presuppose appropriate constraints on maximum & minimum step size d min. < d (n) < d max.

Certain types of adaptation processes also require an initial , or starting step size d start

This can be set to an arbitrary value in general . In speech coding , assuming that speech activity is preceded by silence one can put d start = d min.

The value of d max. controls the amount of slope overload distortion , while the value of d min. controls the amount of granular noise .

In the most general , adaptation logic with m-bit memory , step size d (n) will be a function of the present bit b(n) , m preceding bits , and another step size d ` , which is either the minimum step size d min. or the previous step size d (n-1) .

A useful adaptation rule is d (n) = M(n) d ` , where M(n) is the step size multiplier which can be chosen so that the step sizes changes during successive increases or decreases can be either linear or exponential , or any other arbitrary function .

The non-availability of explicit level information in the one bit quantizer makes the use of the most recent polarity b(n) very valuable in ADM . The use of b(n) for the coding of sample x(n) means that the quantization is carried out in two steps .

The first involves the determination of b(n) , and the second is the application of this value to determine d (n) for the quantization itself .

In our videophone application , we need very special requirements . We need a maximum compression ratio retaining a good sound quality , simply we are not transmitting speech alone thus we have a huge data rate which must be compressed as we are aiming at transmitting at a bit rate of

14.4 kb/s . we don’t need a very high sound quality , only an acceptable quality is needed . (we have to take care that the ear is more sensitive to noise than the eye , thus there is a tradeoff between quality and compression ratio) .

Finally , the algorithm used for compression must has as small computational time as possible for a real time system to be established . From the above requirements , one can conclude that we have to use either Delta Modulation (DM) , ADM or Discrete Walsh Hadamard Transform(DWHT).