Sunday, 1 September 2013

Compression Techniques



Data compression squeezes data so it requires less disk space for storage and less bandwidth on a data transmission channel. Communications equipment like modems, bridges, and routers use compression schemes to improve throughput over standard phone lines or leased lines. Compression is also used to compress voice telephone calls transmitted over leased lines so that more calls can be placed on those lines. In addition, compression is essential for videoconferencing applications that run over data networks.
Most compression schemes take advantage of the fact that data contains a lot of repetition. For example, alphanumeric characters are normally represented by a 7-bit ASCII code, but a compression scheme can use a 3-bit code to represent the eight most common letters.
In addition, long stretches of "nothing" can be replaced by a value that indicates how much "nothing" there is. For example, silence in a compressed audio recording can be replaced by a value that indicates how long that silence is. White space in a compressed graphic image can be replaced by a value that indicates the amount of white space.
Compression has become critical in the move to combine voice and data networks. Compression techniques have been developed that reduce the data requirements for a voice channel down to 8 Kbits/sec. This is a significant improvement over noncompressed voice (64 Kbits/sec) and older compression techniques yielding 32 Kbits/sec.
Two important compression concepts are lossy and lossless compression:
  • Lossy compression    With lossy compression, it is assumed that some loss of information is acceptable. The best example is a videoconference where there is an acceptable amount of frame loss in order to deliver the image in real time. People may appear jerky in their movements, but you still have a grasp for what is happening on the other end of the conference. In the case of graphics files, some resolution may be lost in order to create a smaller file. The loss may be in the form of color depth or graphic detail. For example, high-resolution details can be lost if a picture is going to be displayed on a low-resolution device. Loss is also acceptable in voice and audio compression, depending on the desired quality.
  • Lossless compression    With lossless compression, data is compressed without any loss of data. It assumes you want to get everything back that you put in. Critical financial data files are examples where lossless compression is required.
The removal of information in the lossy technique is acceptable for images, because the loss of information is usually imperceptible to the human eye. While this trick works on humans, you may not be able to use lossy images in some situations, such as when scanners are used to locate details in images.
Lossy compression can provide compression ratios of 100:1 to 200:1, depending on the type of information being compressed. Lossless compression ratios usually only achieve a 2:1 compression ratio. Lossy compression techniques are often "tunable" in that you can turn the compression up to improve throughput, but at a loss in quality. Compression can also be turned downed to the point at which there is little loss of image, but throughput will be affected.

Basic Compression Techniques
The most basic compression techniques are described here:
  • Null compression    Replaces a series of blank spaces with a compression code, followed by a value that represents the number of spaces.
  • Run-length compression    Expands on the null compression technique by compressing any series of four or more repeating characters. The characters are replaced with a compression code, one of the characters, and a value that represents the number of characters to repeat. Some synchronous data transmissions can be compressed by as much as 98 percent using this scheme.
  • Keyword encoding    Creates a table with values that represent common sets of characters. Frequently occurring words like for and the or character pairs like sh or th are represented with tokens used to store or transmit the characters.
  • Adaptive Huffman coding and Lempel Ziv algorithms    These compression techniques use a symbol dictionary to represent recurring patterns. The dictionary is dynamically updated during a compression as new patterns occur. For data transmissions, the dictionary is passed to a receiving system so it knows how to decode the characters. For file storage, the dictionary is stored with the compressed file.
Compression and PDF 
Compression is the reduction in size of data in order to save space or transmission time. For data transmission, compression can be performed on just the data content or on the entire transmission unit depending on a number of factors.
Content compression can be as simple as removing all extra space characters, inserting a single repeat character to indicate a string of repeated characters, and substituting smaller bit strings for frequently occurring characters. This kind of compression can reduce a text file to 50% of its original size. Compression is performed by a program that uses a formula or algorithm to determine how to compress or decompress data. The algorithm is one of the critical factors to determine the compression quality.
To PDF files, compression refers to image compressing. PDF formats are usually designed to compress information as much as possible (since these can tend to become very large files). Compression can be either lossy (some information is permanently lost) or lossless (all information can be restored).
PDF is a page description language, like PostScript but simplified with restricted functionality to be more lightweight, which dues to not only a better data structure but also very efficient compression algorithms to reduce the file size to about half the size of an equivalent PostScript file. PDFs use the following compression algorithms:
  • LZW (Lempel-Ziv-Welch)
  • FLATE (ZIP, in PDF 1.2) 
  • JPEG and JPEG2000 (PDF version 1.5
  • CCITT (the facsimile standard, Group 3 or 4)
  • JBIG2 compression (PDF version 1.4)
  • RLE (Run Length Encoding)
All of these compression filters produce binary data, which can be further converted to ASCII base-85 encoding if a 7-bit ASCII representation is required.
The above algorithms can be divided into two distinct categories: lossless or lossy.
Lossless algorithms do not change the content of a file. If you compress a file and then decompress it, it has not changed. The following algorithms are lossless:
  • CCITT group 3 & 4 compression
  • Flate compression
  • LZW compression
  • RLE compression
  • ZIp
Lossy algorithms achieve better compression ratio's by selectively getting rid of some of the information in the file. Such algorithms can be used for images or sound files but not for text or program data. The following algorithms are lossy:
  • JPEG compression
It is in how well you use these compression techniques, how efficiently the data is described, and the complexity of the document (read number of fonts, forms, images, and multimedia) that ultimately determines how large your resulting PDF file will be.
Compression algorithm introduction
The compression algorithms can be described in detail below.
  • ZIP
ZIP works well on images with large areas of single colors or repeating patterns, such as screen shots and simple images created with paint programs, and for black-and-white images that contain repeating patterns. Acrobat provides 4-bit and 8-bit ZIP compression options. If you use 4-bit ZIP compression with 4-bit images, or 8-bit ZIP with 4-bit or 8-bit images, the ZIP method is lossless, which means it does not remove data to reduce file size and so does not affect an image's quality. However, using 4-bit ZIP compression with 8-bit data can affect the quality, since data is lost.
Note: Adobe implementation of the ZIP filter is derived from the zlib package of Jean-loup Gailly and Mark Adler, whose generous assistance we gratefully acknowledge.
  • CCITT
 (International Coordinating Committee for Telephony and Telegraphy) is appropriate for black-and-white images made by paint programs and any images scanned with an image depth of 1 bit. CCITT is a lossless method. Acrobat provides the CCITT Group 3 and Group 4 compression options. CCITT Group 4 is a general-purpose method that produces good compression for most types of monochrome images. CCITT Group 3, used by most fax machines, compresses monochrome images one row at a time.
  • RLE
RLE (Run Length Encoding) is a lossless compression option that produces the best results for images that contain large areas of solid white or black.
  • JPEG
JPEG stands for Joint Photographic Experts Group, which is a standardization committee. It also stands for the compression algorithm that was invented by this committee.
There are two JPEG compression algorithms: the oldest one is simply referred to as "JPEG" within this page. The newer is JPEG 2000 algorithm
JPEG is a lossy compression algorithm that has been conceived to reduce the file size of natural, photographic-like true-color images as much as possible without affecting the quality of the image as experienced by the human sensory engine. We perceive small changes in brightness more readily than we do small changes in color. It is this aspect of our perception that JPEG compression exploits in an effort to reduce the file size
JPEG is suitable for grayscale or color images, such as continuous-tone photographs that contain more detail than can be reproduced on-screen or in print. JPEG is lossy, which means that it removes image data and may reduce image quality, but it attempts to reduce file size with the minimum loss of information. Because JPEG eliminates data, it can achieve much smaller file sizes than ZIP compression.
Acrobat provides six JPEG options, ranging from Maximum quality (the least compression and the smallest loss of data) to Minimum quality (the most compression and the greatest loss of data). The loss of detail that results from the Maximum and High quality settings is so slight that most people cannot tell an image has been compressed. At Minimum and Low, however, the image may become blocky and acquire a mosaic look. The Medium quality setting usually strikes the best balance in creating a compact file while still maintaining enough information to produce high-quality images.

No comments:

Post a Comment