Data compression
squeezes data so it requires less disk space for storage and less bandwidth on
a data transmission channel. Communications equipment like modems, bridges, and
routers use compression schemes to improve throughput over standard phone lines
or leased lines. Compression is also used to compress voice telephone calls
transmitted over leased lines so that more calls can be placed on those lines.
In addition, compression is essential for videoconferencing applications that
run over data networks.
Most compression schemes
take advantage of the fact that data contains a lot of repetition. For example,
alphanumeric characters are normally represented by a 7-bit ASCII code, but a
compression scheme can use a 3-bit code to represent the eight most common
letters.
In addition, long
stretches of "nothing" can be replaced by a value that indicates how
much "nothing" there is. For example, silence in a compressed audio
recording can be replaced by a value that indicates how long that silence is.
White space in a compressed graphic image can be replaced by a value that
indicates the amount of white space.
Compression has become
critical in the move to combine voice and data networks. Compression techniques
have been developed that reduce the data requirements for a voice channel down
to 8 Kbits/sec. This is a significant improvement over noncompressed voice (64
Kbits/sec) and older compression techniques yielding 32 Kbits/sec.
Two important
compression concepts are lossy and lossless compression:
- Lossy compression With lossy compression, it is assumed that some loss of information is acceptable. The best example is a videoconference where there is an acceptable amount of frame loss in order to deliver the image in real time. People may appear jerky in their movements, but you still have a grasp for what is happening on the other end of the conference. In the case of graphics files, some resolution may be lost in order to create a smaller file. The loss may be in the form of color depth or graphic detail. For example, high-resolution details can be lost if a picture is going to be displayed on a low-resolution device. Loss is also acceptable in voice and audio compression, depending on the desired quality.
- Lossless compression With lossless compression, data is compressed without any loss of data. It assumes you want to get everything back that you put in. Critical financial data files are examples where lossless compression is required.
The removal of
information in the lossy technique is acceptable for images, because the loss
of information is usually imperceptible to the human eye. While this trick
works on humans, you may not be able to use lossy images in some situations,
such as when scanners are used to locate details in images.
Lossy compression can
provide compression ratios of 100:1 to 200:1, depending on the type of
information being compressed. Lossless compression ratios usually only achieve
a 2:1 compression ratio. Lossy compression techniques are often
"tunable" in that you can turn the compression up to improve
throughput, but at a loss in quality. Compression can also be turned downed to
the point at which there is little loss of image, but throughput will be
affected.
Basic Compression
Techniques
The most basic
compression techniques are described here:
- Null compression Replaces a series of blank spaces with a compression code, followed by a value that represents the number of spaces.
- Run-length compression Expands on the null compression technique by compressing any series of four or more repeating characters. The characters are replaced with a compression code, one of the characters, and a value that represents the number of characters to repeat. Some synchronous data transmissions can be compressed by as much as 98 percent using this scheme.
- Keyword encoding Creates a table with values that represent common sets of characters. Frequently occurring words like for and the or character pairs like sh or th are represented with tokens used to store or transmit the characters.
- Adaptive Huffman coding and Lempel Ziv algorithms These compression techniques use a symbol dictionary to represent recurring patterns. The dictionary is dynamically updated during a compression as new patterns occur. For data transmissions, the dictionary is passed to a receiving system so it knows how to decode the characters. For file storage, the dictionary is stored with the compressed file.
Compression and PDF
Compression is the reduction in size
of data in order to save space or transmission time. For data transmission,
compression can be performed on just the data content or on the entire
transmission unit depending on a number of factors.
Content compression can be as simple
as removing all extra space characters, inserting a single repeat
character to indicate a string of repeated characters, and substituting smaller
bit strings for frequently occurring characters. This kind of compression
can reduce a text file to 50% of its original size. Compression is
performed by a program that uses a formula or algorithm to determine how to
compress or decompress data. The algorithm is one of the critical factors to
determine the compression quality.
To PDF files, compression refers to
image compressing. PDF formats are usually designed to compress information
as much as possible (since these can tend to become very large files).
Compression can be either lossy (some information is permanently lost) or
lossless (all information can be restored).
PDF is a page description language,
like PostScript but simplified with restricted functionality to be more
lightweight, which dues to not only a better data structure but also very
efficient compression algorithms to reduce the file size to about half the size
of an equivalent PostScript file. PDFs use the following compression
algorithms:
- LZW (Lempel-Ziv-Welch)
- FLATE (ZIP, in PDF 1.2)
- JPEG and JPEG2000 (PDF version 1.5
- CCITT (the facsimile standard, Group 3 or 4)
- JBIG2 compression (PDF version 1.4)
- RLE (Run Length Encoding)
All of these compression filters
produce binary data, which can be further converted to ASCII base-85 encoding
if a 7-bit ASCII representation is required.
The above algorithms can be divided into
two distinct categories: lossless or lossy.
Lossless algorithms do not change
the content of a file. If you compress a file and then decompress it, it has
not changed. The following algorithms are lossless:
- CCITT group 3 & 4 compression
- Flate compression
- LZW compression
- RLE compression
- ZIp
Lossy algorithms achieve better
compression ratio's by selectively getting rid of some of the information in
the file. Such algorithms can be used for images or sound files but not for
text or program data. The following algorithms are lossy:
- JPEG compression
It is in how well you use these
compression techniques, how efficiently the data is described, and the
complexity of the document (read number of fonts, forms, images, and
multimedia) that ultimately determines how large your resulting PDF file will
be.
Compression algorithm introduction
The compression algorithms can be
described in detail below.
- ZIP
ZIP works well on images with large
areas of single colors or repeating patterns, such as screen shots and simple images created with paint
programs, and for black-and-white images that contain repeating patterns.
Acrobat provides 4-bit and 8-bit ZIP compression options. If you use 4-bit
ZIP compression with 4-bit images, or 8-bit ZIP with 4-bit or 8-bit images, the
ZIP method is lossless, which means it does not remove data to reduce file
size and so does not affect an image's quality. However, using 4-bit ZIP
compression with 8-bit data can affect the quality, since data is lost.
Note: Adobe implementation of the
ZIP filter is derived from the zlib package of Jean-loup Gailly and Mark Adler,
whose generous assistance we gratefully acknowledge.
- CCITT
(International Coordinating
Committee for Telephony and Telegraphy) is appropriate for black-and-white
images made by paint programs and any images scanned with an image depth of 1
bit. CCITT is a lossless method. Acrobat provides the CCITT Group 3 and Group 4
compression options. CCITT Group 4 is a general-purpose method that produces
good compression for most types of monochrome images. CCITT Group 3, used by
most fax machines, compresses monochrome images one row at a time.
- RLE
RLE (Run Length Encoding) is a
lossless compression option that produces the best results for images that
contain large areas of solid white or black.
- JPEG
JPEG stands for Joint Photographic
Experts Group, which is a standardization committee. It also stands for the
compression algorithm that was invented by this committee.
There are two JPEG compression
algorithms: the oldest one is simply referred to as "JPEG" within
this page. The newer is JPEG 2000 algorithm
JPEG is a lossy compression algorithm that has been conceived to reduce the file size
of natural, photographic-like true-color images as much as possible without
affecting the quality of the image as experienced by the human sensory engine.
We perceive small changes in brightness more readily than we do small changes
in color. It is this aspect of our perception that JPEG compression exploits in
an effort to reduce the file size
JPEG is suitable for grayscale or
color images, such as continuous-tone photographs that contain more detail than
can be reproduced on-screen or in print. JPEG is lossy, which means that it
removes image data and may reduce image quality, but it attempts to reduce
file size with the minimum loss of information. Because JPEG eliminates data,
it can achieve much smaller file sizes than ZIP compression.
Acrobat provides six JPEG options,
ranging from Maximum quality (the least compression and the smallest loss of
data) to Minimum quality (the most compression and the greatest loss of data).
The loss of detail that results from the Maximum and High quality settings is
so slight that most people cannot tell an image has been compressed. At Minimum
and Low, however, the image may become blocky and acquire a mosaic look. The
Medium quality setting usually strikes the best balance in creating a compact
file while still maintaining enough information to produce high-quality images.
No comments:
Post a Comment