Licensing and Adaption of DjVu
DjVu appeared first as an open-source implementation which was named «DjVuLibre» and used the GNU General Public License. However, the copy rights to the commercial developments of the encoding software have been transferred to several different companies over the years, including AT&T Corporation, LizardTech, Celartem among others. Although PDF is used more frequently than DjVu despite some experts being of the opinion that DjVu is in fact the better format for documents due to the superior compression algorithms, DjVu reached a considerable level of acceptance because of this open-source licensing.
Since DjVu was developed at the peak of the age of digitalization where many books were scanned still many scanned documents and books across the web are using DjVu. Furthermore, in 2002 the Internet Archive with its Million Book Project which provides millions of scanned public-domain books also decided to support DjVu along with PDF.
The technical file specifications of DjVu
DjVu was originally derived from the Interchange File Format (IFF) which is based on based on hierarchically organized chunks. Like it is the case for IFF, its structure is preceded by a 4-byte AT&T magic number. This identifier is followed by a marker indicating if one has to do with a single-page (DJVU)or a multi-page document (DJVM), respectively. In case you want to create DjVu files yourself you can use a PDF to DjVu converter. Going into more detail here would certainly go beyond the scope of this article. Another important specification, however, is the internet mime-type for DjVu which image/vnd.djvu or image/x-djvu. The current version of DjVu is Version 26 which was released more than 10 years ago.
Why are DjVu files special?
DjVu files use advanced compression technologies which are about 5 to 10 times better then those of JPEG and Tiff. A scanned page in color (resolution 300 DPI) with a file size of lets say about 25 MB can be easily compressed to only 100 kB (!) using DjVu. All DjVus can be equipped with a text-layer to make them searchable. These searchable DjVus behave very similar to PDF documents.
A key to achieve this excellent compression is so-called multi-scale bicolor clustering which allows a foreground/background mask separation that is way more general than the standard text/image segmentation. Along with a set of soft pattern matching algorithms, the JBIG2 compression which is used by DjVu beats the JBIG1 compression, which has been the standard for bi-level images for a long time, by a factor of two. The principle behind the JBIG2 encoding is the following: First the method identifies nearly identical shapes on the page, such as multiple occurrences of a particular character in a given font, style, and size. Then it compresses the bitmap of each unique shape separately, and then encodes the locations where each shape appears on the page. Like that similar shapes are only compressed once instead of multiple times which explains the advantage in terms of file size DjVu files usually show. Further key components of the compression technique used by DjVu are a multi-scale successive projections algorithm and the so called ZP-coder.
Where did the DjVu file format come from?
The DjVu format was developed as an alternative to the PDF format in 1998 at AT&T Labswhere some ground-breaking inventions like the transistor were done. The main contributors to the developments of DjVu were Yann LeCun, Léon Bottou, Patrick Haffner, and Paul G. Howard and the leading idea behind the development of DjVu was to create a file format which is optimized for scanned documents which contain both pictures and text. A key requirement here was that the new file format performs better than PDF for this kind of documents. A key advantage of DjVu is the limited file size of DjVu files. Therefore, it is frequently used for the distribution of scanned documents on the web. In contrast to PDF it is an open file format which means that it can be used by both open source software and proprietary software without any charge. The DjVu Format usually uses the extension .djvu or sometimes also only .djv.
Up to now one can think of a DjVu files as a loose collection of rastered images which do not contain any searchable text information and therefore appear difficult to handle. So we have to get used to the fact that PDF is the more handsome format without discussion? Of course not! The authors of DjVu were clever enough to find a smart work-around here: In ordre to make DjVu files searchable and therefore behave very similar to PDFs they added a hidden OCR layer to the definition of the file format. This is a very economic way of providing the text information in searchable way on the one hand and keeping a strict separation between the visual appearance of the document and the content which can be searched by the reader. Most DjVu files which are circulating in the web contain such a text layer. The main difference between DjVu and PDF is that the DjVu format is a raster image format while the PDF format is a scalable vector file format. This trick even allows to copy and paste text easily from any DjVu, which is equipped with such a layer, like one is used from dealing with PDFs.
С этим читают