OmniPage Web
VER1.0 win - rev. c
by Dan Evans

This program, by Caere Corp., is designed to convert large paper documents to HTML pages while maintaining the logical structure of that document, including a means of hyperlink navigation within the document to be used in web page development. This is done by incorporating OmniPage Pro’s OCR (Optical Character Recognition) engine with Caere’s recently developed Logical Structure Recognition (LCR) technology. Remember, OCR programs convert scanned paper- based information into text that can be edited in word processors or other text-based programs. OmniPage Web takes this a step further, introducing LCR to automatically recognize the outline, hierarchy and structure of scanned documents, and than generate full sets of HTML pages. The program automatically generates a table of contents with hyperlinks to the appropriate section of the document as well as a navigation panel ( top and/or bottom of page) using text links, your own graphics or icons supplied by OmniPage Web. The levels of HTML that are supported by this version range from plain text to Dynamic HTML as well as Cascading Style Sheets (CSS).

I installed this program on a 500MHz Pentium III computer with 96 MB SDRAM memory, 13 GB hard drive and Windows 98 - second edition. My scanner is an antique HP ScanJet 4P connected via a Jaz SCSI PCI Card. Typical instillation consumed about 13 MB of hard drive space and about 15 minutes of setup time. OmniPage minimum requirements are Pentium PC, 45 MB HD space, Win 95/98/NT 4.0, SVGA or VGA 256 colors, CD-Rom and 16 MB RAM.

The program interface before outlining provides you with four toolbars (Standard, Zone, Table, and Auto Web) and three view panes (Thumbnail on the left for each page scanned, Original Image and Text).

The first four steps to converting a printed document to an HTML document is similar to other OCR packages that I’ve used. These steps are:

1. Scanning the document or loading a series of image files. The Scanning tab can be set to accommodate an Automatic Document Feeder (ADF), double sided pages as well as color, grayscale, and black and white scans; useful features when scanning large documents.
Steps two and three are combined:
2. Automatic zoning which identifies page elements such as text, graphics and tables, and establishing a reading order for them and;
3. Automatic OCR to convert printed text into editable text.
At this point you are given the opportunity to manually re-zone each page using the Zone & Table Toolbars to correct text, graphics and table zones as well as the order you may want them to appear. After adjusting page zones, you will want to save the document as a OmniPage Web (.wmt) file. This saves you having to scan and re-zone if you need to re OCR or re-outline your document. You can just reload the .wmt file and continue from there.
4. Now you re-OCR and proofread your document to correct OCR missed interpretations.
OmniPage Web does things a little different here. Using Logical Structure Recognition (LCR), the program completes steps five and six.
5. Once recognized, LCR will automatically create an outline of your document indicating various levels of the document structure. These include headlines, headings (up to 6 levels), body text, graphics, tables, headers, footers, captions, URL’s, e-mail addresses and cross references. This outline becomes the basis for the table of contents on the Web site. A toolbar at the top of the outline view pane allows you to make manual changes to the outline structure.
6. The final step saves the outline to HTML, complete with a Table of Contents (which we didn’t have in the beginning) linking to elements within the Web site, live links to Web addresses and e-mail addresses, as well as cross references to other sections of the site.

The program offers a large variety of options to control HTML output. General settings effect the whole document to include generating Plain HTML for universal browser support, document Title, page breaks and the ability to include the original page image on the web site.

The Components section controls the look and formatting of the Web pages. These incline the navigational panel, image map or banner, signature or copyright page, the order of the components on the page, headers/footers, table of contents, how graphics are presented and horizontal rulings.
The Component Styles section allow text, border, background preference to be modified. A broad range of style options become available with Cascading Style Sheets enabled. The program has 20 predefined themes with unique styles that can be modified or, you can develop you own theme and save for future use.

Priced at $499, this package is a particularly useful tool for web designers needing to convert very large documents ( handbooks, manuals, etc) for display on web sites. Caere Corporation can be found at

Back to the top

Home | Special Interest Groups | Business | Calendar of Events | Newsletter Articles | Reviews | Tidbits

This page was last updated on:
February 2003