Unable to load SSI head   Features

XML

by Mark Wonsil, Aug 1998
On February 10, 1998, XML (the eXtensible Markup Language) became a recommended standard of the World Wide Web Consortium. After that, you couldn't open a trade magazine (now including Interact) without seeing a reference to XML. What is XML? Will it replace Hypertext Markup Language (HTML)? Why do we need it? Who owns it? Can MPE/iX users benefit from it? In order to answer these questions, it helps to study some ancient history.

Back in 1986, the International Standards Organization (ISO) first published the Standard Generalized Markup Language, or SGML. The purpose of SGML is to provide a standard way to describe documents using markup. Most users are familiar with the concept of markup. It is used in word processing and in Web page development. For example, consider this HTML markup: <B>This is bold</B>. A Web browser knows to bold all text between the start and end tags. The user does not see the tags, only the text between the tags. This kind of markup is specific to the presentation of a document. Generalized markup describes the content of a document. Developers use SGML to describe hundreds of different types of documents, from financial documents to recipes. One of these SGML applications is HTML the markup language of the Web.

HTML

In the Internet's infancy, its purpose was to enable institutions to share research data. For the most part, this data started out as text files. If you wanted information from another computer, FTP (File Transfer Protocol) would be used to retrieve the file. As the number of files became larger, people used tools such as Gopher, Veronica, and Archie to help find particular files of interest. HTML appeared on the scene and provided a way to link related documents regardless of type or location. Since the early Internet involved many types of computers, it was important to provide information in a platform-independent manner. Each type of computer had to have a browser that could display an HTML document. In these early days, many people used Lynx, a text-based browser that can be found today on many UNIX boxes and on the HP 3000 (Lynx is available at //jazz.external.hp.com/ in the Freeware section). Later on, a graphical browser called Mosaic, written at the National Center for Supercomputing Applications (NCSA), would change the face of the Web forever.

At this point, the approach was still to let the browser render documents as it saw fit. As time went on, however, the pressure to control presentation began to build. Soon a young man left the NCSA and started a small company called Netscape. The Netscape browser was the first to add display enhancements to HTML, such as frames and tables. The emphasis on separating content from presentation was beginning to wane. As more companies joined the browser wars, presentation took on greater significance. For the sake of sanity, browsers did not validate HTML documents. The browser would simply render what it could and ignore any markup it did not understand.

As the Web grew bigger and bigger, new problems arose. With new sites springing up at every moment, there was a need for search tools to help find information. To use these search tools, you need to be more of an artist than a scientist. For example, a search on the word POPE returns pages on a U.S. Air Force base, a poet, a slew of folks with the last name of Pope, articles written about the current or past leaders of the Roman Catholic Church, pages in foreign languages with the four letters "POPE" in some of their words, and, of course, a few pornographic sites. Today's search engines do not search in context. HTML describes only the presentation of data and not the meaning of the text within a document.

While WWW stands for World Wide Web, the Web is not friendly to non-Roman alphabets. Many search requests return unreadable pages because the browser is not told which type of character encoding to use.

Oddly enough, one of HTML's strengths is also one of its greatest weaknesses. HTML linking is too manual and error prone. Links are hard coded and disbursed throughout each document. This makes link maintenance virtually impossible, and users encounter the inevitable ERROR 404 Not Found all too often. Even with all of the enhancements made to HTML, it was clear that no single markup language ever could handle everything that people would want to do on the Web.

XML

If HTML is so terrible and SGML so wonderful, why did HTML become the language of the Web? There are two reasons. First, SGML predates the popular Web. It was not designed for network usage. Second, SGML is not only large, it is complicated. It is so complicated, in fact, that many software programs have a hard time implementing all of its features. Worse yet, the most difficult features are the ones that are rarely used. So XML developers decided to take the most commonly used and most powerful features of SGML and make them efficient to use on the Web. Think of XML as SGML on a diet rather than HTML on steroids.

To get a better feel for XML, let us break down a simple XML document, such as the one shown in Listing 1. In many ways, this document has the look and feel of HTML, except for the unfamiliar tags. Sections of XML documents that start with <? and end with ?> are called processing instructions. This is where an author can pass information to the XML parser. In this case, it tells us which version of XML the document uses.

Unlike HTML, an XML document must be well-formed. An XML document consists of elements composed of a start tag, content, and an end tag. Tags are the tokens surrounded by angle brackets. Among other criteria, a document is well-formed when all elements are properly nested. This means every element is fully contained within another element. For example: <B><I>This is well formed</I></B> and <B><I>This is not</B></I>. Note that many HTML browsers will accept both.

Another difference between XML and HTML is that XML is case sensitive. <B>For example, this sentence is not a valid XML element, but it will work in HTML.</b> If the case is not the same for both the start and end tags, the document will not be well-formed. The document in Listing 1 is well-formed. Notice there is one element that does not have an end tag. How can the document be well-formed if the support element is not closed? The support element is a special case because it has no content. Instead of forcing you to add an end tag immediately after the start tag, XML allows you to put the forward slash before the closing angle bracket to indicate the element is empty. So our document is indeed well-formed.

There is a reason why proper nesting of elements is so important. When a document is properly nested, you can view it as a tree. Elements at the same level are sibling leaves and all subordinate elements are children leaves. Many of the XML-based technologies rely heavily on this tree structure.

Listing 1: A Sample XML Document
<?XML version="1.0"?>
<!-- A simple XML document -->
<systemConfig>
    <hardware>HP3000<series>918DX</series></hardware>
    <software>
       <opSys>MPE/iX<release>5.5</release><patch>PP4</patch></opSys>
       <application>CA-MANMAN<release>11.0</release></application>
    </software>
    <support level="PHONE"/>
</systemConfig>

DTD

In addition to being well-formed, a document can also be valid. In order for an XML document to be valid, it must be well-formed and it must comply with the rules of a Document Type Declaration, or DTD. A DTD describes the entities, elements, and attributes of an XML document. If an XML document has a DTD, it must appear at the beginning of the document. The DTD may be included in the document or it may exist in an external file. (Since HTML is an SGML application, it has a DTD of its own. Actually, there are three HTML DTDs, one for each version of HTML: 2.0, 3.2, and 4.0. Some Web pages indicate HTML version by specifying the DTD with the DOCTYPE tag.)

Usually some of the first items found in the DTD are entities. XML entities allow you to associate a part of the document with a single name. There are three types of entities: internal, external, and parameter. An internal entity is a simple text replacement. For example, instead of typing "INTEREX, the International Association of HP Computing Professionals" several places in one document, you could create the entity &INTX;. Now you can use &INTX; throughout the document.

All entities begin with an ampersand (&) and end with a semicolon. XML predefines several internal entities. Since the angle brackets (&lt; , &gt;) and the ampersand (&amp;) mean something to the XML parser, you must use an entity to include them in a document. For example, to say that X is less than three, you would code that as X &lt; 3. There are also internal entities for the apostrophe (&apos;) and the double quote (&quot;). You also can use entities to insert any UNICODE character into an XML document. Simply put the UNICODE number, decimal or hexadecimal, in between the &# and the semicolon. As a quick aside, if you have a large section of content riddled with reserved characters, like an XML example, you can "escape" the whole chunk using CDATA. See Listing 2. All text between <![CDATA[ and ]]> gets passed to the application unchanged.

Listing 2: Using CDATA to Bypass the Parser
<?XML version="1.0"?>
<example>Here is an example XML document with one element:
<![CDATA[
    <?XML version="1.0"?>
    <COMPANY>Hewlett Packard</COMPANY>
]]>
</example>
External entities point to data outside the current document. If the external entity is text, the parser inserts it as though it were typed at that point. If the entity is binary, such as a picture or video clip, it is not parsed but held for future reference. There are many good reasons to use external references. You can break a document up into pieces for easier handling or store commonly used text strings such as "Standard Sales Terms" or long legal disclaimers in a single file for insertion into many documents.

The third type of entity is a parameter entity. It is used only in document type declarations. Normally, entities remain unchanged in the DTD and are expanded by the parser. Parameter entities are expanded first, and the text becomes a part of the DTD. Listing 3 shows an internal entity, a binary external entity, a text external entity, and a parameter entity, respectively.

Listing 3: Entity Declarations
<!ENTITY INTX         "INTEREX, the International Association of HP Computing Professionals"> 
<!ENTITY INTLOGO      SYSTEM "/usr/local/pics/interex.gif">
<!ENTITY CHAPTER2     SYSTEM "/usr/users/mybook/chap2.xml">
<!ENTITY %MYENTITIES  SYSTEM "/usr/users/common/entities.xml">
After entities, the element declarations come next in a DTD. Actually, an XML document has only one element in it. That is to say, there is only one element that holds all of the other elements. Listing 4 is an element listing for the XML document in Listing 1. Notice that the systemConfig element contains in parentheses the major elements that describe the computer system. Elements separated by a comma must occur in that order. You can also indicate a choice of elements by separating them with a vertical bar (|). After each element or element group, you can indicate how often the element can appear. The DTD can indicate whether an element can exist zero times or one time
(? operator), zero or more times (* operator), or one or more times (+ operator). In our example, the software element contains one opsys element and at least one application element. It is important to note that the DTD restricts only the element structure. At this point, the DTD has absolutely no control over the content of an element. Several groups are working on data typing of content. For now, data is represented with the term #PCDATA, which stands for parsed character data.

Listing 4: Element and Attribute List Declarations
<!DOCTYPE systemConfig [
<!ELEMENT systemConfig (hardware, software, support)>
<!ELEMENT hardware (desc, series)>
<!ELEMENT desc (#PCDATA)>
<!ELEMENT series (#PCDATA)> 
<!ELEMENT software (opSys, application+)>
<!ELEMENT opSys (release, patch)> <!ELEMENT application (desc, release)>
<!ELEMENT release (#PCDATA)>
<!ELEMENT patch (#PCDATA)>
<!ELEMENT support (#PCDATA)?>
<!ATTLIST support
level NOTATION (NONE, PHONE, WEB, FULL, CUSTOM) #REQUIRED "NONE">
]>
After specifying the elements, you can specify which attributes an element takes. Attributes in XML look like those in HTML. They are named data pairs that appear within the start tag. In the system configuration example, the support element takes a single attribute, level. Listing 4 shows the attribute declaration for the support element. XML attributes give XML and related technologies a lot of power. You can declare an attribute to further describe the element's data. If you had a date element, an attribute would tell the receiving application what type of calendar the date used (Gregorian, Julian, Atmarian, and so on).

Attribute values have several flavors. The simplest is just a string of characters, or CDATA. This stands for character data, # but unfortunately it has nothing to do with the <![CDATA[..]]> described earlier. The next type of attribute is an ID attribute. Every value of an ID attribute in the document must be unique. This essentially puts a key field on any element that has this type of attribute. Other attributes can only take ID field values; in other words, this attribute becomes a pointer to another element. Finally, you can enumerate a list of possible values. This last type is the one used by the support element in our example.

After you set the type of attribute, you can assign default value instructions. There are four kinds of default values. The first is that the attribute is required (#REQUIRED). It must be specified on every element for the given attribute list. The next is just the opposite (#IMPLIED). It means the attribute is not required and no default value will be set by the parser. Although it does not sound useful, there is also a fixed default (#FIXED). This says if the attribute is on an element, it must have the given fixed value. This type is used when declaring links in XML documents. Finally, there is just a string value. If the attribute is not specified, this attribute will take on the default value. XML requires quotes around every attribute value, unlike HTML.

XSL

One of the goals of XML is to separate content from presentation. In the previous section, we focused entirely on structuring the data. Now how do we display it? One of the issues that XML addresses is data reuse. In today's Web, after we write the user manual or owner's guide, we start over and write the Web pages. In order to distribute the manual on CD, often we have to start over again. (Yes, you can save word processing documents as HTML, but often that creates one very large page, or else you must do significant reworking to make it efficient for network use.) This is where style sheets come into play. SGML users are already familiar with style sheets. ISO standard 10179, Document Style and Semantic Specification Language, or DSSSL, can be used to format SGML documents. At the time of this writing, XSL, the eXtensible Style Language, is still in the proposal stages at the World Wide Web Consortium. Many expect it to be a subset of DSSSL and more powerful than existing HTML cascading style sheets.

Although there is no XSL standard yet, there is a philosophical goal of XSL: write a style sheet for each type of output. If you want to publish a document on the Web, write a style sheet that creates HTML. To create a printed document, write a style sheet that creates PostScript. Publish on a CD? You guessed it*write a style sheet. From some of the technical previews, this is what we can expect from XSL.

XSL is a data-driven scheme. You write rules that search for certain situations and format the data accordingly. Cascading style sheets instruct the browser to display defined styles in a particular manner. XSL takes this idea further. It is possible to format elements based on their position within the document. Listing 5 is an XSL fragment and, not coincidentally, an XML file. This fragment says, "Look for 'Section' elements that exist inside of 'chapter' elements and prespace eight points down before displaying and make the font size twelve points."" Of course, this is a trivial rule.

Listing 5: Style Sheet Fragment
<rule>
    <-- Target pattern -->
    <element type="Chapter">
        <target-element type="Section"/>
    </element>
    <-- Perform Action -->
    <sequence>
        <paragraph space-before="8pt" font-size="12pt"
    </sequence>
</rule>
The proposed XSL specification really tries to make the standard extensible. You can filter elements. By selecting only certain elements, it is very easy to create a table of contents or a map for a Web site. Style sheets can also reorder elements. Using filtering and reordering, it is possible to make a user's guide and a reference manual all from the same XML document. If you need more control over the formatting, it is possible to call an ECMAScript, which returns a formatted string. ECMAScript is a standardized version of Netscape's JavaScript scripting language. Members of the W3C working committee hope to have XSL approved as a recommended standard by the end of 1998.

Several other W3C working groups are also working toward standardization. Formerly known as XLL, the eXtensible Linking Language group has split into the XLink and XPointer groups. As with XML and XSL, document linking is based on existing technologies, most notably HyTime and Text Encoding Initiatives (TEI). See the Web Sightings resources for further information on these specifications.

XLink

Like HTML, the XLink specification allows one to include an address of another document as a way to move to a new document. This is called a simple link. It is a one-way path to another target. Unlike HTML, the element does not have to be an anchor tag: <A>. XLink gives you the ability to make any element a link target. This is done by declaring a particular attribute list for the element. In fact, while HTML documents contain the target addresses within the document, XML links can be stored in an external file. The former link type is called in-line, the latter out-of-line. XLink also supports the notion of link groups. Instead of just one target, imagine a small pop-up window giving the user a choice of links. These types of links are called extended links. Unlike a simple link, an extended link group automatically sets up links from each target to every other target in the group. Imagine doing that by hand in HTML!

XLink enhances linking by giving more control to the document author. As in HTML, a link may specify a document fragment with the # operator (//www.hp.com/index.html#support). This operator means to load the entire document and point the browser to the element named support. XLink introduces the | fragment operator. It works like the # operator but does not require that the browser load the entire document. A Web server could retrieve only the requested element, saving time and bandwidth. XLink also gives control over the context of the target. You can specify that the target be displayed in a new window, replace the current window, or be embedded in the current window.

Note that a window makes sense only in a browser context. It may have different meaning in print or on CD. Finally, XLink gives you some control over link activation. You can set a link to traverse when the user clicks on it, or it can be activated automatically when the document loads.

While XLink gives more control in getting to other documents, XPointer provides more control after you have made it to the target. Using the XPointer syntax, you can jump to particular elements, groups of elements, or parts of elements. This technology relies heavily on the tree structure of the document. This syntax provides absolute or relative movement within the tree. XPointer will help make it easier for programs and agents to work with XML documents.

MPE/iX

What does XML mean to the HP 3000 and MPE/iX? A frequent complaint about MPE/iX is the user interface. Users love their graphical interfaces. Now that there is a movement to separate data serving from presentation, can one of the best systems for serving data take advantage? Absolutely. Currently, MPE/iX users can serve up XML files from a Web server. (You need only to add new MIME types such as text/xml.) And there is a popular saying in the XML community: "XML gives Java something to do." Indeed, much of the XML work to date has been done in Java. Many MPE/iX users may wonder what they will do with Java/iX when it arrives in Version 6.0 of the operating system. Let's look at some potential uses for XML on MPE/iX.

XML can represent complex data in a single document. This gives you the ability to make data transfers look more like objects. An XML parser could pick out the "properties" of the data transfer and manipulate them as needed. For example, you could write an MPE/iX server program that "exposed" MPE objects. In April 1998, there was talk on the HP3000-L listgroup about a new format for the :LISTF command. The discussion was about how to display the file sizes for the new larger files in 6.0. The main concern was that many users wrote CI command files that expect data to be in certain places and will break if the format changes too much. If there was a LISTFILE object server that returned an XML file with all of the properties for each file, no one would care about the format of the display.

In fact, a brisk business may be created for those who would write style sheets to improve the look and feel of MPE/iX. The old character-format LISTF command would be just a command file that formats and echoes LISTFILE properties. There can be other MPE/iX objects: a SYSGEN object for system parameters, an ACCOUNTS object for managing accounts and users, or a SESSIONS object for managing jobs and sessions. Figure 1 shows an example of how the SHOWJOB command might look with jobs displayed in different colors for different states (green for EXEC, red for SUSP, teal for WAIT, and so on).

Figure 1: SHOWJOB

What about current MPE/iX applications? You could write data servers too. Instead of writing a single VPlus application that maintains data, you would write an XML server and VPlus program that parses the XML file and sends the results back to the server. Want a Web page for the same application? The server is already written! By using XML documents, your presentation programs could pull data from other MPE/iX systems as well as
HP-UX, NT, or any other operating system that serves up XML pages. Software developers and third-party shops could extend their current offerings, too. Imagine being able to listen to multiple performance servers running on your HP 3000 network and displaying the load for all of those machines on one screen. Imagine report writers and data-extraction tools creating snapshot XML documents for analysis in spreadsheets and statistical programs. The possibilities are endless.

So what is the downside of XML? First, the standards are not set yet. That makes it difficult for new development. Moving to a client-server environment brings with it all of the issues of client-server computing*security, resource sharing, and updating. XML files are larger than their binary counterparts, but because of the duplication of tags they should compress well. Also, if anyone can create a markup language, will there be so many that we end up exactly where we are today? For example, if half of the book catalogs use a <written-by> tag and the other half uses <author>, will network searches improve? Only time will tell.

Conclusion

There is no doubt that XML is here to stay. It has a lot of momentum and the backing of the World Wide Web Consortium. It will grow first where it is useful for applications that want to download data and manipulate it at the client level. Once some of the standards and markup languages firm up, we will see some of the benefits on the Internet, especially in searching and linking. But the lasting benefit will come in the area of data sharing across multiple platforms. It is fair to say that the system that can serve up data the best will have a bright future. MPE/iX already has an edge in the online transaction processing market. If managed correctly, it will remain the leader for some time to come.

References Books

Presenting XML, Richard Light; Sams. One of the first books on XML, it provides a fine theoretical foundation.

XML: A Primer, Simon St. Laurent; MIS/IDG Books. Also an early book, but a favorite among many in the XML-L listgroup.

XML Complete, Steven Holzner; McGraw Hill. Because this book has all of its examples in Java, it is not only a good resource for XML but a nice Java primer as well.

News Groups

microsoft.public.xml

comp.text.sgml

XML-L - LISTSERV@listserv.hei.ie subscribe XML-L <Full name>


Mark Wonsil is president of 4M Enterprises, Inc. He has worked with the HP 3000 since 1982. You can send comments to him at wonsil@4m-ent.com.

  Contents Unable to load SSI foot