2010-11-18

What is StAX? What are its advantages over SAX and DOM?


StAX is a standard XML API that can stream XML data from and to a particular application just like other streaming APIs such as SAX and XNI. The implementation of StAX is based on the standard pull parser interface for Java API. In StAX, the client application requests the parser for the next piece of information rather than the parser telling the client application when the next datum is available. To put it simple, in StAX, the client application will drive the parser instead of the opposite. Moreover, StAX shares with SAX the ability to read arbitrarily large documents and it is a bidirectional API which allows applications to both read existing XML documents and create new ones at the same time.

One of the features of StAX is that it provides efficient XML access using cursor API such as XMLStreamReader. This cursor API will move across an XML document from the beginning to the end, one item at a time and only in a forward motion. The item that the cursor will point at can be a text node, a start-tag, a comment or the beginning of the document. One can retrieve the information about the item which the cursor is currently positioned at by invoking methods such as getName and getText on the XMLStreamReader. Below shows an example to demonstrate how an instance of XMLStreamReader is loaded using the XMLInputFactory java class in a typical StAX program:

URL u = new URL("http://rom1023.blogspot.com/");
InputStream in = u.openStream();
XMLInputFactory factory = XMLInputFactory.newInstance();
XMLStreamReader parser = factory.createXMLStreamReader(in);

There are many getter methods available on the XMLStreamReader for retrieving different types of information from the current item. For example, we can retrieve the name of the element, its text node, its attribute count and etc. for the current item that the cursor is pointing at. Here is a sample code that will iterate through the XML document and print out the names of the different elements the cursor has encountered and the content of the characters if a character event is met:

for (int event = parser.next(); event != XMLStreamConstants.END_DOCUMENT; event = parser.next()) {
    switch (event) {
        case XMLStreamConstants.START_ELEMENT:
            System.out.println(parser.getLocalName());
        case XMLStreamConstants.CHARACTERS:
            System.out.println(parser.getText());
    }
}

The above loop with a switch statement is a very common pattern used in StAX programs instead of using a stack of if-else statements. However, this is also one of major criticisms of StAX as the Integer type codes for determining the type of item the cursor is at and the big switch statements do not align with the pattern of Object oriented programs which are based on classes, inheritance and polymorphism. Instead, the next method of the XMLStreamReader class should return an XMLEvent object that has subclasses like StartElement, Characters, EndDocument and etc. in order to implement the Object oriented concepts. The main reason for using integer type codes instead of classes is to avoid the slow reflection of Java API but it does sacrifice the advantages of using Object oriented programming.

The above simple example alone perhaps doesn't demonstrate the full power of StAX. As what I have mentioned in the previous paragraph, StAX is a bidirectional API which allows reading of XML document and also writing data into XML document as well. For output, instead of using the XMLStreamReader class which I have introduced earlier on, we can use the XMLStreamWriter which is an interface class that provides methods to write elements, attributes, comments, text and etc. to an XML document. Below is an example of how an instance of XMLStreamWriter can be loaded using the XMLOutputFactory java class:

OutputStream out = new FileOutputStream("data.xml");
XMLOutputFactory factory = XMLOutputFactory.newInstance();
XMLStreamWriter writer = factory.createXMLStreamWriter(out);

Different types of data can be written onto the output stream using various writeFOO methods provided by the XMLStreamWriter class such as writeStartDocument, writeStartElement, writeEndElement, writeCharacters, writeComment and etc. Below is an example of writing a hello world XML document:

writer.writeStartDocument("ISO-8859-1", "1.0");
writer.writeStartElement("greeting");
writer.writeCharacters("Hello World");
writer.writeEndDocument();

There are many advantages of using XMLStreamWriter to write data to an XML document. One of these is that it helps to maintain that some constraints are well-formed. For example, the endDocument method will close all the unclosed start-tags and writeCharacters will perform any necessary escaping of special characters such as < and &. Moreover, it is able to deal with documents with multiple roots and namespaces and also element names that contain whitespace. Overall, creating XML document using XMLStreamWriter is more efficient and faster than using a DOM tree.

So, after introducing what is StAX, the big question here is that what are its advantages over SAX and DOM? Firstly, start with SAX, one of the major differences between StAX and SAX APIs is that StAX is a streaming pull parser whilst SAX is a streaming push parser. A streaming pull parser refers to a programming model in which a client application will do the initiative of calling methods on a XML parsing library when it needs to interact with an XML document. That is, the client application only gets (pulls) the XML data when it explicitly asks for it. On the other hand, a streaming push parser refers to a programming model in which an XML parser sends (pushes) the XML data to the client application as the parser moves across an XML document from one element to the other. That is, the parser sends the data to the client application without considering whether or not it is at a ready state for it to use.

The advantages of StAX as a streaming pull parser over SAX which is a streaming push parser are summarized as below:
  • StAX parser is more flexible as it allows the client to control the application thread and call methods on the parser when needed. By contrast, SAX will take the control of the application thread and the client can only accept invocations from the parser.
  • The size of the pull parsing libraries of StAX is much smaller than that of SAX and the implementation codes of StAX are much simpler and easier to code than that of SAX especially even when dealing with more complex XML documents.
  • A StAX pull parser is capable of reading multiple documents at one time with a single thread and this can't be done with SAX parser.
  • A StAX pull parser can filter elements that can be ignored by the client in an XML document and it can support XML views of non-XML data. A SAX push parser, on the other hand, does not support such functions.
  • Unlike SAX, StAX is a bidirectional API which allows programs to both read existing XML documents and create new ones. This gives StAX an edge over SAX by providing user with more functions and alternatives.
Next, let's talk about the advantages of StAX over DOM. Generally speaking, there are two types of programming models for working with XML documents: document streaming (SAX and StAX) and the document model (DOM). Streaming models for XML processing are particularly useful when there is limitation of memory usage in the application or when the application has to process several requests simultaneously. In fact, it can be argued that majority of the XML business logic can benefit more from the streaming processing style than the DOM-tree processing style which demands in-memory maintenance of the entire DOM trees.

To summarize, here are the advantages of StAX as a document streaming model over DOM which uses the document tree model:
  • StAX works better than DOM when processing a large XML document which is larger than a few megabytes in size or in memory constrained environments such as J2ME.
  • StAX API is faster than DOM API in general as they can start generating output from the input almost immediately without waiting for the entire document to be read which is not the case for DOM which needs to build excessively complicated tree data structure upon reading the document.
  • StAX API is able to work on applications that require a constant streaming of XML document to retrieve the real-time data such as Web Services or Instant Messaging applications. DOM API is impossible to work on these applications as it will be inappropriate to wait for the stream’s closing tag (in order to finalize the building of the DOM tree) since the XML document is consistently streaming.
In conclusion, StAX is a fast, straightforward and memory-thrifty way of loading data from an XML document. Although it still have its shortcomings such as it does not support random access of the XML document after loading and it does not work well when the structure of the XML document is very complex, many of the toughest XML processing problems encountered today do come from exactly the domain where StAX does work well in compared to SAX and DOM.


References

0 Comments:

Post a Comment