The Reality of XML

XML was built as an extension to SGML to create a generalized interchangeable format that supports Unicode. Inherent in extending SGML, you get separation of meta data and raw data, self description, hierarchical structures, and a well laid out human readable format. All very nice. It even comes with tools to perform transformation, queries, etc. It has several excellent uses like standard interchange formats for office documents, configuration files, data exports, situations where legibility is paramount, etc.

With the above in mind, it’s easy to see why the world adopted XML as a primary form of data exchange. To many it seemed like a panacea. Any time I want to exchange information, XML must be the right answer because that’s what it’s built for. The problem with XML is that the community doesn’t seem as aware of it’s shortcomings as it’s benefits. There in lies the rub. It’s now being used in situations that it is not appropriate for. What if you don’t need data tools, self described data, hierarchical structures, and human readable formats. Each of those features comes at a serious cost to performance and resource utilization.

The reality of XML is that it’s a drain on performance and doesn’t scale well as a transport. Every XML message comes with a degree of bloat that is taxing on message encoding, message transmission / storage, and message decoding. Let’s take a normal XML message and compare it to some alternatives. The examples below are of a simple stock quote. I know the first three can be compressed further, but they exist to make a general point.

Standard XML - 98 Bytes w/ Tabs & UTF 8

<?xml version=”1.0″?>
<quote>
    <symbol>MSFT</symbol>
    <side>B</side>
    <price>10.00</price>
</quote>
Compressed XML  - 84 Bytes w/ Tabs & UTF 8

<?xml version=”1.0″?>
<quote>
    <data symbol=”MSFT” side=”B” price=”10.00″/>
</quote>
Tag Based - 30 Bytes

symbol=MSFT&side=B&price=10.00
Binary - 10 Bytes

A = ASCII | F = Float | B = Boolean  (Each Character Represents a Byte)

AAAAABFFFF
Note there are 5 ASCII bytes for “MSFT” because strings are variable length and require a delimiter.

So what does the above mean for real world applications. Let’s use an equities trading platform as our example application. To give you an idea of scale, message rates for market data can exceed 100,000 messages a second. Lets be conservative and say real world quotes (more complex than above) are 100 bytes each in XML, and that we have an average rate of 20,000 messages a second. Again being conservative, let’s say we can achieve a 50% reduction in message size by using an alternative format. This equates to a savings of 1,000,000 bytes a second… 58,583.75KB a minute… and 3,433.23MB an hour.

This savings in message size means reduced latency caused by message processing, reduced risk of network packet collision / retransmission, better utilization of user / kernel buffers, increased limits for services that queue unprocessed messages, and less garbage collection in Java. If you were recording incoming messages to disk, it would also decrease the amount of disk IO. Two other items to keep in mind are XML messages are hierarchical and require more complex and cpu intensive parsing and validation. This also leads to increased response latency.

Obviously not all applications are equities trading platforms, but everyone wants their application to be faster, more scalable, and cheaper. XML standards like SOAP and RSS are becoming more ubiquitous on the web and internal networks. The implications are a slower user experience, increased probability of outages, increased hardware budgets, and higher saturation of the internet and internal networks. Some companies will try to solve this problem with band aids like XML processing appliances, but I think it’s time to recognize it’s benefits and shortcomings and design our applications accordingly.

Technorati Tags: , , , ,