API Best Practices Blog
Tradeoffs in XML data transformations »
Daniel Jacobson of NPR posted a fascinating piece about how NPR tackles a common problem – what’s the best way to render content on a variety of devices, from modern web browsers with top-notch CSS implementations that look almost like typesetting (like Safari) to mobile phones using WAP to low-end devices like HD Radio receivers that don’t understand anything but plain ASCII text.
NPR’s clever solution is to strip markup out of the text and store it in a database table, indexed by position in the text document. To re-generate the content for a particular device, their software queries the database and re-applies the markup tags to the content according to what device it is rendering to.
This takes me back to the original reason SGML was invented and made an ISO standard in 1986. The idea was to describe the semantic meaning of text, and then to let a computer program figure out how to render it for human consumption.
SGML was a little over-engineered for that purpose, however, so a bunch of smart people got together in 1996 and invented XML. XML then begat technologies like HTML, XSLT, and CSS.
So today, instead of writing something like:
<h1 class=”headline”>This is a headline</h1><p class=”byline”><b>By I.M.A. Reporter</b></p><p class=”paragraph”>And here is my first paragraph with something in <i>italics</i>.</p>
XML lets us write:
<main_headline>This is a headline</main_headline><byline>By I.M.A. Reporter</byline><p>And here is my first paragraph with something in <i>italics</i>.</p>
The difference is that my second example isn’t HTML – it’s part of a document that uses an XML schema that’s up to me, and when writing it I don’t care if I’m coding for an HTML browser or for a car radio – I just have to identify when I’m writing a headline, or a byline, or a caption, and so on. I can now use XSLT or another transformation technology to transform this XML into very simple HTML for a simple browser, or into very complex HTML with links to a CSS stylesheet for a more sophisticated browser, or just into plain text. And if I decide that part of my XML schema should look just like HTML (like I did above with the “p” and “i” tags) then that’s fine too.
Other approaches and tradeoffs
NPR’s approach has a lot of benefts. Depending on your business and situation, this might mean lot of database processing, which could to be expensive to scale in either licenses or capacity. Caching helps a lot in this case, since once is content there’s no need to do it again.
You could also solve this problem by writing the original content in very simple HTML or XML (in whatever schema one desires) and then by using something like XSLT to transform the content for each input device. This solution might be CPU-intensive but might compare favorably vs. database operations depending on what you are doing. Plus, XSLT processing can be easily scaled across thousands of parallel nodes if necessary without buying any more database licenses.
If development resources and cycles are the constraint, a dedicated policy layer can help. In the case of our Sonoa ServiceNet technology - you could configure transformation policies that leverage XPath or XSLT from within our proxy. This might also make it easier to add and validate 3rd party APIs or feeds from outside your own database. You can also handle other types of mediation such as versioning or protocol transformations, if that is in your use case, such as some of our Sonoa media and consumer web services customers do.




