XML tools update
My Google Summer of Code student, Nhat Minh Le, is working on a suite of simple, efficient, stream-oriented tools for processing XML on UNIX systems. Nhat Minh is making good progress on xmlgrep, a grep-alike program that understands XML syntax.
Read about Nhat Minh's progress on his blog.
Keep reading for my explanation of the niche where Nhat Minh's tools fit.
UNIX's versatile text-processing system consists of simple tools (awk, grep, join, sed, sort) that provide elementary text-processing functions, and tool-building facilities (pipelines and scripts) that let you assemble simple tools into more sophisticated tools. UNIX tools are well-suited to processing tables where there is one record per line and where each field in a record is delimited from the next by a reserved character or characters.
About a decade ago, XML began to show up on UNIX systems in the form of both configuration files and XHTML web pages. Some UNIX administrators grumbled about the introduction of XML, especially XML as a configuration file format. They were accustomed both to reading and writing configuration files, and to automating chores by processing configuration files with standard UNIX tools. Tabular configuration files were more suited to be processed with standard UNIX tools, and to be read and written by people, than XML files were. Some admins thought that the use of XML as a configuration file should stop and never be reconsidered. Others felt that while XML configuration files may have promise, to introduce them without elementary processing tools was premature; some of those admins waited for the analogues to awk, sed, et cetera for XML to appear.
And they waited, and waited. Today, there are no XML tools to equal the economy of implementation, flexibility, and ease of use of the UNIX text-processing tools. Complicated tools with weighty prerequisites, such as a Java virtual machine, are common.
Nhat Minh's XML tools project aims to deliver small, simple programs that can work together to perform sophisticated XML processing tasks, bringing UNIX text processing into the XML era. xmlgrep will extract text from XML documents. A tool called xmlsed will transform them. We are still discussing the merits of xmlsort and xmljoin, and a tool for "decorating" one XML tree with another that has no parallel in the traditional UNIX tools.
(A draft of this blog entry first appeared on an internal mailing list of OJC Technologies.)
[1 comment]
Posted by David Young on March 17, 2011 at 04:26 AM UTC #