June 20, 2009 posted by David Young
My Google Summer of Code student, Nhat Minh Le, is working on
a suite of simple, efficient, stream-oriented tools for processing
XML on UNIX systems. Nhat Minh is making good progress on xmlgrep, a
grep-alike program that understands XML syntax.
Read about Nhat Minh's progress on his blog.
Keep reading for my explanation of the niche where Nhat Minh's tools
UNIX's versatile text-processing system consists of simple tools
(awk, grep, join, sed, sort) that provide elementary text-processing
functions, and tool-building facilities (pipelines and scripts) that let
you assemble simple tools into more sophisticated tools. UNIX tools are
well-suited to processing tables where there is one record per line and
where each field in a record is delimited from the next by a reserved
character or characters.
About a decade ago, XML began to show up on UNIX systems in the
form of both configuration files and XHTML web pages. Some UNIX
administrators grumbled about the introduction of XML, especially XML
as a configuration file format. They were accustomed both to reading
and writing configuration files, and to automating chores by processing
configuration files with standard UNIX tools. Tabular configuration
files were more suited to be processed with standard UNIX tools, and to
be read and written by people, than XML files were. Some admins thought
that the use of XML as a configuration file should stop and never be
reconsidered. Others felt that while XML configuration files may have
promise, to introduce them without elementary processing tools was
premature; some of those admins waited for the analogues to awk, sed, et
cetera for XML to appear.
And they waited, and waited. Today, there are no XML tools to equal
the economy of implementation, flexibility, and ease of use of the UNIX
text-processing tools. Complicated tools with weighty prerequisites,
such as a Java virtual machine, are common.
Nhat Minh's XML tools project aims to deliver small, simple programs that
can work together to perform sophisticated XML processing tasks, bringing UNIX text processing into the XML era. xmlgrep will extract text from XML documents. A tool called xmlsed will transform them. We are still discussing the merits of xmlsort and xmljoin, and a tool for "decorating" one XML tree with another that has no parallel in the traditional UNIX tools.
(A draft of this blog entry first appeared on an internal mailing list
of OJC Technologies.)