October 07, 2009 posted by David Young
This summer I mentored Nhat Minh Lê's project, XML Command-Line
Utilities for NetBSD. Here is my summary of the project goals and
The main idea of the project was to bring the UNIX text-processing idiom
to XML, helping users to employ pipelines, elementary filters, and shell
scripts in XML processing tasks.
Nhat Minh's goal was to produce a set of tools for processing XML
streams on the UNIX command-line. Using his tools, a user should be
able to filter and transform common strains of XML such as XHTML and
XML property lists. Each tool should be small, with few dependencies,
and restrict itself to doing one task very well. Each tool should be
designed to operate efficiently in a pipeline with XML inputs and XML
(or other) outputs. No tool should read its entire input before it
wrote any output, unless the computation strictly required it (e.g.,
sorting the input). The novelty and efficiency of Nhat Minh's tools owe
a lot to that last requirement!
Nhat Minh and I contemplated six or more tools for processing XML. We
called them xmlgrep, xmlsed, xmlawk, xmlsort, xmljoin, and xmldecorate.
We roughly defined and agreed how the first five tools would operate; as
you can guess, they were roughly analogous to the UNIX tools grep, sed,
awk, sort, and join. We deemed the first two tools primarily important,
and difficult enough to occupy Nhat Minh for the entire summer.
Nhat Minh completed xmlgrep this summer. Xmlgrep filters its XML input
using XPath-like expressions. I'm finding it tremendously useful for
finding needles in XML haystacks, for extracting a specific record or
field from an XML dataset, and for filtering superflous data out of an
XML database. Xmlgrep is fast, and it is memory-efficient: compared
with tools such as xmlstarlet that start by reading their entire input
into a DOM, xmlgrep uses a negligible amount of memory to perform many
useful search and filter functions.
The goal of programming both xmlgrep and xmlsed over the summer was too
ambitious. Nhat Minh tweaked the xmlgrep algorithms and the syntax of
xmlgrep patterns as our knowledge increased, and an agreeable design for
xmlsed's patterns and functions eluded us for a while.
The summer is over, but Nhat Minh continues to work on his XML tools.
Now he has refactored xmlgrep and started a prototype of xmlsed.
Keep an eye out for xmlgrep and xmlsed's integration with the NetBSD