Summer of Code Results: XML command-line utilities
This summer I mentored Nhat Minh Lê's project, XML Command-Line Utilities for NetBSD. Here is my summary of the project goals and results.
The main idea of the project was to bring the UNIX text-processing idiom to XML, helping users to employ pipelines, elementary filters, and shell scripts in XML processing tasks.
Goals
Nhat Minh's goal was to produce a set of tools for processing XML streams on the UNIX command-line. Using his tools, a user should be able to filter and transform common strains of XML such as XHTML and XML property lists. Each tool should be small, with few dependencies, and restrict itself to doing one task very well. Each tool should be designed to operate efficiently in a pipeline with XML inputs and XML (or other) outputs. No tool should read its entire input before it wrote any output, unless the computation strictly required it (e.g., sorting the input). The novelty and efficiency of Nhat Minh's tools owe a lot to that last requirement!
Nhat Minh and I contemplated six or more tools for processing XML. We called them xmlgrep, xmlsed, xmlawk, xmlsort, xmljoin, and xmldecorate. We roughly defined and agreed how the first five tools would operate; as you can guess, they were roughly analogous to the UNIX tools grep, sed, awk, sort, and join. We deemed the first two tools primarily important, and difficult enough to occupy Nhat Minh for the entire summer.
Results
Nhat Minh completed xmlgrep this summer. Xmlgrep filters its XML input using XPath-like expressions. I'm finding it tremendously useful for finding needles in XML haystacks, for extracting a specific record or field from an XML dataset, and for filtering superflous data out of an XML database. Xmlgrep is fast, and it is memory-efficient: compared with tools such as xmlstarlet that start by reading their entire input into a DOM, xmlgrep uses a negligible amount of memory to perform many useful search and filter functions.
The goal of programming both xmlgrep and xmlsed over the summer was too ambitious. Nhat Minh tweaked the xmlgrep algorithms and the syntax of xmlgrep patterns as our knowledge increased, and an agreeable design for xmlsed's patterns and functions eluded us for a while.
The summer is over, but Nhat Minh continues to work on his XML tools. Now he has refactored xmlgrep and started a prototype of xmlsed.
Keep an eye out for xmlgrep and xmlsed's integration with the NetBSD base system.
[0 comments]