Google Summer of Code: Efficient wide character regular expressions

June 29, 2009 posted by Matthias-Christian Ott

During this year’s Google Summer of Code I’m improving the performance of NetBSD’s regular expression library and adding support to it for wide characters.

NetBSD’s current regular expression library was initially written by Henry Spencer in 1986, when ASCII was the de facto standard. Moreover, Spencer described it as slow. So it makes real sense to replace it and seek for alternatives.

I decided to develop a new regular expression library at first. It seemed rather easy in the beginning, but turned out to be more difficult in all its details than I expected (everyone who read the regular expression specification in the POSIX standard will agree). Therefore, I was really happy when Ville Laurikari offered to release his regular expression library tre, which he wrote his master thesis about, under a 2-clause BSD licence, so that tre can become part of libc.

tre is a mature, strictly POSIX-conforming regular expression library with additional features, such as approximate matching, matching on binary data, thread safety, ABI compatibility and wide character support. tre also includes agrep, an approximate version of grep, which will be imported into NetBSD.

At the moment I’m preparing a performance benchmark, which is based on realistic regular expressions and data, to ensure that tre is verifiably efficient. If the benchmark reveals that tre is less efficient than NetBSD’s current regular expression library, I will try to address the performance issues. Afterwards tre will be integrated in libc.

Moreover, Glenn Fowler from AT&T Research developed a regression testing framework for regular expressions, which I will integrate with the atf to test and ensure POSIX-compliance.



Post a Comment:
Comments are closed for this entry.