In my previous post I compared my D implementation of the PorterStemmer to the Python NLTK version. The results were ok, the D stemmer ran almost twice as fast.
However, John Colvin was disatified by the performance of my implementation. With his changes He was able to reduce the runtime to an astounding 0.04s per cycle, down from 0.6s. Quite, the speedup. I was excited to try it out against the C implementation provided by Martin Porter himself.
D implementation changes
This was quite impressive, considering the changes were relatively simple. The three major changes were adjustments to iterations, changes in the character replacement, and using static tuples for the adjustment pairs.
The loop adjustments involved moving from
foreach, even for indexed loop, ex:
This didn’t just improve the runtime, it is also more idiomatic for D, and is less prone to error than a traditional for.
Previously I was removing suffixes by adjusting the length of the string
s.length--. Using slices of the source string proved to be a significant improvement
s = s[0 .. $-1].
John’s final improvement was to use slice indices to directly replace the characters instead of using unicode compatible methods. i was originally replacing substrings like this
replaceInPlace(s, s.length - 1, s.length, "i");. John simply used slices
s[$-1] = 'i';. I was initially concerned about breaking compatibility with the text set I like to work with. Below I go over those concerns.
I also spend a small amount of time (like 5 minutes) speeding up the python benchmark. By adjusting the benchmark script to use list comprehension (instead of appending to an array, I was able to reduced runtime from 0.7 to about 0.5 seconds per iteration. This is still over 10x slower than the updated D implementation.
Comparison with C implementation
Initially I wanted to test against the Snowball generated implementation. However, it turned out to be a massive pain to get working. So I opted to go with the version manually coded by Dr. Porter.
The C implementation is doing a bit more work than the Python or D language benchmarks. The
stemfile function is reading the file for strings, as well as stemming each string. Originally it was also printing to stdout. In order to make them more similar, I ended up removing output from the D and Python implementations to make the benchmarks more similar.
This had a dramatic effect on the C, but not the D or Python benchmarks (likely because they were using a more efficient output method).
The C implementation is indeed fast. It completes each cycle in around 0.015s, 2x faster than the D implementation. This isn’t too surprising.
Still unicode compatible?
Initially, I thought that I needed to use handle unicode compatibility for a historical text set that I like to work with. This text contains mostly English characters. But also some extra Latin, Germanic, French, Greek, and other characters from western and southern European languages. However, when I began to test this assumption against the C and D implementations, I was surprised. Both implementations handled a wide variety of characters that I assumed would crash. The below example passed in both implementations
"⚷☧Œstraße" -> "⚷☧Œstraß")
This, despite using direct access to the string indices. The only conclusion I can come up with is that I don’t really understand what unicode is…..
With a couple of changes the D implementation was able to run in one tenth the time. Bringing it from the realm of Python performance closer to the C implementation.
My misgivings of John’s changes proved to be unfounded. It turns out I don’t understand character sets as well as I thought, but the D implementation still handle the character set I need it too work with, and that’s All I really care about.