Splitting Words in Perl

I've created a new module, Text::WordDiff, now on its way to CPAN, to show the differences between two documents using words as tokens, rather than lines as Text::Diff does. I plan to use it in Bricolage to give people a change tracking-type view (as seen in word processors) comparing two versions of a document. Fortunately, Algorithm::Diff makes this extremely easy to do. My only real problem was figuring out how to tokenize a string into words

After looking at discussions in The Perl Cookbook and Mastering Regular Expressions, I settled on using Friedl's pattern for identifying the starting boundary of words, which is qr/(?<!\w)(?=\w)/msx. This pattern will turn the string, this is O'Reilly's string into the following tokens:

[
    q{this },
    q{is },
    q{O'},
    q{Reilly'},
    q{s },
    q{string},
];

So it's imperfect, but it works well enough for me. I'm thinking of using the Unicode character class for words, instead, at least for more recent versions of Perl that understand them (5.8.0 and later?). That would be /(?<!\p{IsWord})(?=\p{IsWord})/msx. The results using that regular expression are the same.

But otherwise, I'm not sure whether or not this is the best approach. I think that it's good enough for the general cases I have, and the matching of words in and of themselves is not that important. What I mean is that, as long as most tokens are words, it's okay with me if some, such as O', Reilly', and s in the above example, are not words. What I don't know is how well it'll work for non-Roman glyphs, such as in Japanese or Korean text. I tried a test on a Korean string I have lying around (borrowed from the Encode.pm test suite), but it didn't split it up at all (with use utf8;).

So what do you think? Does Text::WordDiff work for your text? Is there a better and more general solution for tokenizing the words in a string?

Backtalk

Andy wrote:

East Asian Languages

It seems to me that it'll work fine with any language that uses non-word characters to delimit words. (Okay, that's kind of just a restatement of your regexp.) This does mean that it'll fail with Chinese, Japanese and Korean as they don't put non-word characters between words. While punctuation (commas, periods, etc) have become popular (though not universal), word-delimiting spaces have not. Determining word boundaries for these is extremely non-trivial and sometime impossible. Lingua::ZH::Toke seems to take a stab at it for Chinese.

Theory wrote:

Re: East Asian Languages

Thanks for your comments, Andy. It is rather as I thought. I wonder if there's some way that I can detect single-character words in such languages, and split on characters, instead?

—Theory