Home

This site is a collection of code and information that I’ve either written or found useful. Some of it is information that I’ve found related to processing Japanese text which I’ve found as a result of studying the Japanese language, as well as some tools that I’ve built to help in my own study.  Some of them might be useful – check out my Japanese text parser and my Japanese article reader.

Latest Posts


Browser button to parse selected text

It can be irritating to copy and paste text into the parser. I basically copied this code from WWWJDIC’s buttons.

On Firefox, you can bookmark this link. Then just select text on whatever webpage you’re viewing and click the bookmark.

Japanese article reader

I’ve put together an article reading system to help learn Japanese, sort of like http://readjapanesenews.com or http://www.japaneseclass.jp.  The difference is the articles can be searched based on the words you want to study. Given a list of words, it will scan a small database of articles to find articles with as many matching words as possible.  When I use it, it automatically pulls the list of kanji to search for from my Anki deck’s marked cards.  If you want, you can try it out with your own list of words.  You can paste them in pretty much any format, one per line or separated by commas — any words that aren’t in my dictionary will be ignored. You can also find a more compact form here. Here are some example words that you might search for to test:


Parsing the EDICT format in Perl

If you’ve looked at the EDICT file format documentation, you will notice that while it’s consistent it’s not totally obvious how it should be parsed.  There’s no escaping of characters, and while the “/” is used as a separator and “(tag)” is used to denote a set of informational tags you will find plenty of similar strings scattered within the entries.  Here’s some code that deals with it fairly nicely. More…

Japanese programming resources

I’m compiling a list of useful resources for processing Japanese language content. So far, word breaking, dictionaries and regexps have done everything I need so far. More…

Japanese word breaking using MeCab

If you are not fluent in Japanese, the documentation for MeCab can look intimidating.  For the simplest use, this will do the job. Internally it uses EUC-JP, so you may have to switch encodings if (like me) everything else is in UTF8. MeCab produces a fair bit of information, but for my purposes I only needed a little bit of it. More…