Hanzi Recognizer is a project that has been born out of CS490t at Purdue University by myself and Nathan Hobbs. The information below is current as of 11/24/2007.
Features
When you draw a Chinese character in the drawing panel, and look it up, it gives 5 of the closest matches. For each character, it also informs the user of the definition, pronunciation, main radical and main radical definition.
High Level Implementation Details
The stroke recognition and scoring algorithm isn’t the greatest, but it is based upon JavaDict, a 10 year old Java 1.2 program. A stroke is defined as pen down until pen up. What it does, is it turns a single stroke into a number. The number represents the direction the stroke was made. When you make a multiple direction stroke (Like drawing a capital L), it is registered as two directional strokes.
Directional strokes:
7 8 9
\ | /
4 ---5--- 6
/ | \
1 2 3
So an L would be considered a 26 stroke. Currently then, it scores the given stroke against every stroke in the database, with the lowest score being the best match.
We then have a Unistrok file that contains how to draw every character. This file looks similar to:
6c34 | 61 2 1 3
Where it is [unicode | strokes].
To Do
Currently, we need a better scoring algorithm. If there are too many or too few strokes drawn, then we just arbitrarily give it a bad score. This can be improved through implementing an edit distance algorithm.
Allowing for characters to be searched by pronunciation or radical.
It takes a long time to load. This is a side effect of the fact that our characters are stored unsorted, therefore it takes O(n) time to search for a character. In itself that’s not slow, but doing that 50,000 times is slow. We need to sort the characters as we store them and search with a binary search (or hash and search in O(1)).
Nate just implemented some basic speech functionality. We need to get some speakers to speak the pronunciations and implement this.
Currently we are creating our own XML file format to store the characters so we don’t have to get information from 3 different databases. This file can be rebuilt as the others are updated, and itself would cut down on the load times by almost a full order of magnitude.
We need to add more support for more variants of radicals and simplified / traditional characters. Currently, there are many places where we (seemingly) arbitrarily choose to display either a traditional or simplified variant.
References