
Multi-language, Large Character Set Information Processing





The collections in a museum spans large range in terms of time and geography. During digital archiving, the data needs to input to computer as graphical image as well as textual data for later convenience such as for searching and re-use. There is a large problem that faces us today. Namely that of the relatively the small number of different characters computer systems can handle, and the difficulty of mixing multi-language characters in a document.

For example, JIS character code which is the basis of the computer text processing in Japan today has 6,335 Kanji characters. This number is too small: digitizing documents before the Showa-era result in many missing characters from the character code system. Unicode character code standard contains 20 thousands kanji characters. Unfortunately, due to the "unification" of different characters in China, Korea, Taiwan, and Japan in order to limit the number of characters around 20 thousand characters, there is a high chance computer software mis-renders the documents in which many languages are used at the same time. Furthermore, the number of characters necessary to digitize text data of the world including the ancient times is well above the 20 thousand threshold.

In order to solve the character handling problem, the digital archive uses the TRON multi-lingual large character set processing environment. This environment supports an extensible character code system and supports about 130,000 characters currently. These characters are stored in Character-sensitive database and we can specify exactly which characters are being searched. Such distinctions of what constitute different characters change over time and across regions.

[Image: 多国語・多漢字]