MySQL Database Sorting for Japanese Text with Mixed Writing Systems

I’m working with a MySQL database that contains Japanese text data and need help with character sorting and searching. My current setup uses UTF8_general_ci collation, and I’ve been testing with UTF8_unicode_ci as well. I’m stuck on MySQL 5.6, so newer collation options aren’t really available to me right now.

The main problem I’m facing is with search functionality. I have a location field in my database that might contain the same place name written in different Japanese scripts. For example, I want a search to find the same record whether someone types Tokyo (romaji), 東京 (kanji), とうきょう (hiragana), or トウキョウ (katakana).

Is there a way to configure MySQL collation or use some other approach to make searches work across these different writing systems? I need all these variations to match the same database row when querying.

Been there with a legacy MySQL 5.6 setup handling Japanese addresses. Collation won’t work - MySQL just can’t understand how different Japanese writing systems relate to each other. I built a preprocessing layer in the app that normalizes text before it hits the database. Used kakasi library to convert everything to romaji for both inserts and searches, then stored the original text plus the normalized version. Key thing is treating this as a data transformation problem, not a database config issue. You could try FULLTEXT indexes with a custom parser, but that gets messy quick. Preprocessing gave me reliable cross-script matching without touching the database schema.

mysql collation isn’t gonna do the trick for mixed scripts like that. u might wanna consider normalizing the data in ur app or adding extra search fields with the converted versions. possibly look into using elasticsearch for better search capabilities?

honestly, your best bet is probably using a middleware solution. i dealt with something similar and ended up using mecab for tokenizing japanese text, then storing phonetic representations alongside the original data. works pretty well for cross-script matching without major database changes.

MySQL 5.6 doesn’t support Japanese script conversion, so you’ll have to handle it in your app. I ran into this same issue and ended up creating a custom function that converts all Japanese text to a standard form before storing and searching. I used JpnForPhp (PHP library) for hiragana/katakana conversion and kept a mapping table for common kanji-to-kana conversions. Another option is setting up a trigger that auto-populates a separate search column with normalized versions of your location names. Your original data stays untouched, but searches work across all writing systems. Performance isn’t terrible if you index the normalized column right.

Had this exact issue on an old MySQL 5.6 system a couple years back. MySQL collation can’t handle script conversions, so I used a dual-column setup that worked great. Left the original location field alone and added a “search_normalized” column with standardized location names. Used a server-side script with ICU library to populate the normalized column - converted hiragana to romaji, katakana to romaji, plus kept a lookup table for common kanji readings. Users search against the normalized column but get back the original formatted data. Got consistent results across all writing systems without touching the search logic. Performance was fine with proper indexing on the normalized column.