One of the many problems I ran into while hacking muse, the music library management program I’ve been writing, is that some bands include non ASCII characters in their names, and these are used inconsistently. For example, Blue Öyster Cult is sometimes spelled with an Ö, other times with an O. This make it hard for the software to identify the name.
I spent a lot of time before I finally figured out how to translate accented characters used in file names to their nearest ASCII equivalents. The problem I had was that I assumed that Linux’s file names were encoded using the ISO 8859-1 (AKA latin1) encoding. When I finally dumped the string in hex, I discovered that they’re actually encoded in UTF-8.
The translation table is a simple map of Unicode ordinals (e.g. 0xDF for Ö) to the unicode string to translate them to (e.g. u’O’). Here’s the Python code I ended up with to simplify strings to allow different variants to be compared, which assumes that str objects are encoded with UTF-8:
articlePattern = re.compile(r'(?:(?:a|an|the) )?(.+?)' + r'(?:, (?:a|an|the))?$') latinToAscii = {ord('\xD6'): u"O", ord('\xF6'): u"o"} def simpleString(string): '''trim leading and trailing space, replace whitespaces with single spaces, lower case and remove leading/trailing article''' if string == None: return None if isinstance(string, unicode): return articlePattern.match(u' '.join(string.split()) .translate(latinToAscii) .lower()) .group(1) return articlePattern.match(u' '.join(string.decode('utf8') .split()) .translate(latinToAscii) .lower()) .group(1) .encode('utf8')