Decoding Linux File Names in Python

One of the many problems I ran into while hacking muse, the music library management program I’ve been writing, is that some bands include non ASCII characters in their names, and these are used inconsistently. For example, Blue Öyster Cult is sometimes spelled with an Ö, other times with an O. This make it hard for the software to identify the name.

I spent a lot of time before I finally figured out how to translate accented characters used in file names to their nearest ASCII equivalents. The problem I had was that I assumed that Linux’s file names were encoded using the ISO 8859-1 (AKA latin1) encoding. When I finally dumped the string in hex, I discovered that they’re actually encoded in UTF-8.

The translation table is a simple map of Unicode ordinals (e.g. 0xDF for Ö) to the unicode string to translate them to (e.g. u’O’). Here’s the Python code I ended up with to simplify strings to allow different variants to be compared, which assumes that str objects are encoded with UTF-8:

articlePattern = re.compile(r'(?:(?:a|an|the) )?(.+?)'
                          + r'(?:, (?:a|an|the))?$')
latinToAscii = {ord('\xD6'): u"O", ord('\xF6'): u"o"}

def simpleString(string):
    '''trim leading and trailing space, replace whitespaces with 
       single spaces, lower case and remove leading/trailing 
       article'''

    if string == None:
        return None

    if isinstance(string, unicode):
        return articlePattern.match(u' '.join(string.split())
                                        .translate(latinToAscii)
                                        .lower())
                             .group(1)

    return articlePattern.match(u' '.join(string.decode('utf8')
                                                .split())
                                    .translate(latinToAscii)
                                    .lower())
                         .group(1)
                         .encode('utf8')
Advertisements

About jimbelton

I'm a software developer, and a writer of both fiction and non-fiction, and I blog about movies, books, and philosophy. My interest in religious philosophy and the search for the truth inspires much of my writing.
This entry was posted in linux, programming, python and tagged , , , , , . Bookmark the permalink.

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s