Approximately converting unicode string to ascii string in python - python

don't know wether this is trivial or not, but I'd need to convert an unicode string to ascii string, and I wouldn't like to have all those escape chars around. I mean, is it possible to have an "approximate" conversion to some quite similar ascii character?
For example: Gavin O’Connor gets converted to Gavin O\x92Connor, but I'd really like it to be just converted to Gavin O'Connor. Is this possible? Did anyone write some util to do it, or do I have to manually replace all chars?
Thank you very much!
Marco

Use the Unidecode package to transliterate the string.
>>> import unidecode
>>> unidecode.unidecode(u'Gavin O’Connor')
"Gavin O'Connor"

import unicodedata
unicode_string = u"Gavin O’Connor"
print unicodedata.normalize('NFKD', unicode_string).encode('ascii','ignore')
Output:
Gavin O'Connor
Here's the document that describes the normalization forms: http://unicode.org/reports/tr15/

b = str(a.encode('utf-8').decode('ascii', 'ignore'))
should work fine.

There is a technique to strip accents from characters, but other characters need to be directly replaced. Check this article: http://effbot.org/zone/unicode-convert.htm

Try simple character replacement
str1 = "“I am the greatest”, said Gavin O’Connor"
print(str1)
print(str1.replace("’", "'").replace("“","\"").replace("”","\""))
PS: add # -*- coding: utf-8 -*- to the top of your .py file if you get error

Related

Cleaning python bytes from ASCII codes [duplicate]

I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.
Fancy quotes show up quite often (u'\u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii') to the csv file?
I have tried to find the left quote and replace it:
val = line.find(u'\u201c')
if val >= 0: line[val] = '"'
But to no avail:
TypeError: 'unicode' object does not support item assignment
Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?
Thank you
You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.
from unidecode import unidecode
line = unidecode(line)
This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.
Edit: a comment points out if your language isn't English, you may find ASCII to be too restrictive. Here's an adaptation of the above code that uses a whitelist to indicate characters that shouldn't be converted.
>>> from unidecode import unidecode
>>> whitelist = set('µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ')
>>> line = '\u201cRésumé\u201d'
>>> print(line)
“Résumé”
>>> line = ''.join(c if c in whitelist else unidecode(c) for c in line)
>>> print(line)
"Résumé"
You can't assign to a string, as they are immutable, and can't be changed.
You can, however, just use the regex library, which might be the most flexible way to do this:
import re
newline = re.sub(u'\u201c','"',line)

How to convert utf-8 fancy quotes to neutral quotes

I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.
Fancy quotes show up quite often (u'\u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii') to the csv file?
I have tried to find the left quote and replace it:
val = line.find(u'\u201c')
if val >= 0: line[val] = '"'
But to no avail:
TypeError: 'unicode' object does not support item assignment
Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?
Thank you
You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.
from unidecode import unidecode
line = unidecode(line)
This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.
Edit: a comment points out if your language isn't English, you may find ASCII to be too restrictive. Here's an adaptation of the above code that uses a whitelist to indicate characters that shouldn't be converted.
>>> from unidecode import unidecode
>>> whitelist = set('µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ')
>>> line = '\u201cRésumé\u201d'
>>> print(line)
“Résumé”
>>> line = ''.join(c if c in whitelist else unidecode(c) for c in line)
>>> print(line)
"Résumé"
You can't assign to a string, as they are immutable, and can't be changed.
You can, however, just use the regex library, which might be the most flexible way to do this:
import re
newline = re.sub(u'\u201c','"',line)

How can I convert a unicode string into string literals in Python 2.7?

Python2.7:
I would like to do something unusual. Most people want to convert string literals to more human-readable strings. I would like to convert the following list of unicode strings into their literal forms:
hallöchen
Straße
Gemüse
freø̯̯nt
to their codepoint forms that look something like this:
\u3023\u2344
You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.
I am not sure what the terminology is for these things—please correct me if I am mistaken.
You can use the str.encode([encoding[, errors]]) function with the unicode_escape encoding:
>>> s = u'freø̯̯nt'
>>> print(s.encode('unicode_escape'))
b'fre\\xf8\\u032f\\u032fnt'
You'll notice that freø̯̯nt has two inverted breves below the ø. I would like to convert especially that word into its literal form, such that I can use REGEX to remove the extra breve.
You don't need codecs.encode(unicode_string, 'unicode-escape') in this case. There are no string literals in memory only string objects.
Unicode string is a sequence of Unicode codepoints in Python. The same user-perceived characters can be written using different codepoints e.g., 'Ç' could be written as u'\u00c7' and u'\u0043\u0327'.
You could use NFKD Unicode normalization form to make sure "breves" are separate in order not to miss them when they are duplicated:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re
import unicodedata
s = u"freø̯̯nt"
# remove consecutive duplicate "breves"
print(re.sub(u'\u032f+', u'\u032f', unicodedata.normalize('NFKD', s)))
Could you explain why your re.sub command does not have any +1 for ensuring that the breves are consecutive characters? (like #Paulo Freitas's answer)
re.sub('c+', 'c', text) makes sure that there are no 'cc', 'ccc', 'cccc', etc in the text. Sometimes the regex does unnecessary work by replacing 'c' with 'c'. But the result is the same: no consecutive duplicate 'c' in the text.
The regex from #Paulo Freitas's answer should also work:
no_duplicates = re.sub(u'(\u032f)\\1+', r'\1', unicodedata.normalize('NFKD', s))
It performs the replacement only for duplicates. You can measure time performance and see what regex runs faster if it is a bottleneck in your application.

Python and character normalization

Hello
I retrieve text based utf8 data from a foreign source which contains special chars such as u"ıöüç" while I want to normalize them to English such as "ıöüç" -> "iouc" . What would be the best way to achieve this ?
I recommend using Unidecode module:
>>> from unidecode import unidecode
>>> unidecode(u'ıöüç')
'iouc'
Note how you feed it a unicode string and it outputs a byte string. The output is guaranteed to be ASCII.
It all depends on how far you want to go in transliterating the result. If you want to convert everything all the way to ASCII (αβγ to abg) then unidecode is the way to go.
If you just want to remove accents from accented letters, then you could try decomposing your string using normalization form NFKD (this converts the accented letter á to a plain letter a followed by U+0301 COMBINING ACUTE ACCENT) and then discarding the accents (which belong to the Unicode character class Mn — "Mark, nonspacing").
import unicodedata
def remove_nonspacing_marks(s):
"Decompose the unicode string s and remove non-spacing marks."
return ''.join(c for c in unicodedata.normalize('NFKD', s)
if unicodedata.category(c) != 'Mn')
The simplest way I found:
unicodedata.normalize('NFKD', s).encode("ascii", "ignore")
import unicodedata
unicodedata.normalize()
http://docs.python.org/library/unicodedata.html

UTF in Python Regex

I'm aware that Python 3 fixes a lot of UTF issues, I am not however able to use Python 3, I am using 2.5.1
I'm trying to regex a document but the document has UTF hyphens in it – rather than -. Python can't match these and if I put them in the regex it throws a wobbly.
How can I force Python to use a UTF string or in some way match a character such as that?
Thanks for your help
You have to escape the character in question (–) and put a u in front of the string literal to make it a unicode string.
So, for example, this:
re.compile("–")
becomes this:
re.compile(u"\u2013")
After a quick test and visit to PEP 0264: Defining Python Source Code Encodings, I see you may need to tell Python the whole file is UTF-8 encoded by adding adding a comment like this to the first line.
# encoding: utf-8
Here's the test file I created and ran on Python 2.5.1 / OS X 10.5.6
# encoding: utf-8
import re
x = re.compile("–")
print x.search("xxx–x").start()
Don't use UTF-8 in a regular expression. UTF-8 is a multibyte encoding where some unicode code points are encoded by 2 or more bytes. You may match parts of your string that you didn't plan to match. Instead use unicode strings as suggested.

Categories