I have been looking for a while now but I am not able to find a proper solution.
I have a database with Dutch, French and German words which all have their special characters. e.g. é, è, ß, ç, etc...
For some cases, like in a url, I would like to replace these with alphanumeric characters. respectively e, e, ss, c, etc...
Is there a generic function or Python package that does this?
I could do this with Regex of course, but something generic would be great here.
Thanks.
try this package: https://pypi.python.org/pypi/Unidecode
>>> import unidecode
>>> unidecode.unidecode(u'çß')
'css'
As you say, this could be done using a Regex sub. You would of course need to include upper and lowercase variants.
import re
data = "é, è, ß, ç, äÄ"
lookup = {'é':'e', 'è':'e', 'ß':'ss', 'ç':'c', 'ä':'a', 'Ä':'A'}
print re.sub(r'([éèßçäÄ])', lambda x: lookup[x.group(1)], data)
This would display the following:
e, e, ss, c, aA
you can almost get away with the builtin unicode data (unfortunately a few of your characters break it)
>>> import unicodedata
>>> s=u"é, è, ß, ç"
>>> unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
'e, e, , c'
here is a solution that has the codepoints hardcoded stolen from http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
def latin1_to_ascii (unicrap):
"""This takes a UNICODE string and replaces Latin-1 characters with
something equivalent in 7-bit ASCII. It returns a plain ASCII string.
This function makes a best effort to convert Latin-1 characters into
ASCII equivalents. It does not just strip out the Latin-1 characters.
All characters in the standard 7-bit ASCII range are preserved.
In the 8th bit range all the Latin-1 accented letters are converted
to unaccented equivalents. Most symbol characters are converted to
something meaningful. Anything not converted is deleted.
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += str(i)
return r
of coarse you could just as easily use a regex as indicated in the other answers
I have several strings like this:
s = u'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
s
"awëerwq مرحباмир bròn 1990 23x4 + &23 'we' we's mexicqué"
I couldn't found a way to remove non-printable things like 'مرحباми', but keeping latin characters like 'óé,...'. Also numbers (like '1990') are undesirable in my case. I have used ASCII flag from re but I don't know what's wrong with that because it removes 'óëé,...'. It is the same problem with using string.printable.
I don't know why
ord('ë')
235
Given that the ASCII table it is assigned 137. The result I would to expect is something like this:
x = some_method(s)
"awëerwq bròn 23x4 we we s mexicqué"
Then, I would like to code with no dependence on unfixed codification.
Here's a way that might help (Python 3.4):
import unicodedata
def remove_nonlatin(s):
s = (ch for ch in s
if unicodedata.name(ch).startswith(('LATIN', 'DIGIT', 'SPACE')))
return ''.join(s)
>>> s = 'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
>>> remove_nonlatin(s)
'awëerwqbròn 1990 23x4 23 we wes mexicqué'
This grabs the unicode names of the characters in the string, and matches charaters who's names start with LATIN, DIGIT, or SPACE.
For example, this would match:
>>> unicodedata.name('S')
'LATIN CAPITAL LETTER S'
And this would not:
>>> unicodedata.name('م')
'ARABIC LETTER MEEM'
I'm reasonably sure that latin characters all have unicode names starting with 'LATIN', so this should filter out other writing scripts, while keeping digits and spaces. There's not a convenient one-liner for punctuation, so in this example, exclamation points and such are also filtered out.
You could presumably filter by code point by using something like ord(c) < 0x250, though you may get some things that you aren't expecting. Or, you could try filtering by unicodedata.category. However, the 'letter' category includes letters from a lot of scripts, so you will still end up with some of these: 'م'.
I have used ASCII flag from re but I don't know what's wrong with that because it removes 'óëé,...'.
I think you are asking your question wrong. ASCII does not have the characters óëé in it. Take a look here to see the set of all ASCII characters and see how basic it is:
https://en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart
It appears that the string you are using is in Unicode since it can support both "مرحباми" as well as "'óëé" at the same time.
In that case, you can find the character ranges you want using
http://jrgraphix.net/research/unicode_blocks.php
and include only the Latin ones (this will filter out Arabic characters for example).
Here's an example:
import re
s = u"مرحباми123"
# prints "123" by keeping all characters from the following ranges:
# 0020 — 007F Basic Latin
# 00A0 — 00FF Latin-1 Supplement
# 0100 — 017F Latin Extended-A
# 0180 — 024F Latin Extended-B
print ''.join(re.findall(ur'[\u0020-\u007F\u00A0-\u00FF\u0100-\u017F\u0180-\u024F]+', s))
I have millions of strings scraped from web like:
s = 'WHAT\xe2\x80\x99S UP DOC?'
type(s) == str # returns True
Special characters like in the string above are inevitable when scraping from the web. How should one remove all such special characters to retain just clean text? I am thinking of regular expression like this based on my very limited experience with unicode characters:
\\x.*[0-9]
The special characters are not actually multiple characters long, that is just how they are represented so your regex isn't going to work. If you print you will see the actual unicode (utf-8) characters
>>> s = 'WHAT\xe2\x80\x99S UP DOC?'
>>> print(s)
WHATâS UP DOC?
>>> repr(s)
"'WHATâ\\x80\\x99S UP DOC?'"
If you want to print only the ascii characters you can check if the character is in string.printable
>>> import string
>>> ''.join(i for i in s if i in string.printable)
'WHATS UP DOC?'
This thing worked for me as mentioned by Padriac in comments:
s.decode('ascii', errors='ignore')
I understand that unicodedata.normalize converts diacritics to their non diacritic counterparts:
import unicodedata
''.join( c for c in unicodedata.normalize('NFD', u'B\u0153uf')
if unicodedata.category(c) != 'Mn'
)
My question is (and can be seen in this example): does unicodedata has a way to replace combined char diacritics into their counterparts? (u'œ' becomes 'oe')
If not I assume I will have to put a hit out for these, but then I might as well compile my own dict with all uchars and their counterparts and forget about unicodedata altogether...
There's a bit of confusion about terminology in your question. A diacritic is a mark that can be added to a letter or other character but generally does not stand on its own. (Unicode also uses the more general term combining character.) What normalize('NFD', ...) does is to convert precomposed characters into their components.
Anyway, the answer is that œ is not a precomposed character. It's a typographic ligature:
>>> unicodedata.name(u'\u0153')
'LATIN SMALL LIGATURE OE'
The unicodedata module provides no method for splitting ligatures into their parts. But the data is there in the character names:
import re
import unicodedata
_ligature_re = re.compile(r'LATIN (?:(CAPITAL)|SMALL) LIGATURE ([A-Z]{2,})')
def split_ligatures(s):
"""
Split the ligatures in `s` into their component letters.
"""
def untie(l):
m = _ligature_re.match(unicodedata.name(l))
if not m: return l
elif m.group(1): return m.group(2)
else: return m.group(2).lower()
return ''.join(untie(l) for l in s)
>>> split_ligatures(u'B\u0153uf \u0132sselmeer \uFB00otogra\uFB00')
u'Boeuf IJsselmeer ffotograff'
(Of course you wouldn't do it like this in practice: you'd preprocess the Unicode database to generate a lookup table as you suggest in your question. There aren't all that many ligatures in Unicode.)
Does something exist that can take as input U+0043 and produce as output the letter C, maybe even a small description of the character ( like LATIN CAPITAL LETTER C )?
EDIT: the U+0043 is just an example. I would like a generic solution please, that could work for as many codepoints as possible.
unicodedata.name looks promising. You need a bit of (trivial) parsing, of course, if you have a string input like U+0043.
The hackish way:
import unicodedata
codepoint = b"U+0043"
char = codepoint.replace('U+', "\u").decode('unicode-escape')
# or char = unichr(int(codepoint.replace('U+', ''), 16))
print char
print unicodedata.name(char)
import unicodedata
print unicodedata.name(u'C') # or unicodedata.name(u'\u0043')
# LATIN CAPITAL LETTER C
You could do chr(0x43) do get C.