I have been looking for a while now but I am not able to find a proper solution.
I have a database with Dutch, French and German words which all have their special characters. e.g. é, è, ß, ç, etc...
For some cases, like in a url, I would like to replace these with alphanumeric characters. respectively e, e, ss, c, etc...
Is there a generic function or Python package that does this?
I could do this with Regex of course, but something generic would be great here.
Thanks.
try this package: https://pypi.python.org/pypi/Unidecode
>>> import unidecode
>>> unidecode.unidecode(u'çß')
'css'
As you say, this could be done using a Regex sub. You would of course need to include upper and lowercase variants.
import re
data = "é, è, ß, ç, äÄ"
lookup = {'é':'e', 'è':'e', 'ß':'ss', 'ç':'c', 'ä':'a', 'Ä':'A'}
print re.sub(r'([éèßçäÄ])', lambda x: lookup[x.group(1)], data)
This would display the following:
e, e, ss, c, aA
you can almost get away with the builtin unicode data (unfortunately a few of your characters break it)
>>> import unicodedata
>>> s=u"é, è, ß, ç"
>>> unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
'e, e, , c'
here is a solution that has the codepoints hardcoded stolen from http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
def latin1_to_ascii (unicrap):
"""This takes a UNICODE string and replaces Latin-1 characters with
something equivalent in 7-bit ASCII. It returns a plain ASCII string.
This function makes a best effort to convert Latin-1 characters into
ASCII equivalents. It does not just strip out the Latin-1 characters.
All characters in the standard 7-bit ASCII range are preserved.
In the 8th bit range all the Latin-1 accented letters are converted
to unaccented equivalents. Most symbol characters are converted to
something meaningful. Anything not converted is deleted.
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += str(i)
return r
of coarse you could just as easily use a regex as indicated in the other answers
I have several strings like this:
s = u'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
s
"awëerwq مرحباмир bròn 1990 23x4 + &23 'we' we's mexicqué"
I couldn't found a way to remove non-printable things like 'مرحباми', but keeping latin characters like 'óé,...'. Also numbers (like '1990') are undesirable in my case. I have used ASCII flag from re but I don't know what's wrong with that because it removes 'óëé,...'. It is the same problem with using string.printable.
I don't know why
ord('ë')
235
Given that the ASCII table it is assigned 137. The result I would to expect is something like this:
x = some_method(s)
"awëerwq bròn 23x4 we we s mexicqué"
Then, I would like to code with no dependence on unfixed codification.
Here's a way that might help (Python 3.4):
import unicodedata
def remove_nonlatin(s):
s = (ch for ch in s
if unicodedata.name(ch).startswith(('LATIN', 'DIGIT', 'SPACE')))
return ''.join(s)
>>> s = 'awëerwq\u0645\u0631\u062d\u0628\u0627\u043c\u0438\u0440bròn 1990 23x4 + &23 \'we\' we\'s mexicqué'
>>> remove_nonlatin(s)
'awëerwqbròn 1990 23x4 23 we wes mexicqué'
This grabs the unicode names of the characters in the string, and matches charaters who's names start with LATIN, DIGIT, or SPACE.
For example, this would match:
>>> unicodedata.name('S')
'LATIN CAPITAL LETTER S'
And this would not:
>>> unicodedata.name('م')
'ARABIC LETTER MEEM'
I'm reasonably sure that latin characters all have unicode names starting with 'LATIN', so this should filter out other writing scripts, while keeping digits and spaces. There's not a convenient one-liner for punctuation, so in this example, exclamation points and such are also filtered out.
You could presumably filter by code point by using something like ord(c) < 0x250, though you may get some things that you aren't expecting. Or, you could try filtering by unicodedata.category. However, the 'letter' category includes letters from a lot of scripts, so you will still end up with some of these: 'م'.
I have used ASCII flag from re but I don't know what's wrong with that because it removes 'óëé,...'.
I think you are asking your question wrong. ASCII does not have the characters óëé in it. Take a look here to see the set of all ASCII characters and see how basic it is:
https://en.wikipedia.org/wiki/ASCII#ASCII_printable_code_chart
It appears that the string you are using is in Unicode since it can support both "مرحباми" as well as "'óëé" at the same time.
In that case, you can find the character ranges you want using
http://jrgraphix.net/research/unicode_blocks.php
and include only the Latin ones (this will filter out Arabic characters for example).
Here's an example:
import re
s = u"مرحباми123"
# prints "123" by keeping all characters from the following ranges:
# 0020 — 007F Basic Latin
# 00A0 — 00FF Latin-1 Supplement
# 0100 — 017F Latin Extended-A
# 0180 — 024F Latin Extended-B
print ''.join(re.findall(ur'[\u0020-\u007F\u00A0-\u00FF\u0100-\u017F\u0180-\u024F]+', s))
I currently use re.findall to find and isolate words after the '#' character for hash tags in a string:
hashtags = re.findall(r'#([A-Za-z0-9_]+)', str1)
It searches str1 and finds all the hashtags. This works however it doesn't account for accented characters like these for example: áéíóúñü¿.
If one of these letters are in str1, it will save the hashtag up until the letter before it. So for example, #yogenfrüz would be #yogenfr.
I need to be able to account for all accented letters that range from German, Dutch, French and Spanish so that I can save hashtags like #yogenfrüz
How can I go about doing this
Try the following:
hashtags = re.findall(r'#(\w+)', str1, re.UNICODE)
Regex101 Demo
EDIT
Check the useful comment below from Martijn Pieters.
I know this question is a little outdated but you may also consider adding the range of accented characters À (index 192) and ÿ (index 255) to your original regex.
hashtags = re.findall(r'#([A-Za-z0-9_À-ÿ]+)', str1)
which will return ['yogenfrüz']
Hope this'll help anyone else.
You may also want to use
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
how do i convert all those escape characters into their respective characters like if there is an unicode à, how do i convert that into a standard a?
Assume you have loaded your unicode into a variable called my_unicode... normalizing à into a is this simple...
import unicodedata
output = unicodedata.normalize('NFD', my_unicode).encode('ascii', 'ignore')
Explicit example...
myfoo = u'àà'
myfoo
u'\xe0\xe0'
unicodedata.normalize('NFD', myfoo).encode('ascii', 'ignore')
'aa'
check this answer it helped me a lot: How to convert unicode accented characters to pure ascii without accents?
Here's an update to Ibrahim Najjar's original answer based on the comment Martijn Pieters made to the answer and another answer Martijn Pieters gave in https://stackoverflow.com/a/16467505/5302861:
import re
import unicodedata
s = "#ábá123"
n = unicodedata.normalize('NFC', s)
print(n)
c = ''.join(re.findall(r'#\w+', n, re.UNICODE))
print(s, len(s), c, len(c))
I understand that unicodedata.normalize converts diacritics to their non diacritic counterparts:
import unicodedata
''.join( c for c in unicodedata.normalize('NFD', u'B\u0153uf')
if unicodedata.category(c) != 'Mn'
)
My question is (and can be seen in this example): does unicodedata has a way to replace combined char diacritics into their counterparts? (u'œ' becomes 'oe')
If not I assume I will have to put a hit out for these, but then I might as well compile my own dict with all uchars and their counterparts and forget about unicodedata altogether...
There's a bit of confusion about terminology in your question. A diacritic is a mark that can be added to a letter or other character but generally does not stand on its own. (Unicode also uses the more general term combining character.) What normalize('NFD', ...) does is to convert precomposed characters into their components.
Anyway, the answer is that œ is not a precomposed character. It's a typographic ligature:
>>> unicodedata.name(u'\u0153')
'LATIN SMALL LIGATURE OE'
The unicodedata module provides no method for splitting ligatures into their parts. But the data is there in the character names:
import re
import unicodedata
_ligature_re = re.compile(r'LATIN (?:(CAPITAL)|SMALL) LIGATURE ([A-Z]{2,})')
def split_ligatures(s):
"""
Split the ligatures in `s` into their component letters.
"""
def untie(l):
m = _ligature_re.match(unicodedata.name(l))
if not m: return l
elif m.group(1): return m.group(2)
else: return m.group(2).lower()
return ''.join(untie(l) for l in s)
>>> split_ligatures(u'B\u0153uf \u0132sselmeer \uFB00otogra\uFB00')
u'Boeuf IJsselmeer ffotograff'
(Of course you wouldn't do it like this in practice: you'd preprocess the Unicode database to generate a lookup table as you suggest in your question. There aren't all that many ligatures in Unicode.)
Using re in Python, I would like to return all of the characters in a string that precede the first appearance of an underscore. In addition, I would like the string that is being returned to be in all uppercase and without any non-alpanumeric characters.
For example:
AG.av08_binloop_v6 = AGAV08
TL.av1_binloopv2 = TLAV1
I am pretty sure I know how to return a string in all uppercase using string.upper() but I'm sure there are several ways to remove the . efficiently. Any help would be greatly appreciated. I am still learning regular expressions slowly but surely. Each tip gets added to my notes for future use.
To further clarify, my above examples aren't the actual strings. The actual string would look like:
AG.av08_binloop_v6
With my desired output looking like:
AGAV08
And the next example would be the same. String:
TL.av1_binloopv2
Desired output:
TLAV1
Again, thanks all for the help!
Even without re:
text.split('_', 1)[0].replace('.', '').upper()
Try this:
re.sub("[^A-Z\d]", "", re.search("^[^_]*", str).group(0).upper())
Since everyone is giving their favorite implementation, here's mine that doesn't use re:
>>> for s in ('AG.av08_binloop_v6', 'TL.av1_binloopv2'):
... print ''.join(c for c in s.split('_',1)[0] if c.isalnum()).upper()
...
AGAV08
TLAV1
I put .upper() on the outside of the generator so it is only called once.
You don't have to use re for this. Simple string operations would be enough based on your requirements:
tests = """
AG.av08_binloop_v6 = AGAV08
TL.av1_binloopv2 = TLAV1
"""
for t in tests.splitlines():
print t[:t.find('_')].replace('.', '').upper()
# Returns:
# AGAV08
# TLAV1
Or if you absolutely must use re:
import re
pat = r'([a-zA-Z0-9.]+)_.*'
pat_re = re.compile(pat)
for t in tests.splitlines():
print re.sub(r'\.', '', pat_re.findall(t)[0]).upper()
# Returns:
# AGAV08
# TLAV1
He, just for fun, another option to get text before the first underscore is:
before_underscore, sep, after_underscore = str.partition('_')
So all in one line could be:
re.sub("[^A-Z\d]", "", str.partition('_')[0].upper())
import re
re.sub("[^A-Z\d]", "", yourstr.split('_',1)[0].upper())