Is there any way in Python 3 to replace general language specific characters for English letters?
For example, I've got function get_city(IP), that returns city name connected with given IP. It connects to external database, so I can't change the way it encodes, I am just getting value from database.
I would like to do something like:
city = "České Budějovice"
city = clear_name(city)
print(city) #should return "Ceske Budejovice"
Here I used Czech language, but in general it should work on any non Asian langauge.
Try unidecode:
# coding=utf-8
from unidecode import unidecode
city = "České Budějovice"
print(unidecode(city))
Prints Ceske Budejovice as desired (assuming your post has a typo).
Note: if you're using Python 2.x, you'll need to decode the string before passing it to unidecode, e.g. unidecode(city.decode('utf-8'))
Use unicodedata module for such cases.
To get the needed result you should normalize the given string by using unicodedata.normalize() and
unicodedata.combining() functions:
import unicodedata
city = "České Budějovice"
normalized = unicodedata.normalize('NFD', city)
new_city = u"".join([c for c in normalized if not unicodedata.combining(c)])
print(new_city) # Ceske Budejovice
NFD is one of the four Unicode Normalization Forms
http://www.unicode.org/reports/tr15/
Asongtoring above is almost correct - but in Python 3 it is a bit simpler as Pavlo Fesenko mentions in the comment to the solution. Here the solution in Python 3
from unidecode import unidecode
city = "České Budějovice"
print(unidecode(city))
Related
I am in search of the best way to "slugify" string what "slug" is, and my current solution is based on this recipe
I have changed it a little bit to:
s = 'String to slugify'
slug = unicodedata.normalize('NFKD', s)
slug = slug.encode('ascii', 'ignore').lower()
slug = re.sub(r'[^a-z0-9]+', '-', slug).strip('-')
slug = re.sub(r'[-]+', '-', slug)
Anyone see any problems with this code? It is working fine, but maybe I am missing something or you know a better way?
There is a python package named python-slugify, which does a pretty good job of slugifying:
pip install python-slugify
Works like this:
from slugify import slugify
txt = "This is a test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")
txt = "This -- is a ## test ---"
r = slugify(txt)
self.assertEquals(r, "this-is-a-test")
txt = 'C\'est déjà l\'été.'
r = slugify(txt)
self.assertEquals(r, "cest-deja-lete")
txt = 'Nín hǎo. Wǒ shì zhōng guó rén'
r = slugify(txt)
self.assertEquals(r, "nin-hao-wo-shi-zhong-guo-ren")
txt = 'Компьютер'
r = slugify(txt)
self.assertEquals(r, "kompiuter")
txt = 'jaja---lol-méméméoo--a'
r = slugify(txt)
self.assertEquals(r, "jaja-lol-mememeoo-a")
See More examples
This package does a bit more than what you posted (take a look at the source, it's just one file). The project is still active (got updated 2 days before I originally answered, over nine years later (last checked 2022-03-30), it still gets updated).
careful: There is a second package around, named slugify. If you have both of them, you might get a problem, as they have the same name for import. The one just named slugify didn't do all I quick-checked: "Ich heiße" became "ich-heie" (should be "ich-heisse"), so be sure to pick the right one, when using pip or easy_install.
Install unidecode form from here for unicode support
pip install unidecode
# -*- coding: utf-8 -*-
import re
import unidecode
def slugify(text):
text = unidecode.unidecode(text).lower()
return re.sub(r'[\W_]+', '-', text)
text = u"My custom хелло ворлд"
print slugify(text)
>>> my-custom-khello-vorld
There is python package named awesome-slugify:
pip install awesome-slugify
Works like this:
from slugify import slugify
slugify('one kožušček') # one-kozuscek
awesome-slugify github page
It works well in Django, so I don't see why it wouldn't be a good general purpose slugify function.
Are you having any problems with it?
The problem is with the ascii normalization line:
slug = unicodedata.normalize('NFKD', s)
It is called unicode normalization which does not decompose lots of characters to ascii. For example, it would strip non-ascii characters from the following strings:
Mørdag -> mrdag
Æther -> ther
A better way to do it is to use the unidecode module that tries to transliterate strings to ascii. So if you replace the above line with:
import unidecode
slug = unidecode.unidecode(s)
You get better results for the above strings and for many Greek and Russian characters too:
Mørdag -> mordag
Æther -> aether
def slugify(value):
"""
Converts to lowercase, removes non-word characters (alphanumerics and
underscores) and converts spaces to hyphens. Also strips leading and
trailing whitespace.
"""
value = unicodedata.normalize('NFKD', value).encode('ascii', 'ignore').decode('ascii')
value = re.sub('[^\w\s-]', '', value).strip().lower()
return mark_safe(re.sub('[-\s]+', '-', value))
slugify = allow_lazy(slugify, six.text_type)
This is the slugify function present in django.utils.text
This should suffice your requirement.
Unidecode is good; however, be careful: unidecode is GPL. If this license doesn't fit then use this one
A couple of options on GitHub:
https://github.com/dimka665/awesome-slugify
https://github.com/un33k/python-slugify
https://github.com/mozilla/unicode-slugify
Each supports slightly different parameters for its API, so you'll need to look through to figure out what you prefer.
In particular, pay attention to the different options they provide for dealing with non-ASCII characters. Pydanny wrote a very helpful blog post illustrating some of the unicode handling differences in these slugify'ing libraries: http://www.pydanny.com/awesome-slugify-human-readable-url-slugs-from-any-string.html This blog post is slightly outdated because Mozilla's unicode-slugify is no longer Django-specific.
Also note that currently awesome-slugify is GPLv3, though there's an open issue where the author says they'd prefer to release as MIT/BSD, just not sure of the legality: https://github.com/dimka665/awesome-slugify/issues/24
You might consider changing the last line to
slug=re.sub(r'--+',r'-',slug)
since the pattern [-]+ is no different than -+, and you don't really care about matching just one hyphen, only two or more.
But, of course, this is quite minor.
Another option is boltons.strutils.slugify. Boltons has quite a few other useful functions as well, and is distributed under a BSD license.
By your example, a fast manner to do that could be:
s = 'String to slugify'
slug = s.replace(" ", "-").lower()
another nice answer for creating it could be this form
import re
re.sub(r'\W+', '-', st).strip('-').lower()
I am trying to get a string to use in google geocoding api.I ve checked a lot of threads but I am still facing problem and I don't understand how to solve it.
I need addresse1 to be a string without any special characters. Addresse1 is for example: "32 rue d'Athènes Paris France".
addresse1= collect.replace(' ','+').replace('\n','')
addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')
here I got a string without any accent... Ho no... It is not a string but a bytes. So I ve done what was suggested and 'decode:
addresse1=addresse1.decode('utf-8')
But then addresse1 is exactly the same than at the begining... What do I have to do? What am I doing wrong? Or what i don't understand with unicode? Or is there a better solution?
Thanks,
Stéphane.
with 3rd party package: unidecode
3>> unidecode.unidecode("32 rue d'Athènes Paris France")
"32 rue d'Athenes Paris France"
addresse1=unicodedata.normalize('NFKD', addresse1).encode('utf-8','ignore')
You probably meant .encode('ascii', 'ignore'), to remove non-ASCII characters. UTF-8 contains all characters, so encoding to it doesn't get rid of any, and an encode-decode cycle with it is a no-op.
is there a better solution?
It depends what you are trying to do.
If you only want to remove diacritical marks and not lose all other non-ASCII characters, you could read unicodedata.category for each character after NFKD-normalising and remove those in category M.
If you want to transliterate to ASCII that becomes a language-specific question that requires custom replacements (for example in German ö becomes oe, but not in Swedish).
If you just want to fudge a string into ASCII because having non-ASCII characters in it causes some code to break, it is of course much better to fix that code to work properly with all Unicode characters than to mangle good data. The letter è is not encodable in ASCII, but neither are 99.9989% of all characters so that hardly makes it “special”. Code that only supports ASCII is lame.
The Google Geocoding API can work with Unicode perfectly well so there is no obvious reason you should need to do any of this.
ETA:
url2= 'maps.googleapis.com/maps/api/geocode/json?address=' + addresse1 ...
Ah, you need to URL-encode any data you inject into a URL. That's not just for Unicode — the above will break for many ASCII punctuation symbols too. Use urllib.quote to encode a single string, or urllib.encode to convert multiple parameters:
params = dict(
address=address1.encode('utf-8'),
key=googlekey
)
url2 = '...?' + urllib.urlencode(params)
(in Python 3 it's urllib.parse.quote and urllib.parse.encode and they automatically choose UTF-8 so you don't have to manually encode there.)
data2 = urllib.request.urlopen(url2).read().decode('utf-8')
data3=json.loads(data2)
json.loads reads byte strings so you should be safe to omit the UTF-8 decode. Anyway json.load will read directly from a file-like object so you shouldn't have to load the data into a string at all:
data3 = json.load(urllib.request.urlopen(url2))
Generally, there are two approaches: (1) regular expressions and (2) str.translate.
1) regular expressions
Decompose string and replace characters from the Unicode block \u0300-\u036f:
import unicodedata
import re
word = unicodedata.normalize("NFD", word)
word = re.sub("[\u0300-\u036f]", "", word)
It removes accents, circumflex, diaeresis, and so on:
pingüino > pinguino
εἴκοσι εἶσι > εικοσι εισι
For some languages, it could be another block, such as [\u0559-\u055f] for Armenian script.
2) str.translate
First, create replacement table (case-sensitive) and then apply it.
repl = str.maketrans(
"áéúíó",
"aeuio"
)
word.translate(repl)
Multi-char replacements are made as following:
repl = {
ord("æ"): "ae",
ord("œ"): "oe",
}
word.translate(repl)
I had a similar problem where I was generating tags that users might have to type with their phone.
Without using 3rd party packages you can simplify bobinces's answer above:
collect = "32 rue d'Athènes Paris France"
unicode_collect = unicodedata.normalize('NFD', collect)
address1 = unicode_collect.encode('ascii', 'ignore').decode('utf-8')
address1:
"32 rue d'Athenes Paris France"
You can use the translate() method from python.
Here's an example copied from tutorialspoint.com:
#!/usr/bin/python
from string import maketrans # Required to call maketrans function.
intab = "aeiou"
outtab = "12345"
trantab = maketrans(intab, outtab)
str = "this is string example....wow!!!";
print str.translate(trantab)
This outputs:
th3s 3s str3ng 2x1mpl2....w4w!!!
So you can define what characters you wish to replace more easily than with replace()
My website supports a number of Indian languages. The user can change the language dynamically. When user inputs some string value, I have to split the string value into its individual characters. So, I'm looking for a way to write a common function that will work for English and a select set of Indian languages. I have searched across sites, however, there appears to be no common way to handle this requirement. There are language-specific implementations (for example Open-Tamil package for Tamil implements get_letters) but I could not find a common way to split or iterate through the characters in a unicode string taking the graphemes into consideration.
One of the many methods that I've tried:
name = u'தமிழ்'
print name
for i in list(name):
print i
#expected output
தமிழ்
த
மி
ழ்
#actual output
தமிழ்
த
ம
ி
ழ
்
#Here is another an example using another Indian language
name = u'हिंदी'
print name
for i in list(name):
print i
#expected output
हिंदी
हिं
दी
#actual output
हिंदी
ह
ि
ं
द
ी
The way to solve this is to group all "L" category characters with their subsequent "M" category characters:
>>> regex.findall(ur'\p{L}\p{M}*', name)
[u'\u0ba4', u'\u0bae\u0bbf', u'\u0bb4\u0bcd']
>>> for c in regex.findall(ur'\p{L}\p{M}*', name):
... print c
...
த
மி
ழ்
regex
To get "user-perceived" characters whatever the language, use \X (eXtended grapheme cluster) regular expression:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import regex # $ pip install regex
for text in [u'தமிழ்', u'हिंदी']:
print("\n".join(regex.findall(r'\X', text, regex.U)))
Output
த
மி
ழ்
हिं
दी
uniseg works really well for this, and the docs are OK. The other answer to this question works for international Unicode characters, but falls flat if users enter Emoji. The solution below will work:
>>> emoji = u'😀😃😄😁'
>>> from uniseg.graphemecluster import grapheme_clusters
>>> for c in list(grapheme_clusters(emoji)):
... print c
...
😀
😃
😄
😁
This is from pip install uniseg==0.7.1.
I have following two functions that works perfectly fine with ASCII strings and use the re module:
import re
def findWord(w):
return re.compile(r'\b{0}.*?\b'.format(w), flags=re.IGNORECASE).findall
def replace_keyword(w, c, x):
return re.sub(r"\b({0}\S*)".format(w), r'<mark style="background-color:{0}">\1</mark>'.format(c), x, flags=re.I)
However, they fail on using the utf-8 encoded strings with accented characters. On searching further, I found that the regex module is better suited for Unicode strings and hence I have been trying to port this to use regex for the last couple of hours but nothing seem to be working. This is what I have as of now:
import regex
def findWord(w):
return regex.compile(r'\b{0}.*?\b'.format(w), flags=regex.IGNORECASE|regex.UNICODE).findall
def replace_keyword(w, c, x):
return regex.sub(r"\b({0}\S*)".format(w), r'<mark style="background-color:{0}">\1</mark>'.format(c), x, flags=regex.IGNORECASE|regex.UNICODE)
However, on using an accented (not normalized) utf-8 encoded string, I keep getting an ordinal not in range error.
EDIT: The suggested possible duplicate question: Regular expression to match non-English characters? doesn't solve my problem. I want to use the python re/regex module. Secondly, I want to get the find and replace functions working using python.
EDIT: I am using python 2
EDIT: If you feel you can help me get these two functions working using Python 3 please let me know. I hope I will be able to invoke python 3 for using just these 2 functions through my python 2 script.
I think I am headed somewhere. I am trying to get this working without using the modules re or regex but plain python:
found_keywords = []
for word in keyword_list:
if word.lower() in article_text.lower():
found_keywords.append(word)
for word in found_keywords: # highlight the found keyword in the text
article_text = article_text.lower().replace(word.lower(), '<mark style="background-color:%s">%s</mark>' % (yellow_color, word))
Now, I just have to somehow replace found keywords in a case-insensitive manner and I will be good to go.
Just help me with this last step of replacing keywords in a case-insensitive manner without using re or regex so that it works for accented strings.
I use this regex on some input,
[^a-zA-Z0-9##]
However this ends up removing lots of html special characters within the input, such as
#227;, #1606;, #1588; (i had to remove the & prefix so that it wouldn't
show up as the actual value..)
is there a way that I can convert them to their values so that it will satisfy the regexp expression? I also have no idea why the text decided to be so big.
Given that your text appears to have numeric-coded, not named, entities, you can first convert your byte string that includes xml entity defs (ampersand, hash, digits, semicolon) to unicode:
import re
xed_re = re.compile(r'&#(\d+);')
def usub(m): return unichr(int(m.group(1)))
s = 'ã, ن, ش'
u = xed_re.sub(usub, s)
if your terminal emulator can display arbitrary unicode glyphs, a print u will then show
ã, ن, ش
In any case, you can now, if you wish, use your original RE and you won't accidentally "catch" the entities, only ascii letters, digits, and the couple of punctuation characters you listed. (I'm not sure that's what you really want -- why not accented letters but just ascii ones, for example? -- but, if it is what you want, it will work).
If you do have named entities in addition to the numeric-coded ones, you can also apply the htmlentitydefs standard library module recommended in another answer (it only deals with named entities which map to Latin-1 code points, however).
You can adapt the following script:
import htmlentitydefs
import re
def substitute_entity (match):
name = match.group (1)
if name in htmlentitydefs.name2codepoint:
return unichr (htmlentitydefs.name2codepoint[name])
elif name.startswith ('#'):
try:
return unichr (int (name[1:]))
except:
pass
return '?'
print re.sub ('&(#?\\w+);', substitute_entity, 'x « y &wat; z {')
Produces the following answer here:
x « y ? z {
EDIT: I understood the question as "how to get rid of HTML entities before further processing", hope I haven't wasted time on answering a wrong question ;)
Without knowing what the expression is being used for I can't tell exactly what you need.
This will match special characters or strings of characters excluding letters, digits, #, and #:
[^a-zA-Z0-9##]*|#[0-9A-Za-z]+;