Python Polish character encoding issues

Python Polish character encoding issues - python

I'm having some issues with character encoding, and in this special case with Polish characters.
I need to replace all none windows-1252 characters with a windows-1252 equivalent. I had this working until I needed to work with Polish characters. How can I replace these characters?
The é for example is a windows-1252 character and must stay this way. But the ł is not a windows-1252 character and must be replaced with its equivalent (or stripped if it hasn't a equivalent).
I tried this:
import unicodedata
text = "Racławicka Rógé"
tmp = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(tmp.decode("utf-8"))
This prints:
Racawicka Roge
But now the ó and é are both encoded to o and e.
How can I get this right?

If you want to move to 1252, that's what you should tell encode and decode:
>>> text = "Racławicka Rógé"
>>> text.encode('1252', 'ignore').decode('1252')
'Racawicka Rógé'

If you are not handling with big texts, just like your example, you can make use of Unidecode library with the solution provided by jonrsharpe.
from unidecode import unidecode
text = u'Racławicka Rógé'
result = ''
for i in text:
try:
result += i.encode('1252').decode('1252')
except (UnicodeEncodeError, UnicodeDecodeError):
result += unidecode(i)
print result # which will be 'Raclawicka Rógé'

Related

Reverse python's encoding of umlaut to normalize text or normalize in current form

Python automatically reads German umlauts and punctuation as
Gefrier- und TiefkÃ¼hlmÃ¶bel
How do I normalize this output to remove punctuation?

You could "fix" the encoding issue by doing:
the_string = 'Gefrier- und TiefkÃ¼hlmÃ¶bel'.encode('latin-1').decode('utf-8')
And then apply a solution like this one: https://stackoverflow.com/a/518232/2452074
import unicodedata
def strip_accents(s):
return ''.join(c for c in unicodedata.normalize('NFD', s)
if unicodedata.category(c) != 'Mn')
strip_accents(the_string)
> 'Gefrier- und Tiefkuhlmobel'
But first, I would try to understand why your input looks broken, Python itself shouldn't do that automatically.
Some background docs on unicode and encodings: https://docs.python.org/3/howto/unicode.html

Python generic function to replace special characters

I have been looking for a while now but I am not able to find a proper solution.
I have a database with Dutch, French and German words which all have their special characters. e.g. é, è, ß, ç, etc...
For some cases, like in a url, I would like to replace these with alphanumeric characters. respectively e, e, ss, c, etc...
Is there a generic function or Python package that does this?
I could do this with Regex of course, but something generic would be great here.
Thanks.

try this package: https://pypi.python.org/pypi/Unidecode
>>> import unidecode
>>> unidecode.unidecode(u'çß')
'css'

As you say, this could be done using a Regex sub. You would of course need to include upper and lowercase variants.
import re
data = "é, è, ß, ç, äÄ"
lookup = {'é':'e', 'è':'e', 'ß':'ss', 'ç':'c', 'ä':'a', 'Ä':'A'}
print re.sub(r'([éèßçäÄ])', lambda x: lookup[x.group(1)], data)
This would display the following:
e, e, ss, c, aA

you can almost get away with the builtin unicode data (unfortunately a few of your characters break it)
>>> import unicodedata
>>> s=u"é, è, ß, ç"
>>> unicodedata.normalize('NFKD', s).encode('ascii', 'ignore')
'e, e, , c'
here is a solution that has the codepoints hardcoded stolen from http://code.activestate.com/recipes/251871-latin1-to-ascii-the-unicode-hammer/
def latin1_to_ascii (unicrap):
"""This takes a UNICODE string and replaces Latin-1 characters with
something equivalent in 7-bit ASCII. It returns a plain ASCII string.
This function makes a best effort to convert Latin-1 characters into
ASCII equivalents. It does not just strip out the Latin-1 characters.
All characters in the standard 7-bit ASCII range are preserved.
In the 8th bit range all the Latin-1 accented letters are converted
to unaccented equivalents. Most symbol characters are converted to
something meaningful. Anything not converted is deleted.
"""
xlate={0xc0:'A', 0xc1:'A', 0xc2:'A', 0xc3:'A', 0xc4:'A', 0xc5:'A',
0xc6:'Ae', 0xc7:'C',
0xc8:'E', 0xc9:'E', 0xca:'E', 0xcb:'E',
0xcc:'I', 0xcd:'I', 0xce:'I', 0xcf:'I',
0xd0:'Th', 0xd1:'N',
0xd2:'O', 0xd3:'O', 0xd4:'O', 0xd5:'O', 0xd6:'O', 0xd8:'O',
0xd9:'U', 0xda:'U', 0xdb:'U', 0xdc:'U',
0xdd:'Y', 0xde:'th', 0xdf:'ss',
0xe0:'a', 0xe1:'a', 0xe2:'a', 0xe3:'a', 0xe4:'a', 0xe5:'a',
0xe6:'ae', 0xe7:'c',
0xe8:'e', 0xe9:'e', 0xea:'e', 0xeb:'e',
0xec:'i', 0xed:'i', 0xee:'i', 0xef:'i',
0xf0:'th', 0xf1:'n',
0xf2:'o', 0xf3:'o', 0xf4:'o', 0xf5:'o', 0xf6:'o', 0xf8:'o',
0xf9:'u', 0xfa:'u', 0xfb:'u', 0xfc:'u',
0xfd:'y', 0xfe:'th', 0xff:'y',
0xa1:'!', 0xa2:'{cent}', 0xa3:'{pound}', 0xa4:'{currency}',
0xa5:'{yen}', 0xa6:'|', 0xa7:'{section}', 0xa8:'{umlaut}',
0xa9:'{C}', 0xaa:'{^a}', 0xab:'<<', 0xac:'{not}',
0xad:'-', 0xae:'{R}', 0xaf:'_', 0xb0:'{degrees}',
0xb1:'{+/-}', 0xb2:'{^2}', 0xb3:'{^3}', 0xb4:"'",
0xb5:'{micro}', 0xb6:'{paragraph}', 0xb7:'*', 0xb8:'{cedilla}',
0xb9:'{^1}', 0xba:'{^o}', 0xbb:'>>',
0xbc:'{1/4}', 0xbd:'{1/2}', 0xbe:'{3/4}', 0xbf:'?',
0xd7:'*', 0xf7:'/'
}
r = ''
for i in unicrap:
if xlate.has_key(ord(i)):
r += xlate[ord(i)]
elif ord(i) >= 0x80:
pass
else:
r += str(i)
return r
of coarse you could just as easily use a regex as indicated in the other answers

Remove accented characters form string - Python

I get some data from a webpage and read it like this in python
origional_doc = urllib2.urlopen(url).read()
Sometimes this url has characters such as é and ä and ect., how could I remove these characters, from the string, right now this is what I am trying,
import unicodedata
origional_doc = ''.join((c for c in unicodedata.normalize('NFD', origional_doc) if unicodedata.category(c) != 'Mn'))
But I get an error
TypeError: must be unicode, not str

This should work. It will eliminate all characters that are not ascii.
original_doc = (original_doc.decode('unicode_escape').encode('ascii','ignore'))

using re you can sub all characters that are in a certain hexadecimal ascii range.
>>> re.sub('[\x80-\xFF]','','é and ä and ect')
' and and ect'
You can also do the inverse and sub anything thats NOT in the basic 128 characters:
>>> re.sub('[^\x00-\x7F]','','é and ä and ect')
' and and ect'

Unicode issues when using NLTK

I have a text scraped from internet (I think it was a Spanish text encoded in "latin-1" and decoded to unicode when scraped). The text is something like this:
730\u20ac.\r\n\nropa nueva 2012 ... 5,10 muy buen estado..... 170 \u20ac\r\n\nPack 850\u20ac,
After that I do some replacements on the text to normalize some words (i.e. replace the € symbol (\u20ac) for "euros" using regex (r'\u20ac', r' euros')).
Here my problem seems to start... If I do not encode each string to "UTF-8" before applying the regex, the regex wont find any occurrences (despite a lot of occurrences do exist)...
Anyways, after encoding it to UTF-8, the regex (r'\u20ac', r' euros') works.
After that I tokenize and tag all the strings. When I try to use the regexparser I then get the
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
My question is, if I have already encoded it to UTF-8, how come I have a problem now? And what would be your suggestion to try to avoid it?
Is there a way to do the encoding process once and for all, like below? If so what should I do for the second part (encode/ decode it anyway)?
Get text -> encode/ decode it anyway... -> Work on the text without any issue
Thanks in advance for any help!! I am new to programming and it is killing me...
Code detail:
regex function
replacement_patterns = [(ur' \\u20ac', ur' euros'),(ur' \xe2\x82\xac', r' euros'),(ur' \b[eE]?[uU]?[rR]\b', r' euros'), (ur' \b([0-9]+)[eE][uU]?[rR]?[oO]?[sS]?\b',ur' \1 euros')]
class RegexpReplacer(object):
def __init__(self, patterns=replacement_patterns):
self.patterns = [(re.compile(regex, re.IGNORECASE), repl) for (regex, repl) in patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.patterns:
(s, count) = re.subn(pattern, repl, s)
return s

You seem to be misunderstanding the meaning of r'\u20ac'
The r indicates a raw string. Not a unicode string, a standard one. So using a unicode escape in a pattern only gets you a literal backslash:
>>> p = re.compile(r'\u20ac')
>>> p.pattern
'\\u20ac'
>>> print p.pattern
\u20ac
If you want to use raw strings and unicode escapes, you'll have to use raw unicode strings, indicated by ur instead of just r:
>>> p = re.compile(ur'\u20ac')
>>> p.pattern
u'\u20ac'
>>> print p.pattern
€

Did you use the decode & encode functions correctly?
from nltk import ne_chunk,pos_tag
from nltk.tokenize.punkt import PunktSentenceTokenizer
from nltk.tokenize.treebank import TreebankWordTokenizer
text = "€"
text = text.decode('utf-8')
sentences = PunktTokenizer.tokenize(text)
tokens = [TreeBankTokenizer.tokenize(sentence) for sentence in sentences]
tagged = [pos_tag(token) for token in tokens]
When needed, try to use:
print your_string.encode("utf-8")
I have no problems currently. The only issue is that $50, says:
word: $ meaning: dollar word: 50 meaning: numeral, cardinal
This is correct.
And €50, says:
word: €50 meaning: -NONE-
This is INcorrect.
With a space between the € sign and the number, it says:
word: € meaning: noun, common, singular or mass word: 50 meaning:
numeral, cardinal
Which is more correct.

Python3 : unescaping non ascii characters

(Python 3.3.2) I have to unescape some non ASCII escaped characters returned by a call to re.escape(). I see here and here methods that doesn't work. I'm working in a 100% UTF-8 environment.
# pure ASCII string : ok
mystring = "a\n" # expected unescaped string : "a\n"
cod = codecs.getencoder('unicode_escape')
print( cod(mystring) )
# non ASCII string : method #1
mystring = "€\n"
# equivalent to : mystring = codecs.unicode_escape_decode(mystring)
cod = codecs.getdecoder('unicode_escape')
print(cod(mystring))
# RESULT = ('â\x82¬\n', 5) INSTEAD OF ("€\n", 2)
# non ASCII string : method #2
mystring = "€\n"
mystring = bytes(mystring, 'utf-8').decode('unicode_escape')
print(mystring)
# RESULT = â\202¬ INSTEAD OF "€\n"
Is this a bug ? Have I misunderstood something ?
Any help would be appreciated !
PS : I edited my post thanks to the Michael Foukarakis' remark.

I guess the actual string you need to process is mystring = €\\n?
mystring = "€\n" # that's 2 char, "€" and new line
mystring = "€\\n" # that's 3 char, "€", "\" and "n"
I don't really understand what's going wrong within encode() and decode() of python3, but my friend solve this problem when we are writing some tools.
How we did is to bypass the encoder("utf_8") after the escape procedure is done.
>>> "€\\n".encode("utf_8")
b'\xe2\x82\xac\\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape")
'â\x82¬\n'
>>> "€\\n".encode("utf_8").decode("unicode_escape").encode("utf_8")
b'\xc3\xa2\xc2\x82\xc2\xac\n' # we don't want this
>>> bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")])
b'\xe2\x82\xac\n' # what we really need
>>> str(bytes([ord(char) for char in "€\\n".encode("utf_8").decode("unicode_escape")]), "utf_8")
'€\n'
We can see that: though the result of decode("unicode_escape") looks wired, the bytes object actually contain the correct bytes of your strings(with utf-8 encoding), in this case, "\xe2\x82\xac\n"
And we now do not print the str object directly, neither do we use encode("utf_8"), we use ord() to create the bytes object b'\xe2\x82\xac\n'.
And you can get the correct str from this bytes object, just put it into str()
BTW, the tool my friend and me want to make is a wrapper that allow user to input c-like string literal, and convert the escaped sequence automatically.
User input:\n\x61\x62\n\x20\x21 # 20 characters, which present 6 chars semantically
output: # \n
ab # \x61\x62\n
! # \x20\x21
That's a powerful tool for user to input some non-printable character in terminal.
Our final tools is:
#!/usr/bin/env python3
import sys
for line in sys.stdin:
sys.stdout.buffer.write(bytes([ord(char) for char in line[:-1].encode().decode('unicode_escape')]))
sys.stdout.flush()

You seem to misunderstand encodings. To be protected against common errors, we usually encode a string when it leaves our application, and decode it when it comes in.
Firstly, let's look at the documentation for unicode_escape, which states:
Produce[s] a string that is suitable as Unicode literal in Python source code.
Here is what you would get from the network or a file that claims its contents are Unicode escaped:
b'\\u20ac\\n'
Now, you have to decode this to use it in your app:
>>> s = b'\\u20ac\\n'.decode('unicode_escape')
>>> s
'€\n'
and if you wanted to write it back to, say, a Python source file:
with open('/tmp/foo', 'wb') as fh: # binary mode
fh.write(b'print("' + s.encode('unicode_escape') + b'")')

import string
printable = string.printable
printable = printable + '€'
def cod(c):
return c.encode('unicode_escape').decode('ascii')
def unescape(s):
return ''.join(c if ord(c)>=32 and c in printable else cod(c) for c in s)
mystring = "€\n"
print(unescape(mystring))
Unfortunately string.printable only includes ASCII characters. You can make a copy as I did here and extend it with any Unicode characters that you'd like, such as €.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Polish character encoding issues - python

If you want to move to 1252, that's what you should tell encode and decode: >>> text = "Racławicka Rógé" >>> text.encode('1252', 'ignore').decode('1252') 'Racawicka Rógé'

Related

Reverse python's encoding of umlaut to normalize text or normalize in current form

Python generic function to replace special characters

Remove accented characters form string - Python

Unicode issues when using NLTK

Python3 : unescaping non ascii characters

Categories

Resources