python: Replacing special characters in a string - python

I read the artist of a song from its MP3 tag, then create a folder based on that name. The problem I have is when the name contains a special character like 'AC\DC'. So I wrote this code to deal with that.
def replace_all(text):
print "replace_all"
dictionary = {'\\':"", '?':"", '/':"", '...':"", ':':"", chr(148):"o"}
for i, j in dictionary.iteritems():
text = text.replace(i,j)
return text
What I am running into now is how to deal with non-english characters like an umlaout o in Motorhead or Blue Oyster cult.
As you see I tried adding the ascii-string version of umlaout o at the end of the dictionary but that failed with
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 4: ordinal not in range(128)

I found this code, though I don't understand it.
def strip_accents(s):
return ''.join((c for c in unicodedata.normalize('NFD', s) if unicodedata.category(c) != 'Mn'))
It enabled me to remove the accent marks from the path of proposed dir/filenames.

I suggest using unicode for both input text and the chars replaced. In your example chr(148) is clearly not a unicode symbol.

Related

Python replace unicode characters with spaces of the same length in utf-8

I have some text like
Print\n* Share\n\n\nNUMBER:\u00a00958\n\nPOLICY\n\n1. CRITERIA FOR INITIAL APPROVAL\n\n\n\n Aetna considers\u00a0gemcitabine (Gemzar)
This is causing issues when importing to a data labelling tool, as the labels are offset by the length of the unicode character in utf-8.
For example, /u00a0 is a two character symbol in utf-8: 0xC2 0xA0 (c2a0)
In the example above, I have a label at gemcitabine, but it shows up labelled in the labelling tool as mcitabine (, because there are two /u00a0 characters before it.
If I replace both /u00a0 characters with two spaces , the labels show up correctly.
I was just wondering, how could I detect and replace unicode symbols that represent more than one character in utf-8 with the same number of spaces?
I ended up solving it like this:
convert_to_utf8 = lambda char : char if len(char.encode('utf-8')) <= 1 else ''.ljust(len(char.encode('utf-8')))
string_array = [convert_to_utf8(char) for char in obj['data']]
obj['data'] = ''.join([str(elem) for elem in string_array])
where obj['data'] is the string that contains multi-character symbols.

Problem with .decode('utf-8').upper() and special characters (but only inside the string)

I would like to capitalise letters on given position in string. I have a problem with special letters - polish letters to be specific: for example "ą". Ideally would be a solution which works also for french, spanish etc. (ç, è etc.)
dobry="costąm"
print(dobry[4].decode('utf-8').upper())
I obtain:
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xc4 in position 0: unexpected end of data
while for this:
print("ą".decode('utf-8').upper())
I obtain Ą as desired.
What is more curious for letters on positions 0-3 it works fine while for:
print(dobry[5].decode('utf-8').upper())
I obtain the same problem
The string actually looks like this:
>>> list(dobry)
['c', 'o', 's', 't', '\xc4', '\x85', 'm']
So, dobry[5] == '\x85' because the letter ą is represented by two bytes. To solve this, simply use Python 3 instead of Python 2.
UTF-8 may use more than one byte to encode a character, so iterating over a bytestring and manipulating individual bytes won't always work. It's better to decode to Python 2's unicode type. Perform your manipulations, then re-encode to UTF-8.
>>> dobry="costąm"
>>> udobry = unicode(dobry, 'utf-8')
>>> changed = udobry[:4] + udobry[4].upper() + udobry[5]
>>> new_dobry = changed.encode('utf-8')
>>> print new_dobry
costĄm
As #tripleee commented, non-ascii characters may not map to a single unicode codepoint: "ą" could be the single codepoint U+0105 LATIN SMALL LETTER A WITH OGONEK or it could be composed of "a" followed by U+0328 COMBINING OGONEK.
In the composed string the "a" character can be capitalised, and "a" followed by COMBINING OGONEK will result in "Ą" (though it may look like two separate characters in the Python REPL, or the terminal, depending on the terminal settings).
Note that you need to take the extra character into account when indexing.
It's also possible to normalise the composed string to the single codepoint (canonical) version using the tools in the unicodedata module:
>>> unicodedata.normalize('NFC', u'costa\u0328m') == u"costąm"
True
but this may cause problems if, for example, you are returning the changed string to a system that expects the combining character to be preserved.
what about that instead:
print(dobry.decode('utf-8')[5].upper())

Python Polish character encoding issues

I'm having some issues with character encoding, and in this special case with Polish characters.
I need to replace all none windows-1252 characters with a windows-1252 equivalent. I had this working until I needed to work with Polish characters. How can I replace these characters?
The é for example is a windows-1252 character and must stay this way. But the ł is not a windows-1252 character and must be replaced with its equivalent (or stripped if it hasn't a equivalent).
I tried this:
import unicodedata
text = "Racławicka Rógé"
tmp = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
print(tmp.decode("utf-8"))
This prints:
Racawicka Roge
But now the ó and é are both encoded to o and e.
How can I get this right?
If you want to move to 1252, that's what you should tell encode and decode:
>>> text = "Racławicka Rógé"
>>> text.encode('1252', 'ignore').decode('1252')
'Racawicka Rógé'
If you are not handling with big texts, just like your example, you can make use of Unidecode library with the solution provided by jonrsharpe.
from unidecode import unidecode
text = u'Racławicka Rógé'
result = ''
for i in text:
try:
result += i.encode('1252').decode('1252')
except (UnicodeEncodeError, UnicodeDecodeError):
result += unidecode(i)
print result # which will be 'Raclawicka Rógé'

Convert GBK to utf8 string in python

I have a string.
s = u"<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
How can I translate s into a utf-8 string? I have tried s.decode('gbk').encode('utf-8') but python reports error: UnicodeEncodeError: 'ascii' codec can't encode characters in position 35-50: ordinal not in range(128)
in python2, try this to convert your unicode string:
>>> s.encode('latin-1').decode('gbk')
u"<script language=javascript>alert('\u8bf7\u8f93\u5165\u6b63\u786e\u9a8c\u8bc1\u7801,\u8c22\u8c22!');location='index.asp';</script></script>"
then you can encode to utf-8 as you wish.
>>> s.encode('latin-1').decode('gbk').encode('utf-8')
"<script language=javascript>alert('\xe8\xaf\xb7\xe8\xbe\x93\xe5\x85\xa5\xe6\xad\xa3\xe7\xa1\xae\xe9\xaa\x8c\xe8\xaf\x81\xe7\xa0\x81,\xe8\xb0\xa2\xe8\xb0\xa2!');location='index.asp';</script></script>"
You are mixing apples and oranges. The GBK-encoded string is not a Unicode string and should hence not end up in a u'...' string.
This is the correct way to do it in Python 2.
g = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,' \
'\xd0\xbb\xd0\xbb!'.decode('gbk')
s = u"<script language=javascript>alert(" + g +
u");location='index.asp';</script></script>"
Notice how the initializer for g which is passed to .decode('gbk') is not represented as a Unicode string, but as a plain byte string.
See also http://nedbatchelder.com/text/unipain.html
If you can keep the alert in a separate string "a":
a = '\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!'.decode("gbk")
s = u"<script language=javascript>alert('"+a+"');location='index.asp';</script></script>"
print s
Then it will print:
<script language=javascript>alert('请输入正确验证码,谢谢!');location='index.asp';</script></script>
If you want to automatically extract the substring in one go:
s = "<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
s = unicode("'".join((s.decode("gbk").split("'",2))))
print s
will print:
<script language=javascript>alert('请输入正确验证码,谢谢!');location='index.asp';</script></script>
Take a look at unicodedata but I think one way to do this is:
import unicodedata
s = u"<script language=javascript>alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';</script></script>"
unicodedata.normalize('NFKD', s).encode('utf-8','ignore')
I got the same question
Like this:
name = u'\xb9\xc5\xbd\xa3\xc6\xe6\xcc\xb7'
I want convert to
u'\u53e4\u5251\u5947\u8c2d'
Here is my solution:
new_name = name.encode('iso-8859-1').decode('gbk')
And I tried yours
s = u"alert('\xc7\xeb\xca\xe4\xc8\xeb\xd5\xfd\xc8\xb7\xd1\xe9\xd6\xa4\xc2\xeb,\xd0\xbb\xd0\xbb!');location='index.asp';"
print s
alert('ÇëÊäÈëÕýÈ·ÑéÖ¤Âë,лл!');location='index.asp';
Then:
_s = s.encode('iso-8859-1').decode('gbk')
print _s
alert('请输入正确验证码,谢谢!');location='index.asp';
Hope can help you ..

Unicode issues when using NLTK

I have a text scraped from internet (I think it was a Spanish text encoded in "latin-1" and decoded to unicode when scraped). The text is something like this:
730\u20ac.\r\n\nropa nueva 2012 ... 5,10 muy buen estado..... 170 \u20ac\r\n\nPack 850\u20ac,
After that I do some replacements on the text to normalize some words (i.e. replace the € symbol (\u20ac) for "euros" using regex (r'\u20ac', r' euros')).
Here my problem seems to start... If I do not encode each string to "UTF-8" before applying the regex, the regex wont find any occurrences (despite a lot of occurrences do exist)...
Anyways, after encoding it to UTF-8, the regex (r'\u20ac', r' euros') works.
After that I tokenize and tag all the strings. When I try to use the regexparser I then get the
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 1: ordinal not in range(128)
My question is, if I have already encoded it to UTF-8, how come I have a problem now? And what would be your suggestion to try to avoid it?
Is there a way to do the encoding process once and for all, like below? If so what should I do for the second part (encode/ decode it anyway)?
Get text -> encode/ decode it anyway... -> Work on the text without any issue
Thanks in advance for any help!! I am new to programming and it is killing me...
Code detail:
regex function
replacement_patterns = [(ur' \\u20ac', ur' euros'),(ur' \xe2\x82\xac', r' euros'),(ur' \b[eE]?[uU]?[rR]\b', r' euros'), (ur' \b([0-9]+)[eE][uU]?[rR]?[oO]?[sS]?\b',ur' \1 euros')]
class RegexpReplacer(object):
def __init__(self, patterns=replacement_patterns):
self.patterns = [(re.compile(regex, re.IGNORECASE), repl) for (regex, repl) in patterns]
def replace(self, text):
s = text
for (pattern, repl) in self.patterns:
(s, count) = re.subn(pattern, repl, s)
return s
You seem to be misunderstanding the meaning of r'\u20ac'
The r indicates a raw string. Not a unicode string, a standard one. So using a unicode escape in a pattern only gets you a literal backslash:
>>> p = re.compile(r'\u20ac')
>>> p.pattern
'\\u20ac'
>>> print p.pattern
\u20ac
If you want to use raw strings and unicode escapes, you'll have to use raw unicode strings, indicated by ur instead of just r:
>>> p = re.compile(ur'\u20ac')
>>> p.pattern
u'\u20ac'
>>> print p.pattern
€
Did you use the decode & encode functions correctly?
from nltk import ne_chunk,pos_tag
from nltk.tokenize.punkt import PunktSentenceTokenizer
from nltk.tokenize.treebank import TreebankWordTokenizer
text = "€"
text = text.decode('utf-8')
sentences = PunktTokenizer.tokenize(text)
tokens = [TreeBankTokenizer.tokenize(sentence) for sentence in sentences]
tagged = [pos_tag(token) for token in tokens]
When needed, try to use:
print your_string.encode("utf-8")
I have no problems currently. The only issue is that $50, says:
word: $ meaning: dollar word: 50 meaning: numeral, cardinal
This is correct.
And €50, says:
word: €50 meaning: -NONE-
This is INcorrect.
With a space between the € sign and the number, it says:
word: € meaning: noun, common, singular or mass word: 50 meaning:
numeral, cardinal
Which is more correct.

Categories