I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.
Fancy quotes show up quite often (u'\u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii') to the csv file?
I have tried to find the left quote and replace it:
val = line.find(u'\u201c')
if val >= 0: line[val] = '"'
But to no avail:
TypeError: 'unicode' object does not support item assignment
Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?
Thank you
You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.
from unidecode import unidecode
line = unidecode(line)
This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.
Edit: a comment points out if your language isn't English, you may find ASCII to be too restrictive. Here's an adaptation of the above code that uses a whitelist to indicate characters that shouldn't be converted.
>>> from unidecode import unidecode
>>> whitelist = set('µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ')
>>> line = '\u201cRésumé\u201d'
>>> print(line)
“Résumé”
>>> line = ''.join(c if c in whitelist else unidecode(c) for c in line)
>>> print(line)
"Résumé"
You can't assign to a string, as they are immutable, and can't be changed.
You can, however, just use the regex library, which might be the most flexible way to do this:
import re
newline = re.sub(u'\u201c','"',line)
Related
As in python 2.7, I can save all unicode string into Python source code.
In the following code,
#!/usr/bin/python
#coding:utf-8
a = u'我很好,你呢?'
with open('test.txt', 'wb') as f:
f.write(repr(a))
What I expect is giving me back a txt with following wording
u'\u6211\u5f88\u597d\u002C\u4f60\u5462\u003F'
but turns out it is
u'\u6211\u5f88\u597d,\u4f60\u5462?'
Why punctuation is not handled? Is there anyway to handle the punctuation too?
Updated:
Though I will take #Blckknght advisement in comment as using other encoding is still fine for my purpose, but I am still open to see if any answer about saving punctuation as python string. Thx.
repr() has a fixed representation for Unicode strings. You have to write your own function if you want it to display differently:
#coding:utf8
def my_repr(s):
return "u'" + ''.join(r'\u{:04x}'.format(ord(c)) for c in s) + "'"
s = u'我很好,你呢?'
print my_repr(s)
Output:
u'\u6211\u5f88\u597d\u002c\u4f60\u5462\u003f'
(But this feels like an XY-problem)
I am trying to search for emoticons in python strings.
So I have, for example,
em_test = ['\U0001f680']
print(em_test)
['🚀']
test = 'This is a test string 💰💰🚀'
if any(x in test for x in em_test):
print ("yes, the emoticon is there")
else:
print ("no, the emoticon is not there")
yes, the emoticon is there
and if a search em_test in
'This is a test string 💰💰🚀'
I can actually find it.
So I have made a csv file with all the emoticons I want defined by their unicode.
The CSV looks like this:
\U0001F600
\U0001F601
\U0001F602
\U0001F923
and when I import it and print it I actullay do not get the emoticons but rather just the text representation:
['\\U0001F600',
'\\U0001F601',
'\\U0001F602',
'\\U0001F923',
...
]
and hence I cannot use this to search for these emoticons in another string...
I somehow know that the double backslash \ is only representation of a single slash but somehow the unicode reader does not get it... I do not know what I'm missing.
Any suggestions?
You can decode those Unicode escape sequences with .decode('unicode-escape'). However, .decode is a bytes method, so if those sequences are text rather than bytes you first need to encode them into bytes. Alternatively, you can (probably) open your CSV file in binary mode in order to read those sequences as bytes rather than as text strings.
Just for fun, I'll also use unicodedata to get the names of those emojis.
import unicodedata as ud
emojis = [
'\\U0001F600',
'\\U0001F601',
'\\U0001F602',
'\\U0001F923',
]
for u in emojis:
s = u.encode('ASCII').decode('unicode-escape')
print(u, ud.name(s), s)
output
\U0001F600 GRINNING FACE 😀
\U0001F601 GRINNING FACE WITH SMILING EYES 😁
\U0001F602 FACE WITH TEARS OF JOY 😂
\U0001F923 ROLLING ON THE FLOOR LAUGHING 🤣
This should be much faster than using ast.literal_eval. And if you read the data in binary mode it will be even faster since it avoids the initial decoding step while reading the file, as well as allowing you to eliminate the .encode('ASCII') call.
You can make the decoding a little more robust by using
u.encode('Latin1').decode('unicode-escape')
but that shouldn't be necessary for your emoji data. And as I said earlier, it would be even better if you open the file in binary mode to avoid the need to encode it.
1. keeping your csv as-is:
it's a bloated solution, but using ast.literal_eval works:
import ast
s = '\\U0001F600'
x = ast.literal_eval('"{}"'.format(s))
print(hex(ord(x)))
print(x)
I get 0x1f600 (which is correct char code) and some emoticon character (😀). (well I had to copy/paste a strange char from my console to this answer textfield but that's a console issue by my end, otherwise that works)
just surround with quotes to allow ast to take the input as string.
2. using character codes directly
maybe you'd be better off by storing the character codes themselves instead of the \U format:
print(chr(0x1F600))
does exactly the same (so ast is slightly overkill)
your csv could contain:
0x1F600
0x1F601
0x1F602
0x1F923
then chr(int(row[0],16)) would do the trick when reading it: example if one 1 row in CSV (or first row)
with open("codes.csv") as f:
cr = csv.reader(f)
codes = [int(row[0],16) for row in cr]
I'm writing a little Python script that parses word docs and writes to a csv file. However, some of the docs have some utf-8 characters that my script can't process correctly.
Fancy quotes show up quite often (u'\u201c'). Is there a quick and easy (and smart) way of replacing those with the neutral ascii-supported quotes, so I can just write line.encode('ascii') to the csv file?
I have tried to find the left quote and replace it:
val = line.find(u'\u201c')
if val >= 0: line[val] = '"'
But to no avail:
TypeError: 'unicode' object does not support item assignment
Is what I've described a good strategy? Or should I just set up the csv to support utf-8 (though I'm not sure if the application that will be reading the CSV wants utf-8)?
Thank you
You can use the Unidecode package to automatically convert all Unicode characters to their nearest pure ASCII equivalent.
from unidecode import unidecode
line = unidecode(line)
This will handle both directions of double quotes as well as single quotes, em dashes, and other things that you probably haven't discovered yet.
Edit: a comment points out if your language isn't English, you may find ASCII to be too restrictive. Here's an adaptation of the above code that uses a whitelist to indicate characters that shouldn't be converted.
>>> from unidecode import unidecode
>>> whitelist = set('µÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ')
>>> line = '\u201cRésumé\u201d'
>>> print(line)
“Résumé”
>>> line = ''.join(c if c in whitelist else unidecode(c) for c in line)
>>> print(line)
"Résumé"
You can't assign to a string, as they are immutable, and can't be changed.
You can, however, just use the regex library, which might be the most flexible way to do this:
import re
newline = re.sub(u'\u201c','"',line)
I'm trying to find the index (or indices) of a certain character in a UTF-8 encoded string in a foreign language (for example the character: ش).
I have tried unicode.find('ش'), word.find(u'ش'), word.find(u'\\uش') and also regular expressions: re.compile(u'\\uش) to no avail. The funny thing is that in Visual Studio (my IDE using IronPython) in debug mode, word.find(u'\\uش') returns the correct index in the variable watch window but it doesn't in the actual code (returns index=-1).
I'm reading the strings from a file using the following command:
file= codecs.open(file,'r','utf-8')
Is there something I'm missing? Or is there another way to approach this?
Once you use codecs to read the file, it's no longer UTF-8, it's an internal Unicode string representation. This should be completely compatible with Unicode literals in your program.
>>> line=u'abcش'
>>> line.find(u'ش')
3
Edit: My previous test may have been misleading because both strings were entered through the IDE. Here's a better example:
>>> f = codecs.open(r'c:\temp\temp.txt', 'r', 'utf-8-sig')
>>> line = f.readline()
>>> print line
This is a test.ش
>>> line.find(u'\u0634')
15
i'm trying to store a string and after tokenize it with nltk in python.But i cant understand why after tokenizing it ( it creates a list ) i cant see the strings in list..
Can anyone help me plz?
Here is the code:
#a="Γεια σου"
#b=nltk.word_tokenize(a)
#b
['\xc3\xe5\xe9\xe1', '\xf3\xef\xf5']
I just want to be able to see the content of the list regularly..
Thx in advance
You are using Python 2, where unprefixed quotes denote a byte as opposed to a character string (if you're not sure about the difference, read this). Either switch to Python 3, where this has been fixed, or prefix all character strings with u and print the strings (as opposed to showing their repr, which differs in Python 2.x):
>>> import nltk
>>> a = u'Γεια σου'
>>> b = nltk.word_tokenize(a)
>>> print(u'\n'.join(b))
Γεια
σου
You can see the strings. The characters are represented by escape sequences because of your terminal encoding settings. Configure your terminal to accept input, and present output, in UTF-8.