I'm working on Arabic light stemmer package in python
I want to convert the result from any operation from Unicode to Arabic letters.
My code:
import tashaphyne
form tashaphyne import *
>>> text = u"الْعَرَبِيّةُ"
>>> strip_tashkeel(text)
I want it to display "العربية" not it's Unicode
You can convert unicode strings to any other encoding using the encode() function like so:
text.encode('utf8')
Here is a list of possible encodings in Python 2.7.
You see u'\u0627\u0644\u0639\u0631\u0628\u064a\u0629' instead of "العربية" because repr() representation for unicode strings is meant to be displayable even on 7-bit terminals.
To see actual scripts instead of unicode, do either print _ after your call to strip_tashkeel(), or do print strip_tashkeel(text) directly.
Related
I'm getting back from a library what looks to be an incorrect unicode string:
>>> title
u'Sopet\xc3\xb3n'
Now, those two hex escapes there are the UTF-8 encoding for U+00F3 LATIN SMALL LETTER O WITH ACUTE. So far as I understand, a unicode string in Python should have the actual character, not the the UTF-8 encoding for the character, so I think this is incorrect and presumably a bug either in the library or in my input, right?
The question is, how do I (a) recognize that I have UTF-8 encoded text in my unicode string, and (b) convert this to a proper unicode string?
I'm stumped on (a), as there's nothing wrong, encoding-wise, about that original string (i.e, both are valid characters in their own right, u'\xc3\xb3' == ó, but they're not what's supposed to be there)
It looks like I can achieve (b) by eval()ing that repr() output minus the "u" in front to get a str and then decoding the str with UTF-8:
>>> eval(repr(title)[1:]).decode("utf-8")
u'Sopet\xf3n'
>>> print eval(repr(title)[1:]).decode("utf-8")
Sopetón
But that seems a bit kludgy. Is there an officially-sanctioned way to get the raw data out of a unicode string and treat that as a regular string?
a) Try to put it through the method below.
b)
>>> u'Sopet\xc3\xb3n'.encode('latin-1').decode('utf-8')
u'Sopet\xf3n'
You should use:
>>> title.encode('raw_unicode_escape')
Python2:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape'))
Python3:
print(u'\xd0\xbf\xd1\x80\xd0\xb8'.encode('raw_unicode_escape').decode('utf8'))
I want to run a Python source file that contains unicode (utf-8) characters in the source. I am aware of the fact that this can be done by adding the comment # -*- coding: utf-8 -*- in the beginning. However, I wish to do it without using this method.
One way I could think of was writing the unicode strings in escaped form. For example,
Edit: Updated Source. Added Unicode comments.
# Printing naïve and 男孩
def fxn():
print 'naïve'
print '男孩'
fxn()
becomes
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
I have two questions regarding the above method.
How do I convert the first code snippet, using Python, into its equivalent that
follows it? That is, only unicode sequences should be written in
escaped form.
Is the method foolproof considering only unicode (utf-8) characters are used? Is there something that can go wrong?
Your idea is generally sound but will break in Python 3 and will cause a headache when you manipulating and writing your strings in Python 2.
It's a good idea to use Unicode strings, not regular strings when dealing with non-ASCII.
Instead, you can encode your characters as Unicode (not UTF-8) escape sequences in Unicode strings.
u'na\xefve'
u'\u7537\u5b69'
note the u prefix
Your code is now encoding agnostic.
If you only use byte strings, and save your source file encoded as UTF-8, your byte strings will contain UTF-8-encoded data. No need for the coding statement (although REALLY strange that you don't want to use it...it's just a comment). The coding statement let's Python know the encoding of the source file, so it can decode Unicode strings correctly (u'xxxxx'). If you have no Unicode strings, it doesn't matter.
For your questions, no need to convert to escape codes. If you encode the file as UTF-8, you can use the more readable characters in your byte strings.
FYI, that won't work for Python 3, because byte strings cannot contain non-ASCII in that version.
That said, here's some code that will convert your example as requested. It reads the source assuming it is encoded in UTF-8, then uses a regular expression to locate all non-ASCII characters. It passes them through a conversion function to generate the replacement. This should be safe, since non-ASCII can only be used in string literals and constants in Python 2. Python 3, however, allows non-ASCII in variable names so this wouldn't work there.
import io
import re
def escape(m):
char = m.group(0).encode('utf8')
return ''.join(r'\x{:02x}'.format(ord(b)) for b in char)
with io.open('sample.py',encoding='utf8') as f:
content = f.read()
new_content = re.sub(r'[^\x00-\x7f]',escape,content)
with io.open('sample_new.py','w',encoding='utf8') as f:
f.write(new_content)
Result:
# Printing na\xc3\xafve and \xe7\x94\xb7\xe5\xad\xa9
def fxn():
print 'na\xc3\xafve'
print '\xe7\x94\xb7\xe5\xad\xa9'
fxn()
question 1:
try to use:
print u'naïve'
print u'长者'
question 2:
If you type the sentences by keyboard and Chinese input software, everything should be OK. But if you copy and paste sentence from some web pages, you should consider other encode format such as GBK,GB2312 and GB18030
This snippet of Python 3 should convert your program correctly to work in Python 2.
def convertchar(char): #converts individual characters
if 32<=ord(char)<=126 or char=="\n": return char #if normal character, return it
h=hex(ord(char))[2:]
if ord(char)<256: #if unprintable ASCII
h=" "*(2-len(h))+h
return "\\x"+h
elif ord(char)<65536: #if short unicode
h=" "*(4-len(h))+h
return "\\u"+h
else: #if long unicode
h=" "*(8-len(h))+h
return "\\U"+h
def converttext(text): #converts a chunk of text
newtext=""
for char in text:
newtext+=convertchar(char)
return newtext
def convertfile(oldfilename,newfilename): #converts a file
oldfile=open(oldfilename,"r")
oldtext=oldfile.read()
oldfile.close()
newtext=converttext(oldtext)
newfile=open(newfilename,"w")
newfile.write(newtext)
newfile.close()
convertfile("FILE_TO_BE_CONVERTED","FILE_TO_STORE_OUTPUT")
First a simple remarl: as you are using byte strings in a Python2 script, the # -*- coding: utf-8 -*- has simply no effect. It only helps to convert the source byte string to an unicode string if you had written:
# -*- coding: utf-8 -*-
...
utxt = u'naïve' # source code is the bytestring `na\xc3\xafve'
# but utxt must become the unicode string u'na\xefve'
Simply it might be interpreted by clever editors to automatically use a utf8 charset.
Now for the actual question. Unfortunately, what you are asking for is not really trivial: idenfying in a source file what is in a comment and in a string simply requires a Python parser... And AFAIK, if you use the parser of ast modules you will lose your comments except for docstrings.
But in Python 2, non ASCII characters are only allowed in comments and litteral strings! So you can safely assume that if the source file is a correct Python 2 script containing no litteral unicode string(*), you can safely transform any non ascii character in its Python representation.
A possible Python function reading a raw source file from a file object and writing it after encoding in another file object could be:
def src_encode(infile, outfile):
while True:
c = infile.read(1)
if len(c) < 1: break # stop on end of file
if ord(c) > 127: # transform high characters
c = "\\x{:2x}".format(ord(c))
outfile.write(c)
An nice property is that it works whatever encoding you use, provided the source file is acceptable by a Python interpreter and does not contain high characters in unicode litterals(*), and the converted file will behave exactly the same as the original one...
(*) A problem will arise if you use unicode litterals in an encoding other that Latin1, because the above function will behave as if the file contained the declaration # -*- coding: Latin1 -*-: u'é' will be translated correctly as u'\xe9' if original encoding is latin1 but as u'\xc3\xc9' (not what is expected...) if original encoding is utf8, and I cannot imagine a way to process correctly both litteral byte strings and unicode byte strings without fully parsing the source file...
I want to transfer unicode into asci characters, transfer them through a channel that only accepts asci characters and then transform them back into proper unicode.
I'm dealing with the unicode characters like ɑ in Python 3.5.
ord("ɑ") gives me 63 with is the same as what ord("?") also gives me 63. This means simply using ord() and chr() doesn't work. How do I get the right conversion?
I want to transfer unicode into ascii characters, transfer them through a channel that only accepts ascii characters and then transform them back into proper unicode.
>>> import json
>>> json.dumps('ɑ')
'"\\u0251"'
>>> json.loads(_)
'ɑ'
You can convert a number to a hex string with "0x%x" %255 where 255 would be the number you want to convert to hex.
To do this with ord, you could do "0x%x" %ord("a") or whatever character you want.
You can remove the 0x part of the string if you don't need it. If you want to hex to be capitalized (A-F) use "0x%X" %ord("a")
I found my error. I used Python via the Windows console and the Windows console mishandeled the unicode.
Hello i am experimenting with Python and LXML, and I am stuck with the problem of extracting data from the webpage which contains windows-1250 characters like ž and ć.
tree = html.fromstring(new.text,parser=hparser)
title = tree.xpath('//strong[text()="Title"]')
opis[g] = opis[g].tail.encode('utf-8')[2:]
I get text responses containing something like this :
\xc2\x9ea
instead of characters. Then I have the problem with storing into database
So how can I accomplish this? I tried put 'windows-1250' instead utf8 without success. Can I convert this code to original characters somehow?
Try:
text = "\xc2\x9ea"
print text.decode('windows-1250').encode('utf-8')
Output:
ža
And save nice chars in your DB.
If encoding to UTF-8 results in b'\xc2\x9ea', then that means the original string was '\x9ea'. Whether lxml didn't do things correctly, or something happened on your end (perhaps a parser configuration issue), the fact is that you get the equivalent of this (Python 3.x syntax):
>>> '\x9ea'.encode('utf-8')
b'\xc2\x9ea'
How do you fix it? One error-prone way would be to encode as something other than UTF-8 that can properly handle the characters. It's error-prone because while something might work in one case, it might not in another. You could instead extract the character ordinals by mapping the character ordinals and work with the character ordinals instead:
>>> list(map((lambda n: hex(n)[2:]), map(ord, '\x9ea')))
['9e', '61']
That gets us somewhere because the bytes type has a fromhex method that can decode a string containing hexadecimal values to the equivalent byte values:
>>> bytes.fromhex(''.join(map((lambda n: hex(n)[2:]), map(ord, '\x9ea'))))
b'\x9ea'
You can use decode('cp1250') on the result of that to get ža, which I believe is the string you wanted. If you are using Python 2.x, the equivalent would be
from binascii import unhexlify
unhexlify(u''.join(map((lambda n: hex(n)[2:]), map(ord, u'\x9ea'))))
Note that this is highly destructive as it forces all characters in a Unicode string to be interpreted as bytes. For this reason, it should only be used on strings containing Unicode characters that fit in a single byte. If you had something like '\x9e\u724b\x61', that code would result in joining ['9e', '724b', '61'] as '9e724b61', and interpreting that using a single-byte character set such as CP1250 would result in something like 'žrKa'.
For that reason, more reliable code would replace ord with a function that throws an exception if 0 <= ord(ch) < 0x100 is false, but I'll leave that for you to code.
I have a list of strings with various different characters that are similar to latin ones, I get these from a website that I download from using urllib2. The website is encoded in utf-8. However, after trying quite a few variations, I can't figure out how to convert this to simple ASCII equivalent. So for example, one of the strings I have is:
u'Atl\xc3\xa9tico Madrid'
In plain text it's "Atlético Madrid", what I want, is to change it to just "Atletico Madrid".
If I use simple unidecode on this, I get "AtlA(c)tico Madrid". What am I doing wrong?
You have UTF-8 bytes in a Unicode string. That's not a proper Unicode string, that's a Mojibake:
>>> print u'Atl\xc3\xa9tico Madrid'
Atlético Madrid
Repair your string first:
>>> u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
u'Atl\xe9tico Madrid'
>>> print u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8')
Atlético Madrid
and Unidecode will give you what you expected:
>>> import unidecode
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid')
'AtlA(c)tico Madrid'
>>> unidecode.unidecode(u'Atl\xc3\xa9tico Madrid'.encode('latin1').decode('utf8'))
'Atletico Madrid'
Better still would be to read your data correctly in the first place; you appear to have decoded the data as Latin-1 (or perhaps the Windows CP-1252 codepage) rather than as UTF-8.