i'm trying to store a string and after tokenize it with nltk in python.But i cant understand why after tokenizing it ( it creates a list ) i cant see the strings in list..
Can anyone help me plz?
Here is the code:
#a="Γεια σου"
#b=nltk.word_tokenize(a)
#b
['\xc3\xe5\xe9\xe1', '\xf3\xef\xf5']
I just want to be able to see the content of the list regularly..
Thx in advance
You are using Python 2, where unprefixed quotes denote a byte as opposed to a character string (if you're not sure about the difference, read this). Either switch to Python 3, where this has been fixed, or prefix all character strings with u and print the strings (as opposed to showing their repr, which differs in Python 2.x):
>>> import nltk
>>> a = u'Γεια σου'
>>> b = nltk.word_tokenize(a)
>>> print(u'\n'.join(b))
Γεια
σου
You can see the strings. The characters are represented by escape sequences because of your terminal encoding settings. Configure your terminal to accept input, and present output, in UTF-8.
Related
So, I have a single byte in bytes format looking something like that:
b'\xFF'
It is quite easy to understand that a single byte is the two symbols(0-F) after '\x'
But sometimes the pattern doesn't match, containing more than two symbols after '\x'.
So, for example, if I use secrets.token_bytes() I can get something like that:
>>> import secrets
>>> secrets.token_bytes(32)
b't\xbcJ\xf0'
Or, using hashlib module:
>>> import hashlib
>>> hashlib.sha256('abc'.encode()).digest()
b'\xbax\x16\xbf\x8f\x01\xcf\xeaAA#\xde]\xae"#\xb0\x03a\xa3\x96\x17z\x9c\xb4\x10\xffa\xf2\x00\x15\xad'
So, can someone, please, explain what are those additional symbols purpose and how are they generated?
Thanks!
It's a quirk of the way Python prints byte strings. If the byte value is one of the printable ASCII characters it will print that character; otherwise it prints the hex escape.
Show bytes(range(0x100)) to see it visually.
To get a string that consistently uses hex escapes, you need to build it yourself.
print(''.join(f'\\x{i:02x}' for i in bytes(range(0x100))))
Hello i am experimenting with Python and LXML, and I am stuck with the problem of extracting data from the webpage which contains windows-1250 characters like ž and ć.
tree = html.fromstring(new.text,parser=hparser)
title = tree.xpath('//strong[text()="Title"]')
opis[g] = opis[g].tail.encode('utf-8')[2:]
I get text responses containing something like this :
\xc2\x9ea
instead of characters. Then I have the problem with storing into database
So how can I accomplish this? I tried put 'windows-1250' instead utf8 without success. Can I convert this code to original characters somehow?
Try:
text = "\xc2\x9ea"
print text.decode('windows-1250').encode('utf-8')
Output:
ža
And save nice chars in your DB.
If encoding to UTF-8 results in b'\xc2\x9ea', then that means the original string was '\x9ea'. Whether lxml didn't do things correctly, or something happened on your end (perhaps a parser configuration issue), the fact is that you get the equivalent of this (Python 3.x syntax):
>>> '\x9ea'.encode('utf-8')
b'\xc2\x9ea'
How do you fix it? One error-prone way would be to encode as something other than UTF-8 that can properly handle the characters. It's error-prone because while something might work in one case, it might not in another. You could instead extract the character ordinals by mapping the character ordinals and work with the character ordinals instead:
>>> list(map((lambda n: hex(n)[2:]), map(ord, '\x9ea')))
['9e', '61']
That gets us somewhere because the bytes type has a fromhex method that can decode a string containing hexadecimal values to the equivalent byte values:
>>> bytes.fromhex(''.join(map((lambda n: hex(n)[2:]), map(ord, '\x9ea'))))
b'\x9ea'
You can use decode('cp1250') on the result of that to get ža, which I believe is the string you wanted. If you are using Python 2.x, the equivalent would be
from binascii import unhexlify
unhexlify(u''.join(map((lambda n: hex(n)[2:]), map(ord, u'\x9ea'))))
Note that this is highly destructive as it forces all characters in a Unicode string to be interpreted as bytes. For this reason, it should only be used on strings containing Unicode characters that fit in a single byte. If you had something like '\x9e\u724b\x61', that code would result in joining ['9e', '724b', '61'] as '9e724b61', and interpreting that using a single-byte character set such as CP1250 would result in something like 'žrKa'.
For that reason, more reliable code would replace ord with a function that throws an exception if 0 <= ord(ch) < 0x100 is false, but I'll leave that for you to code.
I want to encode a string in UTF-8 and view the corresponding UTF-8 bytes individually. In the Python REPL the following seems to work fine:
>>> unicode('©', 'utf-8').encode('utf-8')
'\xc2\xa9'
Note that I’m using U+00A9 COPYRIGHT SIGN as an example here. The '\xC2\xA9' looks close to what I want — a string consisting of two separate code points: U+00C2 and U+00A9. (When UTF-8-decoded, it gives back the original string, '\xA9'.)
Then, I want the UTF-8-encoded string to be converted to a JSON-compatible string. However, the following doesn’t seem to do what I want:
>>> import json; json.dumps('\xc2\xa9')
'"\\u00a9"'
Note that it generates a string containing U+00A9 (the original symbol). Instead, I need the UTF-8-encoded string, which would look like "\u00C2\u00A9" in valid JSON.
TL;DR How can I turn '©' into "\u00C2\u00A9" in Python? I feel like I’m missing something obvious — is there no built-in way to do this?
If you really want "\u00c2\u00a9" as the output, give json a Unicode string as input.
>>> print json.dumps(u'\xc2\xa9')
"\u00c2\u00a9"
You can generate this Unicode string from the raw bytes:
s = unicode('©', 'utf-8').encode('utf-8')
s2 = u''.join(unichr(ord(c)) for c in s)
I think what you really want is "\xc2\xa9" as the output, but I'm not sure how to generate that yet.
I'm trying to find the index (or indices) of a certain character in a UTF-8 encoded string in a foreign language (for example the character: ش).
I have tried unicode.find('ش'), word.find(u'ش'), word.find(u'\\uش') and also regular expressions: re.compile(u'\\uش) to no avail. The funny thing is that in Visual Studio (my IDE using IronPython) in debug mode, word.find(u'\\uش') returns the correct index in the variable watch window but it doesn't in the actual code (returns index=-1).
I'm reading the strings from a file using the following command:
file= codecs.open(file,'r','utf-8')
Is there something I'm missing? Or is there another way to approach this?
Once you use codecs to read the file, it's no longer UTF-8, it's an internal Unicode string representation. This should be completely compatible with Unicode literals in your program.
>>> line=u'abcش'
>>> line.find(u'ش')
3
Edit: My previous test may have been misleading because both strings were entered through the IDE. Here's a better example:
>>> f = codecs.open(r'c:\temp\temp.txt', 'r', 'utf-8-sig')
>>> line = f.readline()
>>> print line
This is a test.ش
>>> line.find(u'\u0634')
15
I try to split this kind of lines in Python:
aiburenshi 爱不忍释 "לא מסוגל להינתק, לא יכול להיפרד מדבר מרוב חיבתו אליו"
This line contains Hebrew, simplified Chinese and English.
If I have a tuple T for example, I would like to get the tuple to be T= (Hebrew string, English string, Chinese string).
The problem is that I don't figure out how to get the Unicode value of the Chinese of the Hebrew letters. Both these lines don't work:
print ((unicode("释","utf-8")).encode("utf-8"))
print ((unicode("א","utf-8")).encode("utf-8"))
And I get this error:
SyntaxError: Non-ASCII character '\xe9' in file split_or.py on line 9, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details
In Python 2, you need to open the file specifying an encoding like this:
import codecs
f = codecs.open("myfile.txt","r",encoding="utf-8")
In Python 3, you can just add the encoding option to any open() calls.
This will guarantee that the file is correctly decoded. Note that this doesn't mean your print calls will work properly, that depends on many things (see for example http://www.pycs.net/users/0000323/stories/14.html and that's just a start); it's better to either use a proper debugger, or output to a file (which will again be opened with codecs.open() ).
To get the actual codepoint (i.e. integer "value"), you can use the built-in ord():
>>> ord(u"£")
163
if you know the ranges for different languages, that's all you need. See this page or this page for the ranges.
Otherwise, you might want to use unicodedata to look up stuff, like the bidirectional category:
>>> unicodedata.bidirectional(u"£")
ET # 'E'uropean 'T'erminator
In Python 2, Unicode string constants need to be prefaced with the "u" character, as in:
print ((unicode(u"释","utf-8")).encode("utf-8"))
print ((unicode(u"א","utf-8")).encode("utf-8"))
In Python 3, string constants are Unicode by default.