Unicode object to a list - python

I have a utf8 - text corpus I can read easily in Python 2.7 :
sentence = codecs.open("D:\\Documents\\files\\sentence.txt", "r", encoding="utf8")
sentence = sentence.read()
> This is my sentence in the right format
However, when I pass this text corpus to a list (for example, for tokenizing) :
tokens = sentence.tokenize()
and print it in the notebook, I obtain bit-like caracters, like :
(u'\ufeff\ufeffFaux,', u'Tunisie')
(u'Tunisie', u"l'\xc9gypte,")
Whereas I would like normal characters just like in my original import.
So my question is : how can I pass unicode objects to a list without having strange bit/ASCII characters ?

It's all in how you print. Python 2 displays lists using ASCII-only characters and substituting backslash escape codes for non-ASCII characters. This is to make it easy to see hidden characters that normal printing would make invisible, like the double byte-order-mark (BOM) \ufeff you see in your strings. Printing individual string items will display them correctly.
Many examples
Original strings:
>>> s = (u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t = (u'Tunisie', u"l'\xc9gypte,")
Displaying at the interactive prompt:
>>> s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t
(u'Tunisie', u"l'\xc9gypte,")
>>> print s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> print t
(u'Tunisie', u"l'\xc9gypte,")
Printing individual strings from the tuples:
>>> print s[0]
Faux,
>>> print s[1]
Tunisie
>>> print t[0]
Tunisie
>>> print t[1]
l'Égypte,
>>> print ' '.join(s)
Faux, Tunisie
>>> print ' '.join(t)
Tunisie l'Égypte,
A way to print tuples without escape codes:
>>> print "('"+"', '".join(s)+"')"
('Faux,', 'Tunisie')
>>> print "('"+"', '".join(t)+"')"
('Tunisie', 'l'Égypte,')

Hm, codecs.open(...) returns a "wrapped version of the underlying file object" then you overwrite this variable with the result from executing the read method on that object. Brave, irritating - but ok ;-)
When you type say an äöüß into your "notebook", does it show like "this" or do you see some \uxxxxx instead?
The default value for codecs.open(...) is errors=strict so if this is the same environment for all samples, this should work.
I understand, that when you write "print it" you print the list, that is different from printing the content of the list.
Sample (taking a tab typed as \t into a normal "byte" string - this is python 2.7.11):
>>> a="\t"
>>> print a # below is an expanded tab
>>> a
'\t'
>>> [a]
['\t']
>>> print [a]
['\t']
>>> for element in [a]:
... print element
...
>>> # above is an expanded tab

Related

Python prevent decoding HEX to ASCII while removing backslashes from my Var

I want to strip some unwanted symbols from my variable. In this case the symbols are backslashes. I am using a HEX number, and as an example I will show some short simple code down bellow. But I don't want python to convert my HEX to ASCII, how would I prevent this from happening.? I have some long shell codes for asm to work with later which are really long and removing \ by hand is a long process. I know there are different ways like using echo -e "x\x\x\x" > output etc, but my whole script will be written in python.
Thanks
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> b = a.strip("\\")
>>> print b
1�Phtv
>>> a = "\x31\x32\x33\x34\x35\x36"
>>> b = a.strip("\\")
>>> print b
123456
At the end I would like it to print my var:
>>> print b
x31x32x33x34x35x36
There are no backslashes in your variable:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(a)
1ÀPhtv
Take newline for example: writing "\n" in Python will give you string with one character -- newline -- and no backslashes. See string literals docs for full syntax of these.
Now, if you really want to write string with such backslashes, you can do it with r modifier:
>>> a = r"\x31\xC0\x50\x68\x74\x76"
>>> print(a)
\x31\xC0\x50\x68\x74\x76
>>> print(a.replace('\\', ''))
x31xC0x50x68x74x76
But if you want to convert a regular string to hex-coded symbols, you can do it character by character, converting it to number ("\x31" == "1" --> 49), then to hex ("0x31"), and finally stripping the first character:
>>> a = "\x31\xC0\x50\x68\x74\x76"
>>> print(''.join([hex(ord(x))[1:] for x in a]))
'x31xc0x50x68x74x76'
There are two problems in your Code.
First the simple one:
strip() just removes one occurrence. So you should use replace("\\", ""). This will replace every backslash with "", which is the same as removing it.
The second problem is pythons behavior with backslashes:
To get your example working you need to append an 'r' in front of your string to indicate, that it is a raw string. a = r"\x31\xC0\x50\x68\x74\x76". In raw strings, a backlash doesn't escape a character but just stay a backslash.
>>> r"\x31\xC0\x50\x68\x74\x76"
'\\x31\\xC0\\x50\\x68\\x74\\x76'

decoding a chinese stopwords file and appending to a list

I am trying to read a chinese stopwords file and append the characters to a list. This is my code:
word_list=[]
with open("stop-words_chinese_1_zh.txt", "r") as f:
for row in f:
decoded=row.decode("utf-8")
print decoded
word_list.append(decoded)
print word_list[:10]
This is my output. Decoded looks fine but after i append decoded to a list, it reverts back to the undecoded characters.
着
诸
自
[u'\u7684\r\n', u'\u4e00\r\n', u'\u4e0d\r\n', u'\u5728\r\n', u'\u4eba\r\n', u'\u6709\r\n', u'\u662f\r\n', u'\u4e3a\r\n', u'\u4ee5\r\n', u'\u4e8e\r\n']
The list hasn't reverted to the undecoded characters. If you print the type of the element in the list:
>>> print type(word_list[0])
You'd get:
<type 'unicode'>
So there isn't anything wrong with your list. Now we turn our attention to the print function. When you call print on an object, it prints whatever that object's str function returns. In the case of a list, however, its str function iteratively calls repr on each element, which returns the Python representation string of said element instead.
The behavior that you want here is to have str invoked instead of repr on each element in the list. There is one caveat here: str will attempt to encode the given object using 'ascii' encoding, which will invariably fail as the list elements are in unicode. For the purpose of displaying on screen, you likely want whatever sys.stdout.encoding is, and it's usually 'UTF-8'.
Thus, to print a unicode list on screen:
>>> import sys
>>> print '[' + ','.join(w.encode(sys.stdout.encoding) for w in word_list) + ']'
Alternatively, we can pass in a unicode string and let print deal with the on-screen encoding:
>>> print u'[' + u','.join(word_list) + u']'
And one last thing: it appears that the elements in your word_list contains newline characters as well. You may want to omit them since you're building a list of stop words. Your final solution would be:
>>> print u'[' + u','.join(w[0] for w in word_list) + u']'

How to make Python Interactive Shell print cyrillic symbols?

I'm using Pymorphy2 in my project as a cyrillic morphological analyzer.
But when I try to print out the list of words, I get this:
>>> for t in terms:
... p = morph.parse(t)
... if 'VERB' in p[0].tag:
... t = p[0].normal_form
... elif 'NOUN' in p[0].tag:
... t = p[0].lexeme[0][0]
...
>>> terms
[u'\u041f\u0430\u0432\u0435\u043b', u'\u0445\u043e\u0434\u0438\u0442', u'\u0434\u043e\u043c\u043e\u0439']
How to make it possible to print russian characters in python shell?
You are seeing the repr representation of the unicode strings, if you loop over the list or index and print each string you will see the output you want.
In [4]: terms
Out[4]:
[u'\u041f\u0430\u0432\u0435\u043b',
u'\u0445\u043e\u0434\u0438\u0442',
u'\u0434\u043e\u043c\u043e\u0439'] # repr
In [5]: print terms[0] # str
Павел
In [6]: print terms[1]
ходит
If you want them all printed and to look like a list, use str.format and str.join:
terms = [u'\u041f\u0430\u0432\u0435\u043b',
u'\u0445\u043e\u0434\u0438\u0442',
u'\u0434\u043e\u043c\u043e\u0439']
print(u"[{}]".format(",".join(terms)))
Output:
[Павел,ходит,домой]

How to display the first few characters of a string in Python?

I just started learning Python but I'm sort of stuck right now.
I have hash.txt file containing thousands of malware hashes in MD5, Sha1 and Sha5 respectively separated by delimiters in each line. Below are 2 examples lines I extracted from the .txt file.
416d76b8811b0ddae2fdad8f4721ddbe|d4f656ee006e248f2f3a8a93a8aec5868788b927|12a5f648928f8e0b5376d2cc07de8e4cbf9f7ccbadb97d898373f85f0a75c47f
56a99a4205a4d6cab2dcae414a5670fd|612aeeeaa8aa432a7b96202847169ecae56b07ee|d17de7ca4c8f24ff49314f0f342dbe9243b10e9f3558c6193e2fd6bccb1be6d2
My intention is to display the first 32 characters (MD5 hash) so the output will look something like this:
416d76b8811b0ddae2fdad8f4721ddbe 56a99a4205a4d6cab2dcae414a5670fd
Any ideas?
You can 'slice' a string very easily, just like you'd pull items from a list:
a_string = 'This is a string'
To get the first 4 letters:
first_four_letters = a_string[:4]
>>> 'This'
Or the last 5:
last_five_letters = a_string[-5:]
>>> 'string'
So applying that logic to your problem:
the_string = '416d76b8811b0ddae2fdad8f4721ddbe|d4f656ee006e248f2f3a8a93a8aec5868788b927|12a5f648928f8e0b5376d2cc07de8e4cbf9f7ccbadb97d898373f85f0a75c47f '
first_32_chars = the_string[:32]
>>> 416d76b8811b0ddae2fdad8f4721ddbe
Since there is a delimiter, you should use that instead of worrying about how long the md5 is.
>>> s = "416d76b8811b0ddae2fdad8f4721ddbe|d4f656ee006e248f2f3a8a93a8aec5868788b927|12a5f648928f8e0b5376d2cc07de8e4cbf9f7ccbadb97d898373f85f0a75c47f"
>>> md5sum, delim, rest = s.partition('|')
>>> md5sum
'416d76b8811b0ddae2fdad8f4721ddbe'
Alternatively
>>> md5sum, sha1sum, sha5sum = s.split('|')
>>> md5sum
'416d76b8811b0ddae2fdad8f4721ddbe'
>>> sha1sum
'd4f656ee006e248f2f3a8a93a8aec5868788b927'
>>> sha5sum
'12a5f648928f8e0b5376d2cc07de8e4cbf9f7ccbadb97d898373f85f0a75c47f'
If you want first 2 letters and last 2 letters of a string then you can use the following code:
name = "India"
name[0:2]="In"
names[-2:]="ia"

Retain the "\n"

>>> t = "first%s\n"
>>> t = t %("second")
>>> print t
firstsecond
Is there anyway I could retain the "\n" at the end and get "firstsecond\n" as the output?
You need to escape the slash
>>> t = "first%s\\n"
>>> t = t %("second")
>>> print t
or use raw strings:
>>> t = r"first%s\n"
>>> t = t %("second")
>>> print t
print "firstsecond\n" displays "firstsecond" and the cursor is pushed to the next new line. So you don't see any backslash followed by n. Because the display of strings implies that the special characters such as \n are interpreted.
repr() prevents the interpretation so print repr("firstsecond\n") displays firstsecond\n
Then, what do you want ? :
t being "firstsecond\n" and to display repr(t) to verify that there is the character \n in it ?
or t being "firstsecond\\n" in order that print t will display firstsecond\n ?
See:
t = "first%s\n"
print repr(t),len(t)
t = t %("second")
print repr(t),len(t)
print '-------------------'
t = "first%s\\n" # the same as r"first%s\n"
print repr(t),len(t)
t = t %("second")
print repr(t),len(t)
result
'first%s\n' 8
'firstsecond\n' 12
-------------------
'first%s\\n' 9
'firstsecond\\n' 13
But don't make misinterpretation: when there is a display like that:
'first%s\\n' ,
the two backslashes \\ mean a value of ONE backslash. The two \\ appear only on the screen to express the value of a backslash in an escaped manner. Otherwise, it couldn't be possible to differentiate the two characters consisting of \ followed by n and the unique character \n
Depending on what do you need exactly, you may also check repr().
>>> s = "firstsecond\n"
>>> print repr(s)
'firstsecond\n'

Categories