How to make Python Interactive Shell print cyrillic symbols? - python

I'm using Pymorphy2 in my project as a cyrillic morphological analyzer.
But when I try to print out the list of words, I get this:
>>> for t in terms:
... p = morph.parse(t)
... if 'VERB' in p[0].tag:
... t = p[0].normal_form
... elif 'NOUN' in p[0].tag:
... t = p[0].lexeme[0][0]
...
>>> terms
[u'\u041f\u0430\u0432\u0435\u043b', u'\u0445\u043e\u0434\u0438\u0442', u'\u0434\u043e\u043c\u043e\u0439']
How to make it possible to print russian characters in python shell?

You are seeing the repr representation of the unicode strings, if you loop over the list or index and print each string you will see the output you want.
In [4]: terms
Out[4]:
[u'\u041f\u0430\u0432\u0435\u043b',
u'\u0445\u043e\u0434\u0438\u0442',
u'\u0434\u043e\u043c\u043e\u0439'] # repr
In [5]: print terms[0] # str
Павел
In [6]: print terms[1]
ходит
If you want them all printed and to look like a list, use str.format and str.join:
terms = [u'\u041f\u0430\u0432\u0435\u043b',
u'\u0445\u043e\u0434\u0438\u0442',
u'\u0434\u043e\u043c\u043e\u0439']
print(u"[{}]".format(",".join(terms)))
Output:
[Павел,ходит,домой]

Related

python convert "unicode" as list

I have a doubt about treat a return type in python.
I have a database function that returns this as value:
(1,13616,,"My string, that can have comma",170.90)
I put this into a variable and did test the type:
print(type(var))
I got the result:
<type 'unicode'>
I want to convert this to a list and get the values separeteds by comma.
Ex.:
var[0] = 1
var[1] = 13616
var[2] = None
var[3] = "My string, that can have comma"
var[4] = 170.90
Is it possible?
Using standard library csv readers:
>>> import csv
>>> s = u'(1,13616,,"My string, that can have comma",170.90)'
>>> [var] = csv.reader([s[1:-1]])
>>> var[3]
'My string, that can have comma'
Some caveats:
var[2] will be an empty string, not None, but you can post-process that.
numbers will be strings and also need post-processing, since csv does not tell the difference between 0 and '0'.
You can try to do the following:
b = []
for i in a:
if i != None:
b.append(i)
if i == None:
b.append(None)
print (type(b))
The issue is not with the comma.
this works fine:
a = (1,13616,"My string, that can have comma",170.90)
and this also works:
a = (1,13616,None,"My string, that can have comma",170.90)
but when you leave two commas ",," it doesn't work.
Unicode strings are (basically) just strings in Python2 (in Python3, remove the word "basically" in that last sentence). They're written as literals by prefixing a u before the string (compare raw-strings r"something", or Py3.4+ formatter strings f"{some_var}thing")
Just strip off your parens and split by comma. You'll have to do some post-parsing if you want 170.90 instead of u'170.90' or None instead of u'', but I'll leave that for you to decide.
>>> var.strip(u'()').split(u',')
[u'1', u'13616', u'', u'"My string', u' that can have comma"', u'170.90']

Python: How to remove [' and ']?

I want to remove [' from start and '] characters from the end of a string.
This is my text:
"['45453656565']"
I need to have this text:
"45453656565"
I've tried to use str.replace
text = text.replace("['","");
but it does not work.
You need to strip your text by passing the unwanted characters to str.strip() method:
>>> s = "['45453656565']"
>>>
>>> s.strip("[']")
'45453656565'
Or if you want to convert it to integer you can simply pass the striped result to int function:
>>> try:
... val = int(s.strip("[']"))
... except ValueError:
... print("Invalid string")
...
>>> val
45453656565
Using re.sub:
>>> my_str = "['45453656565']"
>>> import re
>>> re.sub("['\]\[]","",my_str)
'45453656565'
You could loop over the character filtering if the element is a digit:
>>> number_array = "['34325235235']"
>>> int(''.join(c for c in number_array if c.isdigit()))
34325235235
This solution works even for both "['34325235235']" and '["34325235235"]' and whatever other combination of number and characters.
You also can import a package and use a regular expresion to get it:
>>> import re
>>> theString = "['34325235235']"
>>> int(re.sub(r'\D', '', theString)) # Optionally parse to int
Instead of hacking your data by stripping brackets, you should edit the script that created it to print out just the numbers. E.g., instead of lazily doing
output.write(str(mylist))
you can write
for elt in mylist:
output.write(elt + "\n")
Then when you read your data back in, it'll contain the numbers (as strings) without any quotes, commas or brackets.

Unicode object to a list

I have a utf8 - text corpus I can read easily in Python 2.7 :
sentence = codecs.open("D:\\Documents\\files\\sentence.txt", "r", encoding="utf8")
sentence = sentence.read()
> This is my sentence in the right format
However, when I pass this text corpus to a list (for example, for tokenizing) :
tokens = sentence.tokenize()
and print it in the notebook, I obtain bit-like caracters, like :
(u'\ufeff\ufeffFaux,', u'Tunisie')
(u'Tunisie', u"l'\xc9gypte,")
Whereas I would like normal characters just like in my original import.
So my question is : how can I pass unicode objects to a list without having strange bit/ASCII characters ?
It's all in how you print. Python 2 displays lists using ASCII-only characters and substituting backslash escape codes for non-ASCII characters. This is to make it easy to see hidden characters that normal printing would make invisible, like the double byte-order-mark (BOM) \ufeff you see in your strings. Printing individual string items will display them correctly.
Many examples
Original strings:
>>> s = (u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t = (u'Tunisie', u"l'\xc9gypte,")
Displaying at the interactive prompt:
>>> s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> t
(u'Tunisie', u"l'\xc9gypte,")
>>> print s
(u'\ufeff\ufeffFaux,', u'Tunisie')
>>> print t
(u'Tunisie', u"l'\xc9gypte,")
Printing individual strings from the tuples:
>>> print s[0]
Faux,
>>> print s[1]
Tunisie
>>> print t[0]
Tunisie
>>> print t[1]
l'Égypte,
>>> print ' '.join(s)
Faux, Tunisie
>>> print ' '.join(t)
Tunisie l'Égypte,
A way to print tuples without escape codes:
>>> print "('"+"', '".join(s)+"')"
('Faux,', 'Tunisie')
>>> print "('"+"', '".join(t)+"')"
('Tunisie', 'l'Égypte,')
Hm, codecs.open(...) returns a "wrapped version of the underlying file object" then you overwrite this variable with the result from executing the read method on that object. Brave, irritating - but ok ;-)
When you type say an äöüß into your "notebook", does it show like "this" or do you see some \uxxxxx instead?
The default value for codecs.open(...) is errors=strict so if this is the same environment for all samples, this should work.
I understand, that when you write "print it" you print the list, that is different from printing the content of the list.
Sample (taking a tab typed as \t into a normal "byte" string - this is python 2.7.11):
>>> a="\t"
>>> print a # below is an expanded tab
>>> a
'\t'
>>> [a]
['\t']
>>> print [a]
['\t']
>>> for element in [a]:
... print element
...
>>> # above is an expanded tab

String with Backslash and Quotes in Python is problems galore

I've a normal string, that I like to send to a program, which only eats my string as "\"text\"", exactly like that. But in Python I can print it like that but I can't assign it like that. See the following:
My text:
In [12]:
i = fieldList[0]
print str(i.name)
Y03M01D01
Which I can print as "\"text\""
In [13]:
field_new = '"\\"'+str(i.name)+'\\""'
print field_new
"\"Y03M01D01\""
But this is how it is eaten by the program
In [14]:
field_new
Out[14]:
'"\\"Y03M01D01\\""'
Which is not equal to "\"text\"" and so my code fails.
Any suggestions how to resolve this?
Using the r prefix for the string in your comparison will have python treat the string as raw (all backslashes are unescaped).
>>> i = "text"
>>> field_new = '"\\"'+str(i)+'\\""'
>>> field_new
'"\\"text\\""'
>>> field_new == r'"\"text\""'
True

How to convert hex string "\x89PNG" to plain text in python

I have a string "\x89PNG" which I want to convert to plain text.
I referred http://love-python.blogspot.in/2008/05/convert-hext-to-ascii-string-in-python.html
But I found it a little complicated. Can this be done in a simpler way?
\x89PNG is a plain text. Just try to print it:
>>> s = '\x89PNG'
>>> print s
┴PNG
The recipe in the link does nothing:
>>> hex_string = '\x70f=l\x26hl=en\x26geocode=\x26q\x3c'
>>> ascii_string = reformat_content(hex_string)
>>> hex_string == ascii_string
True
The real hex<->plaintext encoding\decoding is a piece of cake:
>>> s.encode('hex')
'89504e47'
>>> '89504e47'.decode('hex')
'\x89PNG'
However, you may have problems with strings like '\x70f=l\x26hl=en\x26geocode=\x26q\x3c', where '\' and 'x' are separate characters:
>>> s = '\\x70f=l\\x26hl=en\\x26geocode=\\x26q\\x3c'
>>> print s
\x70f=l\x26hl=en\x26geocode=\x26q\x3c
In this case string_escape encoding is really helpful:
>>> print s.decode('string_escape')
pf=l&hl=en&geocode=&q<
More about encodings - http://docs.python.org/library/codecs.html#standard-encodings

Categories