Convert Python's internal str to print equivalent - python

Currently I have:
>> class_name = 'AEROSPC\xc2\xa01A'
>> print(class)
>> AEROSPC 1A
>> 'AEROSPC 1A' == class_name
>> False
How can I convert class_name into 'AEROSPC 1A'? Thanks!

Convert to Unicode
You get interesting errors when converting that, I first converted to utf8:
my_utf8 = 'AEROSPC\xc2\xa01A'.decode('utf8', 'ignore')
my_utf8
returns:
u'AEROSPC\xa01A'
and then I normalize the string, the \xa0 is a non-breaking space.
import unicodedata
my_normed_utf8 = unicodedata.normalize('NFKC', my_utf8)
print my_normed_utf8
prints:
AEROSPC 1A
Convert back to String
which I can then convert back to an ASCII string:
my_str = str(my_normed_utf8)
print my_str
prints:
AEROSPC 1A

Related

Decode unicode string in python

I'd like to decode the following string:
t\u028c\u02c8m\u0251\u0279o\u028a\u032f
It should be the IPA of 'tomorrow' as given in a JSON string from http://rhymebrain.com/talk?function=getWordInfo&word=tomorrow
My understanding is that it should be something like:
x = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f'
print x.decode()
I have tried the solutions from here , here , here, and here (and several other that more or less apply), and several permutations of its parts, but I can't get it to work.
Thank you
You need a u before your string (in Python 2.x, which you appear to be using) to indicate that this is a unicode string:
>>> x = u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # note the u
>>> print x
tʌˈmɑɹoʊ̯
If you have already stored the string in a variable, you can use the following constructor to convert the string into unicode:
>>> s = 't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # your string has a unicode-escape encoding but is not unicode
>>> x = unicode(s, encoding='unicode-escape')
>>> print x
tʌˈmɑɹoʊ̯
>>> x
u't\u028c\u02c8m\u0251\u0279o\u028a\u032f' # a unicode string

Regex sub function not working with unicode string

I'm trying to use Python's sub function and I'm having a problem getting it to work. From the troubleshooting I've been doing I believe it has something to do with the unicode characters in the string.
# -*- coding: utf-8 -*-
reload(sys)
sys.setdefaultencoding('utf-8')
import re
someFunction(string):
string = string.decode('utf-8')
match = re.search(ur'éé', string)
if match:
print >> sys.stderr, "It was found"
else:
print >> sys.stderr, "It was NOT found"
if isinstance(string, str):
print >> sys.stderr, 'string is a string object'
elif isinstance(string, unicode):
print >> sys.stderr, 'string is a unicode object'
new_string = re.sub(ur'éé', ur'é:', string)
return new_string
stringNew = 'éégktha'
returnedString = someFunction(stringNew)
print >> sys.stderr, "After printing it: " + returnedString
#At this point in the code string = 'éégktha'
returnString = someFunction(string)
print >> sys.stderr, "After printing it: " + returnedString
So I would like 'é:gktha'. Below is what is printed to the error log when I run this code.
It was found
string is a unicode object
é:gktha
It was NOT found
string is a unicode object
éégktha
So I'm thinking it must be something with string that is passed into my function. When I declared is as a unicode string or a string literal and then decode it the pattern is found. But the pattern is not being found in the string being passed in. I was thinking my string = string.decode('utf-8') statement would convert any string passed into the function and then would would work.
I tried to do this in the python interpreter to work through this and when I declare string as a unicode string it works.
string = u'éégktha'
So to simulate the function I declared the string and then 'decode' it to and then tried my regex statement and it worked.
string = 'éégktha'
newString = string.decode('utf8')
string = re.sub(ur'éé', ur'é:', newString)
print string #é:gktha
This web app that works with a lot of unicode characters. This is Python 2.5 and I've always had a hard time when working with unicode characters. Any help and knowledge is greatly appreciated.
You should print what it returned by someFunction.
>>> string = 'éégktha'
>>> def someFunction(string):
... #string = 'éégktha'
... string = string.decode('utf8')
... new_string = re.sub(ur'éé', ur'é:', string)
... return new_string
>>> import re
>>> someFunction(string)
u'\xe9:gktha'
>>> print someFunction(string)
é:gktha
Your functions fine. In the simulation you are printing which prints what is returned by __str__ while when you return the interpreter prints what is returned by the __repr__ of the new_string/newString.

How do you decode an ascii string in python?

For example, in your python shell(IDLE):
>>> a = "\x3cdiv\x3e"
>>> print a
The result you get is:
<div>
but if a is an ascii encoded string:
>>> a = "\\x3cdiv\\x3e" ## it's the actual \x3cdiv\x3e string if you read it from a file
>>> print a
The result you get is:
\x3cdiv\x3e
Now what i really want from a is <div>, so I did this:
>>> b = a.decode("ascii")
>>> print b
BUT surprisingly I did NOT get the result I want, it's still:
\x3cdiv\x3e
So basically what do I do to convert a, which is \x3cdiv\x3e to b, which should be <div>?
Thanks
>>> a = rb"\x3cdiv\x3e"
>>> a.decode('unicode_escape')
'<div>'
Also check out some interesting codecs.
With python 3.x, you would adapt Kabie answer to
a = b"\x3cdiv\x3e"
a.decode('unicode_escape')
or
a = b"\x3cdiv\x3e"
a.decode('ascii')
both give
>>> a
b'<div>'
What is b prefix for ?
Bytes literals are always prefixed with 'b' or 'B'; they produce an
instance of the bytes type instead of the str type. They may only
contain ASCII characters; bytes with a numeric value of 128 or greater
must be expressed with escapes.

Python: convert a dot separated hex values to string?

In this post: Print a string as hex bytes? I learned how to print as string into an "array" of hex bytes now I need something the other way around:
So for example the input would be: 73.69.67.6e.61.74.75.72.65 and the output would be a string.
you can use the built in binascii module. Do note however that this function will only work on ASCII encoded characters.
binascii.unhexlify(hexstr)
Your input string will need to be dotless however, but that is quite easy with a simple
string = string.replace('.','')
another (arguably safer) method would be to use base64 in the following way:
import base64
encoded = base64.b16encode(b'data to be encoded')
print (encoded)
data = base64.b16decode(encoded)
print (data)
or in your example:
data = base64.b16decode(b"7369676e6174757265", True)
print (data.decode("utf-8"))
The string can be sanitised before input into the b16decode method.
Note that I am using python 3.2 and you may not necessarily need the b out the front of the string to denote bytes.
Example was found here
Without binascii:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(chr(int(e, 16)) for e in a.split('.'))
'signature'
>>>
or better:
>>> a="73.69.67.6e.61.74.75.72.65"
>>> "".join(e.decode('hex') for e in a.split('.'))
PS: works with unicode:
>>> a='.'.join(x.encode('hex') for x in 'Hellö Wörld!')
>>> a
'48.65.6c.6c.94.20.57.94.72.6c.64.21'
>>> print "".join(e.decode('hex') for e in a.split('.'))
Hellö Wörld!
>>>
EDIT:
No need for a generator expression here (thx to thg435):
a.replace('.', '').decode('hex')
Use string split to get a list of strings, then base 16 for decoding the bytes.
>>> inp="73.69.67.6e.61.74.75.72.65"
>>> ''.join((chr(int(i,16)) for i in inp.split('.')))
'signature'
>>>

How to convert hex string "\x89PNG" to plain text in python

I have a string "\x89PNG" which I want to convert to plain text.
I referred http://love-python.blogspot.in/2008/05/convert-hext-to-ascii-string-in-python.html
But I found it a little complicated. Can this be done in a simpler way?
\x89PNG is a plain text. Just try to print it:
>>> s = '\x89PNG'
>>> print s
┴PNG
The recipe in the link does nothing:
>>> hex_string = '\x70f=l\x26hl=en\x26geocode=\x26q\x3c'
>>> ascii_string = reformat_content(hex_string)
>>> hex_string == ascii_string
True
The real hex<->plaintext encoding\decoding is a piece of cake:
>>> s.encode('hex')
'89504e47'
>>> '89504e47'.decode('hex')
'\x89PNG'
However, you may have problems with strings like '\x70f=l\x26hl=en\x26geocode=\x26q\x3c', where '\' and 'x' are separate characters:
>>> s = '\\x70f=l\\x26hl=en\\x26geocode=\\x26q\\x3c'
>>> print s
\x70f=l\x26hl=en\x26geocode=\x26q\x3c
In this case string_escape encoding is really helpful:
>>> print s.decode('string_escape')
pf=l&hl=en&geocode=&q<
More about encodings - http://docs.python.org/library/codecs.html#standard-encodings

Categories