Get rid of unicode error

Get rid of unicode error - python

I have the following code attempting to print the edge lists of graphs. It looks like the edges are cycled but it's my intention to test whether all edges are contained while going through the function for further processing.
def mapper_network(self, _, info):
info[0] = info[0].encode('utf-8')
for i in range(len(info[1])):
info[1][i] = str(info[1][i])
l_lst = len(info[1])
packed = [(info[0], l) for l in info[1]] #each pair of nodes (edge)
weight = [1 /float(l_lst)] #each edge weight
G = nx.Graph()
for i in range(len(packed)):
edge_from = packed[i][0]
edge_to = packed[i][1]
#edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
edge_to = edge_to.encode("utf-8")
weight = weight
G.add_edge(edge_from, edge_to, weight=weight)
#print G.size() #yes, this works :)
G_edgelist = []
G_edgelist = G_edgelist.append(nx.generate_edgelist(G).next())
print G_edgelist
With this code, I obtain the error
Traceback (most recent call last):
File "MRQ7_trevor_2.py", line 160, in <module>
MRMostUsedWord2.run()
File "/tmp/MRQ7_trevor_2.vagrant.20160814.201259.655269/job_local_dir/1/mapper/27/mrjob.tar.gz/mrjob/job.py", line 433, in run
mr_job.execute()
File "/tmp/MRQ7_trevor_2.vagrant.20160814.201259.655269/job_local_dir/1/mapper/27/mrjob.tar.gz/mrjob/job.py", line 442, in execute
self.run_mapper(self.options.step_num)
File "/tmp/MRQ7_trevor_2.vagrant.20160814.201259.655269/job_local_dir/1/mapper/27/mrjob.tar.gz/mrjob/job.py", line 507, in run_mapper
for out_key, out_value in mapper(key, value) or ():
File "MRQ7_trevor_2.py", line 91, in mapper_network
G_edgelist = G_edgelist.append(nx.generate_edgelist(G).next())
File "/home/vagrant/anaconda/lib/python2.7/site-packages/networkx/readwrite/edgelist.py", line 114, in generate_edgelist
yield delimiter.join(map(make_str,e))
File "/home/vagrant/anaconda/lib/python2.7/site-packages/networkx/utils/misc.py", line 82, in make_str
return unicode(str(x), 'unicode-escape')
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 0: \ at end of string
With the modification below
edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
I obtained
edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
TypeError: must be unicode, not str
How to get rid of the error of unicode? It seems very troublesome and I highly appreciate your assistance. Thank you!!

I highly recommend reading this article on unicode. It gives a nice explanation of unicode vs. strings in Python 2.
For your problem specifically, when you call unicodedata.normalize("NFKD", edge_to), edge_to must be a unicode string. However, it is not unicode since you set it in this line: info[1][i] = str(info[1][i]). Here's a quick test:
import unicodedata
edge_to = u'edge' # this is unicode
edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
print edge_to # prints 'edge' as expected
edge_to = 'edge' # this is not unicode
edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
print edge_to # TypeError: must be unicode, not str
You can get rid of the problem by casting edge_to to unicode.
As an aside, it seems like the encoding/decoding of the whole code chunk is a little confusing. Think out exactly where you want strings to be unicode vs. bytes. You may not need to be doing so much encoding/decoding/normalization.

Related

TypeError: cannot use a string pattern on a bytes-like object python3

I have updated my project to Python 3.7 and Django 3.0
Here is code of models.py
def get_fields(self):
fields = []
html_text = self.html_file.read()
self.html_file.seek(0)
# for now just find singleline, multiline, img editable
# may put repeater in there later (!!)
for m in re.findall("(<(singleline|multiline|img editable)[^>]*>)", html_text):
# m is ('<img editable="true" label="Image" class="w300" width="300" border="0">', 'img editable')
# or similar
# first is full tag, second is tag type
# append as a list
# MUST also save value in here
data = {'tag':m[0], 'type':m[1], 'label':'', 'value':None}
title_list = re.findall("label\s*=\s*\"([^\"]*)", m[0])
if(len(title_list) == 1):
data['label'] = title_list[0]
# store the data
fields.append(data)
return fields
Here is my error traceback
File "/home/harika/krishna test/dev-1.8/mcam/server/mcam/emails/models.py", line 91, in get_fields
for m in re.findall("(<(singleline|multiline|img editable)[^>]*>)", html_text):
File "/usr/lib/python3.7/re.py", line 225, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
How can I solve my issue?

The thing is that python3's read returns bytes (i.e. "raw" representation) and not string. You can convert between bytes and string if you specify encoding, i.e. how are characters converted to bytes:
>>> '☺'.encode('utf8')
b'\xe2\x98\xba'
>>> '☺'.encode('utf16')
b'\xff\xfe:&'
the b before string signifies that the value is not string but rather bytes. You can also supply raw bytes if you use that prefix:
>>> bytes_x = b'x'
>>> string_x = 'x'
>>> bytes_x == string_x
False
>>> bytes_x.decode('ascii') == string_x
True
>>> bytes_x == string_x.encode('ascii')
True
Note you can only use basic (ASCII) characters if you are using b prefix:
>>> b'☺'
File "<stdin>", line 1
SyntaxError: bytes can only contain ASCII literal characters.
So to fix your problem you need to either convert the input to a string with appropriate encoding:
html_text = self.html_file.read().decode('utf-8') # or 'ascii' or something else
Or -- probably better option -- is to use bytes in the findalls instead of strings:
for m in re.findall(b"(<(singleline|multiline|img editable)[^>]*>)", html_text):
...
title_list = re.findall(b"label\s*=\s*\"([^\"]*)", m[0])
(note the b in front of each "string")

Reconstruct the source file from string output

I use stepic3 to hide some data. Multiple files are compressed into a zip file, which will be the hidden message. However, when I use the following code
from PIL import Image
import stepic
def enc_():
im = Image.open("secret.png")
text = str(open("source.zip", "rb").read())
im = stepic.encode(im, text)
im.save('stegolena.png','PNG')
def dec_():
im1=Image.open('stegolena.png')
out = stepic.decode(im1)
plaintext = open("out.zip", "w")
plaintext.write(out)
plaintext.close()
I get the error
Complete Trace back
Traceback (most recent call last):
File "C:\Users\Sherif\OneDrive\Pyhton Projects\Kivy Tests\simple.py", line 28, in enc_()
File "C:\Users\Sherif\OneDrive\Pyhton Projects\Kivy Tests\simple.py", line 8, in enc_
im = stepic.encode(im, text)
File "C:\Users\Sherif\OneDrive\Pyhton Projects\Kivy Tests\stepic.py", line 89, in encode
encode_inplace(image, data)
File "C:\Users\Sherif\OneDrive\Pyhton Projects\Kivy Tests\stepic.py", line 75, in encode_inplace
for pixel in encode_imdata(image.getdata(), data):
File "C:\Users\Sherif\OneDrive\Pyhton Projects\Kivy Tests\stepic.py", line 58, in encode_imdata
byte = ord(data[i])
TypeError: ord() expected string of length 1, but int found
There are two ways to convert to a string.
text = open("source.zip", "r", encoding='utf-8', errors='ignore').read()
with output
PKn!K\Z
sec.txt13 byte 1.10mPKn!K\Z
sec.txtPK52
or
text = str(open("source.zip", "rb").read())
with output
b'PK\x03\x04\x14\x00\x00\x00\x00\x00n\x8f!K\\\xac\xdaZ\r\x00\x00\x00\r\x00\x00\x00\x07\x00\x00\x00sec.txt13 byte 1.10mPK\x01\x02\x14\x00\x14\x00\x00\x00\x00\x00n\x8f!K\\\xac\xdaZ\r\x00\x00\x00\r\x00\x00\x00\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xb6\x81\x00\x00\x00\x00sec.txtPK\x05\x06\x00\x00\x00\x00\x01\x00\x01\x005\x00\x00\x002\x00\x00\x00\x00\x00'
I used the second and I got the same string back from the retrieval.
In order to reconstruct the zip file (output is string), I use the code
plaintext = open("out.zip", "w")
plaintext.write(output)
plaintext.close()
but the written file says is corrupted when I try to open it. When I try to read what was written to it, with either
output = output.encode(encoding='utf_8', errors='strict')
or
output = bytes(output, 'utf_8')
the output is
b"b'PK\\x03\\x04\\x14\\x00\\x00\\x00\\x00\\x00n\\x8f!K\\\\\\xac\\xdaZ\\r\\x00\\x00\\x00\\r\\x00\\x00\\x00\\x07\\x00\\x00\\x00sec.txt13 byte 1.10mPK\\x01\\x02\\x14\\x00\\x14\\x00\\x00\\x00\\x00\\x00n\\x8f!K\\\\\\xac\\xdaZ\\r\\x00\\x00\\x00\\r\\x00\\x00\\x00\\x07\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xb6\\x81\\x00\\x00\\x00\\x00sec.txtPK\\x05\\x06\\x00\\x00\\x00\\x00\\x01\\x00\\x01\\x005\\x00\\x00\\x002\\x00\\x00\\x00\\x00\\x00'"
which is different from the source file.
What do I have to reconstruct the embedded file faithfully?

When you read a file in rb mode, you'll get a byte array. If you print it, it may look like a string, but each individual element is actually an integer.
>>> my_bytes = b'hello'
>>> my_bytes
b'hello'
>>> my_bytes[0]
104
This explain the error
"C:\Users\Sherif\OneDrive\Pyhton Projects\Kivy Tests\stepic.py", line 58, in encode_imdata byte = ord(data[i]) TypeError: ord() expected string of length 1, but int found
ord() expects a string, so you have to convert all the bytes to strings. Unfortunately, str(some_byte_array) doesn't do what you think it does. It creates a literal string representation of your byte array, including the preceeding "b" and the surrounding quotes.
>>> string = str(my_bytes)
>>> string[0]
'b'
>>> string[1]
"'"
>>> string[2]
'h'
What you want instead is to convert each byte (integer) to a string individually. map(chr, some_byte_array) will do this for you. We have to do this simply because stepic expects a string. When it embeds a character, it does ord(data[i]), which converts a string of length one to its Unicode code (integer).
Furthermore, we can't leave our string as a map object, because the code needs to calculate the length of the whole string before embedding it. Therefore, ''.join(map(chr, some_bytearray)) is what we have to use for our input secret.
For extraction stepic does the opposite. It extracts the secret byte by byte and turns them into strings with chr(byte). In order to reverse that, we need to get the ordinal value of each character individually. map(ord, out) should do the trick. And since we want to write our file in binary, further feeding that into bytearray() will take care of everything.
Overall, these are the changes you should make to your code.
def enc_():
im = Image.open("secret.png")
text = ''.join(map(chr, open("source.zip", "rb").read()))
im = stepic.encode(im, text)
im.save('stegolena.png','PNG')
def dec_():
im1=Image.open('stegolena.png')
out = stepic.decode(im1)
plaintext = open("out.zip", "wb")
plaintext.write(bytearray(map(ord, out)))
plaintext.close()

'UCS-2' codec can't encode characters in position 61-61

When I run my Python code and print(item), I get the following errors:
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 61-61: Non-BMP character not supported in Tk
Here is my code:
def getUserFollowers(self, usernameId, maxid = ''):
if maxid == '':
return self.SendRequest('friendships/'+ str(usernameId) +'/followers/?rank_token='+ self.rank_token,l=2)
else:
return self.SendRequest('friendships/'+ str(usernameId) +'/followers/?rank_token='+ self.rank_token + '&max_id='+ str(maxid))
def getTotalFollowers(self,usernameId):
followers = []
next_max_id = ''
while 1:
self.getUserFollowers(usernameId,next_max_id)
temp = self.LastJson
for item in temp["users"]:
print(item)
followers.append(item)
if temp["big_list"] == False:
return followers
next_max_id = temp["next_max_id"]
How can I fix this?

Hard to guess without knowing the content of temp["users"], but the error indicates that it contains non BMP unicode characters like for example emoji.
If you try to display that in IDLE, you immediately get that kind of error. Simple example to reproduce (on IDLE for Python 3.5):
>>> t = "ab \U0001F600 cd"
>>> print(t)
Traceback (most recent call last):
File "<pyshell#5>", line 1, in <module>
print(t)
UnicodeEncodeError: 'UCS-2' codec can't encode characters in position 3-3: Non-BMP character not supported in Tk
(\U0001F600 represents the unicode character U+1F600 grinning face)
The error is indeed caused by Tk not supporting unicode characters with code greater than FFFF. A simple workaround is the filter them out of your string:
def BMP(s):
return "".join((i if ord(i) < 10000 else '\ufffd' for i in s))
'\ufffd' is the Python representation for the unicode U+FFFD REPLACEMENT CHARACTER.
My example becomes:
>>> t = "ab \U0001F600 cd"
>>> print(BMP(t))
ab � cd
So your code would become:
for item in temp["users"]:
print(BMP(item))
followers.append(item)

Python: How do I print an unicode string from a .txt file

I'm using Python 3.2.3 and idle to program a text game.
I'm using a .txt file to store the map schemes that later will be opened by the program and draw at the terminal(IDLE for the moment).
What is in the .txt file is it:
╔════Π═╗
Π ║
║w bb c□
║w bb c║
╚═□══□═╝
Π: door; □: window; b: bed; c: computer; w: wardrobe
As I'm new to programming I'm having a difficult problem doing this.
Here is the code I made so far for this:
doc = codecs.open("D:\Escritório\Codes\maps.txt")
map = doc.read().decode('utf8')
whereIsmap = map.find('bedroom')
if buldIntel == 1 and localIntel == 1:
whereIsmap = text.find('map1:')
itsGlobal = 1
if espLocation == "localIntel" == 1:
whereIsmap = text.find('map0:')
if buldIntel == 0 and localIntel == 0:
doc.close()
for line in whereIsmap:
(map) = line
mapa.append(str(map))
doc.close()
if itsGlobal == 1:
print(mapa[0])
print(mapa[1])
print(mapa[2])
print(mapa[3])
print(mapa[4])
print(mapa[5])
print(mapa[6])
print(mapa[7])
if itsLocal == 1 and itsGlobal == 0:
print(mapa[0])
print(mapa[1])
print(mapa[2])
print(mapa[3])
print(mapa[4])
There are two maps and each one of them has a title the smaller one is map1(the one I've show).
Python is giving this error message if I try to run the program:
Traceback (most recent call last):
File "C:\Python32\projetoo", line 154, in <module>
gamePlay(ask1, type, selfIntel1, localIntel, buildIntel, whereAmI, HP, time, itsLocal, itsBuild)
File "C:\Python32\projetoo", line 72, in gamePlay
map = doc.read().decode('utf8')
File "C:\Python32\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte
What do I do to print to the IDLE terminal the maps exactly as I showed up there?

The issue is that you are using codecs.open without specifying an encoding, then trying to decode the string returned by doc.read(), even though it is already a Unicode string.
To fix this, specify an encoding in your call to codecs.open: codecs.open("...", encoding="utf-8"), then you won't need the call to .decode('utf-8') later.
Also, since you're using Python 3, you can just use open:
doc = open("...", encoding="utf-8").read()
Finally, you'll need to re-encode the unicode string when you print it:
print("\n".join(mapa[0:4]).encode("utf-8"))

how to show the right word in my code, my code is : os.urandom(64)

My code is:
print os.urandom(64)
which outputs:
> "D:\Python25\pythonw.exe" "D:\zjm_code\a.py"
\xd0\xc8=<\xdbD'
\xdf\xf0\xb3>\xfc\xf2\x99\x93
=S\xb2\xcd'\xdbD\x8d\xd0\\xbc{&YkD[\xdd\x8b\xbd\x82\x9e\xad\xd5\x90\x90\xdcD9\xbf9.\xeb\x9b>\xef#n\x84
which isn't readable, so I tried this:
print os.urandom(64).decode("utf-8")
but then I get:
> "D:\Python25\pythonw.exe" "D:\zjm_code\a.py"
Traceback (most recent call last):
File "D:\zjm_code\a.py", line 17, in <module>
print os.urandom(64).decode("utf-8")
File "D:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-3: invalid data
What should I do to get human-readable output?

No shortage of choices. Here's a couple:
>>> os.urandom(64).encode('hex')
'0bf760072ea10140d57261d2cd16bf7af1747e964c2e117700bd84b7acee331ee39fae5cff6f3f3fc3ee3f9501c9fa38ecda4385d40f10faeb75eb3a8f557909'
>>> os.urandom(64).encode('base64')
'ZuYDN1BiB0ln73+9P8eoQ3qn3Q74QzCXSViu8lqueKAOUYchMXYgmz6WDmgJm1DyTX598zE2lClX\n4iEXXYZfRA==\n'

os.urandom is giving you a 64-bytes string. Encoding it in hex is probably the best way to make it "human readable" to some extent. E.g.:
>>> s = os.urandom(64)
>>> s.encode('hex')
'4c28351a834d80674df3b6eb5f59a2fd0df2ed2a708d14548e4a88c7139e91ef4445a8b88db28ceb3727851c02ce1822b3c7b55a977fa4f4c4f2a0e278ca569e'
Of course this gives you 128 characters in the result, which may be too long a line to read comfortably; it's easy to split it up, though -- e.g.:
>>> print s[:32].encode('hex')
4c28351a834d80674df3b6eb5f59a2fd0df2ed2a708d14548e4a88c7139e91ef
>>> print s[32:].encode('hex')
4445a8b88db28ceb3727851c02ce1822b3c7b55a977fa4f4c4f2a0e278ca569e
two chunks of 64 characters each shown on separate lines may be easier on the eye.

Random bytes are not likely to be unicode characters, so I'm not suprised that you get encoding errors. Instead you need to convert them somehow. If all you're trying to do is see what they are, then something like:
print [ord(o) for o in os.urandom(64)]
Or, if you'd prefer to have it as hex 0-9a-f:
print ''.join( [hex(ord(o))[2:] for o in os.urandom(64)] )

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get rid of unicode error - python

Related

TypeError: cannot use a string pattern on a bytes-like object python3

Reconstruct the source file from string output

'UCS-2' codec can't encode characters in position 61-61

Python: How do I print an unicode string from a .txt file

how to show the right word in my code, my code is : os.urandom(64)

Categories

Resources