Reconstruct the source file from string output - python

I use stepic3 to hide some data. Multiple files are compressed into a zip file, which will be the hidden message. However, when I use the following code
from PIL import Image
import stepic
def enc_():
im = Image.open("secret.png")
text = str(open("source.zip", "rb").read())
im = stepic.encode(im, text)
im.save('stegolena.png','PNG')
def dec_():
im1=Image.open('stegolena.png')
out = stepic.decode(im1)
plaintext = open("out.zip", "w")
plaintext.write(out)
plaintext.close()
I get the error
Complete Trace back
Traceback (most recent call last):
File "C:\Users\Sherif\OneDrive\Pyhton Projects\Kivy Tests\simple.py", line 28, in enc_()
File "C:\Users\Sherif\OneDrive\Pyhton Projects\Kivy Tests\simple.py", line 8, in enc_
im = stepic.encode(im, text)
File "C:\Users\Sherif\OneDrive\Pyhton Projects\Kivy Tests\stepic.py", line 89, in encode
encode_inplace(image, data)
File "C:\Users\Sherif\OneDrive\Pyhton Projects\Kivy Tests\stepic.py", line 75, in encode_inplace
for pixel in encode_imdata(image.getdata(), data):
File "C:\Users\Sherif\OneDrive\Pyhton Projects\Kivy Tests\stepic.py", line 58, in encode_imdata
byte = ord(data[i])
TypeError: ord() expected string of length 1, but int found
There are two ways to convert to a string.
text = open("source.zip", "r", encoding='utf-8', errors='ignore').read()
with output
PKn!K\Z
sec.txt13 byte 1.10mPKn!K\Z
sec.txtPK52
or
text = str(open("source.zip", "rb").read())
with output
b'PK\x03\x04\x14\x00\x00\x00\x00\x00n\x8f!K\\\xac\xdaZ\r\x00\x00\x00\r\x00\x00\x00\x07\x00\x00\x00sec.txt13 byte 1.10mPK\x01\x02\x14\x00\x14\x00\x00\x00\x00\x00n\x8f!K\\\xac\xdaZ\r\x00\x00\x00\r\x00\x00\x00\x07\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\xb6\x81\x00\x00\x00\x00sec.txtPK\x05\x06\x00\x00\x00\x00\x01\x00\x01\x005\x00\x00\x002\x00\x00\x00\x00\x00'
I used the second and I got the same string back from the retrieval.
In order to reconstruct the zip file (output is string), I use the code
plaintext = open("out.zip", "w")
plaintext.write(output)
plaintext.close()
but the written file says is corrupted when I try to open it. When I try to read what was written to it, with either
output = output.encode(encoding='utf_8', errors='strict')
or
output = bytes(output, 'utf_8')
the output is
b"b'PK\\x03\\x04\\x14\\x00\\x00\\x00\\x00\\x00n\\x8f!K\\\\\\xac\\xdaZ\\r\\x00\\x00\\x00\\r\\x00\\x00\\x00\\x07\\x00\\x00\\x00sec.txt13 byte 1.10mPK\\x01\\x02\\x14\\x00\\x14\\x00\\x00\\x00\\x00\\x00n\\x8f!K\\\\\\xac\\xdaZ\\r\\x00\\x00\\x00\\r\\x00\\x00\\x00\\x07\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\x00\\xb6\\x81\\x00\\x00\\x00\\x00sec.txtPK\\x05\\x06\\x00\\x00\\x00\\x00\\x01\\x00\\x01\\x005\\x00\\x00\\x002\\x00\\x00\\x00\\x00\\x00'"
which is different from the source file.
What do I have to reconstruct the embedded file faithfully?

When you read a file in rb mode, you'll get a byte array. If you print it, it may look like a string, but each individual element is actually an integer.
>>> my_bytes = b'hello'
>>> my_bytes
b'hello'
>>> my_bytes[0]
104
This explain the error
"C:\Users\Sherif\OneDrive\Pyhton Projects\Kivy Tests\stepic.py", line 58, in encode_imdata byte = ord(data[i]) TypeError: ord() expected string of length 1, but int found
ord() expects a string, so you have to convert all the bytes to strings. Unfortunately, str(some_byte_array) doesn't do what you think it does. It creates a literal string representation of your byte array, including the preceeding "b" and the surrounding quotes.
>>> string = str(my_bytes)
>>> string[0]
'b'
>>> string[1]
"'"
>>> string[2]
'h'
What you want instead is to convert each byte (integer) to a string individually. map(chr, some_byte_array) will do this for you. We have to do this simply because stepic expects a string. When it embeds a character, it does ord(data[i]), which converts a string of length one to its Unicode code (integer).
Furthermore, we can't leave our string as a map object, because the code needs to calculate the length of the whole string before embedding it. Therefore, ''.join(map(chr, some_bytearray)) is what we have to use for our input secret.
For extraction stepic does the opposite. It extracts the secret byte by byte and turns them into strings with chr(byte). In order to reverse that, we need to get the ordinal value of each character individually. map(ord, out) should do the trick. And since we want to write our file in binary, further feeding that into bytearray() will take care of everything.
Overall, these are the changes you should make to your code.
def enc_():
im = Image.open("secret.png")
text = ''.join(map(chr, open("source.zip", "rb").read()))
im = stepic.encode(im, text)
im.save('stegolena.png','PNG')
def dec_():
im1=Image.open('stegolena.png')
out = stepic.decode(im1)
plaintext = open("out.zip", "wb")
plaintext.write(bytearray(map(ord, out)))
plaintext.close()

Related

How to decode a string representation of a bytes object?

I have a string which includes encoded bytes inside it:
str1 = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
I want to decode it, but I can't since it has become a string. Therefore I want to ask whether there is any way I can convert it into
str2 = b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'
Here str2 is a bytes object which I can decode easily using
str2.decode('utf-8')
to get the final result:
'Output file 문항분석.xlsx Created'
You could use ast.literal_eval:
>>> print(str1)
b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'
>>> type(str1)
<class 'str'>
>>> from ast import literal_eval
>>> literal_eval(str1).decode('utf-8')
'Output file 문항분석.xlsx Created'
Based on the SyntaxError mentioned in your comments, you may be having a testing issue when attempting to print due to the fact that stdout is set to ascii in your console (and you may also find that your console does not support some of the characters you may be trying to print). You can try something like the following to set sys.stdout to utf-8 and see what your console will print (just using string slice and encode below to get bytes rather than the ast.literal_eval approach that has already been suggested):
import codecs
import sys
sys.stdout = codecs.getwriter('utf-8')(sys.stdout.buffer)
s = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
b = s[2:-1].encode().decode('utf-8')
A simple way is to assume that all the characters of the initial strings are in the [0,256) range and map to the same Unicode value, which means that it is a Latin1 encoded string.
The conversion is then trivial:
str1[2:-1].encode('Latin1').decode('utf8')
Finally I have found an answer where i use a function to cast a string to bytes without encoding.Given string
str1 = "b'Output file \xeb\xac\xb8\xed\x95\xad\xeb\xb6\x84\xec\x84\x9d.xlsx Created'"
now i take only actual encoded text inside of it
str1[2:-1]
and pass this to the function which convert the string to bytes without encoding its values
import struct
def rawbytes(s):
"""Convert a string to raw bytes without encoding"""
outlist = []
for cp in s:
num = ord(cp)
if num < 255:
outlist.append(struct.pack('B', num))
elif num < 65535:
outlist.append(struct.pack('>H', num))
else:
b = (num & 0xFF0000) >> 16
H = num & 0xFFFF
outlist.append(struct.pack('>bH', b, H))
return b''.join(outlist)
So, calling the function would convert it to bytes which then is decoded
rawbytes(str1[2:-1]).decode('utf-8')
will give the correct output
'Output file 문항분석.xlsx Created'

binascii.Error: Incorrect padding, even when string length is multiple of 4

I am trying to convert base64 string to image by python code, but I am getting binascii.Error: Incorrect padding I have gone through with my solution but they only suggest check string length is divisible 4, if not make it divisible by 4 by adding '=' characters at the end of base64 encoded sting.
Please help in this.
PYTHON CODE: (please check code from drive for more visibility)
import base64
strOne= '...string has 200000 character thats why I couldn t paste'
print 'strOne Length',len(strOne)
print 'StrOne Length is completely divisible by 4 (len%4),(len/4):', len(strOne)%4,len(strOne)/4
with open("imageToSave.png", "wb") as fh:
fh.write(strOne.strip().decode('base64'))
output:
strOne Length 200000
StrOne Length is completely divisible by 4 (len%4),(len/4): 0 50000
Traceback (most recent call last):
File "/tests.py", line 13, in <module>
fh.write(strOne.strip().decode('base64'))
File "/usr/lib/python2.7/encodings/base64_codec.py", line 42, in base64_decode
output = base64.decodestring(input)
File "/usr/lib/python2.7/base64.py", line 328, in decodestring
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
by checking your link, your string has 200000 bytes all right, but it contains the header:
strOne = b"...
This is part of MIME message or something. You have to strip this first.
strOne = strOne.partition(",")[2]
then pad (if needed)
pad = len(strOne)%4
strOne += b"="*pad
then decode using codecs (python 3 compliant)
codecs.decode(strOne.strip(),'base64')
=> "we believe in team work" :)

Get rid of unicode error

I have the following code attempting to print the edge lists of graphs. It looks like the edges are cycled but it's my intention to test whether all edges are contained while going through the function for further processing.
def mapper_network(self, _, info):
info[0] = info[0].encode('utf-8')
for i in range(len(info[1])):
info[1][i] = str(info[1][i])
l_lst = len(info[1])
packed = [(info[0], l) for l in info[1]] #each pair of nodes (edge)
weight = [1 /float(l_lst)] #each edge weight
G = nx.Graph()
for i in range(len(packed)):
edge_from = packed[i][0]
edge_to = packed[i][1]
#edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
edge_to = edge_to.encode("utf-8")
weight = weight
G.add_edge(edge_from, edge_to, weight=weight)
#print G.size() #yes, this works :)
G_edgelist = []
G_edgelist = G_edgelist.append(nx.generate_edgelist(G).next())
print G_edgelist
With this code, I obtain the error
Traceback (most recent call last):
File "MRQ7_trevor_2.py", line 160, in <module>
MRMostUsedWord2.run()
File "/tmp/MRQ7_trevor_2.vagrant.20160814.201259.655269/job_local_dir/1/mapper/27/mrjob.tar.gz/mrjob/job.py", line 433, in run
mr_job.execute()
File "/tmp/MRQ7_trevor_2.vagrant.20160814.201259.655269/job_local_dir/1/mapper/27/mrjob.tar.gz/mrjob/job.py", line 442, in execute
self.run_mapper(self.options.step_num)
File "/tmp/MRQ7_trevor_2.vagrant.20160814.201259.655269/job_local_dir/1/mapper/27/mrjob.tar.gz/mrjob/job.py", line 507, in run_mapper
for out_key, out_value in mapper(key, value) or ():
File "MRQ7_trevor_2.py", line 91, in mapper_network
G_edgelist = G_edgelist.append(nx.generate_edgelist(G).next())
File "/home/vagrant/anaconda/lib/python2.7/site-packages/networkx/readwrite/edgelist.py", line 114, in generate_edgelist
yield delimiter.join(map(make_str,e))
File "/home/vagrant/anaconda/lib/python2.7/site-packages/networkx/utils/misc.py", line 82, in make_str
return unicode(str(x), 'unicode-escape')
UnicodeDecodeError: 'unicodeescape' codec can't decode byte 0x5c in position 0: \ at end of string
With the modification below
edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
I obtained
edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
TypeError: must be unicode, not str
How to get rid of the error of unicode? It seems very troublesome and I highly appreciate your assistance. Thank you!!
I highly recommend reading this article on unicode. It gives a nice explanation of unicode vs. strings in Python 2.
For your problem specifically, when you call unicodedata.normalize("NFKD", edge_to), edge_to must be a unicode string. However, it is not unicode since you set it in this line: info[1][i] = str(info[1][i]). Here's a quick test:
import unicodedata
edge_to = u'edge' # this is unicode
edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
print edge_to # prints 'edge' as expected
edge_to = 'edge' # this is not unicode
edge_to = unicodedata.normalize("NFKD", edge_to).encode('utf-8', 'ignore')
print edge_to # TypeError: must be unicode, not str
You can get rid of the problem by casting edge_to to unicode.
As an aside, it seems like the encoding/decoding of the whole code chunk is a little confusing. Think out exactly where you want strings to be unicode vs. bytes. You may not need to be doing so much encoding/decoding/normalization.

how to show the right word in my code, my code is : os.urandom(64)

My code is:
print os.urandom(64)
which outputs:
> "D:\Python25\pythonw.exe" "D:\zjm_code\a.py"
\xd0\xc8=<\xdbD'
\xdf\xf0\xb3>\xfc\xf2\x99\x93
=S\xb2\xcd'\xdbD\x8d\xd0\\xbc{&YkD[\xdd\x8b\xbd\x82\x9e\xad\xd5\x90\x90\xdcD9\xbf9.\xeb\x9b>\xef#n\x84
which isn't readable, so I tried this:
print os.urandom(64).decode("utf-8")
but then I get:
> "D:\Python25\pythonw.exe" "D:\zjm_code\a.py"
Traceback (most recent call last):
File "D:\zjm_code\a.py", line 17, in <module>
print os.urandom(64).decode("utf-8")
File "D:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 0-3: invalid data
What should I do to get human-readable output?
No shortage of choices. Here's a couple:
>>> os.urandom(64).encode('hex')
'0bf760072ea10140d57261d2cd16bf7af1747e964c2e117700bd84b7acee331ee39fae5cff6f3f3fc3ee3f9501c9fa38ecda4385d40f10faeb75eb3a8f557909'
>>> os.urandom(64).encode('base64')
'ZuYDN1BiB0ln73+9P8eoQ3qn3Q74QzCXSViu8lqueKAOUYchMXYgmz6WDmgJm1DyTX598zE2lClX\n4iEXXYZfRA==\n'
os.urandom is giving you a 64-bytes string. Encoding it in hex is probably the best way to make it "human readable" to some extent. E.g.:
>>> s = os.urandom(64)
>>> s.encode('hex')
'4c28351a834d80674df3b6eb5f59a2fd0df2ed2a708d14548e4a88c7139e91ef4445a8b88db28ceb3727851c02ce1822b3c7b55a977fa4f4c4f2a0e278ca569e'
Of course this gives you 128 characters in the result, which may be too long a line to read comfortably; it's easy to split it up, though -- e.g.:
>>> print s[:32].encode('hex')
4c28351a834d80674df3b6eb5f59a2fd0df2ed2a708d14548e4a88c7139e91ef
>>> print s[32:].encode('hex')
4445a8b88db28ceb3727851c02ce1822b3c7b55a977fa4f4c4f2a0e278ca569e
two chunks of 64 characters each shown on separate lines may be easier on the eye.
Random bytes are not likely to be unicode characters, so I'm not suprised that you get encoding errors. Instead you need to convert them somehow. If all you're trying to do is see what they are, then something like:
print [ord(o) for o in os.urandom(64)]
Or, if you'd prefer to have it as hex 0-9a-f:
print ''.join( [hex(ord(o))[2:] for o in os.urandom(64)] )

purpose of '"sss".decode("base64").decode("zlib")'

ACTIVATE_THIS = """
eJx1UsGOnDAMvecrIlYriDRlKvU20h5aaY+teuilGo1QALO4CwlKAjP8fe1QGGalRoLEefbzs+Mk
Sb7NcvRo3iTcoGqwgyy06As+HWSNVciKaBTFywYoJWc7yit2ndBVwEkHkIzKCV0YdQdmkvShs6YH
E3IhfjFaaSNLoHxQy2sLJrL0ow98JQmEG/rAYn7OobVGogngBgf0P0hjgwgt7HOUaI5DdBVJkggR
3HwSktaqWcCtgiHIH7qHV+esW2CnkRJ+9R5cQGsikkWEV/J7leVGs9TV4TvcO5QOOrTHYI+xeCjY
JR/m9GPDHv2oSZunUokS2A/WBelnvx6tF6LUJO2FjjlH5zU6Q+Kz/9m69LxvSZVSwiOlGnT1rt/A
77j+WDQZ8x9k2mFJetOle88+lc8sJJ/AeerI+fTlQigTfVqJUiXoKaaC3AqmI+KOnivjMLbvBVFU
1JDruuadNGcPmkgiBTnQXUGUDd6IK9JEQ9yPdM96xZP8bieeMRqTuqbxIbbey2DjVUNzRs1rosFS
TsLAdS/0fBGNdTGKhuqD7mUmsFlgGjN2eSj1tM3GnjfXwwCmzjhMbR4rLZXXk+Z/6Hp7Pn2+kJ49
jfgLHgI4Jg==
""".decode("base64").decode("zlib")
my code:
import zlib
print 'dsss'.decode('base64').decode('zlib')#error
Traceback (most recent call last):
File "D:\zjm_code\b.py", line 4, in <module>
print 'dsss'.decode('base64').decode('zlib')
File "D:\Python25\lib\encodings\zlib_codec.py", line 43, in zlib_decode
output = zlib.decompress(input)
zlib.error: Error -3 while decompressing data: unknown compression method
a='dsss'.encode('zlib')
print a
a.encode('base64')
print a
a.decode('base64')#error
print a
a.decode('zlib')
print a
x\x9cK)..Traceback (most recent call last):
File "D:\zjm_code\b.py", line 7, in <module>
a.decode('base64')
File "D:\Python25\lib\encodings\base64_codec.py", line 42, in base64_decode
output = base64.decodestring(input)
File "D:\Python25\lib\base64.py", line 321, in decodestring
return binascii.a2b_base64(s)
binascii.Error: Incorrect padding
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('zlib')
print a#why can't print 'dsss'
x\x9cK)..
a='dsss'
a=a.encode('zlib')
#print a
a=a.decode('zlib')
print a#its ok
i think the 'print a' encode the a with 'uhf-8'.
so:
#encoding:utf-8
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('utf-8')#but error.
a=a.decode('zlib')
print a#
x\x9cK)..Traceback (most recent call last):
File "D:\zjm_code\b.py", line 5, in <module>
a=a.decode('utf-8')
File "D:\Python25\lib\encodings\utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c in position 1: unexpected code byte
The data in the strings is encoded and compressed binary data. The .decode("base64").decode("zlib") unencodes and decompresses it.
The error you got was because 'dsss' decoded from base64 is not valid zlib compressed data.
What is the purpose of x.decode(”base64”).decode(”zlib”) for x in ("sss", "dsss", random_garbage)? Excuse me, you should know; you are the one who is doing it!
Edit after OP's addition of various puzzles
Puzzle 1
a='dsss'.encode('zlib')
print a
a.encode('base64')
print a
a.decode('base64')#error
print a
a.decode('zlib')
print a
Resolution: all 3 statements of the form
a.XXcode('encoding')
should be
a = a.XXcode('encoding')
Puzzle 2
a='dsss'
a=a.encode('zlib')
print a
a=a.decode('zlib')
print a#why can't print 'dsss'
x\x9cK)..
But it does print 'dsss':
>>> a='dsss'
>>> a=a.encode('zlib')
>>> print a
x£K)..♠ ♦F☺¥
>>> a=a.decode('zlib')
>>> print a#why can't print 'dsss'
dsss
>>>
Puzzle 3
"""i think the 'print a' encode the a with 'uhf-8'."""
Resolution: You think extremely incorrectly. What follows the print is an expression. There are no such side effects. What do you imagine happens when you do this:
print 'start text ' + a + 'end text'
?
What do you imagine happens if you do print a twice? Encoding the already-encoded text again? Why don't you stop imagining and try it out?
In any case, note that the output of str.encode('zlib') is an str object, not a unicode object:
>>> print repr('dsss'.encode('zlib'))
'x\x9cK)..\x06\x00\x04F\x01\xbe'
Getting from that to UTF-8 is going to be somewhat difficult ... it would have to be decoded into unicode first -- with what codec? ascii and utf8 are going to have trouble with the '\x9c' and the '\xbe' ...
It is the reverse of:
original_message.encode('zlib').encode('base64')
zlib is a binary compression algorithm. base64 is a text encoding of binary data, which is useful to send binary message through text protocols like SMTP.
After 'dsss' was decoded from base64 (the three bytes 76h, CBh, 2Ch), the result was not valid zlib compressed data so it couldn't be decoded.
Try printing ACTIVATE_THIS to see the result of the decoding. It turns out to be some Python code.
.decode('base64') can be called only on a string that's encoded as "base-64, in order to retrieve the byte sequence that was there encoded. Presumably that byte sequence, in the example you bring, was zlib-compressed, and so the .decode('zlib') part decompresses it.
Now, for your case:
>>> 'dsss'.decode('base64')
'v\xcb,'
But 'v\xcv,' is not a zlib-compressed string! And so of course you cannot ask zlib to "decompress" it. Fortunately zlib recognizes the fact (that 'v\xcv,' could not possibly have been produced by applying any of the compression algorithms zlib knows about to any input whatsoever) and so gives you a helpful error message (instead of a random-ish string of bytes, which you might well have gotten if you had randomly supplied a different but equally crazy input string!-)
Edit: the error in
a.encode('base64')
print a
a.decode('base64')#error
is obviously due to the fact that strings are immutable: just calling a.encode (or any other method) does not alter a, it produces a new string object (and here you're just printing it).
In the next snippet, the error is only in the OP's mind:
>>> a='dsss'
>>> a=a.encode('zlib')
>>> print a
x?K)..F?
>>> a=a.decode('zlib')
>>> print a#why can't print 'dsss'
dsss
>>>
that "why can't print" question is truly peculiar, applied to code that does print 'dsss'. Finally,
i think the 'print a' encode the a
with 'uhf-8'.
You think wrongly: there's no such thing as "uhf-8" (you mean "utf-8" maybe?), and anyway print a does not alter a, any more than just calling a.encode does.

Categories