Uncompress Zlib string in using ByteArrays - python

I have a web application developed in Adobe Flex 3 and Python 2.5 (deployed on Google App Engine). A RESTful web service has been created in Python and its results are currently in an XML format which is being read by Flex using the HttpService object.
Now the main objective is to compress the XML so that there is as less a time between the HttpService send() method and result events. I looked up Python docs and managed to use zlib.compress() to compress the XML result.
Then I set the HttpService result type from "xml" to "text" and tried using ByteArrays to uncompress the string back to XML. Here's where I failed. I am doing something like this:
var byteArray:ByteArray = new ByteArray();
byteArray.writeUTF( event.result.toString() );
byteArray.uncompress();
var xmlResult:XML = byteArray.readUTF();
Its throwing an exception at byteArray.uncompress() and says unable to uncompress the byteArray. Also when I trace the length of the byteArray it gets 0.
Unable to figure out what I'm doing wrong. All help is appreciated.
-- Edit --
The code:
# compressing the xml result in Python
print zlib.compress(xmlResult)
# decompresisng it in AS3
var byteArray:ByteArray = new ByteArray();
byteArray.writeUTF( event.result.toString() );
byteArray.uncompress()
Event is of type ResultEvent.
The error:
Error: Error #2058: There was an error decompressing the data.
The error could be because the value of byteArray.bytesAvailable = 0 which means the raw bytes python generated hasn't been written into byteArray properly..
-- Sri

What is byteArray.writeUTF( event.result.toString() ); supposed to do? The result of zlib.compress() is neither unicode nor "UTF" (meaningless without a number after it!?); it is binary aka raw bytes; you should neither decode it nor encode it nor apply any other transformation to it. The receiver should decompress immediately the raw bytes that it receives, in order to recover the data that was passed to zlib.compress().
Update What documentation do you have to support the notion that byteArray.uncompress() is expecting a true zlib stream and not a deflate stream (i.e. a zlib stream after you've snipped the first 2 bytes and the last 4)?
The Flex 3 documentation of ByteArray gives this example:
bytes.uncompress(CompressionAlgorithm.DEFLATE);
but unhelpfully doesn't say what the default (if any) is. If there is a default, it's not documented anywhere obvious, so it would be a very good idea for you to use
bytes.uncompress(CompressionAlgorithm.ZLIB);
to make it obvious what you intend.
AND the docs talk about a writeUTFBytes method, not a writeUTF method. Are you sure that you copy/pasted the exact receiver code in your question?
Update 2
Thanks for the URL. Looks like I got hold of the "help", not the real docs :=(. A couple of points:
(1) Yes, there is an explicit inflate() method. However uncompress DOES have an algorithm arg; it can be either CompressionAlgorithm.ZLIB (the default) or CompressionAlgorithm.DEFLATE ... interestingly the latter is however only available in Adobe Air, not in Flash Player. At least we know the uncompress() call appears OK, and we can get back to the problem of getting the raw bytes onto the wire and off again into a ByteArray instance.
(2) More importantly, there are both writeUTF (Writes a UTF-8 string to the byte stream. The length of the UTF-8 string in bytes is written first, as a 16-bit integer, followed by the bytes representing the characters of the string) and writeUTFBytes (Writes a UTF-8 string to the byte stream. Similar to the writeUTF() method, but writeUTFBytes() does not prefix the string with a 16-bit length word).
Whatever the merits of supplying UTF8-encoded bytes (nil, IMHO), you don't want a 2-byte length prefix there; using writeUTF() is guaranteed to cause uncompress() to bork.
Getting it on to the wire: using Python print on binary data doesn't seem like a good idea (unless sys.stdout has been nobbled to run in raw mode, which you didn't show in your code).
Likewise doing event.result.toString() getting a string (similar to a Python unicode object, yes/no?) -- with what and then encoding it in UTF-8 seem rather unlikely to work.
Given I didn't know that flex existed until today, I really can't help you effectively. Here are some further suggestions towards self-sufficiency in case nobody who knows more flex comes along soon:
(1) Do some debugging. Start off with a minimal XML document. Show repr(xml_doc). Show repr(zlib_compress_output). In (a cut-down version of) your flex script, use the closest function/method to repr() that you can find to show: event.result, event.result.toString() and the result of writeUTF*(). Make sure you understand the effects of everything that can happen after zlib.compress(). Reading the docs carefully may help.
(2) Look at how you can get raw bytes out of event.result.
HTH,
John

Related

How to get raw hex values from pcap file?

I've been playing around with scapy and want to read through and analyse every hex byte. So far I've been using scapy simply because I don't know another way currently. Before just writing tools myself to go through the pcap files I was wondering if there was an easy way to do it. Here's what I've done so far.
packets = rdpcap('file.pcap')
tcpPackets = []
for packet in packets:
if packet.haslayer(TCP):
tcpPackets.append(packet)
When I run type(tcpPackets[0]) the type I get is:
<class 'scapy.layers.l2.Ether'>
Then when I try to covert the Ether object into a string it gives me a mix of hex and ascii (as noted by the random parenthesis and brackets).
str(tcpPackets[0])
"b'$\\xa2\\xe1\\xe6\\xee\\x9b(\\xcf\\xe9!\\x14\\x8f\\x08\\x00E\\x00\\x00[:\\xc6#\\x00#\\x06\\x0f\\xb9\\n\\x00\\x01\\x04\\xc6)\\x1e\\xf1\\xc0\\xaf\\x07[\\xc1\\xe1\\xff0y<\\x11\\xe3\\x80\\x18 1(\\xb8\\x00\\x00\\x01\\x01\\x08\\n8!\\xd1\\x888\\xac\\xc2\\x9c\\x10%\\x00\\x06MQIsdp\\x03\\x02\\x00\\x05\\x00\\x17paho/34AAE54A75D839566E'"
I have also tried using hexdump but I can't find a way to parse through it.
I can't find the proper dupe now, but this is just a miss-use/miss-understanding of str(). The original data is in a bytes format, for instance x = b'moo'.
When str() retrieves your bytes string, it will do so by calling the __str__ function of the bytes class/object. That will return a representation of itself. The representation will keep b at the beginning because it's believed to distinguish and make it easier for humans to understand that it's a bytes object, as well as avoid encoding issues I guess (alltho that's speculations).
Same as if you tried accessing tcpPackets[0] from a terminal, it would call __repr__ and show you something like <class 'scapy.layers.l2.Ether'> most likely.
As an example code you can experiment with, try this out:
class YourEther(bytes):
def __str__(self):
return '<Made Up Representation>'
print(YourEther())
Obviously scapy's returns another representation, not just a static string that says "made up representation". But you probably get the idea.
So in the case of <class 'scapy.layers.l2.Ether'> it's __repr__ or __str__ function probably returns b'$\\xa2\\....... instead of just it's default class representation (some correction here might be in place tho as I don't remember/know all the technical namification of the behaviors).
As a workaround, this might fix your issue:
hexlify(str(tcpPackets[0]))
All tho you probably have to account for the prepended b' as well as trailing ' and remove those accordingly. (Note: " won't be added in the beginning or end, those are just a second representation in your console when printing. They're not actually there in terms of data)
Scapy is probably more intended to use tcpPackets[0].dst rather than grabing the raw data. But I've got very little experience with Scapy, but it's an abstraction layer for a reason and it's probably hiding the raw data or it's in the core docs some where which I can't find right now.
More info on the __str__ description: Does python `str()` function call `__str__()` function of a class?
Last note, and that is if you actually want to access the raw data, it seams like you can access it with the Raw class: Raw load found, how to access?
You can put all the bytes of a packet into a numpy array as follows:
for p in tcpPackets:
raw_pack_data = np.frombuffer(p.load, dtype = np.uint8)
# Manipulate the bytes stored in raw_pack_data as you like.
This is fast. In my case, rdpcap takes ~20 times longer than putting all the packets into a big array in a similar for loop for a 1.5GB file.

How to get the printout from the results of a subprocess.run() function to appear the same as seen in a terminal?

import subprocess
result = subprocess.run(['pkexec', 'apt', 'update'], stdout=subprocess.PIPE)
print(result.stdout)
print(result.stdout) returned a very long string. See below.
pprint.pprint(result.stdout) returned the same content as a block of sentences. See below.
I would like the print out of result.stdout to be similar to the terminal print out when executing sudo apt update. How can I achieve it with python 3.6 found in Ubuntu 18.04?
The reason you get a block of "text" is that the output is not an actual string (which is utf-8 encoded in Python 3) but a bytes object. This is seen from the b being written in front of the text. In order to turn a bytes object into a string it needs to be "decoded."
To decode the bytes object the text.decode() method is used, for the particular string in this question that turns into
print(result.stdout.decode())
The bytes object can be of any coding, therefore the .decode() call accepts a parameter which tells which coding that is supposed to be decoded. The most common one is utf-8 so if no parameter is given, this is assumed. However, specifically on Windows systems, also other codings exists (e.g. "latin1"). To decode a "latin1" bytesobject the call would thus look like
print(text.decode("latin1"))
The opposite operation, which encodes a string to a bytes object is also available. Logically enough it is called .encode() and is typically used in protocols that stream the data to another destination (e.g. over the Internet or to disc). Also this call accepts code argument, that allows the text to be encoded as e.g. "latin1" even though the default is "utf-8".

Serialization via pickle/eval and zlib

I wish to compress a string using zlib and append it to a text message as a string. I have a couple of issues with it:
a. Is there a problem to combine "binary" string with normal string? For example, is there any problem sending via socket a string that looks like that:
MSG 10=12 20=x\x9c+(\xc0\x00\x00S3\x08Q 33=hansz
I ask it since when opening files one usually declares whether he intends to read at binary mode or not, and I never fully understood that.
b. Can I be sure that some characters will not appear in the compressed string? For example, if the compressed string will include some char sequence like x\x9c 33=eve, I'll have trouble parsing the message properly. If I know that whitespaces will never appear in a zlib compressed string, I can do some string split; If I know that quotes and apostrophes do not appear, I might use shlex split.
c. My intention is to use either zlib.compress(str(obj)) or zlib.compress(pickle.dumps(obj)) as kind of pickling, and use either eval(zlib.decompress(s)) or pickle.loads(zlib.decompress(s)) for unpickling. Do you think it makes sense? The first idea is less safe (as eval is never that safe), but it's an inner system, so I'm ok with it, and on the other hand the compressing turns out to be shorter on most cases, and as quick. Do you think it's a good practice?
d. The reason I wish to have these messages short is that I wish to send them later via socket. I am not proficient with sockets, however, I know these tend to read small (4k?) buffers, so I try to make my messages not much longer than that.
a. The Problem with combining bytes and a unicode string is the following: There are more letters that 255. So historically, hundreds of encodings were created to put different alphabets into one byte.
>>> print b'\xE4'.decode('cp1251') # russian d
д
>>> print b'\xE4'.decode('cp1252') # german ae
ä
The letters have different meaning. To not loose the meaning of these letters, you use unicode.
>>> print u'\u00e4\u0434'
äд
However when you see bytes then you may not know the encoding. So you can not combine unicode and bytes straight away because one byte may be different letters.
Use 'UTF-8' as encoding for the next years. It uses more than one byte if necessairy and stores all letters.
b. zlib takes bytes and outputs bytes. It can contain any byte.
c. zlib.compress(pickle.dumps(obj)) and pickle.loads(zlib.decompress(s)) is totally fine. Pickle takes objects and returns bytes. You can save and store more objects than with zlib.compress(repr(obj)) and eval(zlib.decompress(s)). pickle is as safe as eval. If you need save evaluation have a look at import ast ast.literal_eval or use json instead of pickle.
d. Make sure to know when a message ends and another message starts. I think you can use zlib.decompressobj for this. Otherwise zib can get confused. Sockets can send much more that 4k bytes. The buffer means that a socket saves up to 4k bytes and does not want to receive more until you take bytes out of the buffer. If you use TCP you can send endless streams of bytes and nothing is lost.

Downloading YouTube Captions with API in Python with utf-8 characters

I'm using Jeff's demo code for using the YouTube API and Python to interact with captions for my videos. And I have it working great for my videos in English. Unfortunately, when I try to use it with my videos that have automatic transcripts in Spanish, which contain characters such as á¡, etc., I get an encoding error:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 25: ordinal not in range(128)
My Python script has # -*- coding: utf-8 -*- at the top and I've changed the CAPTIONS_LANGUAGE_CODE to 'es', but it seems like the script is still interpreting the .srt file it downloads as ascii rather than utf-8. The line where it downloads the .srt file is:
if response_headers["status"] == "200":
self.srt_captions = SubRipFile.from_string(body)
How can I get Python to consider the srt file as utf-8 so that it doesn't throw an encoding error?
Thanks!
It looks like this isn't really a Youtube API issue at all, but a Python one. Note that your error isn't an encoding error, but a decoding error; you've stumbled upon the way that Python is designed to work (for better or for worse). Many, many functions in Python will cast unicode data as 8-bit strings rather than native unicode objects, using \x with a hex number to represent characters greater than 127. (One such method is the "from_string" method of the SubRipFile object you're using.) Thus the data is still unicode, but the object is a string. Because of this, when you then are forcing a casting to a unicode object (triggered by using the 'join' method of a unicode object in the sample code you provided), Python will assume an ascii codec (the default for 8-bit strings, regardless of data encoding) to deal with the data, which then throws an error on those hex characters.
There are several solutions.
1) You could explicitly tell Python that when you run your join method to not assume an ascii codec, but I always struggle with getting that right (and doing it in every case). So I won't attempt some sample code.
2) You could forego native unicode objects and just use 8-bit strings to work with your unicode data; this would only require you changing this line:
body = u'\n'.join(lines[2:])
To this:
body = '\n'.join(lines[2:])
There are potential drawbacks to this approach, however -- again, you'd have to make sure you're doing it in every case; you also wouldn't be leveraging Python-native unicode objects (which may or may not be an issue for later in your code).
3) you could use the low-level 'codecs' module to ensure that the data is cast as a native unicode object from the get-go rather than messing around with 8-bit strings. Normally, you accomplish such a task in this manner:
import codecs
f=codecs.open('captions.srt',encoding='utf-8')
l=f.readlines()
f.close()
type(l[0]) # will be unicode object rather than string object
Of course, you have the complication of using a SubRipFile object which returns a string, but you could get around that by either sending it through a StringIO object (so the codecs module can treat the ripped data as a file), using the codecs.encode() method, etc. The Python docs have pretty good sections on all of this.
Best of luck.

How do I figure out the format for struct.unpack (because I didn't pack in Python)?

I have a C pipe client (pulled directly from the CallNamedPipe example found here) that, if given the string "first", sends the following message to my Python pipeserver:
b'f\x00i\x00r\x00s\x00t\x00\x00\x00'
The struct documentation gives examples where I did both the packing and unpacking in Python. That means I know the format because I explicitly specified it when I called struct.pack.
Is there some way for me to either a) infer the format from the above output or b) set the format in C the same way I do in Python?
Here's the relevant client code:
LPTSTR lpszPipename = TEXT("\\\\.\\pipe\\testpipe");
LPTSTR lpszWrite = TEXT("first");
fSuccess = CallNamedPipe(
lpszPipename, // pipe name
lpszWrite, // message to server
...
>>> b'f\x00i\x00r\x00s\x00t\x00\x00\x00'.decode('utf-16le')
u'first\x00'
"The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)"
Your C code is not writing a struct to the pipe, it is writing a null-terminated string encoded as little-endian UTF-16 text, which is produced by the TEXT() macro when you compile your Windows program in Unicode mode for an Intel CPU. Python knows how to decode these strings without using the struct module. Try this:
null_terminated_unicode_string = data.decode('utf-16le')
unicode_string = null_terminated_unicode_string[:-1]
You can use decode('utf-16') if your python code is running on the same CPU architecture as the C program that writes the data. You might want to read up on python's unicode codecs.
EDIT: You can infer the type of that data by knowing how UTF-16 and Windows string macros work, but python cannot infer it. You could set the string encoding in C the same way you would in python if you wanted to write some code to do so, but it's probably not worth your time.

Categories