Analyzing Python Code: Modulus Operator - python

I was looking at some code in Python (I know nothing about Python) and I came across this portion:
def do_req(body):
global host, req
data = ""
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.connect((host, 80))
s.sendall(req % (len(body), body))
tmpdata = s.recv(8192)
while len(tmpdata) > 0:
data += tmpdata
tmpdata = s.recv(8192)
s.close()
return data
This is then called later on with body of huge size, as in over 500,000 bytes. This is sent to an Apache server that has the max request size on the default 8190 bytes.
My question is what is happening at the "s.sendall()" part? Obviously the entire body cannot be sent at once and I'm guessing it is reduced by way of the modulus operator. I don't know how it works in Python, though. Can anyone explain? Thanks.

It is not really the modulus operator (technically it is since strings simply implement __mod__) but the python2-style string formatting operator.
Given format % values (where format is a string or Unicode object), % conversion specifications in format are replaced with zero or more elements of values. The effect is similar to the using sprintf() in the C language.
Obviously the entire body cannot be sent at once
While it indeed doesn't fit into a single packet that's a low-level thing which is handled internally (most likely not even by python but by the underlying syscall that writes to a socket)

No, body is not reduced here, because % is format operator, when operates on strings.
http://docs.python.org/release/2.5.2/lib/typesseq-strings.html
All the data is sent by sendall method by parts.
socket.sendall works like that:
do {
n = sendsegmented(s->sock_fd, buf, len, flags);
len -= n;
} while (len > 0);
where sendsegmented sends data and returns len or SEGMENT_SIZE

Related

Python socket: cannot receive int array

I try to make a inter process communication between a Python and c program via winsockets. Sending a string does work, but now I try to send an int array from the c socket to the python socket.
I already found out that I have to use htonl() to convert the int array into a byte stream as the send function of winsock2 cannot send int arrays directly.
Now I want to use ntohl() in the python socket but the receive function returns bytes whereas ntohl() needs an integer value as input.
Here is my code
C-Side (just relevant parts):
uint32_t a[1] = {1231};
uint32_t a_converted[1]={0};
a_converted[0] = htonl(a[0]);
iResult = send( ConnectSocket, ( char *) a_converted, sizeof( a_converted), 0 );
Python Side (just relevant parts):
data = connection.recv(16)
data_i = socket.ntohl(data)
What you received is string of bytes, did not ntohl cause exception?
You may use struct module to unpack - for 16 bytes
struct.unpack('!4I', data)
Meaning - unpack 4 unsigned 32-bit integers in network order
RTM
(I cannot test it - try it on your own)
EDIT:
Ooops, did not read your comment through. According to sockets docs, recv should return object of type bytes. If it returns object of type str - you should convert it to bytes - in Python3 it would be data.encode()
PS Which Python are you on?
You said you have managed to send strings over the connection. I assume you sent a char* and received it in python as a string. What you have done is sent a stream of bytes.
Now you want to send an array of integers. In the memory, the integers are again stored as bytes.
Each integer could occupy 4/8 bytes. You can check this before hand by printing
printf("Size of integer is %zu", sizeof(int));
Okay, great now we know how many bytes we need to send. Say it is 4 for now.
We also need to know the endianness of the integers but lets assume big endian for now.
This means the lowest significant byte will be first and the highest significant byte at the end.
So now you can send the integer array exactly lile you sent, by casting the array to char* and sending sizeof(array).
On the receiving side though, you just have a stream of bytes. To convert it to array of integers you need to get 4 bytes at a time and combine it into an integer.
We can do that as follows.
Say there are total 10 integers. You have to pass this information on separately somehow.
bytes = connection.recv(10*4)
array = []
for i in range(10):
x = ord(bytes[i*4+0])
x += ord(bytes[i*4+1]) << 8
x += ord(bytes[i*4+2]) << 16
x += ord(bytes[i*4+3]) << 24
array += [x]
print x
And you will be able to see you array of integers.
Here the function ord converts a character to its ASCII equivalent integer.
Side notes:
Now, if your system has size of integer as 8 instead of 4, you need to extend the body of the loop in python. It will go till 56. Also each of the index in bytes will be i*8+...
Similarly if the endianess is different, the order of the elements will change. Basically the indices on bytes will go from i*4+3 to i*4+0.

Reading LabVIEW TCP data (Flattened String / Data Cluster) in Python

I have a LabVIEW application that is flattening a cluster (array) of Doubles to a string, before transmitting over TCP/IP to my python application. It does this because TCP/IP will only transmit strings.
The problem is that python reads the string as a load of nonsense ASCII characters, and I can't seem to unscramble them back to the original array of doubles.
How do I interpret the string data that LabVIEW sends after flattening a data strings. My only hint of useful information after hours of google was a PyPI entry called pyLFDS, however it has since been taken down.
The LabVIEW flattened data format is described in some detail here. That document doesn't explicitly describe how double-precision floats (DBL type) are represented, but a little more searching found this which clarifies that they are stored in IEEE 754 format.
However it would probably be simpler and more future proof to send your data in a standard text format such as XML or JSON, both of which are supported by built-in functions in LabVIEW and standard library modules in Python.
A further reason not to use LabVIEW flattened data for exchange with other programs, if you have the choice, is that the flattened string doesn't include the type descriptor you need to convert it back into the original data type - you need to know what type the data was in order to decode it.
I wanted to document the problem and solution so others can hopefully avoid the hours I have wasted looking for a solution on google.
When LabVIEW flattens data, in this case a cluster of doubles, it sends them simply as a concatonated string with each double represented by 8 bytes. This is interpreted by python as 8 ASCII characters per double, which appears as nonsense in your console.
To get back to the transmitted doubles, you need to take each 8-byte section in turn and convert the ASCII characters to their ASCII codes, in Python's case using ord().
This will give you an 8 bytes of decimal codes (e.g. 4.8 = [64 19 51 51 51 51 51 51])
It turns out that LabVIEW does most things, including TCP/IP transmissions, Big Endian. Unless you are working Big Endian, you will probably need to change it around. For example the example above will become [51 51 51 51 51 51 19 64]. I put each of my doubles into a list, so was able to use the list(reversed()) functions to change the endienness.
You can then convert this back to a double. Example python code:
import struct
b = bytearray([51,51,51,51,51,51,19,64]) #this is the number 4.8
value = struct.unpack('d', b)
print(value) #4.8
This is probably obvious to more experienced programmers, however it had me flummuxed for days. I apologise for using stackoverflow as the platform to share this by answering my own question, but hopefully this post helps the next person who is struggling.
EDIT: Note if you are using an earlier version than Python 2.7.5 then you might find struct.unpack() will fail. Using the example code above substituting the following code worked for me:
b = bytes(bytearray([51,51,51,51,51,51,19,64]))
This code works for me. UDP server accept flattened dbl array x, return x+1 to port 6503. Modify LabView UDP client to your needs.
import struct
import socket
import numpy as np
def get_ip():
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
try:
# doesn't even have to be reachable
s.connect(('10.255.255.255', 1))
IP = s.getsockname()[0]
except:
IP = '127.0.0.1'
finally:
s.close()
return IP
#bind_ip = get_ip()
print("\n\n[*] Current ip is %s" % (get_ip()))
bind_ip = ''
bind_port = 6502
server = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
server.bind((bind_ip,bind_port))
print("[*] Ready to receive UDP on %s:%d" % (bind_ip,bind_port))
while True:
data, address = server.recvfrom(1024)
print('[*] Received %s bytes from %s:' % (len(data), address))
arrLen = struct.unpack('>i', data[:4])[0]
print('[*] Received array of %d doubles:' % (arrLen,))
x = []
elt = struct.iter_unpack('>d', data[4:])
while True:
try:
x.append(next(elt)[0])
print(x[-1])
except StopIteration:
break
x = np.array(x)
y = x+1 # np.sin(x)
msg = data[:4]
for item in y:
msg += struct.pack('>d', item)
print(msg)
A = (address[0], 6503)
server.sendto(msg, A)
break
server.close()
print('[*] Server closed')
print('[*] Done')
LabView UDP client:
I understand that this does not solve your problem as you mentioned you didn't have the ability to modify the LabVIEW code. But, I was hoping to add some clarity on common ways string data is transmitted over TCP in LabVIEW.
The Endianness of the data string sent through the Write TCP can be controlled. I recommend using the Flatten To String Function as it gives you the ability to select which byte order you want to use when you flatten your data; big-endian (default if unwired), native (use the byte-order of the host machine), or little-endian.
Another common technique I've seen is using the Type Cast Function. Doing this will convert the numeric to a big-endian string. This of course can be confusing when you read it on the other end of the network as most everything else is little-endian, so you'll need to do some byte-swapping.
In general, if you're not sure what the code looks like, assume that it will be big-endian if its coming from LabVIEW code.
The answer from nekomatic is good one. Using a standard text format when available is always a good option.

Raspberrypi Python bus.read_byte

Is there a Python function that will respond like the Wire.available function in arduino to get all the data on the wire rather than having to specify how many bytes to grab?
This is what I have now, and it works fine, but I have to know how much data is coming down the wire, or it will provide unexpected results.
for i in range(0, 13):
data += chr(bus.read_byte(address));
Thanks!
Not a perfect solution, but I found a way around knowing exactly how many bytes are on the way.
On the Arduino, I specified the max size of the buffer, (128), add my data, then zero out the rest, and then send the whole thing. On the Pi, I receive the whole buffer, and then the first thing that happens is to filter the \x00 characters. It's not perfect, but it works for now.
for i in range(0, 128):
data += chr(bus.read_byte(address))
print repr(data)
#prints the whole string as it is received
data = filter(lambda a: a != '\x00')
print repr(data)
#prints the string without any '\x00' characters.
I use the PIGPIO library for i2c commands on the raspberrypi, it has functions much closer to wire.
http://abyz.co.uk/rpi/pigpio/python.html#i2c_read_device
I think that's the function you're looking for.

Where are python bytearrays used?

I recently came across the dataType called bytearray in python. Could someone provide scenarios where bytearrays are required?
This answer has been shameless ripped off from here
Example 1: Assembling a message from fragments
Suppose you're writing some network code that is receiving a large message on a socket connection. If you know about sockets, you know that the recv() operation doesn't wait for all of the data to arrive. Instead, it merely returns what's currently available in the system buffers. Therefore, to get all of the data, you might write code that looks like this:
# remaining = number of bytes being received (determined already)
msg = b""
while remaining > 0:
chunk = s.recv(remaining) # Get available data
msg += chunk # Add it to the message
remaining -= len(chunk)
The only problem with this code is that concatenation (+=) has horrible performance. Therefore, a common performance optimization in Python 2 is to collect all of the chunks in a list and perform a join when you're done. Like this:
# remaining = number of bytes being received (determined already)
msgparts = []
while remaining > 0:
chunk = s.recv(remaining) # Get available data
msgparts.append(chunk) # Add it to list of chunks
remaining -= len(chunk)
msg = b"".join(msgparts) # Make the final message
Now, here's a third solution using a bytearray:
# remaining = number of bytes being received (determined already)
msg = bytearray()
while remaining > 0:
chunk = s.recv(remaining) # Get available data
msg.extend(chunk) # Add to message
remaining -= len(chunk)
Notice how the bytearray version is really clean. You don't collect parts in a list and you don't perform that cryptic join at the end. Nice.
Of course, the big question is whether or not it performs. To test this out, I first made a list of small byte fragments like this:
chunks = [b"x"*16]*512
I then used the timeit module to compare the following two code fragments:
# Version 1
msgparts = []
for chunk in chunks:
msgparts.append(chunk)
msg = b"".join(msgparts)
#Version 2
msg = bytearray()
for chunk in chunks:
msg.extend(chunk)
When tested, version 1 of the code ran in 99.8s whereas version 2 ran in 116.6s (a version using += concatenation takes 230.3s by comparison). So while performing a join operation is still faster, it's only faster by about 16%. Personally, I think the cleaner programming of the bytearray version might make up for it.
Example 2: Binary record packing
This example is an slight twist on the last example. Suppose you had a large Python list of integer (x,y) coordinates. Something like this:
points = [(1,2),(3,4),(9,10),(23,14),(50,90),...]
Now, suppose you need to write that data out as a binary encoded file consisting of a 32-bit integer length followed by each point packed into a pair of 32-bit integers. One way to do it would be to use the struct module like this:
import struct
f = open("points.bin","wb")
f.write(struct.pack("I",len(points)))
for x,y in points:
f.write(struct.pack("II",x,y))
f.close()
The only problem with this code is that it performs a large number of small write() operations. An alternative approach is to pack everything into a bytearray and only perform one write at the end. For example:
import struct
f = open("points.bin","wb")
msg = bytearray()
msg.extend(struct.pack("I",len(points))
for x,y in points:
msg.extend(struct.pack("II",x,y))
f.write(msg)
f.close()
Sure enough, the version that uses bytearray runs much faster. In a simple timing test involving a list of 100000 points, it runs in about half the time as the version that makes a lot of small writes.
Example 3: Mathematical processing of byte values
The fact that bytearrays present themselves as arrays of integers makes it easier to perform certain kinds of calculations. In a recent embedded systems project, I was using Python to communicate with a device over a serial port. As part of the communications protocol, all messages had to be signed with a Longitudinal Redundancy Check (LRC) byte. An LRC is computed by taking an XOR across all of the byte values.
Bytearrays make such calculations easy. Here's one version:
message = bytearray(...) # Message already created
lrc = 0
for b in message:
lrc ^= b
message.append(lrc) # Add to the end of the message
Here's a version that increases your job security:
message.append(functools.reduce(lambda x,y:x^y,message))
And here's the same calculation in Python 2 without bytearrays:
message = "..." # Message already created
lrc = 0
for b in message:
lrc ^= ord(b)
message += chr(lrc) # Add the LRC byte
Personally, I like the bytearray version. There's no need to use ord() and you can just append the result at the end of the message instead of using concatenation.
Here's another cute example. Suppose you wanted to run a bytearray through a simple XOR-cipher. Here's a one-liner to do it:
>>> key = 37
>>> message = bytearray(b"Hello World")
>>> s = bytearray(x ^ key for x in message)
>>> s
bytearray(b'm#IIJ\x05rJWIA')
>>> bytearray(x ^ key for x in s)
bytearray(b"Hello World")
>>>
Here is a link to the presentation
A bytearray is very similar to a regular python string (str in python2.x, bytes in python3) but with an important difference, whereas strings are immutable, bytearrays are mutable, a bit like a list of single character strings.
This is useful because some applications use byte sequences in ways that perform poorly with immutable strings. When you are making lots of little changes in the middle of large chunks of memory, as in a database engine, or image library, strings perform quite poorly; since you have to make a copy of the whole (possibly large) string. bytearrays have the advantage of making it possible to make that kind of change without making a copy of the memory first.
But this particular case is actually more the exception, rather than the rule. Most uses involve comparing strings, or string formatting. For the latter, there's usually a copy anyway, so a mutable type would offer no advantage, and for the former, since immutable strings cannot change, you can calculate a hash of the string and compare that as a shortcut to comparing each byte in order, which is almost always a big win; and so it's the immutable type (str or bytes) that is the default; and bytearray is the exception when you need it's special features.
If you look at the documentation for bytearray, it says:
Return a new array of bytes. The bytearray type is a mutable sequence of integers in the range 0 <= x < 256.
In contrast, the documentation for bytes says:
Return a new “bytes” object, which is an immutable sequence of integers in the range 0 <= x < 256. bytes is an immutable version of bytearray – it has the same non-mutating methods and the same indexing and slicing behaviors.
As you can see, the primary distinction is mutability. str methods that "change" the string actually return a new string with the desired modification. Whereas bytearray methods that change the sequence actually change the sequence.
You would prefer using bytearray, if you are editing a large object (e.g. an image's pixel buffer) through its binary representation and you want the modifications to be done in-place for efficiency.
Wikipedia provides an example of XOR cipher using Python's bytearrays (docstrings reduced):
#!/usr/bin/python2.7
from os import urandom
def vernam_genkey(length):
"""Generating a key"""
return bytearray(urandom(length))
def vernam_encrypt(plaintext, key):
"""Encrypting the message."""
return bytearray([ord(plaintext[i]) ^ key[i] for i in xrange(len(plaintext))])
def vernam_decrypt(ciphertext, key):
"""Decrypting the message"""
return bytearray([ciphertext[i] ^ key[i] for i in xrange(len(ciphertext))])
def main():
myMessage = """This is a topsecret message..."""
print 'message:',myMessage
key = vernam_genkey(len(myMessage))
print 'key:', str(key)
cipherText = vernam_encrypt(myMessage, key)
print 'cipherText:', str(cipherText)
print 'decrypted:', vernam_decrypt(cipherText,key)
if vernam_decrypt(vernam_encrypt(myMessage, key),key)==myMessage:
print ('Unit Test Passed')
else:
print('Unit Test Failed - Check Your Python Distribution')
if __name__ == '__main__':
main()

AS3 server socket, progress event, readutfbytes

having a bit of an issue here.
I am working on a multiplayer game with flash, and using python for the server side of things. I have the socket connection working... sorta, and quite a bit of the python work done, but I am running into a weird problem.
Let's say, upon logging in, I send some data to the client containing some of their info.
After which, I send some data to assign them to a room.
This data doesn't seem to be read in AS3 as two different things, instead, after readUTFBytes, it is all in the same string.
var str:String = event.currentTarget.readUTFBytes(event.currentTarget.bytesAvailable);
In python, I have defined methods for sending data, which just sends data via transport.write (Twisted), and I am receieving via a progress socket data event in action script. Any idea what could be wrong here? Here's a bit of code...
if ( ! event.currentTarget.bytesAvailable > 0) {
return;
}
var str:String = event.currentTarget.readUTFBytes(event.currentTarget.bytesAvailable);
var Char1:String = str.charAt(0);
var Char2:String = str.charAt(1);
str = str.replace(Char1, "");
str = str.replace(Char2, "");
// Various messages
if (Char1 == "\x03") {
if (Char2 == "\x03") {
trace("Got ping thread");
}
else {
trace("x03 but no secondary prefix handled");
}
return;
}
Quite sloppy I know, but I'm trying to just determine an issue.
All data comes with two prefixes, something like \x02 and \x09 for me to determine what to do, then most data in the string is split on \x01 to get values.
The problem essentially is, where I should get an /x08 /x08 data, I get /x08 /x08 data /x05 /x03 data, when it should be two separate things.
TCP connections are reliable, ordered, stream oriented transports. A stream is a sequence of bytes with no inherent message boundaries. If you want to split up your bytes into separate messages, the bytes themselves must tell you how to do this splitting (or you need some external rule that always applies, like "a message is 5 bytes long").
This applies to all TCP connections, regardless of what language you use them from, or what weird library-specific API gets dropped on top of them (like readUTFBytes).
There are many options for protocols which can help you frame your messages. For example, you could use a length prefix. Then your messages would look like:
\x07 \x08 \x08 h e l l o \x05 \x05 \x03 m a n
\x07 gives the length of the first message, 7 bytes: \x08 \x 08 h e l l o. The next byte after that message, \x05, gives the length of the second message: \x05 \x03 m a n.
You can use multibyte length prefixes if your messages need to be longer, or netstrings which use a decimal representation and a : delimiter to support arbitrary sized prefixes. There are also more sophisticated protocols which offer more features than just separating your bytes into messages. For example, there's AMP which gives you a form of RPC with arguments and responses.

Categories