I need to transfer an array of varying length in which each element is a tuple of two integers. As an example:
path = [(1,1),(1,2)]
path = [(1,1),(1,2),(2,2)]
I am trying to use pack and unpack, however, since the array is of varying length I don't know how to create a format such that both know the format. I was trying to turn it into a single string with delimiters, such as:
msg = 1&1~1&2~
sendMsg = pack("s",msg) or sendMsg = pack("s",str(msg))
on the receiving side:
path = unpack("s",msg)
but that just prints 1 in this case. I was also trying to send 4 integers as well, which send and receive fine, so long as I don't include the extra string representing the path.
sendMsg = pack("hhhh",p.direction[0],p.direction[1],p.id,p.health)
on the receive side:
x,y,id,health = unpack("hhhh",msg)
The first was for illustration as I was trying to send the format "hhhhs", but either way the path doesn't come through properly.
Thank-you for your help. I will also be looking at sending a 2D array of ints, but I can't seem to figure out how to send these more 'complex' structures across the network.
Thank-you for your help.
While you can use pack and unpack, I'd recommend using something like YAML or JSON to transfer your data.
Pack and unpack can lead to difficult to debug errors and incompatibilities if you change your interface and have different versions trying to communicate with each other.
Pickle can give security problems and the pickle format might change between Python versions.
JSON is included in the standard Python distribution since 2.6. For YAML there is PyYAML.
You want some sort of serialization protocol. twisted.spread provides one such (see the Banana specification or Perspective Broker documentation). JSON or protocol buffers would be more verbose examples.
See also Comparison of data serialization formats.
If you include message length as part of the message, then you will know how much data to read. So the entire string should be read across the network.
In any case, perhaps it would help if you posted some of the code you are using to send data across the network, or at least provided more of a description.
Take a look at xdrlib, it might help. It's part of the standard library, and:
The xdrlib module supports the External Data Representation Standard as described in RFC 1014, written by Sun Microsystems, Inc. June 1987. It supports most of the data types described in the RFC.
Pack and unpack are mandatory?
If not, you could use JSON and YAML.
Don't use pickle because is not secure.
Related
I'm new to protobuf. I need to serialize complex graph-like structure and share it between C++ and Python clients.
I'm trying to apply protobuf because:
It is language agnostic, has generators both for C++ and Python
It is binary. I can't afford text formats because my data structure is quite large
But Protobuf user guide says:
Protocol Buffers are not designed to handle large messages. As a
general rule of thumb, if you are dealing in messages larger than a
megabyte each, it may be time to consider an alternate strategy.
https://developers.google.com/protocol-buffers/docs/techniques#large-data
I have graph-like structures that are sometimes up to 1 Gb in size, way above 1 Mb.
Why protobuf is bad for serializing large datasets? What should I use instead?
It is just general guidance, so it doesn't apply to every case. For example, the OpenStreetMap project uses a protocol buffers based file format for its maps, and the files are often 10-100 GB in size. Another example is Google's own TensorFlow, which uses protobuf and the graphs it stores are often up to 1 GB in size.
However, OpenStreetMap does not have the entire file as a single message. Instead it consists of thousands individual messages, each encoding a part of the map. You can apply a similar approach, so that each message only encodes e.g. one node.
The main problem with protobuf for large files is that it doesn't support random access. You'll have to read the whole file, even if you only want to access a specific item. If your application will be reading the whole file to memory anyway, this is not an issue. This is what TensorFlow does, and it appears to store everything in a single message.
If you need a random access format that is compatible across many languages, I would suggest HDF5 or sqlite.
It should be fine to use protocol buffers that are much larger than 1MB. We do it all the time at Google, and I wasn't even aware of the recommendation you're quoting.
The main problem is that you'll need to deserialize the whole protocol buffer into memory at once, so it's worth thinking about whether your data is better off broken up into smaller items so that you only have to have part of the data in memory at once.
If you can't break it up, then no worries. Go ahead and use a massive protocol buffer.
R has its own format that is significantly more expressive than csv (knows about factors, for example). The extension is usually .Rdata, and it is manipulated from R using the load and save functions.
I was wondering if the python pandas library know about this format? If not, is there another format (better than csv) for exchange between pandas and R?
I used to think for the longest time that you needed an R instance to deserialize R objects -- and loading a saved R object, or set of objects, amount to reading a (binary, likely compressed) data stream and de-serializing it.
But Davor proved me wrong. An existence proof is provided in his CPAN module Statistics-R-IO which does this in Perl. Presumably someone with enough motivation could abstract this into C library which many other projects, including Python, could load. Or use to save Pandas data for R.
Having a better data exchange would be nice. Otherwise, you can of course use language-agnostic interchange formats such as Protocol Buffers.
(Note: CPAN.org seems to be down/slow right now. Use Google Cache if need be.)
I am fairly new to R, but the more use it, the more I see how powerful it really is over SAS or SPSS. Just one of the major benefits, as I see them, is the ability to get and analyze data from the web. I imagine this is possible (and maybe even straightforward), but I am looking to parse JSON data that is publicly available on the web. I am not a programmer by any stretch, so any help and instruction you can provide will be greatly appreciated. Even if you point me to a basic working example, I probably can work through it.
RJSONIO from Omegahat is another package which provides facilities for reading and writing data in JSON format.
rjson does not use S4/S3 methods and so is not readily extensible, but still useful. Unfortunately, it does not used vectorized operations and so is too slow for non-trivial data. Similarly, for reading JSON data into R, it is somewhat slow and so does not scale to large data, should this be an issue.
Update (new Package 2013-12-03):
jsonlite: This package is a fork of the RJSONIO package. It builds on the parser from RJSONIO but implements a different mapping between R objects and JSON strings. The C code in this package is mostly from the RJSONIO Package, the R code has been rewritten from scratch. In addition to drop-in replacements for fromJSON and toJSON, the package has functions to serialize objects. Furthermore, the package contains a lot of unit tests to make sure that all edge cases are encoded and decoded consistently for use with dynamic data in systems and applications.
The jsonlite package is easy to use and tries to convert json into data frames.
Example:
library(jsonlite)
# url with some information about project in Andalussia
url <- 'https://api.stackexchange.com/2.2/badges?order=desc&sort=rank&site=stackoverflow'
# read url and convert to data.frame
document <- fromJSON(txt=url)
Here is the missing example
library(rjson)
url <- 'http://someurl/data.json'
document <- fromJSON(file=url, method='C')
The function fromJSON() in RJSONIO, rjson and jsonlite don't return a simple 2D data.frame for complex nested json objects.
To overcome this you can use tidyjson. It takes in a json and always returns a data.frame. It is currently not availble in CRAN, you can get it here: https://github.com/sailthru/tidyjson
Update: tidyjson is now available in cran, you can install it directly using install.packages("tidyjson")
For the record, rjson and RJSONIO do change the file type, but they don't really parse per se. For instance, I receive ugly MongoDB data in JSON format, convert it with rjson or RJSONIO, then use unlist and tons of manual correction to actually parse it into a usable matrix.
Try below code using RJSONIO in console
library(RJSONIO)
library(RCurl)
json_file = getURL("https://raw.githubusercontent.com/isrini/SI_IS607/master/books.json")
json_file2 = RJSONIO::fromJSON(json_file)
head(json_file2)
To preface I'm very new to python (about 7 days) but I'm an experienced software eng undergrad.
I would like to send data between machines running python scripts. The idea I had (in order to simplify things) was to concatenate the data (strings & ints) into a string and do the parsing client-side.
The UDP packets send beautifully with simple strings but when I try to send useful data python always complains about the data I send; specifically python won't let me concatenate tuples.
In order to parse the data on the client I need to seperate the data with a dash character: '-'.
nodeList is of type dictionary where the key is a string and value is a double.
randKey = random.choice( nodeList.keys() )
data = str(randKey) +'-'+ str(nodeList[randKey])
mySocket.sendto ( data , address )
The code above produces the following error:
TypeError: coercing to Unicode: need string or buffer, tuple found
I don't understand why it thinks it is a tuple I am trying to concatenate...
So my question is how can I correct this to keep Python happy, or can someone suggest I better way of sending the data?
Thank you in advance.
I highly suggest using Google Protocol Buffers as implemented in Python as protobuf for this as it will handle the serialization on both ends of the line. It has Python bindings that will allow you to easily use it with your existing Python program.
Using your example code you would create a .proto file like so:
message SomeCoolMessage {
required string key = 1;
required double value = 2;
}
Then after generating, you can use it like so:
randKey = random.choice( nodeList.keys() )
data = SomeCoolMessage()
data.key = randKey
data.value = nodeList[randKey]
mySocket.sendto ( data.SerializeToString() , address )
I'd probably use the json module serialize the data.
You need to serialize the data. Pickle does this built in for you, and you can ask pickle for an ascii representation of the data vs binary data (see the docs), or you could use json (it also serializes the data for you) both are in the standard library. But really there are a hundred thousand different libraries that handle ALL the work for you, in getting data from 1 machine to another. I'd suggest using a library.
Depending on speed, etc. there are different trade offs for the various libraries. In the standard library you get HTTP, that's about it (well and raw sockets). But there are others.
If super fast speed is more important than other things..., zeroMQ, or google's protocol buffers might be valid options.
For me, I use rpyc usually, it lets me be totally lazy, and just call over to the other process across the network. It's fast enough usually.
You know that UDP has no guarantee that the data will ever show up on the other side, or that it will show up IN ORDER. for your application you may not care, I don't know, but just thought I'd bring it up.
I need to be able to send encrypted data between a Ruby client and a Python server (and vice versa) and have been having trouble with the ruby-aes gem/library. The library is very easy to use but we've been having trouble passing data between it and the pyCrypto AES library for Python. These libraries seem to be fine when they're the only one being used, but they don't seem to play well across language boundaries. Any ideas?
Edit: We're doing the communication over SOAP and have also tried converting the binary data to base64 to no avail. Also, it's more that the encryption/decryption is almost but not exactly the same between the two (e.g., the lengths differ by one or there is extra garbage characters on the end of the decrypted string)
(e.g., the lengths differ by one or there is extra garbage characters on the end of the decrypted string)
I missed that bit. There's nothing wrong with your encryption/decryption. It sounds like a padding problem. AES always encodes data in blocks of 128 bits. If the length of your data isn't a multiple of 128 bits the data should be padded before encryption and the padding needs to be removed/ignored after encryption.
Turns out what happened was that ruby-aes automatically pads data to fill up 16 chars and sticks a null character on the end of the final string as a delimiter. PyCrypto requires you to do multiples of 16 chars so that was how we figured out what ruby-aes was doing.
It's hard to even guess at what's happening without more information ...
If I were you, I'd check that in your Python and Ruby programs:
The keys are the same (obviously). Dump them as hex and compare each byte.
The initialization vectors are the same. This is the parameter IV in AES.new() in pyCrypto. Dump them as hex too.
The modes are the same. The parameter mode in AES.new() in pyCrypto.
There are defaults for IV and mode in pyCrypto, but don't trust that they are the same as in the Ruby implementation. Use one of the simpler modes, like CBC. I've found that different libraries have different interpretations of how the mode complex modes, such as PTR, work.
Wikipedia has a great article about how block cipher modes.
Kind of depends on how you are transferring the encrypted data. It is possible that you are writing a file in one language and then trying to read it in from the other. Python (especially on Windows) requires that you specify binary mode for binary files. So in Python, assuming you want to decrypt there, you should open the file like this:
f = open('/path/to/file', 'rb')
The "b" indicates binary. And if you are writing the encrypted data to file from Python:
f = open('/path/to/file', 'wb')
f.write(encrypted_data)
Basically what Hugh said above: check the IV's, key sizes and the chaining modes to make sure everything is identical.
Test both sides independantly, encode some information and check that Ruby and Python endoded it identically. You're assuming that the problem has to do with encryption, but it may just be something as simple as sending the encrypted data with puts which throws random newlines into the data. Once you're sure they encrypt the data correctly, check that you receive exactly what you think you sent. Keep going step by step until you find the stage that corrupts the data.
Also, I'd suggest using the openssl library that's included in ruby's standard library instead of using an external gem.