Pack array of namedtuples in PYTHON

Pack array of namedtuples in PYTHON - python

I need to send an array of namedtuples by a socket.
To create the array of namedtuples I use de following:
listaPeers=[]
for i in range(200):
ipPuerto=collections.namedtuple('ipPuerto', 'ip, puerto')
ipPuerto.ip="121.231.334.22"
ipPuerto.puerto="8988"
listaPeers.append(ipPuerto)
Now that is filled, i need to pack "listaPeers[200]"
How can i do it?
Something like?:
packedData = struct.pack('XXXX',listaPeers)

First of all you are using namedtuple incorrectly. It should look something like this:
# ipPuerto is a type
ipPuerto=collections.namedtuple('ipPuerto', 'ip, puerto')
# theTuple is a tuple object
theTuple = ipPuerto("121.231.334.22", "8988")
As for packing, it depends what you want to use on the other end. If the data will be read by Python, you can just use Pickle module.
import cPickle as Pickle
pickledTuple = Pickle.dumps(theTuple)
You can pickle whole array of them at once.

It is not that simple - yes, for integers and simple numbers, it s possible to pack straight from named tuples to data provided by the struct package.
However, you are holding your data as strings, not as numbers - it is a simple thing to convert to int in the case of the port - as it is a simple integer, but requires some juggling when it comes to the IP.
def ipv4_from_str(ip_str):
parts = ip_str.split(".")
result = 0
for part in parts:
result <<= 8
result += int(part)
return result
def ip_puerto_gen(list_of_ips):
for ip_puerto in list_of_ips:
yield(ipv4_from_str(ip_puerto.ip))
yield(int(ip_puerto.puerto))
def pack(list_of_ips):
return struct.pack(">" + "II" * len(list_of_ips),
*ip_puerto_gen(list_of_ips)
)
And you then use the "pack" function from here to pack your structure as you seem to want.
But first, attempt to the fact that you are creating your "listaPiers" incorrectly (your example code simply will fail with an IndexError) - use an empty list, and the append method on it to insert new named tuples with ip/port pairs as each element:
listaPiers = []
ipPuerto=collections.namedtuple('ipPuerto', 'ip, puerto')
for x in range(200):
new_element = ipPuerto("123.123.123.123", "8192")
listaPiers.append(new_element)
data = pack(listaPiers)

ISTR that pickle is considered insecure in server processes, if the server process is receiving pickled data from untrusted clients.
You might want to come up with some sort of separator character(s) for the records and fields (perhaps \0 and \001 or \376 and \377). Then putting together a message is kind of like a text file broken up into records and fields separated by spaces and newlines. Or for that matter, you could use spaces and newlines, if your normal data doesn't include these.
I find this module very valuable for framing data in socket-based protocols:
http://stromberg.dnsalias.org/~strombrg/bufsock.html
It lets you do things like "read up until the next null byte" or "read the next 10 characters" - without needing to worry about the complexities of IP aggregating or splitting packets.

Related

Python - Import formatted lines as indexed list objects

I am writing a minor OP5 plugin in Python 2.7 (version is out of my hands) that iterates over a multidimensional list that verifies fallback zip downloads have gone as they should.
Up until now I have put each host with their IP address in a multidimensional list looking like (cut short for brevity):
fallback = [
["host1", "192.168.1.3"],
["host2", "192.168.15.59"]
]
...and so on.
This lets me iterate through fallback[i] and use that along with fallback[i][1] for the IP address, the rest of the script uses both of these informations for various tasks and string manipulations. The script as it is now is mechanically sound but relies on availability of these indexes.
There is however a hidden file (.fallbackinfo) containing the same information for another script but it is written for perl, same as the script that uses that file as a source.
The file looks like this:
#hosts = (
["host1", "192.168.1.3", "type of firmware", "subfolder"],
["host2", "192.168.15.59", "type of firmware", "subfolder"],
);
I wish to import this into an iterable multidimensional list in my Python script, but am getting incredibly stuck.
My current attempt is the closest I have gotten:
with open("/home/runninguser/.fallbackinfo") as f:
lines = []
for line in f:
lines.append(line.rstrip().strip())
fallback = lines[1:len(lines)-1]
This has successfully made the list look as I want it, but all lines get imported as str objects. I have attempted to use list() to force the object to become a list but most of the time, that makes each character in the lines to become a list object instead. The network in question is cut off from internet access so I have to rely on built-in modules. My interpretation is that since it is formatted as a list, it should somehow be able to be interpreted as a list.
Can this be done at all, and if so, how?

You can use the json package (built-in) to achieve this:
import json
with open("/home/runninguser/.fallbackinfo") as f:
# For each line
for line in f:
# If the line starts with a bracket
if line.strip()[0] == "[":
# Print the line after removing spaces in front and the comma in the back
# and converting it into a list
print(json.loads(line.strip().rstrip(",")))
If you now use the type() function, you will see the list-formatted strings are now <class 'list'>

Building/Dissecting complicated mulit-layered scapy packet

I am working on write scapy packets to handle many of the DMTF manageability packets. Have been successful thus far in doing most of them. However I have now come to Redfish over PLDM in this spec: https://www.dmtf.org/sites/default/files/standards/documents/DSP0218_1.0.0.pdf
Specifically I am trying to figure out how to properly model the Dictionary Binary Format table in section 7.2.3.2. The part that is troubling me is that this table contains basically an array of Packets that describe among other things offsets and lengths of stings that are at the end of the packet. So there will be a number of these Packets (which I would use a PacketListField for), followed by a list of strings that they describe. So there are 'N' of these Packets/Headers followed by 'N' variable length strings of which those headers contain the length and the offset of where the string begins.
I can dissect all this manually, but was hoping to use FieldLenField and StrLenFields to do this; however I am unable to figure out a slick way of doing this. Only been playing with Scapy for a few weeks, so I am hoping somebody with more experience could help me out and provide some direction.
If the name string were part of the Dictionary Entry, I would do something like the following, however since it is not, I'm not sure how to link the FieldLenField to the StrLenField.
class beJtupleF(Packet)
name = "beJtupleF"
fields_desc = [
BitEnumField("PrincipleDataType",0,4,BEJFormatCodes),
BitField("reserved_flag",0,1),
BitField("nullable_property",0,1),
BitField("read_only",0,1),
BitField("deferred_binding",0,1)
]
class DictionaryEntry(Packet):
name = "Dictionary Entry"
fields_desc = [
PacketField("Format",beJtupleF(),beJtupleF),
ShortField("SequenceNumber",0),
ShortField("ChildPointerOffset",0),
ShortField("ChildCount",0),
ByteField("NameLength",0)
FieldLenField("NameLength",None,"Name","B"), # length of 'Name'
ShortField("NameOffset",0) # Not sure how to fill this one in automatically either :-)
StrLenField("Name","",length_from = lambda pkt: pkt.NameLength), # Variable len Name
]

python: list element in CSV file

I have a csv file of such structure:
Id,Country,Cities
1,Canada,"['Toronto','Ottawa','Montreal']"
2,Italy,"['Rome','Milan','Naples', 'Palermo']"
3,France,"['Paris','Cannes','Lyon']"
4,Spain,"['Seville','Alicante','Barcelona']"
The last column contains a list, but it is represented as a string so that it is treated as a single element. When parsing the file, I need to have this element as a list, not a string. So far I've found the way to convert it:
L = "['Toronto','Ottawa','Montreal']"
seq = ast.literal_eval(L)
Since I'm a newbie in python, my question is -- is this normal way of doing this, or there's a right way to represent lists in CSV so that I don't have to do conversions, or there's a simpler way to convert?
Thanks!

Using ast.literal_eval(...) will work, but it requires special syntax that other CSV-reading software won't recognize, and uses an eval statement which is a red flag.
Using eval can be dangerous, even though in this case you're using the safer literal_eval option which is more restrained than the raw eval function.
Usually what you'll see in CSV files that have many values in a single column is that they'll use a simple delimiter and quote the field.
For instance:
ID,Country,Cities
1,Canada,"Toronto;Ottawa;Montreal"
Then in python, or any other language, it becomes trivial to read without having to resort to eval:
import csv
with open("data.csv") as fobj:
reader = csv.reader(fobj)
field_names = next(reader)
rows = []
for row in reader:
row[-1] = row[-1].split(";")
rows.append(row)
Issues with ast.literal_eval
Even though the ast.literal_eval function is much safer than using a regular eval on user input, it still might be exploitable. The documentation for literal_eval has this warning:
Warning: It is possible to crash the Python interpreter with a sufficiently large/complex string due to stack depth limitations in Python’s AST compiler.
A demonstration of this can be found here:
>>> import ast
>>> ast.literal_eval("()" * 10 ** 6)
[1] 48513 segmentation fault python
I'm definitely not an expert, but giving a user the ability to crash a program and potentially exploit some obscure memory vulnerability is bad, and in this use-case can be avoided.
If the reason you want to use literal_eval is to get proper typing, and you're positive that the input data is 100% trusted, then I suppose it's fine to use. But, you could always wrap the function to perform some sanity checks:
def sanely_eval(value: str, max_size: int = 100_000) -> object:
if len(value) > max_size:
raise ValueError(f"len(value) is greater than the max_size={max_size!r}")
return ast.literal_eval(value)
But, depending on how you're creating and using the CSV files, this may make the data less portable, since it's a python-specific format.

If you can control the CSV, you could separate the items with some other known character that isn't going to be in a city and isn't a comma. Say colon (:).
Then row one, for example, would look like this:
1,Canada,Toronto:Ottawa:Montreal
When it comes to processing the data, you'll have that whole element, and you can just do
cities.split(':')
If you want to go the other way (you have the cities in a Python list, and you want to create this string) you can use join()
':'.join(['Toronto', 'Ottawa', 'Montreal'])

For the specific structure of the csv, you could convert cities to list like this:
cities = '''"['Rome','Milan','Naples', 'Palermo']"'''
cities = cities[2:-2] # remove "[ and ]"
print(cities) # 'Rome','Milan','Naples', 'Palermo'
cities = cities.split(',') # convert to list
print(cities) # ["'Rome'", "'Milan'", "'Naples'", " 'Palermo'"]
cities = [x.strip() for x in cities] # remove leading or following spaces (if exists)
print(cities) # ["'Rome'", "'Milan'", "'Naples'", "'Palermo'"]
cities = [x[1:-1] for x in cities] # remove quotes '' from each city
print(cities) # ['Rome', 'Milan', 'Naples', 'Palermo']

Saving dictionaries to file (numpy and Python 2/3 friendly)

I want to do hierarchical key-value storage in Python, which basically boils down to storing dictionaries to files. By that I mean any type of dictionary structure, that may contain other dictionaries, numpy arrays, serializable Python objects, and so forth. Not only that, I want it to store numpy arrays space-optimized and play nice between Python 2 and 3.
Below are methods I know are out there. My question is what is missing from this list and is there an alternative that dodges all my deal-breakers?
Python's pickle module (deal-breaker: inflates the size of numpy arrays a lot)
Numpy's save/savez/load (deal-breaker: Incompatible format across Python 2/3)
PyTables replacement for numpy.savez (deal-breaker: only handles numpy arrays)
Using PyTables manually (deal-breaker: I want this for constantly changing research code, so it's really convenient to be able to dump dictionaries to files by calling a single function)
The PyTables replacement of numpy.savez is promising, since I like the idea of using hdf5 and it compresses the numpy arrays really efficiently, which is a big plus. However, it does not take any type of dictionary structure.
Lately, what I've been doing is to use something similar to the PyTables replacement, but enhancing it to be able to store any type of entries. This actually works pretty well, but I find myself storing primitive data types in length-1 CArrays, which is a bit awkward (and ambiguous to actual length-1 arrays), even though I set chunksize to 1 so it doesn't take up that much space.
Is there something like that already out there?
Thanks!

After asking this two years ago, I starting coding my own HDF5-based replacement of pickle/np.save. Ever since, it has matured into a stable package, so I thought I would finally answer and accept my own question because it is by design exactly what I was looking for:
https://github.com/uchicago-cs/deepdish

I recently found myself with a similar problem, for which I wrote a couple of functions for saving the contents of dicts to a group in a PyTables file, and loading them back into dicts.
They process nested dictionary and group structures recursively, and handle objects with types that are not natively supported by PyTables by pickling them and storing them as string arrays. It's not perfect, but at least things like numpy arrays will be stored efficiently. There's also a check included to avoid inadvertently loading enormous structures into memory when reading the group contents back into a dict.
import tables
import cPickle
def dict2group(f, parent, groupname, dictin, force=False, recursive=True):
"""
Take a dict, shove it into a PyTables HDF5 file as a group. Each item in
the dict must have a type and shape compatible with PyTables Array.
If 'force == True', any existing child group of the parent node with the
same name as the new group will be overwritten.
If 'recursive == True' (default), new groups will be created recursively
for any items in the dict that are also dicts.
"""
try:
g = f.create_group(parent, groupname)
except tables.NodeError as ne:
if force:
pathstr = parent._v_pathname + '/' + groupname
f.removeNode(pathstr, recursive=True)
g = f.create_group(parent, groupname)
else:
raise ne
for key, item in dictin.iteritems():
if isinstance(item, dict):
if recursive:
dict2group(f, g, key, item, recursive=True)
else:
if item is None:
item = '_None'
f.create_array(g, key, item)
return g
def group2dict(f, g, recursive=True, warn=True, warn_if_bigger_than_nbytes=100E6):
"""
Traverse a group, pull the contents of its children and return them as
a Python dictionary, with the node names as the dictionary keys.
If 'recursive == True' (default), we will recursively traverse child
groups and put their children into sub-dictionaries, otherwise sub-
groups will be skipped.
Since this might potentially result in huge arrays being loaded into
system memory, the 'warn' option will prompt the user to confirm before
loading any individual array that is bigger than some threshold (default
is 100MB)
"""
def memtest(child, threshold=warn_if_bigger_than_nbytes):
mem = child.size_in_memory
if mem > threshold:
print '[!] "%s" is %iMB in size [!]' % (child._v_pathname, mem / 1E6)
confirm = raw_input('Load it anyway? [y/N] >>')
if confirm.lower() == 'y':
return True
else:
print "Skipping item \"%s\"..." % g._v_pathname
else:
return True
outdict = {}
for child in g:
try:
if isinstance(child, tables.group.Group):
if recursive:
item = group2dict(f, child)
else:
continue
else:
if memtest(child):
item = child.read()
if isinstance(item, str):
if item == '_None':
item = None
else:
continue
outdict.update({child._v_name: item})
except tables.NoSuchNodeError:
warnings.warn('No such node: "%s", skipping...' % repr(child))
pass
return outdict
It's also worth mentioning joblib.dump and joblib.load, which tick all of your boxes apart from Python 2/3 cross-compatibility. Under the hood they use np.save for numpy arrays and cPickle for everything else.

I tried playing with np.memmap for saving an array of dictionaries. Say we have the dictionary:
a = np.array([str({'a':1, 'b':2, 'c':[1,2,3,{'d':4}]}])
first I tried to directly save it to a memmap:
f = np.memmap('stack.array', dtype=dict, mode='w+', shape=(100,))
f[0] = d
# CRASHES when reopening since it looses the memory pointer
f = np.memmap('stack.array', dtype=object, mode='w+', shape=(100,))
f[0] = d
# CRASHES when reopening for the same reason
the way it worked is converting the dictionary to a string:
f = np.memmap('stack.array', dtype='|S1000', mode='w+', shape=(100,))
f[0] = str(a)
this works and afterwards you can eval(f[0]) to get the value back.
I do not know the advantage of this approach over the others, but it deserves a closer look.

I absolutely recommend a python object database like ZODB. It seems pretty well suited for your situation, considering you store objects (literally whatever you like) to a dictionary - this means you can store dictionaries inside dictionaries. I've used it in a wide range of problems, and the nice thing is that you can just hand somebody the database file (the one with a .fs extension). With this, they'll be able to read it in, and perform any queries they wish, and modify their own local copies. If you wish to have multiple programs simultaneously accessing the same database, I'd make sure to look at ZEO.
Just a silly example of how to get started:
from ZODB import DB
from ZODB.FileStorage import FileStorage
from ZODB.PersistentMapping import PersistentMapping
import transaction
from persistent import Persistent
from persistent.dict import PersistentDict
from persistent.list import PersistentList
# Defining database type and creating connection.
storage = FileStorage('/path/to/database/zodbname.fs')
db = DB(storage)
connection = db.open()
root = connection.root()
# Define and populate the structure.
root['Vehicle'] = PersistentDict() # Upper-most dictionary
root['Vehicle']['Tesla Model S'] = PersistentDict() # Object 1 - also a dictionary
root['Vehicle']['Tesla Model S']['range'] = "208 miles"
root['Vehicle']['Tesla Model S']['acceleration'] = 5.9
root['Vehicle']['Tesla Model S']['base_price'] = "$71,070"
root['Vehicle']['Tesla Model S']['battery_options'] = ["60kWh","85kWh","85kWh Performance"]
# more attributes here
root['Vehicle']['Mercedes-Benz SLS AMG E-Cell'] = PersistentDict() # Object 2 - also a dictionary
# more attributes here
# add as many objects with as many characteristics as you like.
# commiting changes; up until this point things can be rolled back
transaction.get().commit()
transaction.get().abort()
connection.close()
db.close()
storage.close()
Once the database is created it's very easy use. Since it's an object database (a dictionary), you can access objects very easily:
#after it's opened (lines from the very beginning, up to and including root = connection.root() )
>> root['Vehicles']['Tesla Model S']['range']
'208 miles'
You can also display all of the keys (and do all other standard dictionary things you might want to do):
>> root['Vehicles']['Tesla Model S'].keys()
['acceleration', 'range', 'battery_options', 'base_price']
Last thing I want to mention is that keys can be changed: Changing the key value in python dictionary. Values can also be changed - so if your research results change because you change your method or something you don't have to start the entire database from scratch (especially if everything else is still okay). Be careful with doing both of these. I put in safety measures in my database code to make sure I'm aware of my attempts to overwrite keys or values.
** ADDED **
# added imports
import numpy as np
from tempfile import TemporaryFile
outfile = TemporaryFile()
# insert into definition/population section
np.save(outfile,np.linspace(-1,1,10000))
root['Vehicle']['Tesla Model S']['arraydata'] = outfile
# check to see if it worked
>>> root['Vehicle']['Tesla Model S']['arraydata']
<open file '<fdopen>', mode 'w+b' at 0x2693db0>
outfile.seek(0)# simulate closing and re-opening
A = np.load(root['Vehicle']['Tesla Model S']['arraydata'])
>>> print A
array([-1. , -0.99979998, -0.99959996, ..., 0.99959996,
0.99979998, 1. ])
You could also use numpy.savez() for compressed saving of multiple numpy arrays in this exact same way.

This is not a direct answer. Anyway, you may be interested also in JSON. Have a look at the 13.10. Serializing Datatypes Unsupported by JSON. It shows how to extend the format for unsuported types.
The whole chapter from "Dive into Python 3" by Mark Pilgrim is definitely a good read for at least to know...
Update: Possibly an unrelated idea, but... I have read somewhere, that one of the reasons why XML was finally adopted for data exchange in heterogeneous environment was some study that compared specialized binary format with zipped XML. The conclusion for you could be to use possibly not so space efficient solution and compress it via zip or another well known algorithm. Using the known algorithm helps when you need to debug (to unzip and then look at the text file by eye).

How to read JSON from socket in python? (Incremental parsing of JSON)

I have a socket opened and I'd like to read some json data from it. The problem is that the json module from standard library can only parse from strings (load only reads the whole file and calls loads inside) It even looks that all the way inside the module it all depends on the parameter being string.
This is a real problem with sockets since you can never read it all to string and you don't know how many bytes to read before you actually parse it.
So my questions are: Is there a (simple and elegant) workaround? Is there another json library that can parse data incrementally? Is it worth writing it myself?
Edit: It is XBMC jsonrpc api. There are no message envelopes, and I have no control over the format. Each message may be on a single line or on several lines.
I could write some simple parser that needs only getc function in some form and feed it using s.recv(1), but this doesn't as a very pythonic solution and I'm a little lazy to do that :-)

Edit: given that you aren't defining the protocol, this isn't useful, but it might be useful in other contexts.
Assuming it's a stream (TCP) socket, you need to implement your own message framing mechanism (or use an existing higher level protocol that does so). One straightforward way is to define each message as a 32-bit integer length field, followed by that many bytes of data.
Sender: take the length of the JSON packet, pack it into 4 bytes with the struct module, send it on the socket, then send the JSON packet.
Receiver: Repeatedly read from the socket until you have at least 4 bytes of data, use struct.unpack to unpack the length. Read from the socket until you have at least that much data and that's your JSON packet; anything left over is the length for the next message.
If at some point you're going to want to send messages that consist of something other than JSON over the same socket, you may want to send a message type code between the length and the data payload; congratulations, you've invented yet another protocol.
Another, slightly more standard, method is DJB's Netstrings protocol; it's very similar to the system proposed above, but with text-encoded lengths instead of binary; it's directly supported by frameworks such as Twisted.

If you're getting the JSON from an HTTP stream, use the Content-Length header to get the length of the JSON data. For example:
import httplib
import json
h = httplib.HTTPConnection('graph.facebook.com')
h.request('GET', '/19292868552')
response = h.getresponse()
content_length = int(response.getheader('Content-Length','0'))
# Read data until we've read Content-Length bytes or the socket is closed
data = ''
while len(data) < content_length or content_length == 0:
s = response.read(content_length - len(data))
if not s:
break
data += s
# We now have the full data -- decode it
j = json.loads(data)
print j

What you want(ed) is ijson, an incremental json parser.
It is available here: https://pypi.python.org/pypi/ijson/ . The usage should be simple as (copying from that page):
import ijson.backends.python as ijson
for item in ijson.items(file_obj):
# ...
(for those who prefer something self-contained - in the sense that it relies only on the standard library: I wrote yesterday a small wrapper around json - but just because I didn't know about ijson. It is probably much less efficient.)
EDIT: since I found out that in fact (a cythonized version of) my approach was much more efficient than ijson, I have packaged it as an independent library - see here also for some rough benchmarks: http://pietrobattiston.it/jsaone

Do you have control over the json? Try writing each object as a single line. Then do a readline call on the socket as described here.
infile = sock.makefile()
while True:
line = infile.readline()
if not line: break
# ...
result = json.loads(line)

Skimming the XBMC JSON RPC docs, I think you want an existing JSON-RPC library - you could take a look at:
http://www.freenet.org.nz/dojo/pyjson/
If that's not suitable for whatever reason, it looks to me like each request and response is contained in a JSON object (rather than a loose JSON primitive that might be a string, array, or number), so the envelope you're looking for is the '{ ... }' that defines a JSON object.
I would, therefore, try something like (pseudocode):
while not dead:
read from the socket and append it to a string buffer
set a depth counter to zero
walk each character in the string buffer:
if you encounter a '{':
increment depth
if you encounter a '}':
decrement depth
if depth is zero:
remove what you have read so far from the buffer
pass that to json.loads()

You may find JSON-RPC useful for this situation. It is a remote procedure call protocol that should allow you to call the methods exposed by the XBMC JSON-RPC. You can find the specification on Trac.

res = str(s.recv(4096), 'utf-8') # Getting a response as string
res_lines = res.splitlines() # Split the string to an array
last_line = res_lines[-1] # Normally, the last one is the json data
pair = json.loads(last_line)
https://github.com/A1vinSmith/arbitrary-python/blob/master/sockets/loopHost.py

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.