The python official document mentions that 'struct' module is used to convert between Python and binary data structures.
What do binary data structures here refer to? As I understand, does the data structure refer to the packet structure as defined in network related C functions?
Does struct.pack(fmt,v1,v2) build a C equivalent structure of the fields v1,v2 in format fmt?? For example if I am building an Ip packet, my fmt is the IP header and values are ip header feilds?
I am referring to this example while understanding how network packets can be built.
Binary data structures refers to the layout of the data in memory. Python's objects are far more complicated than a simple C struct. There is a significant amount of header data in Python objects that makes completing common tasks simpler for the Python interpreter.
Your interpretation is largely correct. The other important thing to note is that we specify a particular byte order, which may or may not be the same byte order used by a standard C structure (it depends on your machine architecture).
Related
Say I have a huge immutable dataset represented as say a tuple.
Lets say this dataset consumes much of the working memory so it is impossible to copy it.
is there a way in python to share that tuple with other python processes on the same machine, such that:
the data does not need to be copied, neither wholly nor in small parts
access to the data is fast and does not rely on IPC like sockets and pipes
I dont have to represent the data as RAW shared memory - i.e. I can keep using it as tuples
the representation maintains immutability semantics - i.e. I can't easily overwrite the memory and ruin computations
ideally it would be cross platform, or at least windows + linux.
I was reading an article on the benefits of cache oblivious data structures and found myself wondering if the Python implementations (CPython) use this approach? If not, is there a technical limitation preventing it?
I would say this is mostly irrelevant for built-in (standard library) Python data structures.
Creating a new data type in Python means creating a class, which is not a bare-bones wrapper of underlying primitive types or method pointers, but rather is a particular type of struct that has lots of additional metadata coming from Python object data model.
There is no native tree data structure in Python. There are lists, arrays, and array-based hash tables (dict, set), along with some extensions to these like in the collections module. Third party tree / trie / etc., implementations are free to offer a cache-oblivious implementation if it suits the intended usage. This would include CPython C-level implementations such as with custom extensions modules or via a tool like Cython.
NumPy ndarray is a contiguous array data structure for which the user may choose the data type (i.e. the user could, in theory, choose a weird data type that is not easily made into a multiple of the machine architecture's cache size). Perhaps some customization could be improved there, for fixed data type (and maybe the same is true for array.array), but I am wondering how many array / linear algebra algorithms benefit from some sort of customized cache obliviousness -- normally these sorts of libraries are written to assume use of a particular data type, like int32 or float64, specifically based on the cache size, and employ dynamic memory reallocation, like doubling, to amortize cost of certain operations. For example, your linked article mentions that finding the max over an array is "intrinsically" cache oblivious ... because it's contiguous, you make the maximum possible use of each cache line you read, and you only read the minimal number of cache lines. Perhaps for treating an array like a heap or something, you could be clever about rearranging the memory layout to be optimal regardless of cache size, but it wouldn't be the role of a general purpose array to have its implementation customized like that based on a very specialized use case (an array having the heap property).
In short, I would turn the question around on you and say, given the data structures that are standard in Python, do you see particular trade-offs between dynamic resizing, dynamic typing and (perhaps most importantly) general random access pattern assumptions vs. having a cache oblivious implementation backing them?
I have extracted 6 months of email metadata and saved it as a csv file. The csv now only contains two columns (from and to email addresses). I want to build a graph where the vertices are those with whom I am communicating and whom communicated with me and the edges are created by a communications link labeling the edges by how many communications I had. What is the best approach for going about this?
One approach is to use Linked Data principles (although not advisable if you are short on time and don't have a background in Linked Data). Here's a possible approach:
Depict each entity as a URI
Use an existing ontology (such as foaf) to describe the data
The data is transformed into Resource Description Framework (RDF)
Use an RDF visualization tool.
Since RDF is inherently a graph, you will be able to visualize your data as well as extend it.
If you are unfamiliar with Linked Data, a way to view the garphs is using Pajek (http://vlado.fmf.uni-lj.si/pub/networks/pajek/). This approach is much simpler but lacks the benefits of semantic interoperability, provided you care about them in the first place.
Cytoscape might be able to import your data in that format and build a network from it.
http://www.cytoscape.org/
Your question (while mentioning Python) does not say what part or how much you want to do with Python. I will assume Python is a tool you know but that the main goal is to get the data visualized. In that case:
1) use Gephi network analysis tool - there are tools that can use your CSV file as-is and Gephi is one of them. in your case edge weights need to be preserved (= number of emails exchanged b/w 2 email addresses) which can be done using the "mixed" variation of Gephi's CSV format.
2) another option is to pre-process your CSV file (e.g. using Python), calculate edge weights (the number of e-mail between every 2 email addresses) and save it in any format you like. The result can be visualized in network analysis tools (such as Gephi) or directly in Python (e.g. using https://graph-tool.skewed.de).
Here's an example of an email network analysis project (though their graph does not show weights).
I'm developing a command responder with pysnmp, based on
http://pysnmp.sourceforge.net/examples/current/v3arch/agent/cmdrsp/v2c-custom-scalar-mib-objects.html
My intention is to answer to the get message of my managed objects by reading the snmp data from a text file (updated over time).
I'm polling the responder using snmpB, drawing a graph of the polled object value evolution.
I've successfully modify the example exporting my first Managed Object, adding it with mibBuilder.exportSymbols() and retrieving the values from the txt file in the modified getvalue method. I'm able to poll this object with success. It's a Counter32 type object.
The next step is handle other objects with a value type different from the "supported" classes like Integer32, Counter32, OctetString
I need to handle floating point values or other specific data formats defined within MIB files, because snmpB expect these specific formats for correctly plotting the graph.
Unfortunetely I can't figure out a way to do this.
Hope someone can help,
Mark
EDIT 1
The textual-convention I need to implement is the Float32TC defined in FLOAT-TC-MIB from RFC6340:
Float32TC ::= TEXTUAL-CONVENTION
STATUS current
DESCRIPTION "This type represents a 32-bit (4-octet) IEEE
floating-point number in binary interchange format."
REFERENCE "IEEE Standard for Floating-Point Arithmetic,
Standard 754-2008"
SYNTAX OCTET STRING (SIZE(4))
There is no native floating point type in SNMP, and you can't add radically new types to the protocol. But you can put additional constraints on existing types or modify value representation via TEXTUAL-CONVENTION.
To represent floating point numbers you have two options:
encode floating point number into octet-stream and pass it as OCTET STREAM type (RFC6340)
use INTEGER type along with some TEXTUAL-CONVENTION to represent integer as float
Whatever values are defined in MIB, they always base on some built-in SNMP type.
You could automatically generate pysnmp MibScalar classes from your ASN.1 MIB with pysmi tool, then you could manually add MibScalarInstance classes with some system-specific code thus linking pysnmp to your data sources (like text files).
I am naive to Python. But, what I came to know is that both are being used for serialization and deserialization. So, I just want to know what all basic differences in between them?
YAML is a language-neutral format that can represent primitive types (int, string, etc.) well, and is highly portable between languages. Kind of analogous to JSON, XML or a plain-text file; just with some useful formatting conventions mixed in -- in fact, YAML is a superset of JSON.
Pickle format is specific to Python and can represent a wide variety of data structures and objects, e.g. Python lists, sets and dictionaries; instances of Python classes; and combinations of these like lists of objects; objects containing dicts containing lists; etc.
So basically:
YAML represents simple data types & structures in a language-portable manner
pickle can represent complex structures, but in a non-language-portable manner
There's more to it than that, but you asked for the "basic" difference.
pickle is a special python serialization format when a python object is converted into a byte stream and back:
“Pickling” is the process whereby a Python object hierarchy is
converted into a byte stream, and “unpickling” is the inverse
operation, whereby a byte stream is converted back into an object
hierarchy.
The main point is that it is python specific.
On the other hand, YAML is language-agnostic and human-readable serialization format.
FYI, if you are choosing between these formats, think about:
serialization/derialization speed (see cPickle module)
do you need to store serialized files in a human-readable form?
what are you going to serialize? If it's a python-specific complex data structure, for example, then you should go with pickle.
See also:
Python serialization - Why pickle?
Lightweight pickle for basic types in python?
If it is not important for you to read files by a person, but you just need to save the file, and then read it, then use the pickle. It is much faster and the binaries weigh less.
YAML files are more readable as mentioned above, but also slower and larger in size.
I have tested for my application. I measured the time to upload and download an object to a file, as well as its size.
Serialization/deserialization method
Average time, s
Size of file, kB
PyYAML
1.73
1149.358
pickle
0.004
690.658
As you can see, yaml is 1,67 times heavier. And 432,5 times slower.
P. S. This is for my data. In your case, it may be different. But that's enough for comparison.