data exchange format ocaml to python numpy or pandas

data exchange format ocaml to python numpy or pandas - python

I'm generating time series data in ocaml which are basically long lists of floats, from a few kB to hundreds of MB. I would like to read, analyze and plot them using the python numpy and pandas libraries. Right now, i'm thinking of writing them to csv files.
A binary format would probably be more efficient? I'd use HDF5 in a heartbeat but Ocaml does not have a binding. Is there a good binary exchange format that is usable easily from both sides? Is writing a file the best option, or is there a better protocol for data exchange? Potentially even something that can be updated on-line?

First of all I would like to mention, that there're actually bindings for HDF-5 for OCaml. But, when I was faced with the same problem I didn't find one that suits my purposes and is mature enough. So I wouldn't suggest you to use it, but who knows, maybe today there is something more descent.
So, to my experience the best way to store numeric data in OCaml is Bigarrays. They are actually wrappers around the C-pointer, that can be allocated outside of OCaml runtime. They also can be a memory mapped regions. So, for me this is the most efficient way to share data between different processes (potentially written in different languages). You can share data using memory mapping with OCaml, Python, Matlab or whatever with very little pain, especially if you're not trying to modify it from different processes simultaneously.
Other approaches, is to use MPI, ZMQ or bare sockets. I would prefer the latter for the only reason that the former doesn't support bigarrays. Also, I would suggest you to look for capn'proto, it is also very efficient, and have bindings for OCaml and Python, and for your particular use case, can work very fine.

Related

Manipulate an application window frame using Python

TLDR: Is there a Python library that allows me to get a application window frame as an image and rewrite it to the said application?
So the whole story is that I want to write an application using Python that does something similar to Lossless Scaling and Magpie. I want to grab an application window (a videogame window, for example), get the current frame as an image, then use some Machine Learning/Deep Learning algorithm (like FSR or DLSS) to upscale said image, then rewrite the current frame from the application with said upscaled image.
So far, I have been playing around with some upscaling algorithms like the one from Real-ESRGAN, but now my main problem is how to upscale the video game images in real-time. The only thing I found that does something related to what I need to do is PyAutoGUI. But this package only allows you to take screenshots of an application but not rewrite the graphics of said application.
I hope I have clarified my problem; feel free to comment if you still have any questions.
Thank you for reading this post, and have a good day.

Doing this with Python is going to be very difficult. A lot of the performance involved in this sort of thing is in avoiding as many memory copies as possible, and Python's idiom for string and bytes processing unfortunately makes quite a few additional copies in the course of any idiomatic program. I say this as a die-hard Python fan who is constantly trying to cram Python in everywhere it doesn't belong: you'd be better off doing this in Rust.
Update: After receiving some feedback from some folks with more direct experience in this sort of thing, I may have overstated the difficulty here. Many ML tools in Python provide zero-copy access, you can easily access and manipulate memory-mapped data from numpy and there is even a CUDA protocol for doing this to data in GPU memory, so while it's not exactly easy, as long as your operations are implemented as numpy operations and not as pure-python pixel-by-pixel logic, it shouldn't be much harder than other python machine learning applications which require access to native APIs for accessing their source data.
However, there's no way to access framebuffer data directly from python, so step 1 is going to be writing your own bindings over the relevant DirectX APIs. Since Magpie is open source, you can see which APIs it's using, for example, in its various C++ "Frame Source" backends. For example, this looks relevant: https://github.com/Blinue/Magpie/blob/42cfcba1222b07e4cec282eaff639aead229f123/Runtime/GraphicsCaptureFrameSource.cpp#L87
You can then look those APIs up on MSDN; that one, for example, is here: https://learn.microsoft.com/en-us/uwp/api/windows.graphics.capture.direct3d11captureframepool.createfreethreaded?view=winrt-22621
CFFI is a good choice for writing native wrappers: https://cffi.readthedocs.io/en/latest/
Gluing these together appropriately is left as an exercise for the reader :).

Feasibility of converting all python pandas/numpy code to base python

General python question-
I have built a script using numpy and pandas libraries. I have now been told that I cannot use any libraries- only base python to code. This is because apparently open source libraries are not approved.
Does this restriction make sense? Isn't base python as open source as pandas/numpy libraries are?
Is it possible to convert pandas/numpy code to base python? Does this sound like a simple exercise or does it require learning a lot of new functions? Majority of the code is reading tables and then using if/then type statements and looking up values from other tables to generate and populate new tables.

I'm only going to address the 2nd point. Reimplementing all of numpy/pandas is certainly a very large and useless task. But you're not reimplementing all of it, you only need some parts, and if it's only a few functions, than it's certainly possible.
I'd start from a working script, replace arrays by python lists, and implement the needed fucntions one by one. For SO specifically, I suspect you're better off asking specific questions, e.g. how to implement an analog of a function X in pure python etc.

HDF5 possible data corruption or loss?

On wikipedia one can read the following criticism about HDF5:
Criticism of HDF5 follows from its monolithic design and lengthy
specification. Though a 150-page open standard, there is only a single
C implementation of HDF5, meaning all bindings share its bugs and
performance issues. Compounded with the lack of journaling, documented
bugs in the current stable release are capable of corrupting entire
HDF5 databases. Although 1.10-alpha adds journaling, it is
backwards-incompatible with previous versions. HDF5 also does not
support UTF-8 well, necessitating ASCII in most places. Furthermore
even in the latest draft, array data can never be deleted.
I am wondering if this is just applying to the C implementation of HDF5 or if this is a general flaw of HDF5?
I am doing scientific experiments which sometimes generate Gigabytes of data and in all cases at least several hundred Megabytes of data. Obviously data loss and especially corruption would be a huge disadvantage for me.
My scripts always have a Python API, hence I am using h5py (version 2.5.0).
So, is this criticism relevant to me and should I be concerned about corrupted data?

Declaration up front: I help maintain h5py, so I probably have a bias etc.
The wikipedia page has changed since the question was posted, here's what I see:
Criticism
Criticism of HDF5 follows from its monolithic design and lengthy specification.
Though a 150-page open standard, the only other C implementation of HDF5 is just a HDF5 reader.
HDF5 does not enforce the use of UTF-8, so client applications may be expecting ASCII in most places.
Dataset data cannot be freed in a file without generating a file copy using an external tool (h5repack).
I'd say that pretty much sums up the problems with HDF5, it's complex (but people need this complexity, see the virtual dataset support), it's got a long history with backwards compatibly as it's focus, and it's not really designed to allow for massive changes in files. It's also not the best on Windows (due to how it deals with filenames).
I picked HDF5 for my research because of the available options, it had decent metadata support (HDF5 at least allows UTF-8, formats like FITS don't even have that), support for multidimensional arrays (which formats like Protocol Buffers don't really support), and it supports more than just 64 bit floats (which is very rare).
I can't comment about known bugs, but I have seen corruption (this happened when I was writing to a file and linux OOM'd my script). However, this shouldn't be a concern as long as you have proper data hygiene practices (as mentioned in the hackernews link), which in your case would be to not continuously write to the same file, but for each run create a new file. You should also not modify the file, instead any data reduction should produce new files, and you should always backup the originals.
Finally, it is worth pointing out there are alternatives to HDF5, depending on what exactly your requirements are: SQL databases may fit you needs better (and sqlite comes with Python by default, so it's easy to experiment with), as could a simple csv file. I would recommend against custom/non-portable formats (e.g. pickle and similar), as they're neither more robust than HDF5, and more complex than a csv file.

Audio Domain Specific Language vs Python

I want to write some code to do acoustic analysis and I'm trying to determine the proper tool(s) for the job. I would normally write something like this in Python using numpy and scipy and possibly Cython for the analysis part. I've discovered that the world of Python audio libraries is a bit chaotic, with scads of very limited packages in various states of development.
I've also come across a bunch of audio/acoustic specific languages like SuperCollider, Faust, etc. that seem to make the audio processing easy but may be limited in terms of IO and analysis capability.
I'm currently working on Linux with Alsa and PulseAudio installed by default. I would prefer not to involve and of the various and sundry other audio packages like Jack if possible, though that is not a hard requirement.
My primary interest in this question is to determine whether there is a domain specific language that will provide for quicker prototyping and testing or whether a general language like Python is more appropriate. Thanks.

I've got a lot of experience with SuperCollider and Python (with and without Numpy). I do a lot of audio analysis, and I'm afraid the answer depends on what you want to do.
If you want to create systems that will input OR output audio in real time, then Python is not a good choice. The audio I/O libraries (as you say) are a bit sketchy. There's also a fundamental issue that Python's garbage collector is not really designed for realtime stuff. You should use a system that is designed from the ground up for realtime. SuperCollider is nice for this, and as caseyanderson notes, some of the standard building-blocks for audio analysis are right there. There are other environments too.
If you want to do hardcore work such as applying various machine learning algorithms, not necessarily in real time (i.e. if you can get away with reading/writing WAV files rather than live audio), then you should use a general-purpose programming language with wide support, and an ecosystem of good libraries for the extra things you want. Using Python with libs such as numpy and scikits-learn works great for this. It's good for quick prototyping, but not only does it lack solid realtime audio, it also has far fewer of the standard audio building-blocks. Those are two important things which hold you back when prototyping audio pipelines.
So, then, you're caught between these two options. Depending on your application you may be able to combine the two by manipulating the audio I/O in a realtime environment, and using OSC messaging or shell scripts to communicate with an external Python process. The limitation there is that you can't really throw masses of data around between the two (you can't sensibly pipe all your audio across to some other process, that'd be silly).

SuperCollider has lots of support for things along these lines, both as externals/plugins or Quarks. That said, it depends exactly what you want to do. If you are simply looking to detect events, Onsets.kr would be fine. If you are looking for frequency/pitch information, Pitch or Tartini would work (I find Tartini to be more accurate). If you are trying to track amplitude, a combination of Amplitude.ar and some simple math would also work.
Similarly, there is SpecCentroid.kr (for a kind of brightness analysis), Loudness.kr, SpecFlatness.kr, etc.
The above are all pretty general, and there are lots more (the JoshUGens externals package has some interesting FFT-related acoustics stuff). So I would recommend downloading the program, joining the mailing list (if you have further questions), which lives here, and poking around in the Externals, Quarks, and Standard UGens.
Nonetheless, since I am not sure what you are trying to do, I cannot make more concrete recommendations than the above combined with my feeling that it makes the most sense to go to SC for this, rather than writing all of your own tools in Python from scratch.

I'm not 100% sure what you want to do, but as an additional suggestion I would put forth: Spear with scripting in Common Lisp. If what you are doing involves a great deal of spectral analysis, then you can do the heavy Lifting in Spear, and script all of this using Common List with Common Music. Spear has some great tools in terms of editing out very specific partials.

What's the best serialization method for objects in memcached?

My Python application currently uses the python-memcached API to set and get objects in memcached. This API uses Python's native pickle module to serialize and de-serialize Python objects. This API makes it simple and fast to store nested Python lists, dictionaries and tuples in memcached, and reading these objects back into the application is completely transparent -- it just works.But I don't want to be limited to using Python exclusively, and if all the memcached objects are serialized with pickle, then clients written in other languages won't work.Here are the cross-platform serialization options I've considered:
XML - the main benefit is that it's human-readable, but that's not important in this application. XML also takes a lot space, and it's expensive to parse.
JSON - seems like a good cross-platform standard, but I'm not sure it retains the character of object types when read back from memcached. For example, according to this post tuples are transformed into lists when using simplejson; also, it seems like adding elements to the JSON structure could break code written to the old structure
Google Protocol Buffers - I'm really interested in this because it seems very fast and compact -- at least 10 times smaller and faster than XML; it's not human-readable, but that's not important for this app; and it seems designed to support growing the structure without breaking old code
Considering the priorities for this app, what's the ideal object serialization method for memcached?
Cross-platform support (Python, Java, C#, C++, Ruby, Perl)
Handling nested data structures
Fast serialization/de-serialization
Minimum memory footprint
Flexibility to change structure without breaking old code

One major consideration is "do you want to have to specify each structure definition"?
If you are OK with that, then you could take a look at:
Protocol Buffers - http://code.google.com/apis/protocolbuffers/docs/overview.html
Thrift - http://developers.facebook.com/thrift/ (more geared toward services)
Both of these solutions require supporting files to define each data structure.
If you would prefer not to incur the developer overhead of pre-defining each structure, then take a look at:
JSON (via python cjson, and native PHP json). Both are really really fast if you don't need to transmit binary content (such as images, etc...).
Yet Another Markup Language # http://www.yaml.org/. Also really fast if you get the right library.
However, I believe that both of these have had issues with transporting binary content, which is why they were ruled out for our usage. Note: YAML may have good binary support, you will have to check the client libraries -- see here: http://yaml.org/type/binary.html
At our company, we rolled our own library (Extruct) for cross-language serialization with binary support. We currently have (decently) fast implementations in Python and PHP, although it isn't very human readable due to using base64 on all the strings (binary support). Eventually we will port them to C and use more standard encoding.
Dynamic languages like PHP and Python get really slow if you have too many iterations in a loop or have to look at each character. C on the other hand shines at such operations.
If you'd like to see the implementation of Extruct, please let me know. (contact info at http://blog.gahooa.com/ under "About Me")

I tried several methods and settled on compressed JSON as the best balance between speed and memory footprint. Python's native Pickle function is slightly faster, but the resulting objects can't be used with non-Python clients.
I'm seeing 3:1 compression so all the data fits in memcache and the app gets sub-10ms response times including page rendering.
Here's a comparison of JSON, Thrift, Protocol Buffers and YAML, with and without compression:
http://bouncybouncy.net/ramblings/posts/more_on_json_vs_thrift_and_protocol_buffers/
Looks like this test got the same results I did with compressed JSON. Since I don't need to pre-define each structure, this seems like the fastest and smallest cross-platform answer.

"Cross-platform support (Python, Java, C#, C++, Ruby, Perl)"
Too bad this criteria is first. The intent behind most languages is to express fundamental data structures and processing differently. That's what makes multiple languages a "problem": they're all different.
A single representation that's good across many languages is generally impossible. There are compromises in richness of the representation, performance or ambiguity.
JSON meets the remaining criteria nicely. Messages are compact and parse quickly (unlike XML). Nesting is handled nicely. Changing structure without breaking code is always iffy -- if you remove something, old code will break. If you change something that was required, old code will break. If you're adding things, however, JSON handles this also.
I like human-readable. It helps with a lot of debugging and trouble-shooting.
The subtlety of having Python tuples turn into lists isn't an interesting problem. The receiving application already knows the structure being received, and can tweak it up (if it matters.)
Edit on performance.
Parsing the XML and JSON documents from http://developers.de/blogs/damir_dobric/archive/2008/12/27/performance-comparison-soap-vs-json-wcf-implementation.aspx
xmlParse 0.326
jsonParse 0.255
JSON appears to be significantly faster for the same content. I used the Python SimpleJSON and ElementTree modules in Python 2.5.2.

You might be interested into this link :
http://kbyanc.blogspot.com/2007/07/python-serializer-benchmarks.html
An alternative : MessagePack seems to be the fastest serializer out there. Maybe you can give it a try.

Hessian meets all of your requirements. There is a python library here:
https://github.com/bgilmore/mustaine
The official documentation for the protocol can be found here:
http://hessian.caucho.com/
I regularly use it in both Java and Python. It works and doesn't require writing protocol definition files. I couldn't tell you how the Python serializer performs, but the Java version is reasonably efficient:
https://github.com/eishay/jvm-serializers/wiki/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.