How to use mpi4py read - python

I am having trouble reading with mpi4py. I have opened the file and read it, however then I want to do manipulation in python with list. I get an error that the datatype (mpi4py) datatype object does not support. How to I read it into an object python supports, or convert it?
I use MPI.File_Seek(block_start) wherr block_start is nprocs/size then MPI.File_read. (after opening the file with MPI). It takes a buffer argument, im not certain what that is, but I use a btyearray of size block_start -block_end. I have figured out how to use python to then turn the bytearray into a sting and do the manipulation i need to, then turn it back into a bytearray and print. I am wondering if their is a more efficient way.
The task is to read financial tick data (date,price,volume) with date coming in on microseconds of form yyyymmdd:hh:mm:ss.ssssss, and to identify malformed lines. I succedded sequentially with a small file using python sequentially. We are required to use Python with mpi4py. The task seems simple, however mlst of us are very inexperienced programers (in fact I am taking my first programming course simultaneosly). However, we are not learning programming in the class,.

Related

How to write a line or block in the middle of bgzf

I'd like to reference the following post as well, and mention that I'm familiar with BioPython.
How to obtain random access of a gzip compressed file
I'm familiar with the Bio.bgzf's potential for indexes and random reads. I'm building a library that uses the module to build an index against the blocks that contain data that is relevant to my interests. The technology is very interesting but I'm struggling to understand the pace of development or limitations of what Bio.bgzf or even the bgzf standard are capable of.
Can Bio.bgzf overwrite a specific line in the file, just as it can read from the virtual offset to the end of the line? If it could, would the new data necessarily need to be exactly the same number of bits?
After using make_virtual_offset() to acquire a position in the .bgzf file for a line that I'd like to overwrite, I'm looking for a method like filehandle.writeline() to replace the line in the block with some new text. If that's not possible, then is it possible to get the coordinates for the entire block and then rewrite that. And if not, it could be said that bgzf index files are sufficient for reading only. Is this correct?

How to pass multiple images as input to python script

I use nodejs to call a python script that runs object detection for some jpg images reading from the hard disk. These images are written to the disk by nodejs prior to calling the script.
To make it dynamic and faster, now I want to send multiple images as multidimensional array from nodejs to the python script. This saves me writing and reading images from disk. Is this the best way to do this? If so how do I pass images as multidimensional array to python script? Or Is there any better solution?
Your question leaves out a lot of specifics, but if you literally mean "pass input to a python script," you have two options: command line arguments or standard input.
While it is technically possible to read a series of images from standard input, it is certainly more complicated with no benefit to make it worth while.
Command line arguments can only be strings, read from sys.argv So, you while you could try to use multiple command line arguments, it would be more trouble than it's worth to translate an array of strings into a multidimensional array.
So, tl;dr create a data file format that lets you represent a set of images as a multidimensional array of file paths (or URLs, if you wanted). The easiest by far would be simply to use a CSV file and read it in with the csv Python module.
import csv
image_path_array = list(csv.reader(open('filename.csv','r')))

create pdf from python

I'm looking to generate PDF's from a Python application.
They start relatively simple but some may become more complex (Essentially letter like documents but will include watermarks for example later)
I've worked in raw postscript before and providing I can generate the correct headers etc and file at the end of it I want to avoid use of complex libs that may not do entirely what I want. Some seem to have got bitrot and no longer supported (pypdf and pypdf2) Especially when I know PDF/Postscript can do exactly what I need. PDF content really isn't that complex.
I can generate EPS (Encapsulated postscript) fine by just writing the appropriate text headers to file and my postscript code. But Inspecting PDF's there is a lil binary header I'm not sure how to generate.
I could generate an EPS and convert it. I'm not overly happy with this as the production environment is a Windows 2008 server (Dev is Ubuntu 12.04) and making something and converting it seems very silly.
Has anyone done this before?
Am I being pedantic by not wanting to use a library?
borrowed from ask.yahoo
A PDF file starts with "%PDF-1.1" if it is a version 1.1 type of PDF file. You can read PDF files ok when they don't have binary data objects stored in them, and you could even make one using Notepad if you didn't need to store a binary object like a Paint bitmap in it.
But after seeing the "%PDF-1.1" you ignore what's after that (Adobe Reader does, too) and go straight to the end of the file to where there is a line that says "%%EOF". That's always the last thing in the file; and if that's there you know that just a few characters before that place in the file there's the word "startxref" followed by a number. This number tells a reader program where to look in the file to find the start of the list of items describing the structure of the file. These items in the list can be page objects, dictionary objects, or stream objects (like the binary data of a bitmap), and each one has "obj" and "endobj" marking out where its description starts and ends.
For fairly simple PDF files, you might be able to type the text in just like you did with Notepad to make a working PDF file that Adobe Reader and other PDF viewer programs could read and display correctly.
Doing something like this is a challenge, even for a simple file, and you'd really have to know what you're doing to get any binary data into the file where it's supposed to go; but for character data, you'd just be able to type it in. And all of the commands used in the PDF are in the form of strings that you could type in. The hardest part is calculating those numbers that give the file offsets for items in the file (such as the number following "startxref").
If the way the file format is laid out intrigues you, go ahead and read the PDF manual, which tells the whole story.
http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/PDF32000_2008.pdf
but really you should probably just use a library
Thanks to #LukasGraf for providing this link http://www.gnupdf.org/Introduction_to_PDF that shows how to create a simple hello world pdf from scratch
As long as you're working in Python 2.7, Reportlab seems to be the best solution out there at the moment. It's quite full-featured, and can be a little complex to work with, depending on exactly what you're doing with it, but since you seem to be familiar with PDF internals in general hopefully the learning curve won't be too steep.
I recommend you to use a library. I spent a lot of time creating pdfme and learned a lot of things along the way, but it's not something you would do for a single project. If you want to use my library check the docs here.

Hadoop: Process image files in Python code

I'm working on a side project where we want to process images in a hadoop mapreduce program (for eventual deployment to Amazon's elastic mapreduce). The input to the process will be a list of all the files, each with a little extra data attached (the lat/long position of the bottom left corner - these are aerial photos)
The actual processing needs to take place in Python code so we can leverage the Python Image Library. All the Python streaming examples I can find use stdin and process text input. Can I send image data to Python through stdin? If so, how?
I wrote a Mapper class in Java that takes the list of files and saves the names, the extra data, and the binary contents to a sequence file. I was thinking maybe I need to write a custom Java mapper that takes in the sequence file and pipes it to Python. Is that the right approach? If so, what should the Java to pipe the images out and the Python to read them in look like?
In case it's not obvious, I'm not terribly familiar with Java OR Python, so it's also possible I'm just biting off way more than I can chew with this as my introduction to both languages...
There are a few possible approaches that I can see:
Use both the extra data and the file contents as input to your python program. The tricky part here will be the encoding. I frankly have no idea how streaming works with raw binary content, and I'm assuming that basic answer is "not well." The main issue is that the stdin/stdout communication between processes is very text-based, relying on delimiting input with tabs and newlines, and things like that. You would need to worry about the encoding of the image data, and probably have some sort of pre-processing step, or a custom InputFormat so that you could represent the image as text.
Use only the extra data and the file location as input to your python program. Then the program can independently read the actual image data from the file. The hiccup here is making sure that the file is available to the python script. Remember this is a distributed environment, so the files would have to be in HDFS or somewhere similar, and I don't know if there are good libraries for reading files from HDFS in python.
Do the java-python interaction yourself. Write a java mapper that uses the Runtime class to start the python process itself. This way you get full control over exactly how the two worlds communicate, but obviously its more code and a bit more involved.

Best way to send string data using python UDP packets?

To preface I'm very new to python (about 7 days) but I'm an experienced software eng undergrad.
I would like to send data between machines running python scripts. The idea I had (in order to simplify things) was to concatenate the data (strings & ints) into a string and do the parsing client-side.
The UDP packets send beautifully with simple strings but when I try to send useful data python always complains about the data I send; specifically python won't let me concatenate tuples.
In order to parse the data on the client I need to seperate the data with a dash character: '-'.
nodeList is of type dictionary where the key is a string and value is a double.
randKey = random.choice( nodeList.keys() )
data = str(randKey) +'-'+ str(nodeList[randKey])
mySocket.sendto ( data , address )
The code above produces the following error:
TypeError: coercing to Unicode: need string or buffer, tuple found
I don't understand why it thinks it is a tuple I am trying to concatenate...
So my question is how can I correct this to keep Python happy, or can someone suggest I better way of sending the data?
Thank you in advance.
I highly suggest using Google Protocol Buffers as implemented in Python as protobuf for this as it will handle the serialization on both ends of the line. It has Python bindings that will allow you to easily use it with your existing Python program.
Using your example code you would create a .proto file like so:
message SomeCoolMessage {
required string key = 1;
required double value = 2;
}
Then after generating, you can use it like so:
randKey = random.choice( nodeList.keys() )
data = SomeCoolMessage()
data.key = randKey
data.value = nodeList[randKey]
mySocket.sendto ( data.SerializeToString() , address )
I'd probably use the json module serialize the data.
You need to serialize the data. Pickle does this built in for you, and you can ask pickle for an ascii representation of the data vs binary data (see the docs), or you could use json (it also serializes the data for you) both are in the standard library. But really there are a hundred thousand different libraries that handle ALL the work for you, in getting data from 1 machine to another. I'd suggest using a library.
Depending on speed, etc. there are different trade offs for the various libraries. In the standard library you get HTTP, that's about it (well and raw sockets). But there are others.
If super fast speed is more important than other things..., zeroMQ, or google's protocol buffers might be valid options.
For me, I use rpyc usually, it lets me be totally lazy, and just call over to the other process across the network. It's fast enough usually.
You know that UDP has no guarantee that the data will ever show up on the other side, or that it will show up IN ORDER. for your application you may not care, I don't know, but just thought I'd bring it up.

Categories