Build manager for GIS data processing

Build manager for GIS data processing - python

My organization spends a lot of time processing GIS data. I have built a number of python scripts that perform different steps of the data processing. Other than the first script, all scripts rely on a different script to finish before it can start. Many of the scripts take 5+ minutes to execute (one is over an hour), so I do not want to repeat already-executed steps. I want this to work similar to Make, so that if an error occurs in "script3", I don't have to re-execute "script1" and "script2". I can just re-run "script3".
Is SCons the right tool for this? I looked at it, and it seems to be focused on compiling code rather than running scripts. I'm open to other suitable tools.

I am not sure a build system is what you want. Unless I am missing something, what you want is some kind of controlled automation to execute your processing tasks, and handle runtime errors.
Of course, 'make' and 'SCons' can do that, but it would be like using a bazooka to hammer a nail. And you're actually overlooking something that might be easier and more rewarding to invest time learning on the long run, which is Python itself. Python is a full-fledged, multi-paradigm programming language, with a lot of features for robust exception handling and interaction with the operating system (and it is heavily used in system administration on Unix-like platforms).
A first simple step would be to have a master script call each of your other scripts, each inside a try ... except block, and handle the exceptions according to your requirements. And you might improve on that as you go along, by refactoring your scripts into a consistent Python application.
Here are some links to start with: link1, link2.

Related

How to sandbox students' Python 3 code submissions for automatic assignment evaluation on Windows 10?

I am the TA for a coding class in which students will have to write Python 3 scripts to solve programming problems. An assignment consists of several problems, and for each problem the student is supposed to write a python program that will read input from standard input and write the output to the standard output. And for each problem there will be hidden test cases that we will use to evaluate their codes and grade them accordingly. So the idea is to automatize this process as much as possible. The problem is how to implement the whole framework to run students' assignments without compromising the safety of the system the assignments will be running on, which will probably be my laptop (which has Windows 10). I need to set up some kind of sandbox for Python 3, establishing limits for execution time, memory usage, disallowing access to the file system, networking, limiting imports only to safe modules from Python's standard library, etc.
Conceptually speaking I would like some kind of sand-boxed service that can receive a python script + some tests cases, the service runs the python script against the test cases in a safe environment (detecting compilation errors, time limit exceeded errors, memory limit exceeded errors, attempts to use forbidden libraries, etc.) and reporting the results back. So from Windows I can simply write a simple script that iterates over all students submissions and uses this service as a black-box to evaluate them.
Is anything like that possible on Windows 10? If so, how? My educated guess is that something like Docker or a Virtual Machine might be useful, but to be honest I'm not really sure because I lack enough expertise in these technologies, so I'm open to any suggestions.
Any advises on how to set up a secure system for automatic evaluation of untrusted Python 3 code submissions will be very appreciated.

What you are looking for a system that automatically evaluates a code using test cases.
You can use CMS to satisfy your use case. It is mainly a system to manage a programming contest, but it will be perfect for what you are trying to accomplish in your class.

IPython parallel computing vs pyzmq for cluster computing

I am currently working on some simulation code written in C, which runs on different remote machines. While the C part is finished I want to simplify my work by extending it with a python simulation api and some kind of a job-queue system, which should do the following:
1.specifiy a set of parameters on which simulations should be performed and put them into a queue on a host computer
2.perform simulation on remote machines by workers
3.return results to host computer
I had a look at different frameworks for accomplishing this task and my first choice goes down to IPython.parallel. I had a look at the documentation and from what I tested out it seems pretty easy to use. My approach would be to use a load balanced view like explained at
http://ipython.org/ipython-doc/dev/parallel/parallel_task.html#creating-a-loadbalancedview-instance
But what I dont see is:
what happens i.e. if the ipcontroller crashes, is my job queue gone?
what happens if a remote machine crashes? is there some kind of error handling?
Since I run relatively long simulations (1-2 weeks) I don't want my simulations to fail if some part of the system crashes. So is there maybe some way to handle this in IPython.parallel?
My Second approach would be to use pyzmq and implement the jobsystem from scratch.
In this case what would be the best zmq-pattern for this situation?
And last but not least, is there maybe a better framework for this scenario?

What lies behind the curtain is a bit more complex view on how to arrange the work-package flow alongside the ( parallelised ) number-crunching pipeline(s).
Being the work-package of a many CPU-core-week(s),
or
being the lumpsum volume of the job above a few hundred-of-thousands of CPU-core-hours, the principles are similar and follow a common sense.
Key Features
scaleability of the computing performance of all resources involved ( ideally a linear one )
ease of task submission role
fault-resilience of submitted task(s) ( ideally with an automated self-healing )
feasible TCO cost of access to / use of a sufficient pool of resources ( upfront co$ts, recurring co$ts, adaptation$ co$ts, co$ts of $peed )
Approaches to Solution
home-brew architecture for a distributed massively parallel scheduler based self-healing computation engine
re-use of available grid-based computing resources
Based on own experience to solve a need for repetitive runs of numerical intensive optimisation problem over a vast parameterSetVectorSPACE ( which could not be de-composed into any trivialised GPU parallelisation scheme ), selection of the second approach has been validated to be more productive rather than an attempt to burn dozens of man*years in just-another-trial to re-invent a wheel.
Being in academia environment, one may get a way easier to an acceptable access to resources-pool(s) for processing the work-packages, while commercial entities may acquire the same, based on their acceptable budgeting tresholds.

My gut instinct is to suggest rolling your own solution for this, because like you said otherwise you're depending on IPython not crashing.
I would run a simple python service on each node which listens for run commands. When it receives one it launches your C program. However, I suggest you ensure the C program is a true Unix daemon, so when it runs it completely disconnects itself from python. That way if your node python instance crashes you can still get data if the C program executes successfully. Have the C program write the output data to a file or database, and when the task is finished write "finished" to a "status" or something similar. The python service should monitor that file and when finished is indicated it should retrieve the data and send it back to the server.
The central idea of this design is to have as few possible points of failure as possible. As long as the C program doesn't crash, you can still get the data one way or another. As far as handling system crashes, network disconnects, etc, that's up to you.

Saving the stack?

I'm just curious, is it possible to dump all the variables and current state of the program to a file, and then restore it on a different computer?!
Let's say that I have a little program in Python or Ruby, given a certain condition, it would dump all the current variables, and current state to a file.
Later, I could load it again, in a different machine, and return to it.
Something like VM snapshot function.
I've seen here a question like this, but Java related, saving the current JVM and running it again in a different JVM. Most of the people told that there was nothing like that, only Terracotta had something, still, not perfect.
Thank you.
To clarify what I am trying to achieve:
Given 2 or more Raspberry Pi's, I'm trying to run my software at Pi nº1, but then, when I need to do something different with it, I need to move the software to Pi nº2 without dataloss, only a minor break time.
And so on, to an unlimited number of machines.

Its seams that I was trying to re-invent the wheel.
Check this links:
http://en.wikipedia.org/wiki/Application_checkpointing#DMTCP
http://www.linuxscrew.com/2007/10/17/cryopid-freeze-and-unfreeze-processes-in-linux/

Good question.
In Smalltalk, yes.
Actually, in Smalltalk, dumping the whole program and restarting is the only way to store and share programs. There are no source files and there is no way of starting a program from square zero. So in Smalltalk you would get your feature for free.
The Smalltalk VM offers a hook where each object can register to restore its externals resources after a restart, like reopening files and internet connections. But also, for example integer arrays are registered to that hook to change the endianness of their values on case the dump has been moved to a machine with different endianness.
This might give a hunch at how difficult (or not) it might turn our to achieve this in a language which does not support resumable dumps by design.
All other languages are, alas, much less live. Except for some Lisp implementation, I would not know of any language which supports resuming from a memory dump.
Which is a missed opportunity.

I've seen Mariano demonstrate that using Fuel (object serialization) in Pharo Smalltalk on a recent Esug conference. You could continue debugging and running as long as you don't hit objects not serialized. Squeak smalltalk runs on the Pi, and if saving an image is good enough for you, this is trivial. We're still waiting for the faster JITting VM for ARM though (part of the Google Summer of Code programme)

Managing a running for on-server scripts

The title is a bit fuzzy because I don't know the right vocabulary.
Here's the thing I am trying to do: I have a script/program on the server for running checks. Now my co-workers want that this script can be started from a website, and the logs viewed from there. The process can be quite long running for the checks, usually more than a few hours.
for that, I gathered, I'd have to monitor the processes with the website script, and show their logs. The chosen language would be either PHP or Python.
I'd very much like a hint or view on how such a thing is generally done and what are best practices, as I'm unsure how to start with this one. Especially a reliable way to start/monitor the processes would be much welcome.

If you choose Python check out Celery (although it may be a little bit overkill if you want to keep things simple). It allows you to run asynchronous tasks and you can easily monitor them. There is also a django integration for celery (django-celery) that includes a web monitor for the tasks.

Code interpreter in a web service

I'd like to build a website with a sandboxed interpreter (or compiler) either on the client side of on the server side that can take short blocks of code (python/java/c/c++ any common language would do) as input and execute it.
What I want to build is a place where given a programming question, the user can type in the solution and we can run it through some test cases, to either approve the solution or provide a test case where it breaks.
Looking for pointers to libraries, existing implementation or a general idea.
Any help much appreciated.

There are many contest websites that do something like this-- TopCoder and Timus Online Judge are two examples. They don't have much information on the technology, however.
codepad.org is the closest to what you want to do. They run programs on heavily sandboxed and firewalled EC2 servers that are periodically wiped, to prevent exploits.
Codepad is at least partially based on geordi, an IRC bot designed to run arbitrary C++ programs. It uses Haskell and traps system calls to prevent harmful activity.
Of slightly less interest, one of Google App Engine's example projects is a Python shell. It relies on GAE's server-side sandboxing to prevent malicious activity.
In terms of interface, the simplest would be to do something like the Internation Informatics Olympiad. Have people write a function with a certain name in the target language, then invoke that from your testing framework. Have simple functions that will let them request information from the framework, if necessary.

For Python you can compile PyPy in sandboxed mode which gives you a complete interpreter and full standard library but without the ability to execute arbitrary system calls. You can also limit the runtime and heap size of executed scripts.
Here's some code I wrote a while back to execute an arbitrary string containing a Python script in the pypy-sandbox binary and return the output. You can call this code from regular CPython.

Take a look at the paper An Enticing Environment for Programming which discusses building just such an environment.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.