Simple way to storing data from multiple processes

Simple way to storing data from multiple processes - python

I have a Python script that does something along the line of:
def MyScript(input_filename1, input_filename2):
return val;
i.e. for every pair of input, I calculate some float value. Note that val is a simple double/float.
Since this computation is very intensive, I will be running them across different processes (might be on the same computer, might be on multiple computers).
What I did before was I output this value to a text file: input1_input2.txt . Then I will have 1000000 files that I need to reduce into one file. This process is not very fast since OS doesn't like folders that have too many files.
How do I efficiently get all these data into one single computer? Perhaps having MongoDB running on a computer and all the processes send the data along?
I want something easy. I know that I can do this in MPI but I think it is overkill for such a simple task.

If the inputs have a natural order to them, and each worker can find out "which" input it's working on, you can get away with one file per machine. Since Python floats are 8 bytes long, each worker would write the result to its own 8-byte slot in the file.
import struct
RESULT_FORMAT = 'd' # Double-precision float.
RESULT_SIZE = struct.calcsize(RESULT_FORMAT)
RESULT_FILE = '/tmp/results'
def worker(position, input_filename1, input_filename2):
val = MyScript(input_filename1, input_filename2)
with open(RESULT_FILE, 'rb+') as f:
f.seek(RESULT_SIZE * position)
f.write(struct.pack(RESULT_FORMAT, val))
Compared to writing a bunch of small files, this approach should also be a lot less I/O intensive, since many workers will be writing to the same pages in the OS cache.
(Note that on Windows, you may need some additional setup to allow sharing the file between processes.)

You can use python parallel processing support.
http://wiki.python.org/moin/ParallelProcessing
Specially, I would mention NetWorkSpaces.
http://www.drdobbs.com/web-development/200001971

You can generate a folder structure that contains generated sub folders that contain generated sub folders.
For example you have a main folder that contains 256 sub folder and each sub folder contains 256 sub folders. 3 levels deep will be enough. You can use sub strings of guids for generating unique folder names.
So guid AB67E4534678E4E53436E becomes folder AB that contains sub folder 67 and that folder contains folder E4534678E4E53436E.
Using 2 substrings of 2 characters makes it possible to genereate 256 * 256 folders. More than enough to store 1 million files.

You could run one program that collects the outputs, as example over XMLRPC.

Related

Most efficient way to check if a string contains any file format?

I have a .txt with hundreds of thousands of paths and I simply have to check if each line is a folder or a file. The hard drive is not with me so I can't use the module os with the os.path.isdir() function. I've tried the code below but it is just not perfect since some folders contains . at the end.
for row in files:
if (row[-6:].find(".") < 0):
folders_count += 1
It is just not worth testing if the ending of the string contains any known file format (.zip, .pdf, .doc ...) since there are dozens of different files format inside this HD. When my code reads the .txt, it stores each line as a string inside an array, so my code should work with the string format.
An example of a folder path:
'path1/path2/truckMV.34'
An example of a file path:
'path1/path2/certificates.pdf'

It's impossible for us to judge if it's a file or path just by the string since an extension is just an arbitrary agreeable string that programs choose to decode in a certain way.
Having said that, if I had the same problem I would do my best to estimate with the following pseudo code:
Create a hash map (or a dictionary as you are in Python)
For every line of the file, read the last bit and see if there's a "." in the last path
Create a key for it on the hash map with a counter of how many times you have encountered the "possible extensions".
After you go through all of the list you will have a collection of possible extensions and how many you have encountered them. Assume the ones with only 1 occurrence (or any other low arbitrary number) to be a path and not an extension.
The basis of this heuristic is that it's unlikely for a person to have a lot of unique extensions on their desktop - but that's just an assumption I came up with.

Efficiently move large number of files to different folder structure

I am trying to reorganize a large number of pdf files (3 million files, average file 300KB). Currently, the files are stored in randomly named folders, but I want to organize them by their file name. File names are 8-digit integers such as 12345678.pdf
Currently, the files are stored like this
/old/a/12345678.pdf
/old/a/12345679.pdf
/old/b/22345679.pdf
I want them to be stored like this
/new/12/345/12345678.pdf
/new/12/345/12345679.pdf
/new/22/345/22345679.pdf
I thought this was an easy task using shutil:
from pathlib import Path
import shutil
for path_old in Path('old').rglob('*.pdf'):
r = int(path_old.stem)
path_new = '/new/'+str(r// 1000**2)+'/'+str(r // 1000 % 1000)+'/'+path_old.name
shutil.move(str(path_old),path_new)
Unfortunately, this takes forever. My script is only moving ~15 files per second, which means it will take days to complete.
I am not exactly sure whether this is a Python/shutil problem or a more general IO problem - sorry if I misplaced the question. I am open to any type of solution that makes this process faster.

HTCondor output files: obtain created directory

I am using HTcondor to generate some data (txt, png). By running my program, it creates a directory next to the .sub file, named datasets, where the datasets are stored into. Unfortunately, condor does not give me back this created data when finished. In other words, my goal is to get the created data in a "Datasets" subfolder next to the .sub file.
I tried:
1) to not put the data under the datasets subfolder, and I obtained them as thought. Howerver, this is not a smooth solution, since I generate like 100 files which are now mixed up with the .sub file and all the other.
2) Also I tried to set this up in the sub file, leading to this:
notification = Always
should_transfer_files = YES
RunAsOwner = True
When_To_Transfer_Output = ON_EXIT_OR_EVICT
getenv = True
transfer_input_files = main.py
transfer_output_files = Datasets
universe = vanilla
log = log/test-$(Cluster).log
error = log/test-$(Cluster)-$(Process).err
output = log/test-$(Cluster)-$(Process).log
executable = Simulation.bat
queue
This time I get the error, that Datasets was not found. Spelling was checked already.
3) Another option would be, to pack everything in a zip, but since I have to run hundreds of jobs, I do not want to unpack all this files afterwards.
I hope somebody comes up with a good idea on how to solve this.

Just for the record here: HTCondor does not transfer created directories at the end of the run or its contents. The best way to get the content back is to write a wrapper script that will run your executable and then compress the created directory at the root of the working directory. This file will be transferred with all other files. For example, create run.exe:
./Simulation.bat
tar zcf Datasets.tar.gz Datasets
and in your condor submission script put:
executable = run.exe
However, if you do not want to do this and if HTCondor is using a common shared space like an AFS you can simply copy the whole directory out:
./Simulation.bat
cp -r Datasets <AFS location>
The other alternative is to define an initialdir as described at the end of: https://research.cs.wisc.edu/htcondor/manual/quickstart.html
But one must create the directory structure by hand.
also, look around pg. 65 of: https://indico.cern.ch/event/611296/contributions/2604376/attachments/1471164/2276521/TannenbaumT_UserTutorial.pdf
This document is, in general, a very useful one for beginners.

Data structures suitable for searching

I have common problem. I have some data and I want search in them. My issue is, that I dont know a proper data structures and algorhitm suitable for this situation.
There are two kind of objects - Process and Package. Both have some properties, but they are only data structures (dont have any methods). Next, there are PackageManager and something what can be called ProcessManager, which both have function returning list of files that belongs to some Package or files that is used by some Process.
So semantically, we can imagine these data as
Packages:
Package_1
file_1
_ file_2
file_3
Package_2
file_4
file_5
file_6
Actually file that belongs to Package_k can not belong to Package_l for k != l :-)
Processes:
Process_1
file_2
file_3
Process_2
file_1
Files used by processes corresponds to files owned by packages. Also, there the rule doesn't applies on this as for packages - that means, n processes can use one same file at the same time.
Now what is the task. I need to find some match between processes and packages - for given list of packages, I need to find list of processes which uses any of files owned by packages.
My temporary solution was making list of [package_name, package_files] and list of [process_name, process_files] and for every file from every package I was searching through every file of every process searching for match, but of course it could be only temporary solution vzhledem to horrible time complexity (even when I sort the files and use binary search to it).
What you can recommend me for this kind of searching please?
(I am coding it in python)

Doing the matching with sets should be faster:
watched_packages = [Package_1, Package_3] # Packages to consider
watched_files = { # set comprehension
file_
for package in watched_packages
for file_ in package.list_of_files
}
watched_processes = [
process
for process in all_processes
if any(
file_ in watched_files
for file_ in process.list_of_files
)
]

Based on my understanding of what you are trying to do - given a file name, you want to find a list of all the processes that use that file, this snippet of code should help:
from collections import defaultdict
# First make a dictionary that contains a file, and all processes it is a member of.
file_process_map=defaultdict(list)
[file_process_map[fn].append(p) for p in processes for fn in p.file_list]
Basically, we're converting your existing structure (where a process has one or more files) into a structure where we have a filename, and a list of processes that use it.
Now when you have a file you need to search for (in the processes) just look it up in the "file_process_map" dictionary and you'll have a list of all the processes that use the given file.
It is assumed here that "processes" is a list of objects, and each object has a file_list attribute that contains a list of associated files. Obviously, depending on your data structure you might need to alter the code..

Limitation to Python's glob?

I'm using glob to feed file names to a loop like so:
inputcsvfiles = glob.iglob('NCCCSM*.csv')
for x in inputcsvfiles:
csvfilename = x
do stuff here
The toy example that I used to prototype this script works fine with 2, 10, or even 100 input csv files, but I actually need it to loop through 10,959 files. When using that many files, the script stops working after the first iteration and fails to find the second input file.
Given that the script works absolutely fine with a "reasonable" number of entries (2-100), but not with what I need (10,959) is there a better way to handle this situation, or some sort of parameter that I can set to allow for a high number of iterations?
PS- initially I was using glob.glob, but glob.iglob fairs no better.
Edit:
An expansion of above for more context...
# typical input file looks like this: "NCCCSM20110101.csv", "NCCCSM20110102.csv", etc.
inputcsvfiles = glob.iglob('NCCCSM*.csv')
# loop over individial input files
for x in inputcsvfiles:
csvfile = x
modelname = x[0:5]
# ArcPy
arcpy.AddJoin_management(inputshape, "CLIMATEID", csvfile, "CLIMATEID", "KEEP_COMMON")
do more stuff after
The script fails at the ArcPy line, where the "csvfile" variable gets passed into the command. The error reported is that it can't find a specified csv file (e.g., "NCCSM20110101.csv"), when in fact, the csv is definitely in the directory. Could it be that you can't reuse a declared variable (x) multiple times as I have above? Again, this will work fine if the directory being glob'd only has 100 or so files, but if there's a whole lot (e.g., 10,959), it fails seemingly arbitrarily somewhere down the list.

Try doing a ls * on shell for those 10,000 entries and shell would fail too. How about walking the directory and yield those files one by one for your purpose?
#credit - #dabeaz - generators tutorial
import os
import fnmatch
def gen_find(filepat,top):
for path, dirlist, filelist in os.walk(top):
for name in fnmatch.filter(filelist,filepat):
yield os.path.join(path,name)
# Example use
if __name__ == '__main__':
lognames = gen_find("NCCCSM*.csv",".")
for name in lognames:
print name

One issue that arose was not with Python per se, but rather with ArcPy and/or MS handling of CSV files (more the latter, I think). As the loop iterates, it creates a schema.ini file whereby information on each CSV file processed in the loop gets added and stored. Over time, the schema.ini gets rather large and I believe that's when the performance issues arise.
My solution, although perhaps inelegant, was do delete the schema.ini file during each loop to avoid the issue. Doing so allowed me to process the 10k+ CSV files, although rather slowly. Truth be told, we wound up using GRASS and BASH scripting in the end.

If it works for 100 files but fails for 10000, then check that arcpy.AddJoin_management closes csvfile after it is done with it.
There is a limit on the number of open files that a process may have at any one time (which you can check by running ulimit -n).

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.