My task is:
I have a program, written in Python which takes a set of variables (A,B,C) as input, and outputs two numbers (X, Y).
I want to run this program over a large range of inputs (A,B,C)
I'll refer to running the program with a given set of variables as 'running an experiment'
I am using a cluster to do this (I submit my jobs in SLURM)
I then want to save the results of the experiments into a single file, e.g. a data frame with columns [A|B|C|X|Y], where each row is the output of a different experiment.
My current situation is:
I have written my program in the following form:
import io
from optparse import OptionParser
parser = OptionParser()
parser.add_option("-f", "--file",action="store", type="string", dest="filename")
parser.add_option("-a", "--input_a",type="int", dest="a")
parser.add_option("-b", "--input_b",type="int", dest="b")
parser.add_option("-c", "--input_c",type="int", dest="c")
(options, args) = parser.parse_args()
def F(a,b,c):
return((a+b, b+c))
Alice = options.input_a
Bob = options.input_b
Carol = options.input_c
with io.open("results_a_{0}_b_{1}_c_{2}.txt".format(Alice, Bob, Carol), "a") as f:
(x,y) = F(Alice, Bob, Carol)
f.write(u"first output = {0}, second output = {1}".format(x,y))
This allows me to run the program once, and save the results in a single file.
In principle, I could then submit this job for a range of (A,B,C), obtain a large number of text files with the results, and then try to aggregate them into a single file.
However, I assume this isn't the best way of going about things.
What I'd like to know is:
Is there a more natural way for me to run these experiments and save the results all in one file, and if so, what is it?
How should I submit my collection of jobs on SLURM to do so?
I currently am submitting (effectively) the script below, which is not really working:
...
for a in 1 2 3; do
for b in 10 20 30; do
for c in 100 200 300; do
python toy_experiment.py -a $a -b $b -c $c &
done
done
done
wait
[I'm wary that there are possibly many other places I'm going wrong here - I'm open to using something other than optparse to pass arguments to my program, to saving the results differently, etc. - my main goal is having a working solution.]
TL;DR
You should research how to use file locking in your system and use that to access a single output file safely.
You should also submit your script usign a job array, and letting each job in the array to run a single experiment.
Many things going on here.
One important question: how long usually takes F() to compute? I assume that you just wrote an example, but this is required to decide the best approach to the problem.
If the time span for every calculation is short, maybe you can run few batches aggregating in a single script several computations: in the example, 3 batches (for a==1, a==2 and a==3), each of them gcomputing all the possible experiments for b and c and generating 3 files that have to be aggregated at the end.
If the timespan is long, then the overload of creating some thousands of small files is not a big deal. And concatenating them afterwards will be easy.
Another thing: by putting all your jobs running simultaneously in the backgroud you are probably overloading a single machine. I don't know how do you ask SLURM for resources, but for sure you will be using only one node. And overusing it. If there are other users there, they will probably be pissed off. You must control the number of simultaneous jobs running in a node to be the number of processors granted in that node. Propably, starting your calculations with srun will help.
Another way would be to create a single job per calculation. You can encapsulate them in a job array. In that case, you will only run one experiment per job and you won't have to start and wait for anything in the background.
Finally, getting to your main question: which is the best way to save all this information in disk efficiently.
Creating thousands of files is easy and possible, but not the best way for the file system. Maybe you have access to some RAMdisk in a common node. Writing a small file in compute node local storage and copying that file to the common in-memory disk would be quite a lot more efficient. And when all the experiments have been done you can aggregate the results easily. The drawback is that the space is usually very limited (I don't really know the real size of your data) and it will be an in-memory disk: it can be lost in case of power failure.
Using a single file would be a better approach, but as Dmitri Chubarov pointed, you have to make use of the file locking mechanisms. Otherwise you risk getting mixed results.
Finally, the approach that I feel is best suited for your problem, is to use some kind of database-like solution. If you have access to a relational DB that supports transactions, just create a table with the needed fields and let your code connect and insert the results. Extracting them at the end will be a breeze. The DB can be a client/server one (MySQL, PostgreSQL,Oracle...) or an embedded one (HSQLDB). Another option would be to use some file format like NetCDF, which is precisely intended for this kind of scientific data and have some support for parallel access.
Related
I am reading many (say 1k) CERN ROOT files using a loop and storing some data into a nested NumPy array. The use of loops makes it serial task and each file take quite some time to complete the process. Since I am working on a deep learning model, I must create a large enough dataset - but the reading time itself is taking a very long time (reading 835 events takes about 21 minutes). Can anyone please suggest if it is possible to use multiple GPUs to read the data, so that less time is required for the reading? If so, how?
Adding some more details: I pushed to program to GitHub so that this can be seen (please let me know if posting GitHub link is not allowed, in that case, I will post the relevant portion here):
https://github.com/Kolahal/SupervisedCounting/blob/master/read_n_train.py
I run the program as:
python read_n_train.py <input-file-list>
where the argument is a text file containing the list of the files with addresses. I was opening the ROOT files in a loop in the read_data_into_list() function. But as I mentioned, this serial task is consuming a lot of time. Not only that, I notice that the reading speed is getting worse as we read more and more data.
Meanwhile I tried to used slurmpy package https://github.com/brentp/slurmpy
With this, I can distribute the job into, say, N worker nodes, for example. In this case, an individual reading program will read the file assigned to it and will return a corresponding list. It is just that in the end, I need to add the lists. I couldn't figure out a way to do this.
Any help is highly appreciated.
Regards,
Kolahal
You're looping over all the events sequentially from python, that's probably the bottleneck.
You can look into root_numpy to load the data you need from the root file into numpy arrays:
root_numpy is a Python extension module that provides an efficient interface between ROOT and NumPy. root_numpy’s internals are compiled C++ and can therefore handle large amounts of data much faster than equivalent pure Python implementations.
I'm also currently looking at root_pandas which seems similar.
While this solution does not precisely answer the request for parallelization, it may make the parallelization unnecessary. And if it is still too slow, then it can still be used on parallel using slurm or something else.
I have two files. File A contains 1 million records. File B contains approximately 2,000 strings, each on a separate line.
I have a Python script that takes each string in File B in turn and searches for a match in File A. The Logic is as follows:
For string in File B:
For record in File A:
if record contains string: # I use regex for this
write record to a separate file
This is currently running as a single thread of execution and takes a few hours to complete.
I’d like to implement concurrency to speed up this script. What is the best way to approach it? I have looked into multi-threading but my scenario doesn’t seem to represent the producer-consumer problem as my machine has an SSD and I/O is not an issue. Would multiprocessing help with this?
Running such a problem with multi-threads poses a couple of challenges:
We have to run over all of the records in file A in order to get the algorithm done.
We have to synchronize the writing to the separate file, so we won't override the printed records.
I'd suggest:
Assign a single thread just for printing - so your external file won't get messed up.
Open as many threads as you can support (n), and give each of them different 1000000/n records to work on.
The processing you want to do requires checking whether any of the 2_000 strings is in each of the 1_000_000 records—which amounts to 2_000_000_000 such "checks" total. There's no way around that. Your current logic with the nested for loops just that iterates over all the possible combinations of things in the two files—one-by-one—and does the checking (and output file writing).
You need to determine the way (if any) ahat this could be accomplished in concurrently. For example you could have "N" tasks each checking for one string in each of the million records. The outputs from all these tasks represent the desired output and would likely need to be at aggregated together into a single file. Since the results will be in relatively random order, you may also want to sort it.
We are running logfile parsing jobs in google dataflow using the Python SDK. Data is spread over several 100s of daily logs, which we read via file-pattern from Cloud Storage. Data volume for all files is about 5-8 GB (gz files) with 50-80 million lines in total.
loglines = p | ReadFromText('gs://logfile-location/logs*-20180101')
In addition, we have a simple (small) mapping csv, that maps logfile-entries to human readable text. Has about 400 lines, 5 kb size.
For Example a logfile entry with [param=testing2] should be mapped to "Customer requested 14day free product trial" in the final output.
We do this in a simple beam.Map with sideinput, like so:
customerActions = loglines | beam.Map(map_logentries,mappingTable)
where map_logentries is the mapping function and mappingTable is said mapping table.
However, this only works if we read the mapping table in native python via open() / read(). If we do the same utilising the beam pipeline via ReadFromText() and pass the resulting PCollection as side-input to the Map, like so:
mappingTable = p | ReadFromText('gs://side-inputs/category-mapping.csv')
customerActions = loglines | beam.Map(map_logentries,beam.pvalue.AsIter(mappingTable))
performance breaks down completely to about 2-3 items per Second.
Now, my questions:
Why would performance break so badly, what is wrong with passing a
PCollection as side-input?
If it is maybe not recommended to use
PCollections as side-input, how is one supposed to build such as
pipeline that needs mappings that can/should not be hard coded into
the mapping function?
For us, the mapping does change frequently and I need to find a way to have "normal" users provide it. The idea was to have the mapping csv available in Cloud Storage, and simply incorporate it into the Pipeline via ReadFromText(). Reading it locally involves providing the mapping to the workers, so only the tech-team can do this.
I am aware that there are caching issues with side-input, but surely this should not apply to a 5kb input.
All code above is pseudo code to explain the problem. Any ideas and thoughts on this would be highly appreciated!
For more efficient side inputs (with small to medium size) you can utilize
beam.pvalue.AsList(mappingTable)
since AsList causes Beam to materialize the data, so you're sure that you will get in-memory list for that pcollection.
Intended for use in side-argument specification---the same places
where AsSingleton and AsIter are used, but forces materialization of
this PCollection as a list.
Source: https://beam.apache.org/documentation/sdks/pydoc/2.2.0/apache_beam.pvalue.html?highlight=aslist#apache_beam.pvalue.AsList
The code looks fine. However, since mappingTable is a mapping, wouldn't beam.pvalue.AsDict be more appropriate for your use case?
Your mappingTable is small enough so side input is a good use case here.
Given that mappingTable is also static, you can load it from GCS in start_bundle function of your DoFn. See the answer to this post for more details. If mappingTable becomes very large in future, you can also consider converting your map_logentries and mappingTable into PCollection of key-value pairs and join them using CoGroupByKey.
I create an rdflib graph by parsing records from a database using rdflib-jsonld. However, the subject of triples has a missing / from the url. To add it, I use the following code:
for s,p,o in graph1:
print 'parsing to graph2. next step - run query on graph2'
pprint.pprint((s,p,o))
s = str(s)
s1 =s.replace('https:/w','https://w')
s = URIRef(s1)
graph2.add((s,p,o))
This step takes a very long time (couple of hours) to run because of the high number of triples in the graph. How can I reduce the time taken? Instead of looping through every element, how do I alter the subject in bulk?
First of all, to make proper time measurements, remove anything not related to the replacement itself, particularly, both ordinary and pretty print, you don't need them. If you need some progress indicator, write a short message (e.g. a single dot) into a logfile every N steps.
Avoid memory overconsumption. I don't know how your graph looks like internally, but it'd be better to make the replacement in place, without creating a parallel graph structure. Check memory usage during the process and if the program gets out of free RAM, you're in trouble, all processes will slow down to their knees. If you can't modify the existing graph and go out of memory, for measurement purposes simply avoid the second graph creation, even if such a replacement is lost and thus useless.
If nothing helps, do one step back. You could perform the replacements on a stage when you haven't parsed the file(s) yet with either python re, or with a text tool like sed dedicated to batch text processing.
I would like to get the suggestion on using No-SQL datastore for my particular requirements.
Let me explain:
I have to process the five csv files. Each csv contains 5 million rows and also The common id field is presented in each csv.So, I need to merge all csv by iterating 5 million rows.So, I go with python dictionary to merge all files based on the common id field.But here the bottleneck is you can't store the 5 million keys in memory(< 1gig) with python-dictionary.
So, I decided to use No-Sql.I think It might be helpful to process the 5 million key value storage.Still I didn't have clear thoughts on this.
Anyway we can't reduce the iteration since we have the five csvs each has to be iterated for updating the values.
Is it there an simple steps to go with that?
If this is the way Could you give me the No-Sql datastore to process the key-value pair?
Note: We have the values as list type also.
If the CSV is already sorted by id you can use the merge-join algorithm. It allows you to iterate over the single lines, so you don't have to keep everything in memory.
Extending the algorithm to multiple tables/CSV files will be a greater challenge, though. (But probably faster than learning something new like Hadoop)
If this is just a one-time process, you might want to just setup an EC2 node with more than 1G of memory and run the python scripts there. 5 million items isn't that much, and a Python dictionary should be fairly capable of handling it. I don't think you need Hadoop in this case.
You could also try to optimize your scripts by reordering the items in several runs, than running over the 5 files synchronized using iterators so that you don't have to keep everything in memory at the same time.
As I understand you want to merge about 500,000 items from 5 input files. If you do this on one machine it might take long time to process 1g of data. So I suggest to check the possibility of using Hadoop. Hadoop is a batch processing tool. Usually Hadoop programs are written in Java, but you can write it in Python as well.
I recommend to check feasibility of using Hadoop to process your data in a cluster. You may use HBase (Column datastore) to store your data. It's an idea, check whether its applicable to your problem.
If this does not help, give some more details about the problem your are trying to solve. Technically you can use any language or datastore to solve this problem. But you need to find which one solves the best (in terms of time or resources) and your willingness to use/learn a new tool/db.
Excellent tutorial to get started: http://developer.yahoo.com/hadoop/tutorial/