Concurrency in processing multiple files

Concurrency in processing multiple files - python

I have two files. File A contains 1 million records. File B contains approximately 2,000 strings, each on a separate line.
I have a Python script that takes each string in File B in turn and searches for a match in File A. The Logic is as follows:
For string in File B:
For record in File A:
if record contains string: # I use regex for this
write record to a separate file
This is currently running as a single thread of execution and takes a few hours to complete.
I’d like to implement concurrency to speed up this script. What is the best way to approach it? I have looked into multi-threading but my scenario doesn’t seem to represent the producer-consumer problem as my machine has an SSD and I/O is not an issue. Would multiprocessing help with this?

Running such a problem with multi-threads poses a couple of challenges:
We have to run over all of the records in file A in order to get the algorithm done.
We have to synchronize the writing to the separate file, so we won't override the printed records.
I'd suggest:
Assign a single thread just for printing - so your external file won't get messed up.
Open as many threads as you can support (n), and give each of them different 1000000/n records to work on.

The processing you want to do requires checking whether any of the 2_000 strings is in each of the 1_000_000 records—which amounts to 2_000_000_000 such "checks" total. There's no way around that. Your current logic with the nested for loops just that iterates over all the possible combinations of things in the two files—one-by-one—and does the checking (and output file writing).
You need to determine the way (if any) ahat this could be accomplished in concurrently. For example you could have "N" tasks each checking for one string in each of the million records. The outputs from all these tasks represent the desired output and would likely need to be at aggregated together into a single file. Since the results will be in relatively random order, you may also want to sort it.

Related

Python: Setting up 2 scheduled tasks with variable times

As I have written a few times before here, I am writing a programme to archive user-specified files in a certain time interval. The user also specifies when these files shall be deleted. Hence each file has a different archive and delete time interval associated with it.
I have written pretty much everything, including extracting the timings for each file in the list and working out when the next archive/delete time would be (relevant to the current time).
I am struggling with putting it all together, i.e. with actually scheduling these two processes (archive and delete archive) for each file with its individual time intervals. I guess these two functions have to be running in the background, but only execute when the clock strikes the required time.
I have looked into scheduler, timeloop, threading.Timer, but I don't see how I can set a different time interval for each file in the list, and make it run for both archive and delete processes without interfering. I came across the concept of 'cron jobs' - can anyone let me know if this might be on the right track? I'm just looking for some ideas from more experienced programmers for what I might be missing/what I should look into to get me on the right track.

Is there a way to compare CSVs in Python while ignoring row ordering?

I'm setting up an automatic job that needs to parse csv files from an ftp site, each file contains several 10 thousand lines. I want to pre-process the directory to eliminate duplicate files before parsing the remaining files. The problem is that duplicate files are being pushed to the ftp but with different row ordering (i.e. same data, different order). That results in "duplicate files" having different hashes and byte-by-byte comparisons. with minimal processing.
I want to keep file manipulation to the minimum so I tried sorting the CSVs using the csvsort module but this is giving me an index error: IndexError: list index out of range. Here is the relevant code:
from csvsort import csvsort
csvsort(input_filename=file_path,columns=[1,2])
I tried finding and eliminating empty rows but this didn't seem to be the problem and, like I said, I want to keep file manipulation to the minimum to retain file integrity. Moreover, I have no control over the creation of files or pushing of the files to the ftp
I can think of a number of ways to work around this issue but they would all involve opening the CSVs and reading the contents, manipulating it, etc. Is there anyway I can do a lightweight file comparison that ignores row ordering or will I have to go for heavier processing?

You do not specify how much data you have. My take on this would differ depending on size. Are we talking 100's of lines ? Or multi-million lines ?
If you have "few" lines you can easily sort the lines. But if the data gets longer, you can use other strategies.
I have previously solved the problem of "weeding out lines from file A that appear in file B" using AWK, since AWK is able to do this with just 1 run through the long file (A), making the process extremely fast. However you may need to call an external program. Not sure if this is ideal for you.
If your lines are not completely identical - let's say you need to compare just one of several fields - AWK can do that as well. Just extract fields into variables.
If you choose to go this way, the script is something along:
FNR==NR{
a[$0]++;cnt[1]+=1;next
}
!a[$0]
Use with
c:\path\to\awk.exe -f awkscript.awk SMALL_LIST FULL_LIST > DIFF_LIST
DIFF_LIST is items from FULL that are NOT in SMALL.

So, it turns out that pandas has a built in hash function with an option to ignore the index. As the hash is calculated on each row, you need to run an additional sum function. In terms of code, it is about as lightweight as I could wish, in terms of runtime, it parses ~15 files in ~5 seconds (~30k rows, 17 columns in each file).
from pandas import read_csv
from pandas.util import hash_pandas_object
from collections import defaultdict
duplicate_check = defaultdict(list)
for f in files:
duplicate_check[hash_pandas_object(read_csv(f),index=False).sum()].append(f)

Running an experiment with different parameters, saving outputs efficiently

My task is:
I have a program, written in Python which takes a set of variables (A,B,C) as input, and outputs two numbers (X, Y).
I want to run this program over a large range of inputs (A,B,C)
I'll refer to running the program with a given set of variables as 'running an experiment'
I am using a cluster to do this (I submit my jobs in SLURM)
I then want to save the results of the experiments into a single file, e.g. a data frame with columns [A|B|C|X|Y], where each row is the output of a different experiment.
My current situation is:
I have written my program in the following form:
import io
from optparse import OptionParser
parser = OptionParser()
parser.add_option("-f", "--file",action="store", type="string", dest="filename")
parser.add_option("-a", "--input_a",type="int", dest="a")
parser.add_option("-b", "--input_b",type="int", dest="b")
parser.add_option("-c", "--input_c",type="int", dest="c")
(options, args) = parser.parse_args()
def F(a,b,c):
return((a+b, b+c))
Alice = options.input_a
Bob = options.input_b
Carol = options.input_c
with io.open("results_a_{0}_b_{1}_c_{2}.txt".format(Alice, Bob, Carol), "a") as f:
(x,y) = F(Alice, Bob, Carol)
f.write(u"first output = {0}, second output = {1}".format(x,y))
This allows me to run the program once, and save the results in a single file.
In principle, I could then submit this job for a range of (A,B,C), obtain a large number of text files with the results, and then try to aggregate them into a single file.
However, I assume this isn't the best way of going about things.
What I'd like to know is:
Is there a more natural way for me to run these experiments and save the results all in one file, and if so, what is it?
How should I submit my collection of jobs on SLURM to do so?
I currently am submitting (effectively) the script below, which is not really working:
...
for a in 1 2 3; do
for b in 10 20 30; do
for c in 100 200 300; do
python toy_experiment.py -a $a -b $b -c $c &
done
done
done
wait
[I'm wary that there are possibly many other places I'm going wrong here - I'm open to using something other than optparse to pass arguments to my program, to saving the results differently, etc. - my main goal is having a working solution.]

TL;DR
You should research how to use file locking in your system and use that to access a single output file safely.
You should also submit your script usign a job array, and letting each job in the array to run a single experiment.
Many things going on here.
One important question: how long usually takes F() to compute? I assume that you just wrote an example, but this is required to decide the best approach to the problem.
If the time span for every calculation is short, maybe you can run few batches aggregating in a single script several computations: in the example, 3 batches (for a==1, a==2 and a==3), each of them gcomputing all the possible experiments for b and c and generating 3 files that have to be aggregated at the end.
If the timespan is long, then the overload of creating some thousands of small files is not a big deal. And concatenating them afterwards will be easy.
Another thing: by putting all your jobs running simultaneously in the backgroud you are probably overloading a single machine. I don't know how do you ask SLURM for resources, but for sure you will be using only one node. And overusing it. If there are other users there, they will probably be pissed off. You must control the number of simultaneous jobs running in a node to be the number of processors granted in that node. Propably, starting your calculations with srun will help.
Another way would be to create a single job per calculation. You can encapsulate them in a job array. In that case, you will only run one experiment per job and you won't have to start and wait for anything in the background.
Finally, getting to your main question: which is the best way to save all this information in disk efficiently.
Creating thousands of files is easy and possible, but not the best way for the file system. Maybe you have access to some RAMdisk in a common node. Writing a small file in compute node local storage and copying that file to the common in-memory disk would be quite a lot more efficient. And when all the experiments have been done you can aggregate the results easily. The drawback is that the space is usually very limited (I don't really know the real size of your data) and it will be an in-memory disk: it can be lost in case of power failure.
Using a single file would be a better approach, but as Dmitri Chubarov pointed, you have to make use of the file locking mechanisms. Otherwise you risk getting mixed results.
Finally, the approach that I feel is best suited for your problem, is to use some kind of database-like solution. If you have access to a relational DB that supports transactions, just create a table with the needed fields and let your code connect and insert the results. Extracting them at the end will be a breeze. The DB can be a client/server one (MySQL, PostgreSQL,Oracle...) or an embedded one (HSQLDB). Another option would be to use some file format like NetCDF, which is precisely intended for this kind of scientific data and have some support for parallel access.

return data from another python file that's constantly updating

I have a python file that is taking websocket data and constantly updating a giant list. It updates somewhere between 2 to 10 times a second. This file runs constantly.
I want to be able to call that list from a different file so this file can process that data and do something else with it.
Basically file 1 is a worker that keeps the current state in a list, I need to be able to get this state from file 2.
I have 2 questions:
Is there any way of doing this easily? I guess the most obvious answers are storing the list in a file or a DB, which leads me to my second question;
Given that the list is updating somewhere between 2 and 10 times a second, which would be better? a file or a db? can these IO functions handle these types of update speeds?

DB is the best bet for your use case
This gives the flexibility to know which part of data you have already processed by having a status flag.
Persistence data (you can also have data replication)
You can scale easily if in future your applications pulls more and more data
2 -10 times is a good use case for heavy write application with DB as you will gather tons of data in short duration.

Process 5 million key-value data in python.Will NoSql solve?

I would like to get the suggestion on using No-SQL datastore for my particular requirements.
Let me explain:
I have to process the five csv files. Each csv contains 5 million rows and also The common id field is presented in each csv.So, I need to merge all csv by iterating 5 million rows.So, I go with python dictionary to merge all files based on the common id field.But here the bottleneck is you can't store the 5 million keys in memory(< 1gig) with python-dictionary.
So, I decided to use No-Sql.I think It might be helpful to process the 5 million key value storage.Still I didn't have clear thoughts on this.
Anyway we can't reduce the iteration since we have the five csvs each has to be iterated for updating the values.
Is it there an simple steps to go with that?
If this is the way Could you give me the No-Sql datastore to process the key-value pair?
Note: We have the values as list type also.

If the CSV is already sorted by id you can use the merge-join algorithm. It allows you to iterate over the single lines, so you don't have to keep everything in memory.
Extending the algorithm to multiple tables/CSV files will be a greater challenge, though. (But probably faster than learning something new like Hadoop)

If this is just a one-time process, you might want to just setup an EC2 node with more than 1G of memory and run the python scripts there. 5 million items isn't that much, and a Python dictionary should be fairly capable of handling it. I don't think you need Hadoop in this case.
You could also try to optimize your scripts by reordering the items in several runs, than running over the 5 files synchronized using iterators so that you don't have to keep everything in memory at the same time.

As I understand you want to merge about 500,000 items from 5 input files. If you do this on one machine it might take long time to process 1g of data. So I suggest to check the possibility of using Hadoop. Hadoop is a batch processing tool. Usually Hadoop programs are written in Java, but you can write it in Python as well.
I recommend to check feasibility of using Hadoop to process your data in a cluster. You may use HBase (Column datastore) to store your data. It's an idea, check whether its applicable to your problem.
If this does not help, give some more details about the problem your are trying to solve. Technically you can use any language or datastore to solve this problem. But you need to find which one solves the best (in terms of time or resources) and your willingness to use/learn a new tool/db.
Excellent tutorial to get started: http://developer.yahoo.com/hadoop/tutorial/

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.