Bulk edit subject of triples in rdflib - python

I create an rdflib graph by parsing records from a database using rdflib-jsonld. However, the subject of triples has a missing / from the url. To add it, I use the following code:
for s,p,o in graph1:
print 'parsing to graph2. next step - run query on graph2'
pprint.pprint((s,p,o))
s = str(s)
s1 =s.replace('https:/w','https://w')
s = URIRef(s1)
graph2.add((s,p,o))
This step takes a very long time (couple of hours) to run because of the high number of triples in the graph. How can I reduce the time taken? Instead of looping through every element, how do I alter the subject in bulk?

First of all, to make proper time measurements, remove anything not related to the replacement itself, particularly, both ordinary and pretty print, you don't need them. If you need some progress indicator, write a short message (e.g. a single dot) into a logfile every N steps.
Avoid memory overconsumption. I don't know how your graph looks like internally, but it'd be better to make the replacement in place, without creating a parallel graph structure. Check memory usage during the process and if the program gets out of free RAM, you're in trouble, all processes will slow down to their knees. If you can't modify the existing graph and go out of memory, for measurement purposes simply avoid the second graph creation, even if such a replacement is lost and thus useless.
If nothing helps, do one step back. You could perform the replacements on a stage when you haven't parsed the file(s) yet with either python re, or with a text tool like sed dedicated to batch text processing.

Related

Building and accessing a database runtime

My code is acquiring data from a sensor and it's performing some operations based on the last N-minutes of data.
At the moment I just initialize a list at the beginning of my code as
x = []
while running:
new_value = acquire_new_point()
x.append(new_value)
do_something_with_x(x)
Despite working, this has some intrinsic flaws that make it sub-optimal. The most important are:
If the code crashes or restart, the whole time-history is reset
There is no record or log of the past time-history
The memory consumption could grow out of control and exceed the available memory
Some obvious solutions exist as:
log each new element to a csv file and read it when the code starts
divide the data in N-minutes chunks and drop from the memory chunks that are more than N-minutes old
I have the feeling, however, that this could be a problem for which something more specific solution has been already created. My first thought went to HDF5, but I'm not sure it's the best candidate for this problem.
So, what is the best strategy/format for a database that needs to be written (appended) run-time and needs to be accessed partially (only the last N-minutes)? Is there any tool that is specific for a task like this one?
I'd simply just use SQLite with a two-column table (timestamp, value) with an index on the timestamp.
This has the additional bonus that you can use SQL for data analysis, which will likely be faster than doing it by hand in Python.
If you don't need the historical data, you can periodically DELETE FROM data WHERE TIMESTAMP < ....

How to create a dependency tree for python functions

I'm writing some code using python/numpy that I will be using for data analysis of some experimental data sets. Certain steps of these analysis routines can take a while. It is not practical to rerun every step of the analysis every time (I.E when debugging) so it makes sense to save the output from these steps to a file and just reuse them if they're already available.
The data I ultimately want to obtain can be derived from various steps along this analysis process. I.E, A can be used to calculate B and C. D can be calculated from B. E can then be calculated using C and D. etc etc.
The catch here is that it's not uncommon to make it through a few (or many) datasets only to find that there's some tiny little gotcha in the code somewhere that requires some part of the tree to be recalculated. I.E - I discover a bug in B, so now anything that depends on B also needs to be recalculated because it was derived from incorrect data.
The end goal here is to basically protect myself from having sets of data that I forget to reprocess when bugs are found. In other words, I want to be confident that all of my data is calculated using the newest code.
Is there a way to implement this in Python? I have no specific form this solution needs to take so long as it's extensible as I add new steps. I also am okay with the "recalculation step" only being performed when the dependent quantity is recalculated (Rather than at the time one of the parents are changed).
My first thought of how this might be done is to embed information in a header of each saved file (A, B, C, etc) indicating what version of each module it was created with. Then, when loading the saved data the code can check if the version in the file matches the current version of the parent module. (Some sort of parent.getData() which checks if the data has been calculated for that dataset and if it's up to date)
The thing is, at least at first glance, I could see that this could have problems when the change happens several steps up in the dependency chain because the derived file may still be up to date with its module even though its parents are out of date. I suppose I could add some sort of parent.checkIfUpToDate() that checks its own files and then asks each of its parents if they're up to date (which then ask their parents, etc), and updates it if not. The version number can just be a static string stored in each module.
My concern with that approach is that it might mean reading potentially large files from disk just to get a version number. If I went with the "file header" approach, does Python actually load the whole file in to memory when I do an open(myFile), or can I do that, just read the header lines, and close the file without loading the whole thing in to memory?
Last - is there a good way to embed this type of information beyond just having the first line of the file be some variation of # MyFile made with MyModule V x.y.z and writing some piece of code to parse that line?
I'm kind of curious if this approach makes sense, or if I'm reinventing the wheel and there's already something out there to do this.
edit: And something else that occurred to me after I submitted this - does Python have any mechanism to define templates that modules must follow? Just as a means to make sure I keep the format for the data reading steps consistent from module to module.
I cannot answer all of your questions but you can read in only a small part of data from a large file as you can see here:
How to read specific part of large file in Python
I do not see why you would need a parent.checkIfUpToDate() function. You could as well just store the version number of the parent functions in the file itself as well.
To me your approach sounds reasonable, however I have never done anything similar. Alternatively you could create an additional file that holds the specified information but I think storing the information in the actual file should prevent Version errors between your "data file" and the "function version file".

Running an experiment with different parameters, saving outputs efficiently

My task is:
I have a program, written in Python which takes a set of variables (A,B,C) as input, and outputs two numbers (X, Y).
I want to run this program over a large range of inputs (A,B,C)
I'll refer to running the program with a given set of variables as 'running an experiment'
I am using a cluster to do this (I submit my jobs in SLURM)
I then want to save the results of the experiments into a single file, e.g. a data frame with columns [A|B|C|X|Y], where each row is the output of a different experiment.
My current situation is:
I have written my program in the following form:
import io
from optparse import OptionParser
parser = OptionParser()
parser.add_option("-f", "--file",action="store", type="string", dest="filename")
parser.add_option("-a", "--input_a",type="int", dest="a")
parser.add_option("-b", "--input_b",type="int", dest="b")
parser.add_option("-c", "--input_c",type="int", dest="c")
(options, args) = parser.parse_args()
def F(a,b,c):
return((a+b, b+c))
Alice = options.input_a
Bob = options.input_b
Carol = options.input_c
with io.open("results_a_{0}_b_{1}_c_{2}.txt".format(Alice, Bob, Carol), "a") as f:
(x,y) = F(Alice, Bob, Carol)
f.write(u"first output = {0}, second output = {1}".format(x,y))
This allows me to run the program once, and save the results in a single file.
In principle, I could then submit this job for a range of (A,B,C), obtain a large number of text files with the results, and then try to aggregate them into a single file.
However, I assume this isn't the best way of going about things.
What I'd like to know is:
Is there a more natural way for me to run these experiments and save the results all in one file, and if so, what is it?
How should I submit my collection of jobs on SLURM to do so?
I currently am submitting (effectively) the script below, which is not really working:
...
for a in 1 2 3; do
for b in 10 20 30; do
for c in 100 200 300; do
python toy_experiment.py -a $a -b $b -c $c &
done
done
done
wait
[I'm wary that there are possibly many other places I'm going wrong here - I'm open to using something other than optparse to pass arguments to my program, to saving the results differently, etc. - my main goal is having a working solution.]
TL;DR
You should research how to use file locking in your system and use that to access a single output file safely.
You should also submit your script usign a job array, and letting each job in the array to run a single experiment.
Many things going on here.
One important question: how long usually takes F() to compute? I assume that you just wrote an example, but this is required to decide the best approach to the problem.
If the time span for every calculation is short, maybe you can run few batches aggregating in a single script several computations: in the example, 3 batches (for a==1, a==2 and a==3), each of them gcomputing all the possible experiments for b and c and generating 3 files that have to be aggregated at the end.
If the timespan is long, then the overload of creating some thousands of small files is not a big deal. And concatenating them afterwards will be easy.
Another thing: by putting all your jobs running simultaneously in the backgroud you are probably overloading a single machine. I don't know how do you ask SLURM for resources, but for sure you will be using only one node. And overusing it. If there are other users there, they will probably be pissed off. You must control the number of simultaneous jobs running in a node to be the number of processors granted in that node. Propably, starting your calculations with srun will help.
Another way would be to create a single job per calculation. You can encapsulate them in a job array. In that case, you will only run one experiment per job and you won't have to start and wait for anything in the background.
Finally, getting to your main question: which is the best way to save all this information in disk efficiently.
Creating thousands of files is easy and possible, but not the best way for the file system. Maybe you have access to some RAMdisk in a common node. Writing a small file in compute node local storage and copying that file to the common in-memory disk would be quite a lot more efficient. And when all the experiments have been done you can aggregate the results easily. The drawback is that the space is usually very limited (I don't really know the real size of your data) and it will be an in-memory disk: it can be lost in case of power failure.
Using a single file would be a better approach, but as Dmitri Chubarov pointed, you have to make use of the file locking mechanisms. Otherwise you risk getting mixed results.
Finally, the approach that I feel is best suited for your problem, is to use some kind of database-like solution. If you have access to a relational DB that supports transactions, just create a table with the needed fields and let your code connect and insert the results. Extracting them at the end will be a breeze. The DB can be a client/server one (MySQL, PostgreSQL,Oracle...) or an embedded one (HSQLDB). Another option would be to use some file format like NetCDF, which is precisely intended for this kind of scientific data and have some support for parallel access.

Concurrency in processing multiple files

I have two files. File A contains 1 million records. File B contains approximately 2,000 strings, each on a separate line.
I have a Python script that takes each string in File B in turn and searches for a match in File A. The Logic is as follows:
For string in File B:
For record in File A:
if record contains string: # I use regex for this
write record to a separate file
This is currently running as a single thread of execution and takes a few hours to complete.
I’d like to implement concurrency to speed up this script. What is the best way to approach it? I have looked into multi-threading but my scenario doesn’t seem to represent the producer-consumer problem as my machine has an SSD and I/O is not an issue. Would multiprocessing help with this?
Running such a problem with multi-threads poses a couple of challenges:
We have to run over all of the records in file A in order to get the algorithm done.
We have to synchronize the writing to the separate file, so we won't override the printed records.
I'd suggest:
Assign a single thread just for printing - so your external file won't get messed up.
Open as many threads as you can support (n), and give each of them different 1000000/n records to work on.
The processing you want to do requires checking whether any of the 2_000 strings is in each of the 1_000_000 records—which amounts to 2_000_000_000 such "checks" total. There's no way around that. Your current logic with the nested for loops just that iterates over all the possible combinations of things in the two files—one-by-one—and does the checking (and output file writing).
You need to determine the way (if any) ahat this could be accomplished in concurrently. For example you could have "N" tasks each checking for one string in each of the million records. The outputs from all these tasks represent the desired output and would likely need to be at aggregated together into a single file. Since the results will be in relatively random order, you may also want to sort it.

What is the best way to loop through and process a large (10GB+) text file?

I need to loop through a very large text file, several gigabytes in size (a zone file to be exact). I need to run a few queries for each entry in the zone file, and then store the results in a searchable database.
My weapons of choice at the moment, mainly because I know them, are Python and MySQL. I'm not sure how well either will deal with files of this size, however.
Does anyone with experience in this area have any suggestions on the best way to open and loop through the file without overloading my system? How about the most efficient way to process the file once I can open it (threading?) and store the processed data?
You shouldn't have any real trouble storing that amount of data in MySQL, although you will probably not be able to store the entire database in memory, so expect some IO performance issues. As always, make sure you have the appropriate indices before running your queries.
The most important thing is to not try to load the entire file into memory. Loop through the file, don't try to use a method like readlines which will load the whole file at once.
Make sure to batch the requests. Load up a few thousand lines at a time and send them all in one big SQL request.
This approach should work:
def push_batch(batch):
# Send a big INSERT request to MySQL
current_batch = []
with open('filename') as f:
for line in f:
batch.append(line)
if len(current_batch) > 1000:
push_batch(current_batch)
current_batch = []
push_batch(current_batch)
Zone files are pretty normally formatted, consider if you can get away with just using LOAD DATA INFILE. You might also consider creating a named pipe, pushing partially formatted data in to it from python, and using LOAD DATA INFILE to read it in with MySQL.
MySQL has some great tips on optimizing inserts, some highlights:
Use multiple value lists in each insert statement.
Use INSERT DELAYED, particularly if you are pushing from multiple clients at once (e.g. using threading).
Lock your tables before inserting.
Tweak the key_buffer_size and bulk_insert_buffer_size.
The fastest processing will be done in MySQL, so consider if you can get away with doing the queries you need after the data is in the db, not before. If you do need to do operations in Python, threading is not going to help you. Only one thread of Python code can execute at a time (GIL), so unless you're doing something which spends a considerable amount of time in C, or interfaces with external resources, you're only going to ever be running in one thread anyway.
The most important optimization question is what is bounding the speed, there's no point spinning up a bunch of threads to read the file, if the database is the bounding factor. The only way to really know is to try it and make tweaks until it is fast enough for your purpose.
#Zack Bloom's answer is excellent and I upvoted it. Just a couple of thoughts:
As he showed, just using with open(filename) as f: / for line in f is all you need to do. open() returns an iterator that gives you one line at a time from the file.
If you want to slurp every line into your database, do it in the loop. If you only want certain lines that match a certain regular expression, that's easy.
import re
pat = re.compile(some_pattern)
with open(filename) as f:
for line in f:
if not pat.search(line):
continue
# do the work to insert the line here
With a file that is multiple gigabytes, you are likely to be I/O bound. So there is likely no reason to worry about multithreading or whatever. Even running several regular expressions is likely to crunch through the data faster than the file can be read or the database updated.
Personally, I'm not much of a database guy and I like using an ORM. The last project I did database work on, I was using Autumn with SQLite. I found that the default for the ORM was to do one commit per insert, and it took forever to insert a bunch of records, so I extended Autumn to let you explicitly bracket a bunch of inserts with a single commit; it was much faster that way. (Hmm, I should extend Autumn to work with a Python with statement, so that you could wrap a bunch of inserts into a with block and Autumn would automatically commit.)
http://autumn-orm.org/
Anyway, my point was just that with database stuff, doing things the wrong way can be very slow. If you are finding that the database inserting is your bottleneck, there might be something you can do to fix it, and Zack Bloom's answer contains several ideas to start you out.

Categories