Building and accessing a database runtime

Building and accessing a database runtime - python

My code is acquiring data from a sensor and it's performing some operations based on the last N-minutes of data.
At the moment I just initialize a list at the beginning of my code as
x = []
while running:
new_value = acquire_new_point()
x.append(new_value)
do_something_with_x(x)
Despite working, this has some intrinsic flaws that make it sub-optimal. The most important are:
If the code crashes or restart, the whole time-history is reset
There is no record or log of the past time-history
The memory consumption could grow out of control and exceed the available memory
Some obvious solutions exist as:
log each new element to a csv file and read it when the code starts
divide the data in N-minutes chunks and drop from the memory chunks that are more than N-minutes old
I have the feeling, however, that this could be a problem for which something more specific solution has been already created. My first thought went to HDF5, but I'm not sure it's the best candidate for this problem.
So, what is the best strategy/format for a database that needs to be written (appended) run-time and needs to be accessed partially (only the last N-minutes)? Is there any tool that is specific for a task like this one?

I'd simply just use SQLite with a two-column table (timestamp, value) with an index on the timestamp.
This has the additional bonus that you can use SQL for data analysis, which will likely be faster than doing it by hand in Python.
If you don't need the historical data, you can periodically DELETE FROM data WHERE TIMESTAMP < ....

Related

How to create a dependency tree for python functions

I'm writing some code using python/numpy that I will be using for data analysis of some experimental data sets. Certain steps of these analysis routines can take a while. It is not practical to rerun every step of the analysis every time (I.E when debugging) so it makes sense to save the output from these steps to a file and just reuse them if they're already available.
The data I ultimately want to obtain can be derived from various steps along this analysis process. I.E, A can be used to calculate B and C. D can be calculated from B. E can then be calculated using C and D. etc etc.
The catch here is that it's not uncommon to make it through a few (or many) datasets only to find that there's some tiny little gotcha in the code somewhere that requires some part of the tree to be recalculated. I.E - I discover a bug in B, so now anything that depends on B also needs to be recalculated because it was derived from incorrect data.
The end goal here is to basically protect myself from having sets of data that I forget to reprocess when bugs are found. In other words, I want to be confident that all of my data is calculated using the newest code.
Is there a way to implement this in Python? I have no specific form this solution needs to take so long as it's extensible as I add new steps. I also am okay with the "recalculation step" only being performed when the dependent quantity is recalculated (Rather than at the time one of the parents are changed).
My first thought of how this might be done is to embed information in a header of each saved file (A, B, C, etc) indicating what version of each module it was created with. Then, when loading the saved data the code can check if the version in the file matches the current version of the parent module. (Some sort of parent.getData() which checks if the data has been calculated for that dataset and if it's up to date)
The thing is, at least at first glance, I could see that this could have problems when the change happens several steps up in the dependency chain because the derived file may still be up to date with its module even though its parents are out of date. I suppose I could add some sort of parent.checkIfUpToDate() that checks its own files and then asks each of its parents if they're up to date (which then ask their parents, etc), and updates it if not. The version number can just be a static string stored in each module.
My concern with that approach is that it might mean reading potentially large files from disk just to get a version number. If I went with the "file header" approach, does Python actually load the whole file in to memory when I do an open(myFile), or can I do that, just read the header lines, and close the file without loading the whole thing in to memory?
Last - is there a good way to embed this type of information beyond just having the first line of the file be some variation of # MyFile made with MyModule V x.y.z and writing some piece of code to parse that line?
I'm kind of curious if this approach makes sense, or if I'm reinventing the wheel and there's already something out there to do this.
edit: And something else that occurred to me after I submitted this - does Python have any mechanism to define templates that modules must follow? Just as a means to make sure I keep the format for the data reading steps consistent from module to module.

I cannot answer all of your questions but you can read in only a small part of data from a large file as you can see here:
How to read specific part of large file in Python
I do not see why you would need a parent.checkIfUpToDate() function. You could as well just store the version number of the parent functions in the file itself as well.
To me your approach sounds reasonable, however I have never done anything similar. Alternatively you could create an additional file that holds the specified information but I think storing the information in the actual file should prevent Version errors between your "data file" and the "function version file".

Storing and querying a large amount of data

I have a large amount of data around 50GB worth in a csv which i want to analyse purposes of ML. It is however way to large to fit in Python. I ideally want to use mySQL because querying is easier. Can anyone offer a host of tips for me to look into. This can be anything from:
How to store it in the first place, i realise i probably can't load it in all at once, would i do it iteratively? If so what things can i look into for this? In addition i've heard about indexing, would that really speed up queries on such a massive data set?
Are there better technologies out there to handle this amount of data and still be able to query and do feature engineering quickly. What i eventually feed into my algorithm should be able to be done in Python but i need query and do some feature engineering before i get my data set that is ready to be analysed.
I'd really appreciate any advice this all needs to be done on personal computer! Thanks!!

Can anyone offer a host of tips for me to look into
Gladly!
Look at the CSV file first line to see if there is a header. You'd need to create a table with the same fields (and type of data)
One of the fields might seem unique per line and can be used later to find the line. That's your candidate for PRIMARY KEY. Otherwise add an AUTO-INCREMENT field as PRIMARY KEY
INDEXes are used to later search for data. Whatever fields you feel you will be searching/filtering on later should have some sort of INDEX. You can always add them later.
INDEXes can combine multiple fields if they are often searched together
In order to read in the data, you have 2 ways:
Use LOAD DATA INFILE Load Data Infile Documentation
Write your own script: The best technique is to create a prepared statement for the
INSERT command. Then read your CSV line by line (in a loop), split the fields
into variables and execute the prepared statement with this line's
values
You will benefit from a web page designed to search the data. Depends on who needs to use it.
Hope this gives you some ideas

That's depend on what you have, you can use Apache spark and then use their SQL feature, spark SQL gives you the possibility to write SQL queries in your dataset, but for best performance you need a distributed mode(you can use it in a local machine but the result is limited) and high machine performance. you can use python, scala, java to write your code.

Bulk edit subject of triples in rdflib

I create an rdflib graph by parsing records from a database using rdflib-jsonld. However, the subject of triples has a missing / from the url. To add it, I use the following code:
for s,p,o in graph1:
print 'parsing to graph2. next step - run query on graph2'
pprint.pprint((s,p,o))
s = str(s)
s1 =s.replace('https:/w','https://w')
s = URIRef(s1)
graph2.add((s,p,o))
This step takes a very long time (couple of hours) to run because of the high number of triples in the graph. How can I reduce the time taken? Instead of looping through every element, how do I alter the subject in bulk?

First of all, to make proper time measurements, remove anything not related to the replacement itself, particularly, both ordinary and pretty print, you don't need them. If you need some progress indicator, write a short message (e.g. a single dot) into a logfile every N steps.
Avoid memory overconsumption. I don't know how your graph looks like internally, but it'd be better to make the replacement in place, without creating a parallel graph structure. Check memory usage during the process and if the program gets out of free RAM, you're in trouble, all processes will slow down to their knees. If you can't modify the existing graph and go out of memory, for measurement purposes simply avoid the second graph creation, even if such a replacement is lost and thus useless.
If nothing helps, do one step back. You could perform the replacements on a stage when you haven't parsed the file(s) yet with either python re, or with a text tool like sed dedicated to batch text processing.

Parallelizing reads from MySQL database via Python

I need to read data from a huge table (>1million rows, 16 cols of raw text) and do some processing on it. Reading it row by row seems very slow (python, MySQLdb) indeed and I would like to be able to read multiple rows at a time (possibly parallelize it).
Just FYI, my code currently looks something like this:
cursor.execute('select * from big_table')
rows = int(cursor.rowcount)
for i in range(rows):
row = cursor.fetchone()
.... DO Processing ...
I tried to run multiple instances of the program to iterate over different sections of the table (for example, the 1st instance would iterate over 1st 200k rows, 2nd instance would iterate over rows 200k-400k ...) but the problem is that the 2nd instance (and 3rd instance and so on) takes FOREVER to get to a stage where it starts looking at row 200k onwards. It almost seems like it is still doing the processing of 1st 200k rows instead of skipping over them. The code I use (for 2nd instance) in this case is something like:
for i in range(rows):
#Fetch the row but do nothing (need to skip over 1st 200k rows)
row = cur.fetchone()
if not i in range(200000,400000):
continue
.... DO Processing ...
How can I speed up this process? Is there a clean way to do faster/parallel reads from MySQL database through python?
EDIT 1: I tried the "LIMIT" thing based on the suggestions below. For some reason though when I start 2 processes on my quad core server, it seems like only 1 single process is being run at a time (CPU seems to be time sharing between these processes, as opposed to each core running a separate process). The 2 python processes are using respectively 14% and 9% of the CPUs. Any thoughts what might be wrong?

The LIMIT clause can take two parameters, where the first is the start row and the second is the row count.
SELECT ...
...
LIMIT 200000,200000

You may also run into i/o contention on the DB server (even though you are getting the data in chunks, the disks need to serialize the reads at some level). So, rather than reading from mysql in parallel, a single read may work better for you.
Rather than reading 200K rows at a time, you could dump the whole of the data in one hit and process the data (possibly in parallel) in memory, in python.
Potentially, you can use something like psycopg.copy_expert(). Or alternatively, do a mysql dump in a single file and use csv.reader to iterate over it (or sections of it if you're processing it in parallel).

You're exactly right that your attempt to parallelize the second chunk is requesting the first 200k records before it begins processing. You need to use the LIMIT keyword to ask the server to return different results:
select * from big_table LIMIT 0,200000
...
select * from big_table LIMIT 200000,200000
...
select * from big_table LIMIT 400000,200000
...
And so on. Pick the numbers however you wish -- but be aware that memory, network, and disk bandwidths might not give you perfect scaling. In fact, I'd be wary of starting more than two or three of these simultaneously.

What is the best way to loop through and process a large (10GB+) text file?

I need to loop through a very large text file, several gigabytes in size (a zone file to be exact). I need to run a few queries for each entry in the zone file, and then store the results in a searchable database.
My weapons of choice at the moment, mainly because I know them, are Python and MySQL. I'm not sure how well either will deal with files of this size, however.
Does anyone with experience in this area have any suggestions on the best way to open and loop through the file without overloading my system? How about the most efficient way to process the file once I can open it (threading?) and store the processed data?

You shouldn't have any real trouble storing that amount of data in MySQL, although you will probably not be able to store the entire database in memory, so expect some IO performance issues. As always, make sure you have the appropriate indices before running your queries.
The most important thing is to not try to load the entire file into memory. Loop through the file, don't try to use a method like readlines which will load the whole file at once.
Make sure to batch the requests. Load up a few thousand lines at a time and send them all in one big SQL request.
This approach should work:
def push_batch(batch):
# Send a big INSERT request to MySQL
current_batch = []
with open('filename') as f:
for line in f:
batch.append(line)
if len(current_batch) > 1000:
push_batch(current_batch)
current_batch = []
push_batch(current_batch)
Zone files are pretty normally formatted, consider if you can get away with just using LOAD DATA INFILE. You might also consider creating a named pipe, pushing partially formatted data in to it from python, and using LOAD DATA INFILE to read it in with MySQL.
MySQL has some great tips on optimizing inserts, some highlights:
Use multiple value lists in each insert statement.
Use INSERT DELAYED, particularly if you are pushing from multiple clients at once (e.g. using threading).
Lock your tables before inserting.
Tweak the key_buffer_size and bulk_insert_buffer_size.
The fastest processing will be done in MySQL, so consider if you can get away with doing the queries you need after the data is in the db, not before. If you do need to do operations in Python, threading is not going to help you. Only one thread of Python code can execute at a time (GIL), so unless you're doing something which spends a considerable amount of time in C, or interfaces with external resources, you're only going to ever be running in one thread anyway.
The most important optimization question is what is bounding the speed, there's no point spinning up a bunch of threads to read the file, if the database is the bounding factor. The only way to really know is to try it and make tweaks until it is fast enough for your purpose.

#Zack Bloom's answer is excellent and I upvoted it. Just a couple of thoughts:
As he showed, just using with open(filename) as f: / for line in f is all you need to do. open() returns an iterator that gives you one line at a time from the file.
If you want to slurp every line into your database, do it in the loop. If you only want certain lines that match a certain regular expression, that's easy.
import re
pat = re.compile(some_pattern)
with open(filename) as f:
for line in f:
if not pat.search(line):
continue
# do the work to insert the line here
With a file that is multiple gigabytes, you are likely to be I/O bound. So there is likely no reason to worry about multithreading or whatever. Even running several regular expressions is likely to crunch through the data faster than the file can be read or the database updated.
Personally, I'm not much of a database guy and I like using an ORM. The last project I did database work on, I was using Autumn with SQLite. I found that the default for the ORM was to do one commit per insert, and it took forever to insert a bunch of records, so I extended Autumn to let you explicitly bracket a bunch of inserts with a single commit; it was much faster that way. (Hmm, I should extend Autumn to work with a Python with statement, so that you could wrap a bunch of inserts into a with block and Autumn would automatically commit.)
http://autumn-orm.org/
Anyway, my point was just that with database stuff, doing things the wrong way can be very slow. If you are finding that the database inserting is your bottleneck, there might be something you can do to fix it, and Zack Bloom's answer contains several ideas to start you out.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.