time series array database for python - python

We have a fairly big db about our traffic.
We have over 1000 trains daily where we gather from different sources either on minute or event bases (a train is arrived a a station, a car or loc composition has changed etc.)
The db has over 100M rows actually in MSSQL.
It turns out that 90% of the data is redundant, just the time, the station, where the train passed and the distance is changing.
Insert/update and reasonably simple queries are fine.
But when it comes to make statistical queries (like how many KM was a specific train/car/loc running in a given period of time) the response time and also the query complexity becomes important. (response needed within a 1-2 sec range).
What kind of DB/storage solution could I use for this kind of queries?
We have a python (Flask) front end for reporting so a DB solution having a python interface is a kind of must.
I've considered Pytables/pyhdf5 but I have some concerns about the reliability (I cannot guarantee that only on process will write the file so as per the documentation the risk of data corruption is high. And I cannot afford loose the data.
Side note: I'm quite OK with DB optimization so I know fairly well the relational limit.
And idea?

Related

Query execution time with small batches vs entire input set

I'm using ArangoDB 3.9.2 for search task. The number of items in dataset is 100.000. When I pass the entire dataset as an input list to the engine - the execution time is around ~10 sec, which is pretty quick. But if I pass the dataset in small batches one by one - 100 items per batch, the execution time is rapidly growing. In this case, to process the full dataset takes about ~2 min. Could you explain please, why is it happening? The dataset is the same.
I'm using python driver "ArangoClient" from python-arango lib ver 0.2.1
PS: I had the similar problem with Neo4j, but the problem was solved using transactions committing with HTTP API. Does the ArangoDB have something similar?
Every time you make a call to a remote system (Neo4J or ArangoDB or any database) there is overhead in making the connection, sending the data, and then after executing your command, tearing down the connection.
What you're doing is trying to find the 'sweet spot' for your implementation as to the most efficient batch size for the type of data you are sending, the complexity of your query, the performance of your hardware, etc.
What I recommend doing is writing a test script that sends the data in varying batch sizes to help you determine the optimal settings for your use case.
I have taken this approach with many systems that I've designed and the optimal batch sizes are unique to each implementation. It totally depends on what you are doing.
See what results you get for the overall load time if you use batch sizes of 100, 1000, 2000, 5000, and 10000.
This way you'll work out the best answer for you.

Is it faster and more memory efficient to manipulate data in Python or PostgreSQL?

Say I had a PostgreSQL table with 5-6 columns and a few hundred rows. Would it be more effective to use psycopg2 to load the entire table into my Python program and use Python to select the rows I want and order the rows as I desire? Or would it be more effective to use SQL to select the required rows, order them, and only load those specific rows into my Python program.
By 'effective' I mean in terms of:
Memory Usage.
Speed.
Additionally, how would these factors start to vary as the size of the table increases? Say, the table now has a few million rows?
It's almost always going to be faster to perform all of these operations in PostgreSQL. These database systems have been designed to scale well for huge amounts of data, and are highly optimised for their typical use cases. For example, they don't have to load all of the data from disk to perform most basic filters[1].
Even if this were not the case, the network latency / usage alone world be enough to balance this out, especially if you were running the query often.
Actually, if you are comparing data that is already loaded into memory to data being retrieved from a database, then the in-memory operations are often going to be faster. Databases have overhead:
They are in separate processes on the same server or on a different server, so data and commands needs to move between them.
Queries need to be parsed and optimized.
Databases support multiple users, so other work may be going on using up resources.
Databases maintain ACID properties and data integrity, which can add additional overhead.
The first two of these in particular add overhead compared to equivalent in-memory operations for every query.
That doesn't mean that databases do not have advantages, particularly for complex queries:
They implement multiple different algorithms and have an optimizer to choose the best one.
They can take advantage of more resources -- particularly by running in parallel.
They can (sometimes) cache results saving lots of time.
The advantage of databases is not that they provide the best performance all the time. The advantage is that they provide good performance across a very wide range of requests with a simple interface (even if you don't like SQL, I think you need to admit that it is simpler, more concise, and more flexible that writing code in a 3rd generation language).
In addition, databases protect data, via ACID properties and other mechanisms to support data integrity.

What is a viable approach for loading a large quantity of data originating in csv files?

Background
Ultimately I'm trying to load a large quantity of data into GBQ. Records exist for the majority of seconds across many dates.
I want to have a table for each day's data so I may query the set using the syntax:
_TABLE_SUFFIX BETWEEN '20170403' AND '20170408'...
Some of the columns will be nested and sets of columns currently originate from a high number of csv files each between 30-150mb in size.
Approach
My approach to processing the files is to:
1) Create tables with correct schema
2) Fill each table with a record for each second in preparation for cycling through data and running UPDATE queries
3) Cycle through csv files and load data
All this is intended to be driven by the GBQ client library for python.
Step (2) may seem odd, but I think it's needed in order to simplify the loading as I will be able to guarantee that UPDATE DML will work in all cases.
I do realise that (3) has little chance of working given the issues with (2), but I have run a similar operation at a slightly smaller scale leading me to believe that my approach might not be entirely hopeless at this larger scale.
I'm also left wondering whether this is the most appropriate method to load data of this nature, but some of the options presented in GBQ documentation seem like they could be overkill for a one-time load of data i.e I would not want to learn apache beam unless I know I have to given that my chief concern is analysing the dataset. But I'm not beyond learning something like apache beam if I know it's the only way of completing such a load as it could come in handy in future.
Problem that I've run into
In order to run (2) I have three nested for loops cycling through each hour, minute second and constructing a long string to go into the INSERT query:
(TIMESTAMP '{timestamp}', TIMESTAMP '2017-04-03 00:00:00', ... + ~86k times for each day)
GBQ doesn't allow you to run such a long query there is a limit, from memory it's 250k characters (could be wrong). For that reason the length of query is tested at each iteration and if it exceeds a number slightly lower than 250k characters then the insert query is performed.
Although this worked for a similar type of operation involving more columns when I run this this time the process terminates when GQB returns:
"google.cloud.exceptions.BadRequest: 400 Resources exceeded during query execution: Not enough resources for query planning - too many subqueries or query is too complex.."
When I reduce the size of the strings and run smaller queries this error is returned less frequently, but does not disappear all together. Unfortunately at this reduced size of query the script would take too long to complete to serve as a feasible option for step (2). It takes >600 seconds and the entire data set would take in excess of 20 days to run.
I had thought that changing the query mode to 'BATCH' may overcome this, but it's been pointed out to me that this won't help, I'm not yet clear why, but happy to take someone's word for it. Hence I've detailed the problem further as it seems likely that my approach may be wrong.
This is what I see as the relevant code block:
from google.cloud import bigquery
from google.cloud.bigquery import Client
... ...
bqc = Client()
dataset = bqc.dataset('test_dataset')
job = bqc.run_async_query(str(uuid.uuid4()), sql)
job.use_legacy_sql = False
job.begin()
What would be the best approach to loading this dataset? Is there a way of getting my current approach to work?
Query
INSERT INTO `project.test_dataset.table_20170403` (timestamp) VALUES (TIMESTAMP '2017-04-03 00:00:00'), ... for ~7000 distinct values
Update
It seems as if the GBQ client library has had a not insignificant upgrade since I last downloaded it Client.run_asyn_query() appears to have been replaced with Client.query(). I'll experiment with the new code and report back.

Django Rest Framework large number of queries for nested relationship

My problem
I have a large number of nested model serializers (5 levels deep) via one-to-many relationships. The total query time is not that high according to django debug toolkit - maybe 100ms, but it takes about 6s of cpu time to run due to the many hundreds of times the database is being hit per query.
The problem is, 3 of those tables are huge, with millions of rows. As such, whenever I run 'prefetch_related' for anything it takes far longer to execute, sometimes minutes. select_related isn't an option, as they are all one-to-many.
Imagine something like a pedometer that takes steps, and also ties multiple environment readings and accelerometer readings to a step.
Example Schema
Person
id
name
Pedometer
id
name
owner(Person)
Step
id
info
timestamp
pedometer
Environment_Readings
id
pressure
step
Accelerometer_Readings
id
force
step
What I want to achieve
So I want to select all people > an array of their pedometers (often multiple) > an array of all steps they have taken in the last minute > array of environment readings and array of accelerometer readings.
I am not seeing a way to prefetch environment_readings or accelerometer_readings as neither have a timestamp. The only ones that can be prefetched appear to be pedometer and step. And for some reason prefetching pedometer seems to slow it down.
Any ideas?

what changes when your input is giga/terabyte sized?

I just took my first baby step today into real scientific computing today when I was shown a data set where the smallest file is 48000 fields by 1600 rows (haplotypes for several people, for chromosome 22). And this is considered tiny.
I write Python, so I've spent the last few hours reading about HDF5, and Numpy, and PyTable, but I still feel like I'm not really grokking what a terabyte-sized data set actually means for me as a programmer.
For example, someone pointed out that with larger data sets, it becomes impossible to read the whole thing into memory, not because the machine has insufficient RAM, but because the architecture has insufficient address space! It blew my mind.
What other assumptions have I been relying in the classroom that just don't work with input this big? What kinds of things do I need to start doing or thinking about differently? (This doesn't have to be Python specific.)
I'm currently engaged in high-performance computing in a small corner of the oil industry and regularly work with datasets of the orders of magnitude you are concerned about. Here are some points to consider:
Databases don't have a lot of traction in this domain. Almost all our data is kept in files, some of those files are based on tape file formats designed in the 70s. I think that part of the reason for the non-use of databases is historic; 10, even 5, years ago I think that Oracle and its kin just weren't up to the task of managing single datasets of O(TB) let alone a database of 1000s of such datasets.
Another reason is a conceptual mismatch between the normalisation rules for effective database analysis and design and the nature of scientific data sets.
I think (though I'm not sure) that the performance reason(s) are much less persuasive today. And the concept-mismatch reason is probably also less pressing now that most of the major databases available can cope with spatial data sets which are generally a much closer conceptual fit to other scientific datasets. I have seen an increasing use of databases for storing meta-data, with some sort of reference, then, to the file(s) containing the sensor data.
However, I'd still be looking at, in fact am looking at, HDF5. It has a couple of attractions for me (a) it's just another file format so I don't have to install a DBMS and wrestle with its complexities, and (b) with the right hardware I can read/write an HDF5 file in parallel. (Yes, I know that I can read and write databases in parallel too).
Which takes me to the second point: when dealing with very large datasets you really need to be thinking of using parallel computation. I work mostly in Fortran, one of its strengths is its array syntax which fits very well onto a lot of scientific computing; another is the good support for parallelisation available. I believe that Python has all sorts of parallelisation support too so it's probably not a bad choice for you.
Sure you can add parallelism on to sequential systems, but it's much better to start out designing for parallelism. To take just one example: the best sequential algorithm for a problem is very often not the best candidate for parallelisation. You might be better off using a different algorithm, one which scales better on multiple processors. Which leads neatly to the next point.
I think also that you may have to come to terms with surrendering any attachments you have (if you have them) to lots of clever algorithms and data structures which work well when all your data is resident in memory. Very often trying to adapt them to the situation where you can't get the data into memory all at once, is much harder (and less performant) than brute-force and regarding the entire file as one large array.
Performance starts to matter in a serious way, both the execution performance of programs, and developer performance. It's not that a 1TB dataset requires 10 times as much code as a 1GB dataset so you have to work faster, it's that some of the ideas that you will need to implement will be crazily complex, and probably have to be written by domain specialists, ie the scientists you are working with. Here the domain specialists write in Matlab.
But this is going on too long, I'd better get back to work
In a nutshell, the main differences IMO:
You should know beforehand what your likely
bottleneck will be (I/O or CPU) and focus on the best algorithm and infrastructure
to address this. I/O quite frequently is the bottleneck.
Choice and fine-tuning of an algorithm often dominates any other choice made.
Even modest changes to algorithms and access patterns can impact performance by
orders of magnitude. You will be micro-optimizing a lot. The "best" solution will be
system-dependent.
Talk to your colleagues and other scientists to profit from their experiences with these
data sets. A lot of tricks cannot be found in textbooks.
Pre-computing and storing can be extremely successful.
Bandwidth and I/O
Initially, bandwidth and I/O often is the bottleneck. To give you a perspective: at the theoretical limit for SATA 3, it takes about 30 minutes to read 1 TB. If you need random access, read several times or write, you want to do this in memory most of the time or need something substantially faster (e.g. iSCSI with InfiniBand). Your system should ideally be able to do parallel I/O to get as close as possible to the theoretical limit of whichever interface you are using. For example, simply accessing different files in parallel in different processes, or HDF5 on top of MPI-2 I/O is pretty common. Ideally, you also do computation and I/O in parallel so that one of the two is "for free".
Clusters
Depending on your case, either I/O or CPU might than be the bottleneck. No matter which one it is, huge performance increases can be achieved with clusters if you can effectively distribute your tasks (example MapReduce). This might require totally different algorithms than the typical textbook examples. Spending development time here is often the best time spent.
Algorithms
In choosing between algorithms, big O of an algorithm is very important, but algorithms with similar big O can be dramatically different in performance depending on locality. The less local an algorithm is (i.e. the more cache misses and main memory misses), the worse the performance will be - access to storage is usually an order of magnitude slower than main memory. Classical examples for improvements would be tiling for matrix multiplications or loop interchange.
Computer, Language, Specialized Tools
If your bottleneck is I/O, this means that algorithms for large data sets can benefit from more main memory (e.g. 64 bit) or programming languages / data structures with less memory consumption (e.g., in Python __slots__ might be useful), because more memory might mean less I/O per CPU time. BTW, systems with TBs of main memory are not unheard of (e.g. HP Superdomes).
Similarly, if your bottleneck is the CPU, faster machines, languages and compilers that allow you to use special features of an architecture (e.g. SIMD like SSE) might increase performance by an order of magnitude.
The way you find and access data, and store meta information can be very important for performance. You will often use flat files or domain-specific non-standard packages to store data (e.g. not a relational db directly) that enable you to access data more efficiently. For example, kdb+ is a specialized database for large time series, and ROOT uses a TTree object to access data efficiently. The pyTables you mention would be another example.
While some languages have naturally lower memory overhead in their types than others, that really doesn't matter for data this size - you're not holding your entire data set in memory regardless of the language you're using, so the "expense" of Python is irrelevant here. As you pointed out, there simply isn't enough address space to even reference all this data, let alone hold onto it.
What this normally means is either a) storing your data in a database, or b) adding resources in the form of additional computers, thus adding to your available address space and memory. Realistically you're going to end up doing both of these things. One key thing to keep in mind when using a database is that a database isn't just a place to put your data while you're not using it - you can do WORK in the database, and you should try to do so. The database technology you use has a large impact on the kind of work you can do, but an SQL database, for example, is well suited to do a lot of set math and do it efficiently (of course, this means that schema design becomes a very important part of your overall architecture). Don't just suck data out and manipulate it only in memory - try to leverage the computational query capabilities of your database to do as much work as possible before you ever put the data in memory in your process.
The main assumptions are about the amount of cpu/cache/ram/storage/bandwidth you can have in a single machine at an acceptable price. There are lots of answers here at stackoverflow still based on the old assumptions of a 32 bit machine with 4G ram and about a terabyte of storage and 1Gb network. With 16GB DDR-3 ram modules at 220 Eur, 512 GB ram, 48 core machines can be build at reasonable prices. The switch from hard disks to SSD is another important change.

Categories