I am currently facing the problem of having to frequently access a large but simple data set on a smallish (700 Mhz) device in real time. The data set contains around 400,000 mappings from abbreviations to abbreviated words, e.g. "frgm" to "fragment". Reading will happen frequently when the device is used and should not require more than 15-20ms.
My first attempt was to utilize SQLite in order to create a simple data base which merely contains a single table where two strings constitute a data set:
CREATE TABLE WordMappings (key text, word text)
This table is created once and although alterations are possible, only read-access is time critical.
Following this guide, my SELECT statement looks as follows:
def databaseQuery(self, query_string):
self.cursor.execute("SELECT word FROM WordMappings WHERE key=" + query_string + " LIMIT 1;")
result = self.cursor.fetchone()
return result[0]
However, using this code on a test data base with 20,000 abbreviations, I am unable to fetch data quicker than ~60ms, which is far to slow.
Any suggestions on how to improve performance using SQLite or would another approach yield more promising results?
You can speed up lookups on the key column by creating an index for it:
CREATE INDEX kex_index ON WordMappings(key);
To check whether a query uses an index or scans the entire table, use EXPLAIN QUERY PLAN.
A long time ago I tried to use SQLite for sequential data and it was not fast enough for my needs. At the time, I was comparing it against an existing in-house binary format, which I ended up using.
I have not personally used, but a friend uses PyTables for large time-series data; maybe it's worth looking into.
It turns out that defining a primary key speeds up individual queries by an factor order of magnitude.
Individual queries on a test table with 400,000 randomly created entries (10/20 characters long) took no longer than 5ms which satisfies the requirements.
The table is now created as follows:
CREATE TABLE WordMappings (key text PRIMARY KEY, word text)
A primary key is used because
It is implicitly unique, which is a property of the abbreviations stored
It cannot be NULL, so the rows containing it must not be NULL. In our case, if they were, the database would be corrupt
Other users have suggested using an index, however, they are not necessarily unique and according to the accept answer to this question, they unnecessarily slow down update/insert/delete performance. Nevertheless, using an index may as well increase performance. This has, however not been tested by the original author, although not tested by the original author.
Related
I tried cx_Oracle package , fetchall() , but this ended up consuming lot of RAM,
Tried Pandas as well, but that also doesn't seem to be that efficient in case we have billions of records.
My Use case - Fetch each row from oracle table to python , do some processing and load it to another table.
PS - Expecting something like fetchmany() , tried this but not able to get it work.
With your large data set, since your machine's memory is "too small", then you will have to do batch processing of sets of rows and reinsert each set before fetching the next set.
Tuning Cursor.arraysize and Cursor.prefetchrows (new in cx_Oracle 8) will be important for your fetch performance. The value is used for internal buffer sizes, regardless of whether you use fetchone() or fetchmany(). See Tuning cx_Oracle.
Use executemany() when reinserting the values, see Batch Statement Execution and Bulk Loading.
The big question is whether you even need to fetch into Python, or whether you can do the manipulation in PL/SQL or SQL - using these two will remove the need to transfer data across the network so will be much, much more efficient.
Generally your strategy should be the following.
Write an sql query that gets the data you want.
Execute the query
loop: Read a tuple from the result.
Process the tuple.
Write the processed stuff to the target relation
Lots of details to deal with however.
Generally when executing a query of this time, you'll want to read a version of the database. If you have simultaneous writing to the same data, you'll get a stale version, but you won't hold up the writers. If you want the most recent version, you have a hard problem to solve.
When executing step 4, don't open a new connection, reuse a connection. Opening a new connection is akin to launching a battleship.
Best if you don't have any indexes on the target relation, since you'll pay the cost of updating the target relation each time you write a tuple. It will work, but it will be slower. You can build the index after finishing the processing.
If this implementation is too slow for you, an easy way to speed things up is to process, say, 1000 tuples at once and write them in batches. If that's too slow, you've entered into the murky world of database performance tuning.
You can use the offset-limit property. If the data can be ordered by a column (primary key or combination of columns so that the data is ordered same across the execution), heres what I tried.
I take the total count() of the table.
Based on a chunk size(eg:100000), create a list of values [0, 0+chunk, 0+chunk2...count(*)].
ex: if count(*)=2222 and chunk=100, output list will be:
[0,100,200,...2222]
And then using the values of above list, I fetch each partition by using 'offset i rows fetch next j values only'.
This will select the data in partitions without overlapping any data.
I change values of i and j in each iteration so that I get the complete data.
At a runtime, the given chunk is 100,000; that much data is loaded into memory. You can change it to any values based on your system spec.
At each iteration, save the data wherever you want.
I'm looking to maintain a (Postgres) SQL database collecting data from third parties. As most data is static, while I get a full dump every day, I want to store only the data that is new. I.e., every day I get 100K new records with say 300 columns, and 95K rows will be the same. In order to do so in an efficient way, I was thinking of inserting a hash of my record (coming from a Pandas dataframe or a Python dict) alongside the data. Some other data is stored as well, like when it was loaded into the database. Then I could, prior to inserting data in the database, hash the incoming data and verify the record is not yet in the database easily, instead of having to check all 300 columns.
My question: which hash function to pick (given that I'm in Python and prefer to use a very fast & solid solution that requires little coding from my side while being able to handle all kinds of data like ints, floats, strings, datetimes, etc)
Python's hash is unsuited as it changes for every session (like: Create hash value for each row of data with selected columns in dataframe in python pandas does)
md5 or sha1 are cryptographic hashes. I don't need the crypto part, as this is not for security. Might be a bit slow as well, and I had some troubles with strings as these require encoding.
is a solution like CRC good enough?
For two and three, if you recommend, how can I implement it for arbitrary dicts and pandas rows? I have had little success in keeping this simple. For instance, for strings I needed to explicitly define the encoding, and the order of the fields in the record should also not change the hash.
Edit: I just realized that it might be tricky to depend on Python for this, if I change programming language I might end up with different hashes. Tying it to the database seems the more sensible choice.
Have you tried pandas.util.hash_pandas_object?
Not sure how efficient this is, but maybe you could use it like this:
df.apply(lambda row: pd.util.hash_pandas_object(row), axis=1)
This will at least get you a pandas Series of hashes for each row in the df.
I am trying to compare two tables in an sqlite3 database in python. One of the answers to this question:
Comparing two sqlite3 tables using python
gives a solution:
Alternatively, import them into SQLite tables. Then you can use queries like the following:
SELECT * FROM a INTERSECT SELECT * FROM b;
SELECT * FROM a EXCEPT SELECT * FROM b;
to get rows that exist in both tables, or only in one table.
This works great for tables with less than a million rows, but is far too slow for my program which requires comparing tables with more than ten billion rows. (Script took over ten minutes for just 100 million rows.)
Is there a faster way to compare two sqlite3 tables in python?
I thought about trying to compare the hashes of the two database files, but an overview of a program called dbhash on sqlite.org claims that even if the contents of two database files are the same certain operations "can potentially cause vast changes to the raw database file, and hence cause very different SHA1 hashes at the file level," which makes me think that this would not work unless I ran some sort of script to query all the data in an ordered fashion and then hash that (like the dbhash program does), but would that even be faster?
Or should I be using another database entirely that can preform this comparison faster than sqlite3?
Any ideas or suggestions would be greatly appreciated.
Edit: There have been some good ideas put forward so far, but to clarify: the order of the tables doesn't matter, just the contents.
You could resort to the following workaround:
Add a column to each table where you store a hash over the content of all other columns.
Add an index to the new column.
Compute and store the hash with the record.
Compare the hash columns of your tables instead of using intersect/except.
If altering the tables isn't an option you can perhaps create new tables that relate a hash to the primary key or rowid of the hashed record.
With that you shift part of the processing time needed for the comparison to the time you insert/update the records. I would expect this to be significant faster at the time you execute the comparison than comparing all columns of all rows just then.
Of course your hash must be aware of the order of values and produce unique values for every permutation; a simple checksum won't suffice. Suggestion:
Convert every column value to a string.
Concatenate the strings with a separator that's guaranteed not to occur in the values themselves.
Use SHA1 or a similarly sophisticated hashing algorithm over the concatenated string.
You can test whether storing the hash as string, blob or integer (provided it fits into 64 bit) makes a difference in speed.
Yes, it will take a lot of time for a single thread (or even several) on a single hard drive to crawl billions of raws.
It can obviously be better with stronger DB engines but indexing all your columns would not really help in the end.
You have to resort to precalculation or distributing your dataset amongst multiple systems...
If you have a LOT of RAM you can try copying the SQLite files first in /dev/shm allowing you to read your data straight from memory and benefit a performance boost.
There's been several similar questions around online for processing large csv files into multiple postgresql tables with python. However, none seem to address a couple concerns around optimizing database reads/writes and system memory/processing.
Say I have a row of product data that looks like this:
name,sku,datetime,decimal,decimal,decimal,decimal,decimal,decimal
Where the name and sku are stored in one table (parent), then each decimal field is stored in a child EAV table that essentially contains the decimal, parent_id, and datetime.
Let's say I have 20000 of these rows in a csv file, so I end up chunking them up. Right now, I take chunks of 2000 of these rows and loop line by line. Each iteration checks to see if the product exists and creates it if not, retrieving the parent_id. Then, I have a large list of insert statements generated for the child table with the decimal values. If the user has selected to only overwrite non-modified decimal values, then this also checks each individual decimal value to see if it has been modified before adding to the insert list.
In this example, if I had the worst case scenario, I'd end up doing 160,000 database reads and anywhere from 10-20010 writes. I'd also be storing up to 12000 insert statements in a list in memory for each chunk (however, this would only be one list, so that part isn't as bad).
My main question is:
How can I optimize this to be faster, use less database operations (since this also affects network traffic), and use less processing and memory? I'd also rather have the processing speed to be slower if it could save on the other two optimizations, as those ones cost more money when translated to server/database processing pricing in something like AWS.
Some sub questions are:
Is there a way I can combine all the product read/writes and replace them in the file before doing the decimals?
Should I be doing a smaller chunk size to help with memory?
Should I be utilizing threads or keeping it linear?
Could I have it build a more efficient sql query that does the product create if not exists and referencing inline, thus moving some of the processing into sql rather than python?
Could I optimize the child insert statements to do something better than thousands of INSERT INTO statements?
A fun question, but one that's hard to answer precisely, since there are many
variables defining the best solution that may or may not apply.
Below is one approach, based on the following assumptions -
You don't need the database code to be portable.
The csv is structured with a header, or at the least the attribute names are
known and fixed.
The sku (or name/sku combo) in product table have unique constraints.
Likewise, the EAV table has a unique constraint on product_id, and
attr_name
Corollary - you didn't specify, but I also assume that the EAV table has a field
for the attribute name.
The process boils down to -
Load the data into the database by the fastest path possible
Unpivot the csv from a tabular structure to EAV structure during or after the load
"Upsert" the resulting records - update if present, insert otherwise.
Approach -
All that background, given a similar problem, here is the approach I would take.
Create temp tables mirroring the final destination, but without pks, types, or constraints
The temp tables will get deleted when the database session ends
Load the .csv straight into the temp tables in a single pass; two SQL executions per row
One for product
One for the EAV, using the 'multi-value' insert - insert into tmp_eav (sku, attr_name, attr_value) values (%s, %s), (%s, %s)....
psycopg2 has a custom method to do this for you: http://initd.org/psycopg/docs/extras.html#psycopg2.extras.execute_values
Select from tmp tables to upsert into final tables, using a statement like insert into product (name, sku) select name, sku from tmp_product on conflict (sku) do nothing
This requires PostgreSQL 9.5+.
For the user-selectable requirement to optionally update fields based on the csv, you can change do nothing to do update set col = excluded.col. excluded is the input row that conflicted
Alternative approach -
Create the temp table based on the structure of the csv (assumes you have
have enough metadata to do this on each run or that the csv structure is
fixed and can be consistently translated to a table)
Load the csv into the database using the COPY command (supported in psycopg2
via the cursor.copy_from method, passing in the csv as a file object).
This will be faster than anything you write in Python
Caveat: this works if the csv is very dependable (same number of cols on
every row) and the temp table is very lax w/ nulls, all strings w/ no
type coercion.
You can 'unpivot' the csv rows with a union all query that combines a
select for each column to row transpose. The 6 decimals in your example
should be manageable.
For example:
select sku, 'foo' as attr_name, foo as attr_value from tmp_csv union all
select sku, 'bar' as attr_name, bar as attr_value from tmp_csv union all
...
order by sku;
This solution hits a couple of the things you were you interested in:
Python application memory remains flat
Network I/O is limited to what it takes to get the .csv into the db and issue
the right follow up sql statements
A little general advice to close out -
Optimal and "good enough" are almost never the same thing
Optimal is only required under very specific situations
So, aim for "good enough", but be precise about what "good enough" means -
i.e., pick one or two measures
Iterate, solving for one variable at a time. In my experience, the first hurdle (say, "end to end processing time less than
X seconds") is often sufficient.
I have large text files upon which all kinds of operations need to be performed, mostly involving row by row validations. The data are generally of a sales / transaction nature, and thus tend to contain a huge amount of redundant information across rows, such as customer names. Iterating and manipulating this data has become such a common task that I'm writing a library in C that I hope to make available as a Python module.
In one test, I found that out of 1.3 million column values, only ~300,000 were unique. Memory overhead is a concern, as our Python based web application could be handling simultaneous requests for large data sets.
My first attempt was to read in the file and insert each column value into a binary search tree. If the value has never been seen before, memory is allocated to store the string, otherwise a pointer to the existing storage for that value is returned. This works well for data sets of ~100,000 rows. Much larger and everything grinds to a halt, and memory consumption skyrockets. I assume the overhead of all those node pointers in the tree isn't helping, and using strcmp for the binary search becomes very painful.
This unsatisfactory performance leads me to believe I should invest in using a hash table instead. This, however, raises another point -- I have no idea ahead of time how many records there are. It could be 10, or ten million. How do I strike the right balance of time / space to prevent resizing my hash table again and again?
What are the best data structure candidates in a situation like this?
Thank you for your time.
Hash table resizing isn't a concern unless you have a requirement that each insert into the table should take the same amount of time. As long as you always expand the hash table size by a constant factor (e.g. always increasing the size by 50%), the computational cost of adding an extra element is amortized O(1). This means that n insertion operations (when n is large) will take an amount of time that is proportionate to n - however, the actual time per insertion may vary wildly (in practice, one of the insertions will be very slow while the others will be very fast, but the average of all operations is small). The reason for this is that when you insert an extra element that forces the table to expand from e.g. 1000000 to 1500000 elements, that insert will take a lot of time, but now you've bought yourself 500000 extremely fast future inserts before you need to resize again. In short, I'd definitely go for a hash table.
You need to use incremental resizing of your hash table. In my current project, I keep track of the hash key size used in every bucket, and if that size is below the current key size of the table, then I rehash that bucket on an insert or lookup. On a resizing of the hash table, the key size doubles (add an extra bit to the key) and in all the new buckets, I just add a pointer back to the appropriate bucket in the existing table. So if n is the number of hash buckets, the hash expand code looks like:
n=n*2;
bucket=realloc(bucket, sizeof(bucket)*n);
for (i=0,j=n/2; j<n; i++,j++) {
bucket[j]=bucket[i];
}
library in C that I hope to make
available as a Python module
Python already has very efficient finely-tuned hash tables built in. I'd strongly suggest that you get your library/module working in Python first. Then check the speed. If that's not fast enough, profile it and remove any speed-humps that you find, perhaps by using Cython.
setup code:
shared_table = {}
string_sharer = shared_table.setdefault
scrunching each input row:
for i, field in enumerate(fields):
fields[i] = string_sharer(field, field)
You may of course find after examining each column that some columns don't compress well and should be excluded from "scrunching".