How to fast search specific data in MS SQL with Python if you have three dependent variables and you have 10 million data? If prepare use pyodbc library or any else?
Starting from the fact that 10 million rows for a table in a RDBMS is not a huge number, I personally used TurboODBC with MSSQL to migrate a table with ~2 billion rows. I was very satisfied.
You should carefully review and optimize your query/schema (adding indexes, exploiting them, etc.) regardless of the technology you will use to run it. Most likely the database query execution will be the bottleneck of your process (assuming that you will not process this data in python).
Using a columnstore index is appropriate for high volume of data/rows in a table (many millions at least). If your criterias for the WHERE cluse of the SELECT are the 3 variables, create such index with those three columns. As an hypothetical exemple :
CREATE COLUMNSTORE INDEX XC_MyTable ON MyTable (Col1, Col2, Col3);
Related
Using a JDBC connection, I have about 50 million rows to insert to DB2 from Python.
I initially tried to insert one row at a time, which resulted in a huge processing time lag.
Therefore, I made it to insert 1,000 rows at each attempt, which clearly made a faster processing time. However, I wonder if there is a clear guideline to do this type of insertion.
How does one find the dynamic optimization of batch size for making dump insertion without taking a long connection hold?
I have a database with a single indexed UNIQUE text column, currently at 100m rows but it could go up to a billion. this is for doing millions of contains queries on the db so index is necessary for fast lookup.
The problem i'm having is imports have slowed down by a lot as the DB got bigger. I've tried a lot of ideas including stuff like this (reads roughly 100k lines from file and does batch insert):
lines = f.readlines(3355629)
while lines:
c.execute('BEGIN TRANSACTION')
rows_modified += c.executemany('INSERT OR IGNORE INTO mydb (name) values (?)', map(lambda name: (name.strip().lower(),), lines)).rowcount
c.execute('COMMIT')
lines = f.readlines(3355629)
the above takes many minutes to insert 100k lines when the DB is at 100m rows.
I've imported to Postgres with a python script which can get 100k inserts on an indexed column in 2 seconds at 100m rows (using psycopg2.extras.execute_values).
I keep seeing claims that SQLite can easily handle terabytes of data, yet can't figure out how to get the data in. BTW I can't drop the index and recreate it because then id need additional code to make sure the data is unique, and you can't change a column to UNIQUE after creation. Multiple tables is an option for speed increase though it makes things more complex.
I tried cx_Oracle package , fetchall() , but this ended up consuming lot of RAM,
Tried Pandas as well, but that also doesn't seem to be that efficient in case we have billions of records.
My Use case - Fetch each row from oracle table to python , do some processing and load it to another table.
PS - Expecting something like fetchmany() , tried this but not able to get it work.
With your large data set, since your machine's memory is "too small", then you will have to do batch processing of sets of rows and reinsert each set before fetching the next set.
Tuning Cursor.arraysize and Cursor.prefetchrows (new in cx_Oracle 8) will be important for your fetch performance. The value is used for internal buffer sizes, regardless of whether you use fetchone() or fetchmany(). See Tuning cx_Oracle.
Use executemany() when reinserting the values, see Batch Statement Execution and Bulk Loading.
The big question is whether you even need to fetch into Python, or whether you can do the manipulation in PL/SQL or SQL - using these two will remove the need to transfer data across the network so will be much, much more efficient.
Generally your strategy should be the following.
Write an sql query that gets the data you want.
Execute the query
loop: Read a tuple from the result.
Process the tuple.
Write the processed stuff to the target relation
Lots of details to deal with however.
Generally when executing a query of this time, you'll want to read a version of the database. If you have simultaneous writing to the same data, you'll get a stale version, but you won't hold up the writers. If you want the most recent version, you have a hard problem to solve.
When executing step 4, don't open a new connection, reuse a connection. Opening a new connection is akin to launching a battleship.
Best if you don't have any indexes on the target relation, since you'll pay the cost of updating the target relation each time you write a tuple. It will work, but it will be slower. You can build the index after finishing the processing.
If this implementation is too slow for you, an easy way to speed things up is to process, say, 1000 tuples at once and write them in batches. If that's too slow, you've entered into the murky world of database performance tuning.
You can use the offset-limit property. If the data can be ordered by a column (primary key or combination of columns so that the data is ordered same across the execution), heres what I tried.
I take the total count() of the table.
Based on a chunk size(eg:100000), create a list of values [0, 0+chunk, 0+chunk2...count(*)].
ex: if count(*)=2222 and chunk=100, output list will be:
[0,100,200,...2222]
And then using the values of above list, I fetch each partition by using 'offset i rows fetch next j values only'.
This will select the data in partitions without overlapping any data.
I change values of i and j in each iteration so that I get the complete data.
At a runtime, the given chunk is 100,000; that much data is loaded into memory. You can change it to any values based on your system spec.
At each iteration, save the data wherever you want.
I have a set of large tables with many records each. I'm writing a Python program that SELECTs a large number of records from these tables, based on the value of multiple columns on those records.
Essentially, these are going to be lots of queries of the form:
SELECT <some columns> FROM <some table> WHERE <column1=val1 AND column2=val2...>
Each table has a different set of columns, but otherwise the SELECT formula above holds.
By default, I was going to just run all these queries through the psycopg2 PostgreSQL database driver, each as a separate query. But I'm wondering if there's a more efficient way of going about it, given that there will be a very large number of such queries - thousands or more.
If the SELECT list entries are the same for all queries (the same number of entries and the same data types), you can use UNION ALL to combine several such queries. This won't reduce the amount of work for the database, but it will reduce the number of client-server round trips. This can be a huge improvement, because for short queries network latency is often the dominant cost.
If all your queries have different SELECT lists, there is nothing you can do.
I am trying to compare two tables in an sqlite3 database in python. One of the answers to this question:
Comparing two sqlite3 tables using python
gives a solution:
Alternatively, import them into SQLite tables. Then you can use queries like the following:
SELECT * FROM a INTERSECT SELECT * FROM b;
SELECT * FROM a EXCEPT SELECT * FROM b;
to get rows that exist in both tables, or only in one table.
This works great for tables with less than a million rows, but is far too slow for my program which requires comparing tables with more than ten billion rows. (Script took over ten minutes for just 100 million rows.)
Is there a faster way to compare two sqlite3 tables in python?
I thought about trying to compare the hashes of the two database files, but an overview of a program called dbhash on sqlite.org claims that even if the contents of two database files are the same certain operations "can potentially cause vast changes to the raw database file, and hence cause very different SHA1 hashes at the file level," which makes me think that this would not work unless I ran some sort of script to query all the data in an ordered fashion and then hash that (like the dbhash program does), but would that even be faster?
Or should I be using another database entirely that can preform this comparison faster than sqlite3?
Any ideas or suggestions would be greatly appreciated.
Edit: There have been some good ideas put forward so far, but to clarify: the order of the tables doesn't matter, just the contents.
You could resort to the following workaround:
Add a column to each table where you store a hash over the content of all other columns.
Add an index to the new column.
Compute and store the hash with the record.
Compare the hash columns of your tables instead of using intersect/except.
If altering the tables isn't an option you can perhaps create new tables that relate a hash to the primary key or rowid of the hashed record.
With that you shift part of the processing time needed for the comparison to the time you insert/update the records. I would expect this to be significant faster at the time you execute the comparison than comparing all columns of all rows just then.
Of course your hash must be aware of the order of values and produce unique values for every permutation; a simple checksum won't suffice. Suggestion:
Convert every column value to a string.
Concatenate the strings with a separator that's guaranteed not to occur in the values themselves.
Use SHA1 or a similarly sophisticated hashing algorithm over the concatenated string.
You can test whether storing the hash as string, blob or integer (provided it fits into 64 bit) makes a difference in speed.
Yes, it will take a lot of time for a single thread (or even several) on a single hard drive to crawl billions of raws.
It can obviously be better with stronger DB engines but indexing all your columns would not really help in the end.
You have to resort to precalculation or distributing your dataset amongst multiple systems...
If you have a LOT of RAM you can try copying the SQLite files first in /dev/shm allowing you to read your data straight from memory and benefit a performance boost.