Fast sqlite data retrieval with pandas - python

For one batch process I query my database ~30_000x. Every query returns between 0 and 20_000 rows. Profiling the code shows that the most time is spend getting this data out.
I tested 2 different methods with similar results. Assuming that the database schema is not the bottleneck, what can I do to speed up the data retrieval (besides going parallel)?
I was thinking on using another library/wrapper for sqlite
getting rid of pandas, but a lot of computation is done later on with the df, dask eventual faster?
wrapping it in cpython/julia?
db = sqlite.connect('D:\data.db')
cur = db.cursor()
# 1.just pandas, 231 ms
t = pd.read_sql("SELECT * FROM daily d WHERE d.id == 1 ", db)
# 2. db-api & pandas, 242ms
# query is actual half the time of 1., but creating that df cost time
cur.execute("SELECT * FROM daily d WHERE d.id ==1")
rows = cur.fetchall()
t = pd.DataFrame(rows, columns=col)

The question is very general given that we know little about the processing you do with Pandas.
The one suggestion though, would be to move as much processing as possible (particularly limiting the size of the dataframe) to SQL for SQLite to process.
So, if there is any filtering done in Pandas, I would strive to move it to SQL even if the condition for SELECT or GROUP BY is somewhat cumbersome. There is a cost of copying the data to Python realm and pandas is eating up memory and time.

Related

Sql Select statement Optimization

I have made an test table in sql with the following information schema as shown:
Now I extract this information using the python script the code of which is as shown:
import pandas as pd
import mysql.connector
db = mysql.connector.connect(host="localhost", user="root", passwd="abcdef")
pointer = db.cursor()
pointer.execute("use holdings")
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
pointer.execute(x)
rows = pointer.fetchall()
rows = pd.DataFrame(rows)
stock = rows[1]
The production table contains 200 unique trading symbols and has the schema similar to the test table.
My doubt is that for the following statement:
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
I will have to replace value of tradingsymbols 200 times which is ineffective.
Is there an effective way to do this?
If I understand you correctly, your problem is that you want to avoid sending multiple queries for each trading symbol, correct? In this case the following MySQL IN might be of help. You could then simply send one query to the database containing all tradingsymbols you want. If you want to do different things with the various trading symbols, you could select the subsets within pandas.
Another performance improvement could be pandas.read_sql since this speeds up the creation of the dataframe somewhat
Two more things to add for efficiency:
Ensure that tradingsymbols is indexed in MySQL for faster lookup processes
Make tradingsymbols an ENUM to ensure that no typos or alike are accepted. Otherwise the above-mentioned "IN" method also does not work since it does a full-text comparison.

Querying large data set from Postgresql and processing

My question is about memory and performance with querying large data and then processing.
Long story short, because of a bug. I am querying a table and getting all results between two timestamps. My Python script crashed due to not enough memory - This table is very wide and holds a massive JSON object. So I changed this to only return the Primary_Key of each row.
select id from *table_name*
where updated_on between %(date_one)s and %(date_two)s
order by updated_on asc
From here I loop through and query each row one by the Primary key for the row data.
for primary_key in *query_results*:
row_data = data_helper.get_by_id( primary_key )
# from here I do some formatting and throw a message on a queue processor,
# this is not heavy processing
Example:
queue_helper.put_message_on_queue('update_related_tables', message_dict)
My question is, is this a "good" way of doing this? Do I need to help Python with GC? or will Python clean the memory after each iteration in the loop?
Must be a very wide table? That doesn't seem like too crazy of a number of rows. Anyway you can make a lazy function to yield the data x number of rows at a time. It's not stated how you're executing your query, but this is a sqlalchemy/psycopg implementation:
with engine.connect() as conn:
result = conn.execute(*query*)
while True:
chunk = result.fetchmany(x)
if not chunk:
break
for row in chunk:
heavy_processing(row)
This is pretty similar to #it's-yer-boy-chet's answer, except it's just using the lower-level psycopg2 library instead of sqlalchemy. The iterator over conn.execute() will implicitly call the cursor.fetchone() method, returning one row at a time which keeps the memory footprint relatively small provided there aren't thousands and thousands of columns returned by the query. Not sure if it necessarily provides any performance benefits over sqlalchemy, it might be doing basically the same thing under the hood.
If you still need more performance after that I'd look into a different database connection library like asyncpg
conn = psycopg2.connect(user='user', password='password', host='host', database='database')
cursor = conn.cursor()
for row in cursor.execute(query):
message_dict = format_message(row)
queue_helper.put_message_on_queue('update_related_tables', message_dict)

Speeding up GROUP BY clause in SQL (Python/Pandas)

I have searched this website thoroughly and have not been able to find a solution that works for me. I code in python, and have very little SQL knowledge. I currently need to create a code to pull data from a SQL database, and organize/summarize it. My code is below: (it has been scrubbed for data security purposes)
conn = pc.connect(host=myhost,dbname =mydb, port=myport,user=myuser,password=mypassword)
cur = conn.cursor()
query = ("""CREATE INDEX index ON myTable3 USING btree (name);
CREATE INDEX index2 ON myTable USING btree (date, state);
CREATE INDEX index3 ON myTable4 USING btree (currency, type);
SELECT tp.name AS trading_party_a,
tp2.name AS trading_party_b,
('1970-01-01 00:00:00'::timestamp without time zone + ((mc.date)::double precision * '00:00:00.001'::interval)) AS val_date,
mco.currency,
mco.type AS type,
mc.state,
COUNT(*) as call_count,
SUM(mco.call_amount) as total_call_sum,
SUM(mco.agreed_amount) as agreed_sum,
SUM(disputed_amount) as disputed_sum
FROM myTable mc
INNER JOIN myTable2 cp ON mc.a_amp_id = cp.amp_id
INNER JOIN myTable3 tp ON cp.amp_id = tp.amp_id
INNER JOIN myTable2 cp2 ON mc.b_amp_id = cp2.amp_id
INNER JOIN myTable3 tp2 ON cp2.amp_id = tp2.amp_id,
myTable4 mco
WHERE (((mc.amp_id)::text = (mco.call_amp_id)::text))
GROUP BY tp.name, tp2.name,
mc.date, mco.currency, mco.type, mc.state
LIMIT 1000""")
frame = pdsql.read_sql_query(query,conn)
The query takes over 15 minutes to run, even when my limit is set to 5. Before the GROUP BY clause was added, it would run with LIMIT 5000 in under 10 seconds. I was wondering, as I'm aware my SQL is not great, if anybody has any insight on where might be causing delay, as well as any improvements to be made.
EDIT: I do not know how to view the performance of a SQL query, but if someone could inform me on this as well, I could post the performance of the script.
In regards to speeding up your workflow, you might be interested in checking out the 3rd part of my answer on this post : https://stackoverflow.com/a/50457922/5922920
If you want to keep a SQL-like interface while using a distributed file system you might want to have a look into Hive, Pig and Sqoop in addition to Hadoop and Spark.
Besides, to trace the performance of your SQL query you can always track the execution time of your code on your client side if appropriate.
For example :
import timeit
start_time = timeit.default_timer()
#Your code here
end_time = timeit.default_timer()
print end_time - start_time
Or use tools like those to have a deeper look at what is going on: https://stackify.com/performance-tuning-in-sql-server-find-slow-queries/
I think the delay is because SQL runs the groupby statement first then it runs everything else. So it is going through your entire large dataset to group everything, then it is going through it again to pull values and do the counts and summations.
Without the groupby, it does not have to parse the entire dataset before it can start generating the results - it jumps right into summing and counting the variables that you desire.

Python/Psycopg2/PostgreSQL Copy_From loop gets slower as it progresses

I have written a Python script that takes a 1.5 G XML file, parses out data and feeds it to a database using copy_from. It invokes the following function every 1000 parsed nodes. There are about 170k nodes in all which update about 300k rows or more. It starts out quite fast and then gets progressively slower as time goes on. Any ideas on why this is happening and what I can do to fix it?
Here is the function where I feed the data to the db.
def db_update(val_str, tbl, cols):
conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>")
cur = conn.cursor()
output = cStringIO.StringIO()
output.write(val_str)
output.seek(0)
cur.copy_from(output, tbl, sep='\t', columns=(cols))
conn.commit()
I haven't included the xml parsing as I don't think that's an issue. Without the db the parser executes in under 2 minutes.
There are several things that can slow inserts as tables grow:
Triggers that have to do more work as the DB grows
Indexes, which get more expensive to update as they grow
Disable any non-critical triggers, or if that isn't possible re-design them to run in constant time.
Drop indexes, then create them after the data has been loaded. If you need any indexes for the actual INSERTs or UPDATEs you'll need to keep them an wear the cost.
If you're doing lots of UPDATEs, consider VACUUMing the table periodically, or setting autovacuum to run very aggressively. That'll help Pg re-use space rather than more expensively allocating new space from the file system, and will help avoid table bloat.
You'll also save time by not re-connecting for each block of work. Maintain a connection.
From personal experience, copy_from doesn't update any indexes after you commit anything, so you will have to do it later. I would move your conn = psycopg2.connect("dbname=<mydb> user=postgres password=<mypw>"); cur = conn.cursor() outside of the function and do a commit() when you've finnished inserting everything (I suggest to commit every ~100k rows or it will start getting slow).
Also, it may seem stupid, but it happened to me a lot of times: Make sure you reset your val_str after you call db_update. For me, when the copy_from /inserts starts to go slower it's because im inserting the same rows plus more rows.
I using the following and I don't get any hit on performance as far as I have seen:
import psycopg2
import psycopg2.extras
local_conn_string = """
host='localhost'
port='5432'
dbname='backupdata'
user='postgres'
password='123'"""
local_conn = psycopg2.connect(local_conn_string)
local_cursor = local_conn.cursor(
'cursor_unique_name',
cursor_factory=psycopg2.extras.DictCursor)
I have made the following outputs in my code to test run-time (and I am parsing a LOT of rows. More than 30.000.000).
Parsed 2600000 rows in 00:25:21
Parsed 2700000 rows in 00:26:19
Parsed 2800000 rows in 00:27:16
Parsed 2900000 rows in 00:28:15
Parsed 3000000 rows in 00:29:13
Parsed 3100000 rows in 00:30:11
I have to mention I don't "copy" anything. But I am moving my rows from a remote PostGreSQL to a local one, and in the process create a few more tables to index my data better than it was done, as 30.000.000+ is a bit too much to handle on regular queries.
NB: The time is counting upwards and is not for each query.
I believe it has to do with the way my cursor is created.
EDIT1:
I am using the following to run my query:
local_cursor.execute("""SELECT * FROM data;""")
row_count = 0
for row in local_cursor:
if(row_count % 100000 == 0 and row_count != 0):
print("Parsed %s rows in %s" % (row_count,
my_timer.get_time_hhmmss()
))
parse_row(row)
row_count += 1
print("Finished running script!")
print("Parsed %s rows" % row_count)
The my_timer is a timer class I've made, and the parse_row(row) function formats my data, transfers it to to my local DB and eventually deletes from remote DB once the data is verified as having been moved to my local DB.
EDIT2:
It takes roughly 1 minute to parse every 100.000 rows in my DB, even after parsing around 4.000.000 queries:
Parsed 3800000 rows in 00:36:56
Parsed 3900000 rows in 00:37:54
Parsed 4000000 rows in 00:38:52
Parsed 4100000 rows in 00:39:50

Querying relational data in a reasonable amount of time

I have a spreadsheet with about 1.7m lines, totalling 1 GB, and need to perform queries on it.
Being most comfortable with Python, my first approach was to hack together a bunch of dictionaries keyed in a way that would facilitate the queries I was trying to make. E.g. if I needed to access everyone with a particular area code and age I would make an areacode_age 2-dimensional dictionary. I ended up needing quite a few which multiplied memory footprint to ~10GB, and even though I had enough RAM the process was slow.
I imported sqlite3 and imported my data into an in-memory database. Turns out doing a query like SELECT (a, b, c) FROM foo WHERE date1<=d AND date2>e AND name=f takes 0.05 seconds. Doing this for my 1.7m rows would take 24 hours. My hacky approach with dictionaries was about 3 orders of magnitude faster for this particular task (and, in this example, I couldn't key on date1 and date2, so I was getting every row that matched name and then filtering by date).
Why is this so slow, and how can I make it fast? And what is the Pythonic approach? Possibilities I've been considering:
sqlite3 is too slow and I need something more heavyweight.
I need to change my schema or my queries to be more optimized.
I need a new tool of some kind.
I read somewhere that in sqlite 3, doing repeated calls to cursor.execute is much slower than using cursor.executemany. It turns out executemany isn't compatible with select statements though, so I think this was a red herring.
sqlite3 is too slow, and I need something more heavyweight
First, sqlite3 is fast, sometime faster than MySQL
Second, you have to use index, put a compound index in (date1, date2, name) will speed thing up significantly
It turns out though, that doing a query like "SELECT (a, b, c) FROM
foo WHERE date1<=d AND date2>e AND name=f" takes 0.05 seconds. Doing
this for my 1.7m rows would take 24 hours of compute time. My hacky
approach with dictionaries was about 3 orders of magnitude faster for
this particular task (and, in this example, I couldn't key on date1
and date2 obviously, so I was getting every row that matched name and
then filtering by date).
Did you actually try this and observe that it was taking 24 hours? Processing time is not necessarily directly proportional to data size.
And are you suggesting that you might need to run SELECT (a, b, c) FROM foo WHERE date1<=d AND date2>e AND name=f 1.7 million times? You only need to run it once, and it will return the entire subset of rows matching your query.
1.7 million rows is not small, but certainly not an issue for a database entirely in memory on your local computer. (No slow disk access; no slow network access.)
Proof is in the pudding. This is pretty fast for me (most of the time is spent in generating ~ 10 million random floats.)
import sqlite3, random
conn = sqlite3.connect(":memory:")
conn.execute("CREATE TABLE numbers (a FLOAT, b FLOAT, c FLOAT, d FLOAT, e FLOAT, f FLOAT)");
for _ in xrange(1700000):
data = [ random.random() for _ in xrange(6) ];
conn.execute("INSERT INTO numbers VALUES (?,?,?,?,?,?)", data)
conn.commit()
print "done generating random numbers"
results = conn.execute("SELECT * FROM numbers WHERE a > 0.5 AND b < 0.5")
accumulator = 0
for row in results:
accumulator += row[0]
print ("Sum of column `a` where a > 0.5 and b < 0.5 is %f" % accumulator)
Edit: Okay, so you really do need to run this 1.7 million times.
In that case, what you probably want is an index. To quote Wikipedia:Database Index:
A database index is a data structure that improves the speed of data
retrieval operations on a database table at the cost of slower writes
and increased storage space. Indexes can be created using one or more
columns of a database table, providing the basis for both rapid random
lookups and efficient access of ordered records.
You would do something like CREATE INDEX dates_and_name ON foo(date1,date2,name) and then (I believe) execute the rest of your SELECT statements as usual. Try this and see if it speeds things up.
Since you are already talking SQL the easiest approach will be:
Put all your data to MySQL table. It should perform well for 1.7 millions of rows.
Add indexes you need, check for settings, make sure it will run fast on big table.
Access it from Python
...
Profit!

Categories