Speeding up GROUP BY clause in SQL (Python/Pandas) - python

I have searched this website thoroughly and have not been able to find a solution that works for me. I code in python, and have very little SQL knowledge. I currently need to create a code to pull data from a SQL database, and organize/summarize it. My code is below: (it has been scrubbed for data security purposes)
conn = pc.connect(host=myhost,dbname =mydb, port=myport,user=myuser,password=mypassword)
cur = conn.cursor()
query = ("""CREATE INDEX index ON myTable3 USING btree (name);
CREATE INDEX index2 ON myTable USING btree (date, state);
CREATE INDEX index3 ON myTable4 USING btree (currency, type);
SELECT tp.name AS trading_party_a,
tp2.name AS trading_party_b,
('1970-01-01 00:00:00'::timestamp without time zone + ((mc.date)::double precision * '00:00:00.001'::interval)) AS val_date,
mco.currency,
mco.type AS type,
mc.state,
COUNT(*) as call_count,
SUM(mco.call_amount) as total_call_sum,
SUM(mco.agreed_amount) as agreed_sum,
SUM(disputed_amount) as disputed_sum
FROM myTable mc
INNER JOIN myTable2 cp ON mc.a_amp_id = cp.amp_id
INNER JOIN myTable3 tp ON cp.amp_id = tp.amp_id
INNER JOIN myTable2 cp2 ON mc.b_amp_id = cp2.amp_id
INNER JOIN myTable3 tp2 ON cp2.amp_id = tp2.amp_id,
myTable4 mco
WHERE (((mc.amp_id)::text = (mco.call_amp_id)::text))
GROUP BY tp.name, tp2.name,
mc.date, mco.currency, mco.type, mc.state
LIMIT 1000""")
frame = pdsql.read_sql_query(query,conn)
The query takes over 15 minutes to run, even when my limit is set to 5. Before the GROUP BY clause was added, it would run with LIMIT 5000 in under 10 seconds. I was wondering, as I'm aware my SQL is not great, if anybody has any insight on where might be causing delay, as well as any improvements to be made.
EDIT: I do not know how to view the performance of a SQL query, but if someone could inform me on this as well, I could post the performance of the script.

In regards to speeding up your workflow, you might be interested in checking out the 3rd part of my answer on this post : https://stackoverflow.com/a/50457922/5922920
If you want to keep a SQL-like interface while using a distributed file system you might want to have a look into Hive, Pig and Sqoop in addition to Hadoop and Spark.
Besides, to trace the performance of your SQL query you can always track the execution time of your code on your client side if appropriate.
For example :
import timeit
start_time = timeit.default_timer()
#Your code here
end_time = timeit.default_timer()
print end_time - start_time
Or use tools like those to have a deeper look at what is going on: https://stackify.com/performance-tuning-in-sql-server-find-slow-queries/

I think the delay is because SQL runs the groupby statement first then it runs everything else. So it is going through your entire large dataset to group everything, then it is going through it again to pull values and do the counts and summations.
Without the groupby, it does not have to parse the entire dataset before it can start generating the results - it jumps right into summing and counting the variables that you desire.

Related

Sql Select statement Optimization

I have made an test table in sql with the following information schema as shown:
Now I extract this information using the python script the code of which is as shown:
import pandas as pd
import mysql.connector
db = mysql.connector.connect(host="localhost", user="root", passwd="abcdef")
pointer = db.cursor()
pointer.execute("use holdings")
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
pointer.execute(x)
rows = pointer.fetchall()
rows = pd.DataFrame(rows)
stock = rows[1]
The production table contains 200 unique trading symbols and has the schema similar to the test table.
My doubt is that for the following statement:
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
I will have to replace value of tradingsymbols 200 times which is ineffective.
Is there an effective way to do this?
If I understand you correctly, your problem is that you want to avoid sending multiple queries for each trading symbol, correct? In this case the following MySQL IN might be of help. You could then simply send one query to the database containing all tradingsymbols you want. If you want to do different things with the various trading symbols, you could select the subsets within pandas.
Another performance improvement could be pandas.read_sql since this speeds up the creation of the dataframe somewhat
Two more things to add for efficiency:
Ensure that tradingsymbols is indexed in MySQL for faster lookup processes
Make tradingsymbols an ENUM to ensure that no typos or alike are accepted. Otherwise the above-mentioned "IN" method also does not work since it does a full-text comparison.

How to decrease the time sqlalchemy used to connect to database and select data

I'm a beginner to data science and my recent job is to select data from company's database with some conditions using python. I tried to achieve this by using sqlalchemy and engine, but it takes too long to get all the rows I need. I can't see what I can do to reduce the time it performs.
For example, I use following codes to get total orders of a store during a time period by its store_id in the database:
import pandas as pd
from sqlalchemy import create_engine, MetaData, select, Table, func, and_, or_, cast, Float
import pymysql
#create engine and connect it to the database
engine = create_engine('mysql+pymysql://root:*******#127.0.0.1:3306/db')
order = Table('order', metadata, autoload=True, autoload_with=engine)
#use the store_id to get all the data in two months from the table
def order_df_func(store_id):
df = pd.DataFrame()
stmt = select([order.columns.gmt_create, order.columns.delete_status, order.columns.payment_time])
stmt = stmt.where(
and_(order.columns.store_id == store_id,
order.columns.gmt_create <= datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
order.columns.gmt_create >= get_day_zero(last_month_start.date()) #func defined to get 00:00 for a day
)
)
results = connection.execute(stmt).fetchall()
df = pd.DataFrame(results)
df.columns = results[0].keys()
return df
#get the data in a specific time period
def time_bounded_order_df(store_id, date_required_type, time_period):
order_df = order_df_func(store_id)
get_date(date_required_type)# func defined to get the start time and end time, e.g. this week or this month
if time_period == 't':
order_df = order_df[(order_df['gmt_create'].astype(str) >= start_time) & (order_df['gmt_create'].astype(str) <= end_time)]
elif time_period == 'l':
order_df = order_df[(order_df['gmt_create'].astype(str) >= last_period_start_time) & (order_df['gmt_create'].astype(str) <= last_period_end_time)]
return order_df
#get the number or orders
def num_of_orders(df):
return len(df.index)
It takes around 8s to get 0.4 millions results, which is too long. Is there anyway I can adjust my code to make it shorter?
Update
I tried to select data directly in the mysql workbench and it takes around 0.02s to get 1000 results. I believe that the question comes from the following code
results = connection.execute(stmt).fetchall()
But I don't know anyway else I can store the data to a pd.dataframe. Any thoughts with that?
Update2
I just learned that there is something called 'indexes' in the table that can decrease the processing time. My database is given by the company and I can't edit it. I'm not sure if it's the problem of the table in the database or I still need to do something to fix my code. Is there a way I can 'use' the indexes in my code? Or it should be given? Or can I create indexes through python?
Update3
I figured out that my database stopped using indexes when I select several columns, which significantly increased the processing time. I believe this is a mysql question rather than a python question. I'm still searching on how to fix this since I barely know sql.
Update4
I degrade my mysql server version from 8.0 to 5.7 and indexes in my table started to work. But it still takes a long time for python to process. I'll keep trying to figure out what I can do on this.
I found out that if I used
results = connection.execute(stmt).fetchall()
df = pd.DataFrame(results)
df.columns = results[0].keys()
then you're resaving all the data from database to python, and since I didn't create indexes for python, the resaving time and searching time is very long. However, in my case I don't need to resave the data in python, I just need the total count of several variables. So, instead of select several columns, I just use
stmt = select([func.count(yc_order.columns.id)])
#where something something
results = connection.execute(stmt).scalar()
return results
And it runs just as fast as it inside mysql, and question is solved
P.S. I also need some variables that count total orders in each hour. I decided to create a new table in my database and use schedule module to run the script every hour and insert the data in the new table.

SQL Count Optimisations

I have been using the Django Rest Framework, and part of the ORM does the following query as part of a generic object list endpoint:
`SELECT COUNT(*) AS `__count`
FROM `album`
INNER JOIN `tracks`
ON (`album`.`id` = `tracks`.`album_id`)
WHERE `tracks`.`viewable` = 1`
The API is supposed to only display albums with tracks that are set to viewable, but with a tracks table containing 50 million rows this is query never seem to complete and hangs the endpoint's execution.
All columns referenced are indexed, so I do not know why this is taking so long to execute. If there are any potential optimisations that I might have not considered please let me know.
For this query:
SELECT COUNT(*) AS `__count`
FROM `album` INNER JOIN
`tracks`
ON (`album`.`id` = `tracks`.`album_id`)
WHERE `tracks`.`viewable` = 1`;
An index on tracks(viewable, album_id) and album(id) would help.
But, in all likelihood a join is not needed, so you can do:
select count(*)
from tracks
where viewable = 1;
For this the index on tracks(viewable) will be a big help.

Memory efficient way of fetching postgresql uniqueue dates?

I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.
If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.
You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.

Efficient way to run select query for millions of data

I want to run various select query 100 million times and I have aprox. 1 million rows in a table. Therefore, I am looking for the fastest method to run all these select queries.
So far I have tried three different methods, and the results were similar.
The following three methods are, of course, not doing anything useful, but are purely for comparing performance.
first Method:
for i in range (100000000):
cur.execute("select id from testTable where name = 'aaa';")
second method:
cur.execute("""PREPARE selectPlan AS
SELECT id FROM testTable WHERE name = 'aaa' ;""")
for i in range (10000000):
cur.execute("""EXECUTE selectPlan ;""")
third method:
def _data(n):
cur = conn.cursor()
for i in range (n):
yield (i, 'test')
sql = """SELECT id FROM testTable WHERE name = 'aaa' ;"""
cur.executemany(sql, _data(10000000))
And the table is created like this:
cur.execute("""CREATE TABLE testTable ( id int, name varchar(1000) );""")
cur.execute("""CREATE INDEX indx_testTable ON testTable(name)""")
I thought that using the prepared statement functionality would really speed up the queries, but as it seems like this will not happen, I thought you could give me a hint on other ways of doing this.
This sort of benchmark is unlikely to produce any useful data, but the second method should be fastest, as once the statement is prepared it is stored in memory by the database server. Further calls to repeat the query do not require the text of the query to be transmitted, so saving a small about of time.
This is likely to be moot as the query is very small (likely the same quantity of packets over the wire as repeating sending the query text), and the query cache will serve the same data for every request.
What's the purpose of retrieving such amount of data at once? I don't know your situation, but I'd definitely page the results using limit and offset. Take a look at:
7.6. LIMIT and OFFSET
If you just want to benchmark SQL all on it's own and not mix Python into the equation try pgbench.
http://developer.postgresql.org/pgdocs/postgres/pgbench.html
Also what is your goal here?

Categories