For business, I need to retrieve data using PandaSQL. I use around 4 queries in my code, and my base data size is 2,000,000.
I'm using following type of queries in my code. Note that the variables are dummies, but the syntax is the same.
import pandasql as pdsql
str1="""select distinct class,year,section,student_name from student_data where class=%d and year='%s'"""
str2=str1%(class,year)
pysql = lambda q: pdsql.sqldf(q, globals())
df1 = pysql(str2)
Currently, the code takes five minutes and 30 seconds to execute. How could I make this run faster using PandaSQL in Python 3.x?
Related
I am using MySQL with pandas and sqlalchemy. However, it is extremely slow. A simple query as this one takes more than 11 minutes to complete on a table with 11 milion rows. What actions could improve this performance? The table mentioned does not have a primary key and was indexed only by one column.
from sqlalchemy import create_engine
import pandas as pd
sql_engine_access = 'mysql+pymysql://root:[password]#localhost')
sql_engine = create_engine(sql_engine_access, echo=False)
script = 'select * from my_database.my_table'
df = pd.read_sql(script, con=self.sql_engine)
You can try out our tool connectorx (pip install -U connectorx). It is implemented in Rust and targeting on improving the performance of pandas.read_sql. The API is basically the same with pandas. For example in your case the code would look like:
import connectorx as cx
conn_url = "mysql://root:[password]#localhost:port/my_database"
query = "select * from my_table"
df = cx.read_sql(conn_url, query)
If there is a numerical column that is evenly distributed like ID in your query result, you can also further speed up the process by leveraging multiple cores like this:
df = cx.read_sql(conn_url, query, partition_on="ID", partition_num=4)
This would split the entire query to four small ones by filtering on the ID column and connectorx will run them in parallel. You can check out here for more usage and examples.
Here is the benchmark result loading 60M rows x 16 columns from MySQL to pandas DataFrame using 4 cores:
While perhaps not the entire cause of the slow performance, one contributing factor would be that PyMySQL (mysql+pymysql://) can be significantly slower than mysqlclient (mysql+mysqldb://) under heavy loads. In a very informal test (no multiple runs, no averaging, no server restarts) I saw the following results using df.read_sql_query() against a local MySQL database:
rows retrieved
mysql+mysqldb (seconds)
mysql+pymysql (seconds)
1_000_000
13.6
54.0
2_000_000
25.9
114.1
3_000_000
38.9
171.5
4_000_000
62.8
217.0
5_000_000
78.3
277.4
I have made an test table in sql with the following information schema as shown:
Now I extract this information using the python script the code of which is as shown:
import pandas as pd
import mysql.connector
db = mysql.connector.connect(host="localhost", user="root", passwd="abcdef")
pointer = db.cursor()
pointer.execute("use holdings")
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
pointer.execute(x)
rows = pointer.fetchall()
rows = pd.DataFrame(rows)
stock = rows[1]
The production table contains 200 unique trading symbols and has the schema similar to the test table.
My doubt is that for the following statement:
x = "Select * FROM orders where tradingsymbol like 'TATACHEM'"
I will have to replace value of tradingsymbols 200 times which is ineffective.
Is there an effective way to do this?
If I understand you correctly, your problem is that you want to avoid sending multiple queries for each trading symbol, correct? In this case the following MySQL IN might be of help. You could then simply send one query to the database containing all tradingsymbols you want. If you want to do different things with the various trading symbols, you could select the subsets within pandas.
Another performance improvement could be pandas.read_sql since this speeds up the creation of the dataframe somewhat
Two more things to add for efficiency:
Ensure that tradingsymbols is indexed in MySQL for faster lookup processes
Make tradingsymbols an ENUM to ensure that no typos or alike are accepted. Otherwise the above-mentioned "IN" method also does not work since it does a full-text comparison.
I'm a beginner to data science and my recent job is to select data from company's database with some conditions using python. I tried to achieve this by using sqlalchemy and engine, but it takes too long to get all the rows I need. I can't see what I can do to reduce the time it performs.
For example, I use following codes to get total orders of a store during a time period by its store_id in the database:
import pandas as pd
from sqlalchemy import create_engine, MetaData, select, Table, func, and_, or_, cast, Float
import pymysql
#create engine and connect it to the database
engine = create_engine('mysql+pymysql://root:*******#127.0.0.1:3306/db')
order = Table('order', metadata, autoload=True, autoload_with=engine)
#use the store_id to get all the data in two months from the table
def order_df_func(store_id):
df = pd.DataFrame()
stmt = select([order.columns.gmt_create, order.columns.delete_status, order.columns.payment_time])
stmt = stmt.where(
and_(order.columns.store_id == store_id,
order.columns.gmt_create <= datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'),
order.columns.gmt_create >= get_day_zero(last_month_start.date()) #func defined to get 00:00 for a day
)
)
results = connection.execute(stmt).fetchall()
df = pd.DataFrame(results)
df.columns = results[0].keys()
return df
#get the data in a specific time period
def time_bounded_order_df(store_id, date_required_type, time_period):
order_df = order_df_func(store_id)
get_date(date_required_type)# func defined to get the start time and end time, e.g. this week or this month
if time_period == 't':
order_df = order_df[(order_df['gmt_create'].astype(str) >= start_time) & (order_df['gmt_create'].astype(str) <= end_time)]
elif time_period == 'l':
order_df = order_df[(order_df['gmt_create'].astype(str) >= last_period_start_time) & (order_df['gmt_create'].astype(str) <= last_period_end_time)]
return order_df
#get the number or orders
def num_of_orders(df):
return len(df.index)
It takes around 8s to get 0.4 millions results, which is too long. Is there anyway I can adjust my code to make it shorter?
Update
I tried to select data directly in the mysql workbench and it takes around 0.02s to get 1000 results. I believe that the question comes from the following code
results = connection.execute(stmt).fetchall()
But I don't know anyway else I can store the data to a pd.dataframe. Any thoughts with that?
Update2
I just learned that there is something called 'indexes' in the table that can decrease the processing time. My database is given by the company and I can't edit it. I'm not sure if it's the problem of the table in the database or I still need to do something to fix my code. Is there a way I can 'use' the indexes in my code? Or it should be given? Or can I create indexes through python?
Update3
I figured out that my database stopped using indexes when I select several columns, which significantly increased the processing time. I believe this is a mysql question rather than a python question. I'm still searching on how to fix this since I barely know sql.
Update4
I degrade my mysql server version from 8.0 to 5.7 and indexes in my table started to work. But it still takes a long time for python to process. I'll keep trying to figure out what I can do on this.
I found out that if I used
results = connection.execute(stmt).fetchall()
df = pd.DataFrame(results)
df.columns = results[0].keys()
then you're resaving all the data from database to python, and since I didn't create indexes for python, the resaving time and searching time is very long. However, in my case I don't need to resave the data in python, I just need the total count of several variables. So, instead of select several columns, I just use
stmt = select([func.count(yc_order.columns.id)])
#where something something
results = connection.execute(stmt).scalar()
return results
And it runs just as fast as it inside mysql, and question is solved
P.S. I also need some variables that count total orders in each hour. I decided to create a new table in my database and use schedule module to run the script every hour and insert the data in the new table.
I am using:
Python3
Pandas
Sqllite3 on my local (will be Postgresql on production, although
this should not matter for this purpose)
I have a project where I am attempting to remove some joins, counts, groups and other aggregate functions from queries - they all need to be moved to code.
I am new to Python and have skimmed the Pandas manual and other resources on StackOverflow.
I am attempting to recreate the following query:
SELECT D.ID, D.Name, COUNT(W.ID)
FROM Departments D
LEFT JOIN Widgets W ON D.ID=W.department
GROUP BY D.ID, D.Name
HAVING COUNT(W.ID)>0
On the Python side - just using two queries:
SELECT * FROM departments
SELECT * FROM widgets
I could be wrong, but I believe this is what needs to happen:
Import the python module and create a connection
Import Pandas (from what i understand this is an arguably
efficient tool for this type of work)
Assign my queries to a variable
Have Pandas read the queries
(Merge?) the query results to construct a dataframe
Perform count and aggregation using methods on the dataframe
I am struggling with the syntax and am having trouble determining if I am even going about this the right way. Both tables passed into the queries have multiple columns above and beyond what I am working with, which could be contributing to the difficulty.
The result should have the Department ID, Department Name and a count of widgets belonging to the department. Here is the python code I have been experimenting with:
import sqlite3
import pandas as pd
... #functions and connection info removed
with conn:
sql1="SELECT * FROM departments"
sql2="SELECT * FROM widgets"
#print("Read Queries Into Dataframes")
df = pd.read_sql(sql1, conn)
lf = pd.read_sql(sql2, conn)
#print("Connected and read - print the dataframe")
merged_df=pd.merge(df, lf, left_on='id', right_on='department', how='inner') #.groupby(['id'])
#result=merged_df.groupby(['id'])
#result = pd.merge(df, lf, on='key')
#print(result)
Notes:
It appears to (mostly) work until I introduce the group by It appears
I am getting a key error in ID - perhaps this is a syntax error or I
do not have something aliased properly
Changing the join type (how) from Left to Inner yields some NAN
results
I have a database with roughly 30 million entries, which is a lot and i don't expect anything but trouble working with larger database entries.
But using py-postgresql and the .prepare() statement i would hope i could fetch entries on a "yield" basis and thus avoiding filling up my memory with only the results from the database, which i aparently can't?
This is what i've got so far:
import postgresql
user = 'test'
passwd = 'test
db = postgresql.open('pq://'+user+':'+passwd+'#192.168.1.1/mydb')
results = db.prepare("SELECT time time FROM mytable")
uniqueue_days = []
with db.xact():
for row in result():
if not row['time'] in uniqueue_days:
uniqueue_days.append(row['time'])
print(uniqueue_days)
Before even getting to if not row['time'] in uniqueue_days: i run out of memory, which isn't so strange considering result() probably fetches all results befor looping through them?
Is there a way to get the library postgresql to "page" or batch down the results in say a 60k per round or perhaps even rework the query to do more of the work?
Thanks in advance!
Edit: Should mention the dates in the database is Unix timestamps, and i intend to convert them into %Y-%m-%d format prior to adding them into the uniqueue_days list.
If you were using the better-supported psycopg2 extension, you could use a loop over the client cursor, or fetchone, to get just one row at a time, as psycopg2 uses a server-side portal to back its cursor.
If py-postgresql doesn't support something similar, you could always explicitly DECLARE a cursor on the database side and FETCH rows from it progressively. I don't see anything in the documentation that suggests py-postgresql can do this for you automatically at the protocol level like psycopg2 does.
Usually you can switch between database drivers pretty easily, but py-postgresql doesn't seem to follow the Python DB-API, so testing it will take a few more changes. I still recommend it.
You could let the database do all the heavy lifting.
Ex: Instead of reading all the data into Python and then calculating unique_dates why not try something like this
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES FROM mytable;
If you want to strictly enforce sort order on unique_dates returned then do the following:
SELECT DISTINCT DATE(to_timestamp(time)) AS UNIQUE_DATES
FROM mytable
order by 1;
Usefull references for functions used above:
Date/Time Functions and Operators
Data Type Formatting Functions
If you would like to read data in chunks you could use the dates you get from above query to subset your results further down the line:
Ex:
'SELECT * FROM mytable mytable where time between' +UNIQUE_DATES[i] +'and'+ UNIQUE_DATES[j] ;
Where UNIQUE_DATES[i]& [j] will be parameters you would pass from Python.
I will leave it for you to figure how to convert date into unix timestamps.