Convert SAS proc sql to Python(pandas) - python

I rewrite some code from SAS to Python using Pandas library.
I've got such code, and I have no idea what should I do with it?
Can you help me, beacase its too complicated for me to do it correct. I've changed the name of columns (for encrypt sensitive data)
This is SAS code:
proc sql;
create table &work_lib..opk_do_inf_4 as
select distinct
*,
min(kat_opk) as opk_do_inf,
count(nr_ks) as ilsc_opk_do_kosztu_infr
from &work_lib..opk_do_inf_3
group by kod_ow, kod_sw, nr_ks, nr_ks_pr, nazwa_zabiegu_icd_9, nazwa_zabiegu
having kat_opk = opk_do_inf
;
quit;
This is my try in Pandas:
df = self.opk_do_inf_3() -> create DF using other function
df['opk_do_inf'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['kat_opk'].min()
df['ilsc_opk_do_kosztu_infr'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['nr_ks'].count()
df_groupby = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu']).filter(lambda x: x['kat_opk']==x['opk_do_inf'])
df = df_groupby.reset_index()
df = df.drop_duplcates()
return df

First, calling SELECT * in an aggregate GROUP BY query is not valid SQL. SAS may allow it but can yield unknown results. Usually SELECT columns should be limited to columns in GROUP BY clause.
With that said, aggregate SQL queries can generally be translated in Pandas with groupby.agg() operations with WHERE (filter before aggregation) or HAVING (filter after aggregation) conditions handled using either .loc or query.
SQL
SELECT col1, col2, col3,
MIN(col1) AS min_col1,
AVG(col2) AS mean_col2,
MAX(col3) AS max_col3,
COUNT(*) AS count_obs
FROM mydata
GROUP BY col1, col2, col3
HAVING col1 = min(col1)
Pandas
General
agg_data = (mydata.groupby(["col1", "col2", "col3"], as_index=False)
.agg(min_col1 = ("col1", "min"),
mean_col2 = ("col2", "mean"),
max_col3 = ("col3", "max"),
count_obs = ("col1", "count"))
.query("col1 == min_col1")
)
Specific
opk_do_inf_4 = (mydata.groupby(["kat_opk", "kod_ow", "kod_sw", "nr_ks", "nr_ks_pr",
"nazwa_zabiegu_icd_9", "nazwa_zabiegu"],
as_index=False)
.agg(opk_do_inf = ("kat_opk", "min"),
ilsc_opk_do_kosztu_infr = ("nr_ks", "count"))
.query("kat_opk == opk_do_inf")
)

You can use the sqldf function from pandasql package to run the sql query on dataframe. example below
''' from pandasql import sqldf
query = "select top 10 * from df "
newdf = sqldf(query, locals())
'''

Related

Selecting rows from sql if they are also in a dataframe

I have a MS sql server with a lot of rows( around 4 million) from all the customers and their information.
I can also get a list of phone numbers of all visitors of my website in a given timeframe that I can get in a csv file and then covert to a dataframe in python. What I want to do is to select two columns from my server(one is the phone number and the other one is a property of that person) but I only want to select this records from people who are in both my dataframe and my server.
What I currently do is selecting all customers from sql server and then merge them with my dataframe. But obviously this is not very fast. Is there any way to do this faster?
query2 = """
SELECT encrypt_phone, col2
FROM DatabaseTable
"""
cursor.execute(query2)
tables = cursor.fetchall()
df2 = pd.DataFrame.from_records(tables, columns=[x[0] for x in cursor.description])
df1.merge(df2, how='inner', indicator=True)
If your DataFrame have not many rows, I would do it the simple way as here :
V = df["colx"].unique()
Q = 'SELECT encrypt_phone, col2 FROM DatabaseTable WHERE coly IN ({})'.format(','.join(['?']*len(V)))
cursor.execute(Q, tuple(V))
tables = cursor.fetchall()
df2 = pd.DataFrame.from_records(tables, columns=[x[0] for x in cursor.description])
NB : colx and coly are the columns that refer to the customers (id, or name, ..) in the pandas DataFrame and in the SQL table, respectively.
Otherwise, you may need to store df1 as a table in your DB and then perform a sub-query :
df1.to_sql('DataFrameTable', conn, index=False) #this will store df1 in the DB
Q = "SELECT encrypt_phone, col2 FROM DatabaseTable WHERE coly IN (SELECT colx FROM DataFrameTable)"
df2 = pd.read_sql_query(Q, conn)

In Python, using pandasql: query return "Empty DataFrame"

In Python, using pandasql: query return "Empty DataFrame"
import pandas as pd
import sqlite3 as db
import pandasql
dataSet = pd.read_csv("https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",header=None)
type(dataSet)
dataSet.columns = ['age', 'workclass','fnlwgt','education','education_num','marital_status','occupation','relationship'
,'race','sex','capital_gain','capital_loss','hours_per_week','native_country','salary']
dataSet.head()
from pandasql import sqldf
q1 = "select distinct sex from dataSet where sex='Male';"
pysqldf = lambda q: sqldf(q, globals())
print(pysqldf(q1))
For this data set I check actual data and found the whitespaces in the columns.
So first we need to perform cleansing over the data then we can perform transformation on that.
For cleansing we need to trim white spaces. for this purpose I have written function trim_all_the_columns that will remove all the whitespace
Code for above mentioned data set
#!/usr/bin/env python
# coding: utf-8
#import required packages
import pandas as pd
import sqlite3 as db
import pandasql as ps
from pandasql import sqldf
#give the path location from where data is loaded, in your case give "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
inputpath=r'C:\Users\Z0040B9K\Desktop\SIGShowcase\adult.data.txt'
#Trim whitespace from ends of each value across all series in dataframe
def trim_all_the_columns(df):
trim_strings = lambda x: x.strip() if type(x) is str else x
return df.applymap(trim_strings)
#creating dataframe
dataSet = pd.read_csv(inputpath,header=None)
#calling trim function over the dataframe to remove all whitespaces
dataSet = trim_all_the_columns(dataSet)
type(dataSet)
dataSet.columns = ['age', 'workclass','fnlwgt','education','education_num','marital_status','occupation','relationship' ,'race','sex','capital_gain','capital_loss','hours_per_week','native_country','salary']
#sql query
q1 = "select distinct sex from dataSet where sex='Male';"
#this will return distinct result of **Male** column and that will be 0
# If you add any other column like **age** or something else you will get result
#q1 = "select distinct age,sex from dataSet where sex='Male';"
pysqldf = lambda q: sqldf(q, globals())
#print result
print(pysqldf(q1))
#this also can be use to print result
print(ps.sqldf(q1, locals()))
Find the resultant output: for query
q1 = "select distinct age, sex from dataSet where sex='Male';"
Find the result output for query
q1 = "select distinct sex from dataSet where sex='Male';"

Python pd.read_sql where cause parameters

Use case:
We have nested queries and our tables have 10 to 20 million rows. Here our intention is to reduce the query CPU time by smart filter
I like to filter my columns in pd.read_sql by other data frame column name. Is that possible?
Step 1: df1 data frame age1 and age3 are my future filter columns for pd.read_sql
raw_data1 = {'age1': [23,45,21],'age2': [10,20,50], 'age3':['forty','fortyone','fortyfour']}
df1 = pd.DataFrame(raw_data1, columns = ['age1','age2','age3'])
df1
Step2: I like take age1 from above df1 dataframe want to use in below pd.read_sql like below to get item1 dataframe
item1 = pd.read_sql("""
SELECT * from [dbo].[ITEM]
where item_age1 = df1.age1
""", conn)
Step3: I like to take age3 from above df1 dataframe want to use in below pd.read_sql like below to get item2 dataframe
item2 = pd.read_sql("""
SELECT * from [dbo].[ITEM]
where item_age3 = df1.age3
""", conn)
Use a parameterized query:
item2 = pd.read_sql("""
SELECT * from [dbo].[ITEM]
where item_age3 IN ({})
""".format(','.join('?'*len(df1.age3))), conn,
params=list(df1.age3))
depending on the database backend this syntax may use '%s' or %(name)s instead of '?'. See PEP249 paramstyle for more information.

How do I make inner join with external joiner on sql in python in efficient way

I want to merge an excel file with sql in pandas, here's my code
import pandas as pd
import pymysql
from sqlalchemy import create_engine
data1 = pd.read_excel('data.xlsx')
engine = create_engine('...cloudprovider.com/...')
data2 = pd.read_sql_query("select id, column3, column4 from customer", engine)
data = data1.merge(data2, on='id', how='left')
It works, just to make it clearer
If input data1.columns the output Index(['id', 'column1', 'column2'], dtype='object')
If input data2.columns the output Index(['id', 'column3', 'column4'], dtype='object')
If input data.columns the output Index(['id', 'column1', 'column2', 'column3', 'column4'], dtype='object')
Since the data2 getting bigger, I can't query entirely, so I want to query data2 with id that exist on data1. How suppose I do this?
You could leverage the fact that SQLAlchemy is a great query builder. Either reflect the customer table, or build the metadata by hand:
from sqlalchemy import MetaData, select
metadata = MetaData()
metadata.reflect(engine, only=['customer'])
customer = metadata.tables['customer']
and build your query, letting SQLAlchemy worry about proper usage of placeholders, data conversion etc. You're looking for customer rows where id is in the set of ids from data1, achieved in SQL with the IN operator:
query = select([customer.c.id,
customer.c.column3,
customer.c.column4]).\
where(customer.c.id.in_(data1['id']))
data2 = pd.read_sql_query(query, engine)
If you wish to keep on using SQL strings manually, you could build a parameterized query as such:
placeholders = ','.join(['%s'] * data1['id'].count())
# Note that you're not formatting the actual values here, but placeholders
query = f"SELECT id, column3, column4 FROM customer WHERE id IN ({placeholders})"
data2 = pd.read_sql_query(query, engine, params=data1['id'])
In general it is beneficial to learn to use placeholders instead of mixing SQL and values by formatting/concatenating strings, as it may expose you to SQL injection, if handling user generated data. Usually you'd write required placeholders in the query string directly, but some string building is required, if you have a variable amount of parameters1.
1: Some DB-API drivers, such as psycopg2, allow passing tuples and lists as scalar values and know how to construct suitable SQL.
Since you are looking into a condition as WHERE IN [Some_List]. This should work for you
id_list = data1['id'].tolist()
your_query = "select id, column3, column4 from customer where id in "+tuple(id_list)
data2 = pd.read_sql_query(your_query , engine)
Hope it works.

SQL values to update pandas dataframe

i am doing a lot of sql to pandas and i have run in to the following challenge.
I have a dataframe, that looks like
UserID, AccountNo, AccountName
123, 12345, 'Some name'
...
What i would like to do is for each account number, i would like to add a column called total revenue which is gotten from a mysql database, som i am thinking of something like,
for accountno in df['AccountNo']:
df1 = pd.read_sql(('select sum(VBRK_NETWR) as sum from sapdata2016.orders where VBAK_BSARK="ZEDI" and VBRK_KUNAG = %s;') % accountno, conn)
And i need to expand the the dataframe such that
UserID, AccountNo, AccountName, TotalRevenue
123, 12345, 'Some name', df1
...
The code that i have so far (and is not working casts a getitem error)
sets3 = []
i=0
for accountno in df5['kna1_kunnr']:
df1 = pd.read_sql(('select sum(VBRK_NETWR) as sum from sapdata2016.orders where VBAK_BSARK="ZEDI" and VBRK_KUNAG = %s;') % accountno, conn)
df2 = pd.DataFrame([(df5['userid'][i], df5['kna1_kunnr'][i], accountno, df5['kna1_name1'][i], df1['sum'][0])], columns=['User ID', 'AccountNo', 'tjeck', 'AccountName', 'Revenue'])
sets3.append(df2)
i += 1
df6 = pd.concat(sets3)
This idea/code is not pretty, and i wonder if there is a better/nicer way to do it, any ideas?
Consider exporting pandas data to MySQL as a temp table then run an SQL query that joins your pandas data and an aggregate query for TotalRevenue. Then, read resultset into pandas dataframe. This approach avoids any looping.
from sqlalchemy import create_engine
...
# SQL ALCHEMY CONNECTION (PREFERRED OVER RAW CONNECTION)
engine = create_engine('mysql://user:pwd#localhost/database')
# engine = create_engine("mysql+pymysql://user:pwd#hostname:port/database") # load pymysql
df1.to_sql("mypandastemptable", con=engine, if_exists='replace')
sql = """SELECT t.UserID, t.AccountNo, t.AccountName, agg.TotalRevenue
FROM mypandastemptable t
LEFT JOIN
(SELECT VBRK_KUNAG as AccountNo
SUM(VBRK_NETWR) as TotalRevenue
FROM sapdata2016.orders
WHERE VBAK_BSARK='ZEDI'
GROUP BY VBRK_KUNAG) agg
ON t.AccountNo = agg.AccountNo)
"""
newdf = pd.read_sql(sql, con=engine)
Of course the converse is true as well, merging on two pandas dataframes of existing dataframe and the grouped aggregate query resultset:
sql = """SELECT VBRK_KUNAG as AccountNo
SUM(VBRK_NETWR) as TotalRevenue
FROM sapdata2016.orders
WHERE VBAK_BSARK='ZEDI'
GROUP BY VBRK_KUNAG
"""
df2 = pd.read_sql(sql, con=engine)
newdf = df1.merge(df2, on='AccountNo', how='left')

Categories