Selecting rows from sql if they are also in a dataframe - python

I have a MS sql server with a lot of rows( around 4 million) from all the customers and their information.
I can also get a list of phone numbers of all visitors of my website in a given timeframe that I can get in a csv file and then covert to a dataframe in python. What I want to do is to select two columns from my server(one is the phone number and the other one is a property of that person) but I only want to select this records from people who are in both my dataframe and my server.
What I currently do is selecting all customers from sql server and then merge them with my dataframe. But obviously this is not very fast. Is there any way to do this faster?
query2 = """
SELECT encrypt_phone, col2
FROM DatabaseTable
"""
cursor.execute(query2)
tables = cursor.fetchall()
df2 = pd.DataFrame.from_records(tables, columns=[x[0] for x in cursor.description])
df1.merge(df2, how='inner', indicator=True)

If your DataFrame have not many rows, I would do it the simple way as here :
V = df["colx"].unique()
Q = 'SELECT encrypt_phone, col2 FROM DatabaseTable WHERE coly IN ({})'.format(','.join(['?']*len(V)))
cursor.execute(Q, tuple(V))
tables = cursor.fetchall()
df2 = pd.DataFrame.from_records(tables, columns=[x[0] for x in cursor.description])
NB : colx and coly are the columns that refer to the customers (id, or name, ..) in the pandas DataFrame and in the SQL table, respectively.
Otherwise, you may need to store df1 as a table in your DB and then perform a sub-query :
df1.to_sql('DataFrameTable', conn, index=False) #this will store df1 in the DB
Q = "SELECT encrypt_phone, col2 FROM DatabaseTable WHERE coly IN (SELECT colx FROM DataFrameTable)"
df2 = pd.read_sql_query(Q, conn)

Related

Insert data from pandas into sql db - keys doesn't fit columns

I have a database with around 10 columns. Sometimes I need to insert a row which has only 3 of the required columns, the rest are not in the dic.
The data to be inserted is a dictionary named row :
(this insert is to avoid duplicates)
row = {'keyword':'abc','name':'bds'.....}
df = pd.DataFrame([row]) # df looks good, I see columns and 1 row.
engine = getEngine()
connection = engine.connect()
df.to_sql('temp_insert_data_index', connection, if_exists ='replace',index=False)
result = connection.execute(('''
INSERT INTO {t} SELECT * FROM temp_insert_data_index
ON CONFLICT DO NOTHING''').format(t=table_name))
connection.close()
Problem : when I don't have all columns in the row(dic), it will insert dic fields by order (a 3 keys dic will be inserted to the first 3 columns) and not to the right columns. ( I expect the keys in dic to fit the db columns)
Why ?
Consider explicitly naming the columns to be inserted in INSERT INTO and SELECT clauses which is best practice for SQL append queries. Doing so, the dynamic query should work for all or subset of columns. Below uses F-string (available Python 3.6+) for all interpolation to larger SQL query:
# APPEND TO STAGING TEMP TABLE
df.to_sql('temp_insert_data_index', connection, if_exists='replace', index=False)
# STRING OF COMMA SEPARATED COLUMNS
cols = ", ".join(df.columns)
sql = (
f"INSERT INTO {table_name} ({cols}) "
f"SELECT {cols} FROM temp_insert_data_index "
"ON CONFLICT DO NOTHING"
)
result = connection.execute(sql)
connection.close()

Convert SAS proc sql to Python(pandas)

I rewrite some code from SAS to Python using Pandas library.
I've got such code, and I have no idea what should I do with it?
Can you help me, beacase its too complicated for me to do it correct. I've changed the name of columns (for encrypt sensitive data)
This is SAS code:
proc sql;
create table &work_lib..opk_do_inf_4 as
select distinct
*,
min(kat_opk) as opk_do_inf,
count(nr_ks) as ilsc_opk_do_kosztu_infr
from &work_lib..opk_do_inf_3
group by kod_ow, kod_sw, nr_ks, nr_ks_pr, nazwa_zabiegu_icd_9, nazwa_zabiegu
having kat_opk = opk_do_inf
;
quit;
This is my try in Pandas:
df = self.opk_do_inf_3() -> create DF using other function
df['opk_do_inf'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['kat_opk'].min()
df['ilsc_opk_do_kosztu_infr'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['nr_ks'].count()
df_groupby = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu']).filter(lambda x: x['kat_opk']==x['opk_do_inf'])
df = df_groupby.reset_index()
df = df.drop_duplcates()
return df
First, calling SELECT * in an aggregate GROUP BY query is not valid SQL. SAS may allow it but can yield unknown results. Usually SELECT columns should be limited to columns in GROUP BY clause.
With that said, aggregate SQL queries can generally be translated in Pandas with groupby.agg() operations with WHERE (filter before aggregation) or HAVING (filter after aggregation) conditions handled using either .loc or query.
SQL
SELECT col1, col2, col3,
MIN(col1) AS min_col1,
AVG(col2) AS mean_col2,
MAX(col3) AS max_col3,
COUNT(*) AS count_obs
FROM mydata
GROUP BY col1, col2, col3
HAVING col1 = min(col1)
Pandas
General
agg_data = (mydata.groupby(["col1", "col2", "col3"], as_index=False)
.agg(min_col1 = ("col1", "min"),
mean_col2 = ("col2", "mean"),
max_col3 = ("col3", "max"),
count_obs = ("col1", "count"))
.query("col1 == min_col1")
)
Specific
opk_do_inf_4 = (mydata.groupby(["kat_opk", "kod_ow", "kod_sw", "nr_ks", "nr_ks_pr",
"nazwa_zabiegu_icd_9", "nazwa_zabiegu"],
as_index=False)
.agg(opk_do_inf = ("kat_opk", "min"),
ilsc_opk_do_kosztu_infr = ("nr_ks", "count"))
.query("kat_opk == opk_do_inf")
)
You can use the sqldf function from pandasql package to run the sql query on dataframe. example below
''' from pandasql import sqldf
query = "select top 10 * from df "
newdf = sqldf(query, locals())
'''

Python pd.read_sql where cause parameters

Use case:
We have nested queries and our tables have 10 to 20 million rows. Here our intention is to reduce the query CPU time by smart filter
I like to filter my columns in pd.read_sql by other data frame column name. Is that possible?
Step 1: df1 data frame age1 and age3 are my future filter columns for pd.read_sql
raw_data1 = {'age1': [23,45,21],'age2': [10,20,50], 'age3':['forty','fortyone','fortyfour']}
df1 = pd.DataFrame(raw_data1, columns = ['age1','age2','age3'])
df1
Step2: I like take age1 from above df1 dataframe want to use in below pd.read_sql like below to get item1 dataframe
item1 = pd.read_sql("""
SELECT * from [dbo].[ITEM]
where item_age1 = df1.age1
""", conn)
Step3: I like to take age3 from above df1 dataframe want to use in below pd.read_sql like below to get item2 dataframe
item2 = pd.read_sql("""
SELECT * from [dbo].[ITEM]
where item_age3 = df1.age3
""", conn)
Use a parameterized query:
item2 = pd.read_sql("""
SELECT * from [dbo].[ITEM]
where item_age3 IN ({})
""".format(','.join('?'*len(df1.age3))), conn,
params=list(df1.age3))
depending on the database backend this syntax may use '%s' or %(name)s instead of '?'. See PEP249 paramstyle for more information.

How to select data from SQL Server based on data available in pandas data frame?

I have a list of data in one of the pandas dataframe column for which I want to query SQL Server database. Is there any way I can query a SQL Server DB based on data I have in pandas dataframe.
select * from table_name where customerid in pd.dataframe.customerid
In SAP, there is something called "For all entries in" where the SQL can query the DB based on the data available in the array, I was trying to find something similar.
Thanks.
If you are working with tiny DataFrame, then the easiest way would be to generate a corresponding SQL:
In [8]: df
Out[8]:
id val
0 1 21
1 3 111
2 5 34
3 12 76
In [9]: q = 'select * from tab where id in ({})'.format(','.join(['?']*len(df['id'])))
In [10]: q
Out[10]: 'select * from tab where id in (?,?,?,?)'
now you can read data from SQL Server:
from sqlalchemy import create_engine
conn = create_engine(...)
new = pd.read_sql(q, conn, params=tuple(df['id']))
NOTE: this approach will not work for bigger DF's as the generated query (and/or list of bind variables) might bee too long either for Pandas to_sql() function or for SQL Server or even for both.
For bigger DFs I would recommend to write your pandas DF to SQL Server table and then use SQL subquery to filter needed data:
df[list_of_columns_to_save].to_sql('tmp_tab_name', conn, index=False)
q = "select * from tab where id in (select id from tmp_tab_name)"
new = pd.read_sql(q, conn, if_exists='replace')
This is a very familiar scenario and one can use the below code to Query SQL using a very large pandas dataframe. The parameter n needs to be manipulated based on your SQL server memory. For me n=25000 worked.
n = 25000 #chunk row size
## Big_frame dataframe divided into smaller chunks of n into a list
list_df = [big_frame[i:i+n] for i in range(0,big_frame.shape[0],n)]
## Create another dataframe with columns names as expected from SQL
big_frame_2 = pd.DataFrame(columns=[<Mention all column names from SQL>])
## Print total no. of iterations
print("Total Iterations:", len(list_df))
for i in range(0,len(list_df)):
print("Iteration :",i)
temp_frame = list_df[i]
testList = temp_frame['customer_no']
## Pass smaller chunk of data to SQL(here I am passing a list of customers)
temp_DF = SQL_Query(tuple(testList))
print(temp_DF.shape[0])
## Append all the data retrieved from SQL to big_frame_2
big_frame_2=big_frame_2.append(temp_DF, ignore_index=True)

SQL values to update pandas dataframe

i am doing a lot of sql to pandas and i have run in to the following challenge.
I have a dataframe, that looks like
UserID, AccountNo, AccountName
123, 12345, 'Some name'
...
What i would like to do is for each account number, i would like to add a column called total revenue which is gotten from a mysql database, som i am thinking of something like,
for accountno in df['AccountNo']:
df1 = pd.read_sql(('select sum(VBRK_NETWR) as sum from sapdata2016.orders where VBAK_BSARK="ZEDI" and VBRK_KUNAG = %s;') % accountno, conn)
And i need to expand the the dataframe such that
UserID, AccountNo, AccountName, TotalRevenue
123, 12345, 'Some name', df1
...
The code that i have so far (and is not working casts a getitem error)
sets3 = []
i=0
for accountno in df5['kna1_kunnr']:
df1 = pd.read_sql(('select sum(VBRK_NETWR) as sum from sapdata2016.orders where VBAK_BSARK="ZEDI" and VBRK_KUNAG = %s;') % accountno, conn)
df2 = pd.DataFrame([(df5['userid'][i], df5['kna1_kunnr'][i], accountno, df5['kna1_name1'][i], df1['sum'][0])], columns=['User ID', 'AccountNo', 'tjeck', 'AccountName', 'Revenue'])
sets3.append(df2)
i += 1
df6 = pd.concat(sets3)
This idea/code is not pretty, and i wonder if there is a better/nicer way to do it, any ideas?
Consider exporting pandas data to MySQL as a temp table then run an SQL query that joins your pandas data and an aggregate query for TotalRevenue. Then, read resultset into pandas dataframe. This approach avoids any looping.
from sqlalchemy import create_engine
...
# SQL ALCHEMY CONNECTION (PREFERRED OVER RAW CONNECTION)
engine = create_engine('mysql://user:pwd#localhost/database')
# engine = create_engine("mysql+pymysql://user:pwd#hostname:port/database") # load pymysql
df1.to_sql("mypandastemptable", con=engine, if_exists='replace')
sql = """SELECT t.UserID, t.AccountNo, t.AccountName, agg.TotalRevenue
FROM mypandastemptable t
LEFT JOIN
(SELECT VBRK_KUNAG as AccountNo
SUM(VBRK_NETWR) as TotalRevenue
FROM sapdata2016.orders
WHERE VBAK_BSARK='ZEDI'
GROUP BY VBRK_KUNAG) agg
ON t.AccountNo = agg.AccountNo)
"""
newdf = pd.read_sql(sql, con=engine)
Of course the converse is true as well, merging on two pandas dataframes of existing dataframe and the grouped aggregate query resultset:
sql = """SELECT VBRK_KUNAG as AccountNo
SUM(VBRK_NETWR) as TotalRevenue
FROM sapdata2016.orders
WHERE VBAK_BSARK='ZEDI'
GROUP BY VBRK_KUNAG
"""
df2 = pd.read_sql(sql, con=engine)
newdf = df1.merge(df2, on='AccountNo', how='left')

Categories