i am doing a lot of sql to pandas and i have run in to the following challenge.
I have a dataframe, that looks like
UserID, AccountNo, AccountName
123, 12345, 'Some name'
...
What i would like to do is for each account number, i would like to add a column called total revenue which is gotten from a mysql database, som i am thinking of something like,
for accountno in df['AccountNo']:
df1 = pd.read_sql(('select sum(VBRK_NETWR) as sum from sapdata2016.orders where VBAK_BSARK="ZEDI" and VBRK_KUNAG = %s;') % accountno, conn)
And i need to expand the the dataframe such that
UserID, AccountNo, AccountName, TotalRevenue
123, 12345, 'Some name', df1
...
The code that i have so far (and is not working casts a getitem error)
sets3 = []
i=0
for accountno in df5['kna1_kunnr']:
df1 = pd.read_sql(('select sum(VBRK_NETWR) as sum from sapdata2016.orders where VBAK_BSARK="ZEDI" and VBRK_KUNAG = %s;') % accountno, conn)
df2 = pd.DataFrame([(df5['userid'][i], df5['kna1_kunnr'][i], accountno, df5['kna1_name1'][i], df1['sum'][0])], columns=['User ID', 'AccountNo', 'tjeck', 'AccountName', 'Revenue'])
sets3.append(df2)
i += 1
df6 = pd.concat(sets3)
This idea/code is not pretty, and i wonder if there is a better/nicer way to do it, any ideas?
Consider exporting pandas data to MySQL as a temp table then run an SQL query that joins your pandas data and an aggregate query for TotalRevenue. Then, read resultset into pandas dataframe. This approach avoids any looping.
from sqlalchemy import create_engine
...
# SQL ALCHEMY CONNECTION (PREFERRED OVER RAW CONNECTION)
engine = create_engine('mysql://user:pwd#localhost/database')
# engine = create_engine("mysql+pymysql://user:pwd#hostname:port/database") # load pymysql
df1.to_sql("mypandastemptable", con=engine, if_exists='replace')
sql = """SELECT t.UserID, t.AccountNo, t.AccountName, agg.TotalRevenue
FROM mypandastemptable t
LEFT JOIN
(SELECT VBRK_KUNAG as AccountNo
SUM(VBRK_NETWR) as TotalRevenue
FROM sapdata2016.orders
WHERE VBAK_BSARK='ZEDI'
GROUP BY VBRK_KUNAG) agg
ON t.AccountNo = agg.AccountNo)
"""
newdf = pd.read_sql(sql, con=engine)
Of course the converse is true as well, merging on two pandas dataframes of existing dataframe and the grouped aggregate query resultset:
sql = """SELECT VBRK_KUNAG as AccountNo
SUM(VBRK_NETWR) as TotalRevenue
FROM sapdata2016.orders
WHERE VBAK_BSARK='ZEDI'
GROUP BY VBRK_KUNAG
"""
df2 = pd.read_sql(sql, con=engine)
newdf = df1.merge(df2, on='AccountNo', how='left')
Related
I have a MS sql server with a lot of rows( around 4 million) from all the customers and their information.
I can also get a list of phone numbers of all visitors of my website in a given timeframe that I can get in a csv file and then covert to a dataframe in python. What I want to do is to select two columns from my server(one is the phone number and the other one is a property of that person) but I only want to select this records from people who are in both my dataframe and my server.
What I currently do is selecting all customers from sql server and then merge them with my dataframe. But obviously this is not very fast. Is there any way to do this faster?
query2 = """
SELECT encrypt_phone, col2
FROM DatabaseTable
"""
cursor.execute(query2)
tables = cursor.fetchall()
df2 = pd.DataFrame.from_records(tables, columns=[x[0] for x in cursor.description])
df1.merge(df2, how='inner', indicator=True)
If your DataFrame have not many rows, I would do it the simple way as here :
V = df["colx"].unique()
Q = 'SELECT encrypt_phone, col2 FROM DatabaseTable WHERE coly IN ({})'.format(','.join(['?']*len(V)))
cursor.execute(Q, tuple(V))
tables = cursor.fetchall()
df2 = pd.DataFrame.from_records(tables, columns=[x[0] for x in cursor.description])
NB : colx and coly are the columns that refer to the customers (id, or name, ..) in the pandas DataFrame and in the SQL table, respectively.
Otherwise, you may need to store df1 as a table in your DB and then perform a sub-query :
df1.to_sql('DataFrameTable', conn, index=False) #this will store df1 in the DB
Q = "SELECT encrypt_phone, col2 FROM DatabaseTable WHERE coly IN (SELECT colx FROM DataFrameTable)"
df2 = pd.read_sql_query(Q, conn)
I have an SQL table (table_1) that contains data, and I have a Python script that reads a csf and creates a dataframe.
I want to compare the dataframe with the SQL table data and then insert the missing data from the dataframe into the SQL table.
I went around and read this comparing pandas dataframe with sqlite table via sqlquery post and Compare pandas dataframe columns to sql table dataframe columns, but was not able to do it.
The table and the dataframe have the exact same columns.
The dataframe is:
import pandas as pd
df = pd.DataFrame({'userid':[1,2,3],
'user': ['Bob', 'Jane', 'Alice'],
'income': [40000, 50000, 42000]})
and the SQL table (using SQLAlchemy):
userid user income
1 Bob 40000
2 Jane 42000
I'd like to compare the df to the SQL table and insert userid 3, Alice, with all her details and it's the only value missing between them.
Since you are only interested in inserting new records, and are loading from a CSV so you will have data in local memory already:
# read current userids
sql = pd.read_sql('SELECT userid FROM table_name', conn)
# keep only userids not in the sql table
df = df[~df['userid'].isin(sql['userid'])]
# insert new records
df.to_sql('table_name', conn, if_exists='append')
Other options would require first loading more data into SQL than needed.
There is still some information missing to provide a full answer. For example, what database engine do you use (SQLalchemy, sqlite3)? I assume the id is unique, and all new ids should be added?
If you are using SQLalchemy, you might take a look at pangres, which can insert and update SQL-databases from a pandas dataframe. It does however require a column with the UNIQUE property in the database (meaning that every entry in it is unique, you could set the id column UNIQUE here). This method scales better than loading all data from the database and doing the comparison in python, because only the csf data is in memory, and the database does the comparison.
If you want to do it all in Python, an option is loading the SQL-table into pandas and merging data based on the user_id columns:
import pandas as pd
df = pd.DataFrame({'userid': [0, 1, 2],'user': ['Bob', 'Jane', 'Alice'], 'income': [40000, 50000, 42000]})
sqldf = pd.read_sql_query("SELECT * FROM table_1",connection)
df = df.merge(sqldf,how='left' left_on='userid', right_on='userid')
Then you can replace the old table with the new table.
EDIT:
I saw another answer using merge, but keeping the new values and only sending them to the database. This is cleaner than the code above.
Why not just left join the tables?
conn = #your connection
df = pd.DataFrame({'userid':[1,2,3],
'user': ['Bob', 'Jane', 'Alice'],
'income': [40000, 50000, 42000]})
sql = pd.read_sql("SELECT * FROM table", con = conn)
joined = pd.merge(df, sql, how = "left", on = "userid")
joined = joined[pd.isna(joined["user_y"])]
index = joined["userid"].tolist()
variable index now contains all userids that are only in df but not in sql.
To insert to database
columns = ("userid", "user", "income")
for i in index:
data = tuple(df[df["userid"] == i].values.tolist()[0])
data = [str(x) for x in data]
sql = f"""INSERT INTO table {columns}
VALUES {data}"""
conn.execute(sql)
I rewrite some code from SAS to Python using Pandas library.
I've got such code, and I have no idea what should I do with it?
Can you help me, beacase its too complicated for me to do it correct. I've changed the name of columns (for encrypt sensitive data)
This is SAS code:
proc sql;
create table &work_lib..opk_do_inf_4 as
select distinct
*,
min(kat_opk) as opk_do_inf,
count(nr_ks) as ilsc_opk_do_kosztu_infr
from &work_lib..opk_do_inf_3
group by kod_ow, kod_sw, nr_ks, nr_ks_pr, nazwa_zabiegu_icd_9, nazwa_zabiegu
having kat_opk = opk_do_inf
;
quit;
This is my try in Pandas:
df = self.opk_do_inf_3() -> create DF using other function
df['opk_do_inf'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['kat_opk'].min()
df['ilsc_opk_do_kosztu_infr'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['nr_ks'].count()
df_groupby = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu']).filter(lambda x: x['kat_opk']==x['opk_do_inf'])
df = df_groupby.reset_index()
df = df.drop_duplcates()
return df
First, calling SELECT * in an aggregate GROUP BY query is not valid SQL. SAS may allow it but can yield unknown results. Usually SELECT columns should be limited to columns in GROUP BY clause.
With that said, aggregate SQL queries can generally be translated in Pandas with groupby.agg() operations with WHERE (filter before aggregation) or HAVING (filter after aggregation) conditions handled using either .loc or query.
SQL
SELECT col1, col2, col3,
MIN(col1) AS min_col1,
AVG(col2) AS mean_col2,
MAX(col3) AS max_col3,
COUNT(*) AS count_obs
FROM mydata
GROUP BY col1, col2, col3
HAVING col1 = min(col1)
Pandas
General
agg_data = (mydata.groupby(["col1", "col2", "col3"], as_index=False)
.agg(min_col1 = ("col1", "min"),
mean_col2 = ("col2", "mean"),
max_col3 = ("col3", "max"),
count_obs = ("col1", "count"))
.query("col1 == min_col1")
)
Specific
opk_do_inf_4 = (mydata.groupby(["kat_opk", "kod_ow", "kod_sw", "nr_ks", "nr_ks_pr",
"nazwa_zabiegu_icd_9", "nazwa_zabiegu"],
as_index=False)
.agg(opk_do_inf = ("kat_opk", "min"),
ilsc_opk_do_kosztu_infr = ("nr_ks", "count"))
.query("kat_opk == opk_do_inf")
)
You can use the sqldf function from pandasql package to run the sql query on dataframe. example below
''' from pandasql import sqldf
query = "select top 10 * from df "
newdf = sqldf(query, locals())
'''
I want to merge an excel file with sql in pandas, here's my code
import pandas as pd
import pymysql
from sqlalchemy import create_engine
data1 = pd.read_excel('data.xlsx')
engine = create_engine('...cloudprovider.com/...')
data2 = pd.read_sql_query("select id, column3, column4 from customer", engine)
data = data1.merge(data2, on='id', how='left')
It works, just to make it clearer
If input data1.columns the output Index(['id', 'column1', 'column2'], dtype='object')
If input data2.columns the output Index(['id', 'column3', 'column4'], dtype='object')
If input data.columns the output Index(['id', 'column1', 'column2', 'column3', 'column4'], dtype='object')
Since the data2 getting bigger, I can't query entirely, so I want to query data2 with id that exist on data1. How suppose I do this?
You could leverage the fact that SQLAlchemy is a great query builder. Either reflect the customer table, or build the metadata by hand:
from sqlalchemy import MetaData, select
metadata = MetaData()
metadata.reflect(engine, only=['customer'])
customer = metadata.tables['customer']
and build your query, letting SQLAlchemy worry about proper usage of placeholders, data conversion etc. You're looking for customer rows where id is in the set of ids from data1, achieved in SQL with the IN operator:
query = select([customer.c.id,
customer.c.column3,
customer.c.column4]).\
where(customer.c.id.in_(data1['id']))
data2 = pd.read_sql_query(query, engine)
If you wish to keep on using SQL strings manually, you could build a parameterized query as such:
placeholders = ','.join(['%s'] * data1['id'].count())
# Note that you're not formatting the actual values here, but placeholders
query = f"SELECT id, column3, column4 FROM customer WHERE id IN ({placeholders})"
data2 = pd.read_sql_query(query, engine, params=data1['id'])
In general it is beneficial to learn to use placeholders instead of mixing SQL and values by formatting/concatenating strings, as it may expose you to SQL injection, if handling user generated data. Usually you'd write required placeholders in the query string directly, but some string building is required, if you have a variable amount of parameters1.
1: Some DB-API drivers, such as psycopg2, allow passing tuples and lists as scalar values and know how to construct suitable SQL.
Since you are looking into a condition as WHERE IN [Some_List]. This should work for you
id_list = data1['id'].tolist()
your_query = "select id, column3, column4 from customer where id in "+tuple(id_list)
data2 = pd.read_sql_query(your_query , engine)
Hope it works.
I have a list of data in one of the pandas dataframe column for which I want to query SQL Server database. Is there any way I can query a SQL Server DB based on data I have in pandas dataframe.
select * from table_name where customerid in pd.dataframe.customerid
In SAP, there is something called "For all entries in" where the SQL can query the DB based on the data available in the array, I was trying to find something similar.
Thanks.
If you are working with tiny DataFrame, then the easiest way would be to generate a corresponding SQL:
In [8]: df
Out[8]:
id val
0 1 21
1 3 111
2 5 34
3 12 76
In [9]: q = 'select * from tab where id in ({})'.format(','.join(['?']*len(df['id'])))
In [10]: q
Out[10]: 'select * from tab where id in (?,?,?,?)'
now you can read data from SQL Server:
from sqlalchemy import create_engine
conn = create_engine(...)
new = pd.read_sql(q, conn, params=tuple(df['id']))
NOTE: this approach will not work for bigger DF's as the generated query (and/or list of bind variables) might bee too long either for Pandas to_sql() function or for SQL Server or even for both.
For bigger DFs I would recommend to write your pandas DF to SQL Server table and then use SQL subquery to filter needed data:
df[list_of_columns_to_save].to_sql('tmp_tab_name', conn, index=False)
q = "select * from tab where id in (select id from tmp_tab_name)"
new = pd.read_sql(q, conn, if_exists='replace')
This is a very familiar scenario and one can use the below code to Query SQL using a very large pandas dataframe. The parameter n needs to be manipulated based on your SQL server memory. For me n=25000 worked.
n = 25000 #chunk row size
## Big_frame dataframe divided into smaller chunks of n into a list
list_df = [big_frame[i:i+n] for i in range(0,big_frame.shape[0],n)]
## Create another dataframe with columns names as expected from SQL
big_frame_2 = pd.DataFrame(columns=[<Mention all column names from SQL>])
## Print total no. of iterations
print("Total Iterations:", len(list_df))
for i in range(0,len(list_df)):
print("Iteration :",i)
temp_frame = list_df[i]
testList = temp_frame['customer_no']
## Pass smaller chunk of data to SQL(here I am passing a list of customers)
temp_DF = SQL_Query(tuple(testList))
print(temp_DF.shape[0])
## Append all the data retrieved from SQL to big_frame_2
big_frame_2=big_frame_2.append(temp_DF, ignore_index=True)