Use case:
We have nested queries and our tables have 10 to 20 million rows. Here our intention is to reduce the query CPU time by smart filter
I like to filter my columns in pd.read_sql by other data frame column name. Is that possible?
Step 1: df1 data frame age1 and age3 are my future filter columns for pd.read_sql
raw_data1 = {'age1': [23,45,21],'age2': [10,20,50], 'age3':['forty','fortyone','fortyfour']}
df1 = pd.DataFrame(raw_data1, columns = ['age1','age2','age3'])
df1
Step2: I like take age1 from above df1 dataframe want to use in below pd.read_sql like below to get item1 dataframe
item1 = pd.read_sql("""
SELECT * from [dbo].[ITEM]
where item_age1 = df1.age1
""", conn)
Step3: I like to take age3 from above df1 dataframe want to use in below pd.read_sql like below to get item2 dataframe
item2 = pd.read_sql("""
SELECT * from [dbo].[ITEM]
where item_age3 = df1.age3
""", conn)
Use a parameterized query:
item2 = pd.read_sql("""
SELECT * from [dbo].[ITEM]
where item_age3 IN ({})
""".format(','.join('?'*len(df1.age3))), conn,
params=list(df1.age3))
depending on the database backend this syntax may use '%s' or %(name)s instead of '?'. See PEP249 paramstyle for more information.
Related
I have a MS sql server with a lot of rows( around 4 million) from all the customers and their information.
I can also get a list of phone numbers of all visitors of my website in a given timeframe that I can get in a csv file and then covert to a dataframe in python. What I want to do is to select two columns from my server(one is the phone number and the other one is a property of that person) but I only want to select this records from people who are in both my dataframe and my server.
What I currently do is selecting all customers from sql server and then merge them with my dataframe. But obviously this is not very fast. Is there any way to do this faster?
query2 = """
SELECT encrypt_phone, col2
FROM DatabaseTable
"""
cursor.execute(query2)
tables = cursor.fetchall()
df2 = pd.DataFrame.from_records(tables, columns=[x[0] for x in cursor.description])
df1.merge(df2, how='inner', indicator=True)
If your DataFrame have not many rows, I would do it the simple way as here :
V = df["colx"].unique()
Q = 'SELECT encrypt_phone, col2 FROM DatabaseTable WHERE coly IN ({})'.format(','.join(['?']*len(V)))
cursor.execute(Q, tuple(V))
tables = cursor.fetchall()
df2 = pd.DataFrame.from_records(tables, columns=[x[0] for x in cursor.description])
NB : colx and coly are the columns that refer to the customers (id, or name, ..) in the pandas DataFrame and in the SQL table, respectively.
Otherwise, you may need to store df1 as a table in your DB and then perform a sub-query :
df1.to_sql('DataFrameTable', conn, index=False) #this will store df1 in the DB
Q = "SELECT encrypt_phone, col2 FROM DatabaseTable WHERE coly IN (SELECT colx FROM DataFrameTable)"
df2 = pd.read_sql_query(Q, conn)
I rewrite some code from SAS to Python using Pandas library.
I've got such code, and I have no idea what should I do with it?
Can you help me, beacase its too complicated for me to do it correct. I've changed the name of columns (for encrypt sensitive data)
This is SAS code:
proc sql;
create table &work_lib..opk_do_inf_4 as
select distinct
*,
min(kat_opk) as opk_do_inf,
count(nr_ks) as ilsc_opk_do_kosztu_infr
from &work_lib..opk_do_inf_3
group by kod_ow, kod_sw, nr_ks, nr_ks_pr, nazwa_zabiegu_icd_9, nazwa_zabiegu
having kat_opk = opk_do_inf
;
quit;
This is my try in Pandas:
df = self.opk_do_inf_3() -> create DF using other function
df['opk_do_inf'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['kat_opk'].min()
df['ilsc_opk_do_kosztu_infr'] = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu'])['nr_ks'].count()
df_groupby = df.groupby(by=['kod_ow', 'kod_sw', 'nr_ks', 'nr_ks_pr', 'nazwa_zabiegu_icd_9', 'nazwa_zabiegu']).filter(lambda x: x['kat_opk']==x['opk_do_inf'])
df = df_groupby.reset_index()
df = df.drop_duplcates()
return df
First, calling SELECT * in an aggregate GROUP BY query is not valid SQL. SAS may allow it but can yield unknown results. Usually SELECT columns should be limited to columns in GROUP BY clause.
With that said, aggregate SQL queries can generally be translated in Pandas with groupby.agg() operations with WHERE (filter before aggregation) or HAVING (filter after aggregation) conditions handled using either .loc or query.
SQL
SELECT col1, col2, col3,
MIN(col1) AS min_col1,
AVG(col2) AS mean_col2,
MAX(col3) AS max_col3,
COUNT(*) AS count_obs
FROM mydata
GROUP BY col1, col2, col3
HAVING col1 = min(col1)
Pandas
General
agg_data = (mydata.groupby(["col1", "col2", "col3"], as_index=False)
.agg(min_col1 = ("col1", "min"),
mean_col2 = ("col2", "mean"),
max_col3 = ("col3", "max"),
count_obs = ("col1", "count"))
.query("col1 == min_col1")
)
Specific
opk_do_inf_4 = (mydata.groupby(["kat_opk", "kod_ow", "kod_sw", "nr_ks", "nr_ks_pr",
"nazwa_zabiegu_icd_9", "nazwa_zabiegu"],
as_index=False)
.agg(opk_do_inf = ("kat_opk", "min"),
ilsc_opk_do_kosztu_infr = ("nr_ks", "count"))
.query("kat_opk == opk_do_inf")
)
You can use the sqldf function from pandasql package to run the sql query on dataframe. example below
''' from pandasql import sqldf
query = "select top 10 * from df "
newdf = sqldf(query, locals())
'''
I have a database in sqlite with c.300 tables. Currently i am iterating through a list and appending the data.
Is there a faster way / more pythonic way of doing this?
df = []
for i in Ave.columns:
try:
df2 = get_mcap(i)
df.append(df2)
#print (i)
except:
pass
df = pd.concat(df, axis=0
Ave is a dataframe where the column in the list i want to iterate through.
def get_mcap(Ticker):
cnx = sqlite3.connect('Market_Cap.db')
df = pd.read_sql_query("SELECT * FROM '%s'"%(Ticker), cnx)
df.columns = ['Date', 'Mcap-Ave', 'Mcap-High', 'Mcap-Low']
df = df.set_index('Date')
df.index = pd.to_datetime(df.index)
cnx.close
return df
Before I post my solution, I should include a quick warning that you should never use string manipulation to generate SQL queries unless it's absolutely unavoidable, and in such cases you need to be certain that you are in control of the data which is being used to format the strings and it won't contain anything that will cause the query to do something unintended.
With that said, this seems like one of those situations where you do need to use string formatting, since you cannot pass table names as parameters. Just make sure there's no way that users can alter what is contained within your list of tables.
Onto the solution. It looks like you can get your list of tables using:
tables = Ave.columns.tolist()
For my simple example, I'm going to use:
tables = ['table1', 'table2', 'table3']
Then use the following code to generate a single query:
query_template = 'select * from {}'
query_parts = []
for table in tables:
query = query_template.format(table)
query_parts.append(query)
full_query = ' union all '.join(query_parts)
Giving:
'select * from table1 union all select * from table2 union all select * from table3'
You can then simply execute this one query to get your results:
cnx = sqlite3.connect('Market_Cap.db')
df = pd.read_sql_query(full_query, cnx)
Then from here you should be able to set the index, convert to datetime etc, but now you only need to do these operations once rather than 300 times. I imagine the overall runtime of this should now be much faster.
I have a list of data in one of the pandas dataframe column for which I want to query SQL Server database. Is there any way I can query a SQL Server DB based on data I have in pandas dataframe.
select * from table_name where customerid in pd.dataframe.customerid
In SAP, there is something called "For all entries in" where the SQL can query the DB based on the data available in the array, I was trying to find something similar.
Thanks.
If you are working with tiny DataFrame, then the easiest way would be to generate a corresponding SQL:
In [8]: df
Out[8]:
id val
0 1 21
1 3 111
2 5 34
3 12 76
In [9]: q = 'select * from tab where id in ({})'.format(','.join(['?']*len(df['id'])))
In [10]: q
Out[10]: 'select * from tab where id in (?,?,?,?)'
now you can read data from SQL Server:
from sqlalchemy import create_engine
conn = create_engine(...)
new = pd.read_sql(q, conn, params=tuple(df['id']))
NOTE: this approach will not work for bigger DF's as the generated query (and/or list of bind variables) might bee too long either for Pandas to_sql() function or for SQL Server or even for both.
For bigger DFs I would recommend to write your pandas DF to SQL Server table and then use SQL subquery to filter needed data:
df[list_of_columns_to_save].to_sql('tmp_tab_name', conn, index=False)
q = "select * from tab where id in (select id from tmp_tab_name)"
new = pd.read_sql(q, conn, if_exists='replace')
This is a very familiar scenario and one can use the below code to Query SQL using a very large pandas dataframe. The parameter n needs to be manipulated based on your SQL server memory. For me n=25000 worked.
n = 25000 #chunk row size
## Big_frame dataframe divided into smaller chunks of n into a list
list_df = [big_frame[i:i+n] for i in range(0,big_frame.shape[0],n)]
## Create another dataframe with columns names as expected from SQL
big_frame_2 = pd.DataFrame(columns=[<Mention all column names from SQL>])
## Print total no. of iterations
print("Total Iterations:", len(list_df))
for i in range(0,len(list_df)):
print("Iteration :",i)
temp_frame = list_df[i]
testList = temp_frame['customer_no']
## Pass smaller chunk of data to SQL(here I am passing a list of customers)
temp_DF = SQL_Query(tuple(testList))
print(temp_DF.shape[0])
## Append all the data retrieved from SQL to big_frame_2
big_frame_2=big_frame_2.append(temp_DF, ignore_index=True)
I am writing spark code in python.
How do I pass a variable in a spark.sql query?
q25 = 500
Q1 = spark.sql("SELECT col1 from table where col2>500 limit $q25 , 1")
Currently the above code does not work? How do we pass variables?
I have also tried,
Q1 = spark.sql("SELECT col1 from table where col2>500 limit q25='{}' , 1".format(q25))
You need to remove single quote and q25 in string formatting like this:
Q1 = spark.sql("SELECT col1 from table where col2>500 limit {}, 1".format(q25))
Update:
Based on your new queries:
spark.sql("SELECT col1 from table where col2>500 order by col1 desc limit {}, 1".format(q25))
Note that the SparkSQL does not support OFFSET, so the query cannot work.
If you need add multiple variables you can try this way:
q25 = 500
var2 = 50
Q1 = spark.sql("SELECT col1 from table where col2>{0} limit {1}".format(var2,q25))
Another option if you're doing this sort of thing often or want to make your code easier to re-use is to use a map of configuration variables and the format option:
configs = {"q25":10,
"TABLE_NAME":"my_table",
"SCHEMA":"my_schema"}
Q1 = spark.sql("""SELECT col1 from {SCHEMA}.{TABLE_NAME}
where col2>500
limit {q25}
""".format(**configs))
Using f-Strings approach (PySpark):
table = 'my_schema.my_table'
df = spark.sql(f'select * from {table}')
A really easy solution is to store the query as a string (using the usual python formatting), and then pass it to the spark.sql() function:
q25 = 500
query = "SELECT col1 from table where col2>500 limit {}".format(q25)
Q1 = spark.sql(query)
All you need to do is add s (String interpolator) to the string. This allows the usage of variable directly into the string.
val q25 = 10
Q1 = spark.sql(s"SELECT col1 from table where col2>500 limit $q25)