Read from SQL Server with Python using few parameters from DataFrame - python

I need to read from SQl Server Database using this parameters:
period of time from uploaded Dataframe (date of order and date after month)
clients id from the same Dataframe
So I have something like this:
sql_sales = """
SELECT
dt,
clientID,
cost
WHERE
dt between %(date1)s AND %(date2)s
AND kod in %(client)s
"""
And I have df with columns:
clientsID
date of order
date after month
I can use list of clients but the code should parsed database with a few lists of paramenters (two of them is a part of period).
sales = sales.append(pd.read_sql(sql_sales, conn, params={'client':df['clientsID].tolist()}))

The way I got something similar to work in the past was to do in {} and then use .format with the parameters listed in order. Also, then you don't need to use the params argument. Finally, if you are using IN with SQL, then in Python you need to create a tuple from the client list. For the line dt between {} AND {}, you may also be able to do dt between ? AND ?.
client = tuple(df['clientsID'].tolist())
sql_sales = """
SELECT
dt,
clientID,
cost
WHERE
dt between {} AND {}
AND kod in {}
""".format(date1,date2,client)
sales = sales.append(pd.read_sql(sql_sales, conn))

Related

Iterate a SQL query via PYODBC and collect the results into a Pandas DF

I am trying to run a query over and over again for all dates in a date range and collect the results into a Pandas DF for each iteration.
I established a connection (PYODBC) and created a list of dates I would like to run through the SQL query to aggregate into a DF. I confirmed that the dates are a list.
link = pyodbc.connect( Connection Details )
date = [d.strftime('%Y-%m-%d') for d in pd.date_range('2020-10-01','2020-10-02')]
type(date)
I created an empty DF to collect the results for each iteration of the SQL query and checked the structure.
empty = pd.DataFrame(columns = ['Date', 'Balance'])
empty
I have the query set up as so:
sql = """
Select dt as "Date", sum(BAL)/1000 as "Balance"
From sales as bal
where bal.item IN (1,2,3,4)
AND bal.dt = '{}'
group by "Date";
""".format(day)
I tried the following for loop in the hopes of aggregating the results of each query execution into the empty df, but I get a blank df.
for day in date:
a = (pd.read_sql_query(sql, link))
empty.append(a)
Any ideas if the issue is related to the SQL setup and/or for loop? A better more efficient way to tackle the issue?
Avoid the loop and run a single SQL query by adding Date as a GROUP BY column and pass start and end dates as parameters for filtering. And use the preferred parameterization method instead of string formatting which pandas.read_sql does support:
# PREPARED STATEMENT WITH ? PLACEHOLDERS
sql = """SALES dt AS "Date"
, SUM(BAL)/1000 AS "Balance"
FROM sales
WHERE item IN (1,2,3,4)
AND dt BETWEEN ? AND ?
GROUP BY dt;
"""
# BIND PARAMS TO QUERY RETURN IN SINGLE DATA FRAME
df = pd.read_sql(sql, conn, params=['2020-10-01', '2020-10-02'])
Looks like you didn't defined the day variable when you generated sql.
That may help:
def sql_gen(day):
sql = """
Select dt as "Date", sum(BAL)/1000 as "Balance"
From sales as bal
where bal.item IN (1,2,3,4)
AND bal.dt = '{}'
group by "Date";
""".format(day)
return sql
for day in date:
a = (pd.read_sql_query(sql_gen(day), link))
empty.append(a)

Bigquery Partition Date Python

I would like to write another table by partition date the table in bigquery. But I couldn't find how to do it. I use Python and google cloud library. I want to create a table using standard SQL.But I get an error.
Error : google.api_core.exceptions.BadRequest: 400 GET https://bigquery.googleapis.com/bigquery/v2/projects/astute-baton-272707/queries/f4b9dadb-1390-4260-bb0e-fb525aff662c?maxResults=0&location=US: The number of columns in the column definition list does not match the number of columns produced by the query at [2:72]
Please let me know if there is another solution. Day to day İnsert to table the next stage of the project.
I may have been doing it wrong from the beginning. I am not sure.
Thank You.
client = bigquery.Client()
sql = """
CREATE OR REPLACE TABLE zzzzz.xxxxx.yyyyy (visitStartTime_ts INT64,date TIMESTAMP,hitsTime_ts INT64,appId STRING,fullVisitorId STRING,cUserId STRING,eventCategory STRING,eventLabel STRING,player_type STRING,PLAY_SESSION_ID STRING,CHANNEL_ID STRING,CONTENT_EPG_ID STRING,OFF_SET STRING)
PARTITION BY date
OPTIONS (
description="weather stations with precipitation, partitioned by day"
) AS
select
FORMAT_TIMESTAMP("%Y-%m-%d %H:%M:%S", TIMESTAMP_SECONDS(SAFE_CAST(visitStartTime AS INT64)), "Turkey") AS visitStartTime_ts,
date
,FORMAT_TIMESTAMP("%Y-%m-%d %H:%M:%S", TIMESTAMP_SECONDS(SAFE_CAST(visitStartTime+(h.time/1000) AS INT64)), "Turkey") AS hitsTime_ts
,h.appInfo.appId as appId
,fullVisitorId
,(SELECT value FROM h.customDimensions where index=1) as cUserId
,h.eventInfo.eventCategory as eventCategory
,h.eventInfo.eventAction as eventAction
,h.eventInfo.eventLabel as eventLabel
,REPLACE(SPLIT(h.eventInfo.eventCategory,'/{')[OFFSET(1)],'}','') as player_type
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(0)] as PLAY_SESSION_ID
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(1)] as CHANNEL_ID
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(2)] as CONTENT_EPG_ID
,SPLIT(h.eventInfo.eventLabel,'|')[OFFSET(3)] as OFF_SET
FROM `zzzzz.yyyyyy.xxxxxx*` a,
UNNEST(hits) AS h
where
1=1
and SPLIT(SPLIT(h.eventInfo.eventCategory,'/{')[OFFSET(0)],'/')[OFFSET(0)] like 'player'
and _TABLE_SUFFIX BETWEEN FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND FORMAT_DATE('%Y%m%d',DATE_SUB(CURRENT_DATE(), INTERVAL 1 DAY))
AND (BYTE_LENGTH(h.eventInfo.eventCategory) - BYTE_LENGTH(REPLACE(h.eventInfo.eventCategory,'/{','')))/2 + 1 = 2
AND h.eventInfo.eventAction='heartBeat'
"""
job = client.query(sql) # API request.
job.result()
query_job.result() # Waits for the query to finish
print('Query results loaded to table {}'.format(table_ref.path))
A quick solution for the problem presented here: When creating a table, you don't need to declare the schema of it, if there's a query where data is coming from. Right now there's a conflict between the data and the declared schema. So remove one.
Instead of starting the query with:
CREATE OR REPLACE TABLE zzzzz.xxxxx.yyyyy (visitStartTime_ts INT64,date TIMESTAMP,hitsTime_ts INT64,appId STRING,fullVisitorId STRING,cUserId STRING,eventCategory STRING,eventLabel STRING,player_type STRING,PLAY_SESSION_ID STRING,CHANNEL_ID STRING,CONTENT_EPG_ID STRING,OFF_SET STRING)
PARTITION BY date
Start the query with:
CREATE OR REPLACE TABLE zzzzz.xxxxx.yyyyy
PARTITION BY date

Incorrect date being returned from SQL DB with Python script

I have an SQL DB which I am trying to extract data from. When I extract date/time values my script adds three zeros to the date/time value, like so: 2011-05-03 15:25:26.170000
Below is my code in question:
value_Time = ('SELECT TOP (4) [TimeCol] FROM [database1].[dbo].[table1]')
cursor.execute(value_Time)
for Timerow in cursor:
print(Timerow)
Time_list = [elem for elem in Timerow]
The desired result is that there is not an additional three zeros at then end of the date/time value so that I can insert it into a different database.
Values within Time_List will contain the incorrect date/time values, as well as the Timerow value.
Any help with this would be much appreciated!
from datetime import datetime
value_Time = ('SELECT TOP (4) [TimeCol] FROM [database1].[dbo].[table1]')
cursor.execute(value_Time)
row=cursor.fetchone()
for i in range(len(row)):
var=datetime.strftime(row[i], '%Y-%m-%d %H:%M:%S')
print(var)
I think you need a wrapper to surround your date control example "yyyy/mm/dd/hh/mm/ss" or "yyyymmddhhmmss"
Format((Datecontrol),"yyyy/mm/dd/hh/mm/ss")

Looping Python Parameters Through SQL Code

I need to create the following report scalable:
query = """
(SELECT
'02/11/2019' as Week_of,
media_type,
campaign,
count(ad_start_ts) as frequency
FROM usotomayor.digital
WHERE ds between 20190211 and 20190217
GROUP BY 1,2,3)
UNION ALL
(SELECT
'02/18/2019' as Week_of,
media_type,
campaign,
count(ad_start_ts) as frequency
FROM usotomayor.digital
WHERE ds between 20190211 and 20190224
GROUP BY 1,2,3)
"""
#Converting to dataframe
query2 = spark.sql(query).toPandas()
query2
However, as you can see I cannot make this report scalable if I have a long list of dates for each SQL query that I need to union.
My first attempt at looping in a list of date variables into the SQL script is as follows:
dfys = ['20190217','20190224']
df2 = ['02/11/2019','02/18/2019']
for i in df2:
date=i
for j in dfys:
date2=j
query = f"""
SELECT
'{date}' as Week_of,
raw.media_type,
raw.campaign,
count(raw.ad_start_ts) as frequency
FROM usotomayor.digital raw
WHERE raw.ds between 20190211 and {date2}
GROUP BY 1,2,3
"""
#Converting to dataframe
query2 = spark.sql(query).toPandas()
query2
However, this is not working for me. I think I need to loop through the sql query itself, but I don't know how to do this. Can someone help me?
As a commenter said "this is not working for me" is not very specific so let's start at specifying the problem. You need to execute a query for each pair of dates you need to execute these queries as a loop and save the result (or actually union them, but then you need to change your query logic).
You could do it like this:
dfys = ['20190217', '20190224']
df2 = ['02/11/2019', '02/18/2019']
query_results = list()
for start_date, end_date in zip(dfys, df2):
query = f"""
SELECT
'{start_date}' as Week_of,
raw.media_type,
raw.campaign,
count(raw.ad_start_ts) as frequency
FROM usotomayor.digital raw
WHERE raw.ds between 20190211 and {end_date}
GROUP BY 1,2,3
"""
query_results.append(spark.sql(query).toPandas())
query_results[0]
query_results[1]
Now you get a list of your results (query_results).

join & f.write behaviour not as expected

I have a query running in SQL, which is returning the results in to a variable via a loop then punting that in to an HTML file. When I test this by printing to the console in Jupyter Notebook it prints as expected, the next 30 days of the calendar in order of date.
However, when I tell it to join the data using
dates = ''.join(variable)
it seems to not only reorder the dates so that the 13th of August sits oddly before the 13th of July, but it repeats the date div's 4 times in the page. See below for full code;
from os import getenv
import pyodbc
import os
cnxn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};SERVER=MYVM\SQLEXPRESS;DATABASE=MyTables;UID=test;PWD=t')
cursor = cnxn.cursor() #makes connection
cursor.execute('DECLARE #today as date SET #today = GetDate() SELECT style112, day, month, year, dayofweek, showroom_name, isbusy from ShowroomCal where Date Between #today and dateadd(month,1,#today) order by style112') #runs statement
while row is not None:
inset = inset + ['<div class="'+ str(row.isbusy) + '">' + str(row.day) + '</div>']
row = cursor.fetchone()
dates = ''.join(inset)
f = open("C:\\tes.html",'r') # open file with read permissions
filedata = f.read() # read contents
f.close() # closes file
filedata = filedata.replace("{inset}", dates)
#os.remove("c:\\inetpub\\wwwroot\\cal\\tes.html")
f = open("c:\\inetpub\\wwwroot\\cal\\tes.html",'w')
f.write(filedata) # update it replacing the previous strings
f.close() # closes the file
cnxn.close()
''.join() does not alter the order in any way. If you get a different order then the database query produced rows in a different order.
I don't think you are telling the database to order your results by date. You order by style112, and the database is free to order values with the same style112 column value in any order it pleases. If style112 doesn't include date information (as a year, month, day sequence of fixed length) and date order is important, tell the database to use a correct order! Here that'd include year, month, day at the very least.
I'd also refactor the code to avoid quadratic performance behaviour; the inset = inset + [....] expression has to create a new list object each time, copying across all elements from inset and the new list into that. When adding N elements to a list this way, Python has to execute N * N steps. For 1000 elements, that's 1 million steps to execute! Use list.append() to add single elements, which will reduce the workload to roughly N steps.
You can loop directly over a cursor; this is more efficient as it can buffer rows, here's cursor.fetchone() can't assume you'll fetch more data. A for row in cursor: loop is also more readable.
You can also use string formatting rather than string concatenation, it'll help avoid all those str() calls and redundancy, as well as further reduce performance issues; all those string concatenations also create and recreate a lot of intermediate string objects that you don't need to create at all.
So use this:
cnxn = pyodbc.connect(
'DRIVER={ODBC Driver 13 for SQL Server};SERVER=MYVM\SQLEXPRESS;'
'DATABASE=MyTables;UID=test;PWD=t')
cursor = cnxn.cursor()
cursor.execute('''
DECLARE #today as date
SET #today = GetDate()
SELECT
style112, day, month, year, dayofweek, showroom_name, isbusy
from ShowroomCal
where Date Between #today and dateadd(month,1,#today)
order by year, month, day, style112
''')
inset = []
for row in cursor:
inset.append(
'<div class="{r.isbusy}">'
'<a href="#" id="{r.style112}"'
' onclick="parent.updateField(field38, {r.style112});">'
'{r.day}</a></div>'.format(r=row))
with open(r"C:\tes.html") as template:
template = f.read()
html = template.format(inset=''.join(inset))
with open(r"C:\inetpub\wwwroot\cal\tes.html", 'w') as output:
output.write(html)
Note: if any of your database data was entered by your users, you must ensure that the data is properly escaped for inclusion in HTML first, or you'll leave yourself open to XSS cross-site scripting attacks. Personally, I'd use a HTML templating engine with default escaping support, such as Jinja.

Categories