I was loading my data from individual csv files into a dataframe using
df = pd.read_csv('data.csv', names=col_names, sep=',', skiprows=1)
col_names = ['created_date', 'latitude', 'longitude']
This would separate my data nicely into column frames and skip the first row which had the row labels
However i wanted to automate the process using a for loop that did the same query for every user. My function goes:
sql = "select distinct mobile_user_id from score where speed_range_id > 1"
distance_query = """SELECT created_date, latitude, longitude FROM score s where s.mobile_user_id = %(mobile_user_id)s and speed_range_id > 1 group by latitude, longitude order by id asc"""
cursor1.execute(sql)
result = cursor1.fetchall()
for rowdict in result:
distance = cursor3.execute(distance_query, rowdict)
distance_result = cursor3.fetchall()
df = pd.read_sql_query(distance_query, rdsConn, params={rowdict})
As you can see here the result variable holds the list of users and i want to iterate through all the users to generate a dataset for every user.
I've been trying to use the pd.read_sql_query but i've been unable to pass the mobile user parameter which is rowdict to the query.
How would i go so i can pass that variable using pandas? How can i organize my data in the way i had it before?
sample of the data.csv:
created_date, latitude, longitude
"2018-05-24 17:46:25", 20.61844841, -100.40813424
"2018-05-24 21:03:02", 20.58469452, -100.39204018
"2018-05-25 10:29:57", 20.61180308, -100.40826959
"2018-05-25 21:02:43", 20.59868518, -100.37825344
Any help is appreciated.
Consider running pure SQL combining both queries by adding a WHERE clause to your aggregate query.
Currently, you are attempting a WHERE clause comparing one value per row to many values: where mobile_user_id = %(mobile_user_id)s which will never be equal. Plus, your prepared statement does not have same number of placeholders as parameter values. Possibly you meant where mobile_user_id IN (?, ?, ?, ?, ?, ...) which involves dynamically setting placeholders, ?.
Nonetheless, simply run an aggregate query. Then, import the resultset into pandas. Specifically, add mobile_user_id as a grouping in query:
sql = """select mobile_user_id, created_date, latitude, longitude
from score
where speed_range_id > 1
group by mobile_user_id, created_date, latitude, longitude
order by id asc
"""
df = pd.read_sql_query(sql, rdsConn)
Related
I am trying to run a query over and over again for all dates in a date range and collect the results into a Pandas DF for each iteration.
I established a connection (PYODBC) and created a list of dates I would like to run through the SQL query to aggregate into a DF. I confirmed that the dates are a list.
link = pyodbc.connect( Connection Details )
date = [d.strftime('%Y-%m-%d') for d in pd.date_range('2020-10-01','2020-10-02')]
type(date)
I created an empty DF to collect the results for each iteration of the SQL query and checked the structure.
empty = pd.DataFrame(columns = ['Date', 'Balance'])
empty
I have the query set up as so:
sql = """
Select dt as "Date", sum(BAL)/1000 as "Balance"
From sales as bal
where bal.item IN (1,2,3,4)
AND bal.dt = '{}'
group by "Date";
""".format(day)
I tried the following for loop in the hopes of aggregating the results of each query execution into the empty df, but I get a blank df.
for day in date:
a = (pd.read_sql_query(sql, link))
empty.append(a)
Any ideas if the issue is related to the SQL setup and/or for loop? A better more efficient way to tackle the issue?
Avoid the loop and run a single SQL query by adding Date as a GROUP BY column and pass start and end dates as parameters for filtering. And use the preferred parameterization method instead of string formatting which pandas.read_sql does support:
# PREPARED STATEMENT WITH ? PLACEHOLDERS
sql = """SALES dt AS "Date"
, SUM(BAL)/1000 AS "Balance"
FROM sales
WHERE item IN (1,2,3,4)
AND dt BETWEEN ? AND ?
GROUP BY dt;
"""
# BIND PARAMS TO QUERY RETURN IN SINGLE DATA FRAME
df = pd.read_sql(sql, conn, params=['2020-10-01', '2020-10-02'])
Looks like you didn't defined the day variable when you generated sql.
That may help:
def sql_gen(day):
sql = """
Select dt as "Date", sum(BAL)/1000 as "Balance"
From sales as bal
where bal.item IN (1,2,3,4)
AND bal.dt = '{}'
group by "Date";
""".format(day)
return sql
for day in date:
a = (pd.read_sql_query(sql_gen(day), link))
empty.append(a)
I am trying to optimize the performance of a simple query to a SQLite database by using indexing. As an example, the table has 5M rows, 5 columns; the SELECT statement is to pick up all columns and the WHERE statement checks for only 2 columns. However, unless I have all columns in the multi-column index, the performance of the query is worse than without any index.
Did I index the column incorrectly, or when selecting all columns, am I supposed to include all of them in the index in order to improve performance?
Below each case # is the result I got when creating the SQLite database in hard-disk. However, for some reason using the ':memory:' mode made all the indexing cases faster than without index.
import sqlite3
import datetime
import pandas as pd
import numpy as np
import os
import time
# Simulate the data
size = 5000000
apps = [f'{i:010}' for i in range(size)]
dates = np.random.choice(pd.date_range('2016-01-01', '2019-01-01').to_pydatetime().tolist(), size)
prod_cd = np.random.choice([f'PROD_{i}' for i in range(30)], size)
models = np.random.choice([f'MODEL{i}' for i in range(15)], size)
categories = np.random.choice([f'GROUP{i}' for i in range(10)], size)
# create a db in memory
conn = sqlite3.connect(':memory:', detect_types=sqlite3.PARSE_DECLTYPES)
c = conn.cursor()
# Create table and insert data
c.execute("DROP TABLE IF EXISTS experiment")
c.execute("CREATE TABLE experiment (appId TEXT, dtenter TIMESTAMP, prod_cd TEXT, model TEXT, category TEXT)")
c.executemany("INSERT INTO experiment VALUES (?, ?, ?, ?, ?)", zip(apps, dates, prod_cd, models, categories))
# helper functions
def time_it(func):
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
print("time for {} function is {}".format(func.__name__, time.time() - start))
return result
return wrapper
#time_it
def read_db(query):
df = pd.read_sql_query(query, conn)
return df
#time_it
def run_query(query):
output = c.execute(query).fetchall()
print(output)
# The main query
query = "SELECT * FROM experiment WHERE prod_cd IN ('PROD_1', 'PROD_5', 'PROD_10') AND dtenter >= '2018-01-01'"
# CASE #1: WITHOUT ANY INDEX
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 2.4783718585968018
# CASE #2: WITH INDEX FOR COLUMNS IN WHERE STATEMENT
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 3.221407890319824
# CASE #3: WITH INDEX FOR MORE THEN WHAT IN WHERE STATEMENT, BUT NOT ALL COLUMNS
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter, appId, category)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>>time for read_db function is 3.176532745361328
# CASE #4: WITH INDEX FOR ALL COLUMNS
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter, appId, category, model)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 0.8257918357849121
The SQLite Query Optimizer Overview says:
When doing an indexed lookup of a row, the usual procedure is to do a binary search on the index to find the index entry, then extract the rowid from the index and use that rowid to do a binary search on the original table. Thus a typical indexed lookup involves two binary searches.
Index entries are not in the same order as the table entries, so if a query returns data from most of the table's pages, all those random-access lookups are slower than just scanning all table rows.
Index lookups are more efficient than a table scan only if your WHERE condition filters out much more rows than are returned.
SQLite assumes that lookups on indexed columns have a high selectivity. You can get better estimates by running ANALYZE after filling the table.
But if all your queries are in a form where an index does not help, it wold be a better idea to not use an index at all.
When you create an index over all columns used in the query, the additional table accesses are no longer necessary:
If, however, all columns that were to be fetched from the table are already available in the index itself, SQLite will use the values contained in the index and will never look up the original table row. This saves one binary search for each row and can make many queries run twice as fast.
When an index contains all of the data needed for a query and when the original table never needs to be consulted, we call that index a "covering index".
I need to create the following report scalable:
query = """
(SELECT
'02/11/2019' as Week_of,
media_type,
campaign,
count(ad_start_ts) as frequency
FROM usotomayor.digital
WHERE ds between 20190211 and 20190217
GROUP BY 1,2,3)
UNION ALL
(SELECT
'02/18/2019' as Week_of,
media_type,
campaign,
count(ad_start_ts) as frequency
FROM usotomayor.digital
WHERE ds between 20190211 and 20190224
GROUP BY 1,2,3)
"""
#Converting to dataframe
query2 = spark.sql(query).toPandas()
query2
However, as you can see I cannot make this report scalable if I have a long list of dates for each SQL query that I need to union.
My first attempt at looping in a list of date variables into the SQL script is as follows:
dfys = ['20190217','20190224']
df2 = ['02/11/2019','02/18/2019']
for i in df2:
date=i
for j in dfys:
date2=j
query = f"""
SELECT
'{date}' as Week_of,
raw.media_type,
raw.campaign,
count(raw.ad_start_ts) as frequency
FROM usotomayor.digital raw
WHERE raw.ds between 20190211 and {date2}
GROUP BY 1,2,3
"""
#Converting to dataframe
query2 = spark.sql(query).toPandas()
query2
However, this is not working for me. I think I need to loop through the sql query itself, but I don't know how to do this. Can someone help me?
As a commenter said "this is not working for me" is not very specific so let's start at specifying the problem. You need to execute a query for each pair of dates you need to execute these queries as a loop and save the result (or actually union them, but then you need to change your query logic).
You could do it like this:
dfys = ['20190217', '20190224']
df2 = ['02/11/2019', '02/18/2019']
query_results = list()
for start_date, end_date in zip(dfys, df2):
query = f"""
SELECT
'{start_date}' as Week_of,
raw.media_type,
raw.campaign,
count(raw.ad_start_ts) as frequency
FROM usotomayor.digital raw
WHERE raw.ds between 20190211 and {end_date}
GROUP BY 1,2,3
"""
query_results.append(spark.sql(query).toPandas())
query_results[0]
query_results[1]
Now you get a list of your results (query_results).
I am working on a project for one of my Python classes, and I am trying to grab an average monthly snowfall for a given year. In my data set, data collected spans from 2016 to 2017 for many different weather outposts.
This is simply for cleaning up some weather report .csv files with SQLite. I have managed to get the data, traditionally in csv format, into a sqlite format in memory, but my SQL is rusty and I can't get the data to call forward the way I want it to. I have looked through, tried separating the data with a WHERE DATE < '20170101' before grouping by date, but I can't even get the data to separate with dates (possibly an issue with how SQL looks for dates and how my dates are punched in, which looks like 12/24/2017).
Here's what I'm trying to run
con = sqlite3.connect(":memory:")
cur = con.cursor()
cur.execute("CREATE TABLE t (STATION, NAME, DATE, AWND, SNOW);")
with open('filteredData.csv','r') as fin:
# csv.DictReader uses first line in file for column headings by default
dr = csv.DictReader(fin) # comma is default delimiter
to_db = [(i['STATION'], i['NAME'], i['DATE'], i['AWND'], i['SNOW']) for i in dr]
cur.executemany("INSERT INTO t (STATION, NAME, DATE, AWND, SNOW) VALUES (?, ?, ?, ?, ?)", to_db)
con.commit()
data = cur.execute("SELECT STATION, NAME, DATE, AWND, AVG(SNOW) FROM t GROUP BY STATION")
and I have been trying to add a line in either the execute or executemany statements to sort through all data entries and filter out before year like so
cur.executemany("INSERT INTO t (STATION, NAME, DATE, AWND, SNOW) VALUES (?, ?, ?, ?, ?) WHERE DATE < '20170101'", to_db)
I expected the output to show (right now) every location's average snowfall for 2016 (still working on further segregating to average monthly snowfall for every location), but when I add the line above, I get an error. When I run the code without the WHERE statement, the code processes fine (and outputs back to a csv like I wanted it to) but only shows averages for every location with no regards to what timeline those averages were taken.
For those curious, the date format in the csv that I'm importing from looks something like this: 12/24/2017
EDIT: I have modified the execute statement in the data variable to look like
Jan = cur.execute("SELECT STATION, NAME, DATE, AWND, AVG(SNOW) FROM t WHERE (DATE > '2016-01-01' AND DATE < '2016-02-01') GROUP BY STATION")
Jan is now reflecting the average for dates 2016-01-01 to 2016-02-01, which, for the record, appears to actually take the January average snowfall and output it to the CSV. Now, I am working on trying to get February to print after it without overwriting it, of which, simply calling another writerows with another variable seems to just overwrite it.
SELECT
STATION
, NAME
, MIN(DATE)
, AVG(AWND)
, AVG(SNOW)
FROM
t
WHERE
DATE < '1/1/17'
GROUP BY
STATION
That SQL statement is invalid by SQL 92+ standards.
In general when using GROUP BY all non aggregated columns which are used in the SELECT clause should also be in the GROUP BY clause. So the Name column should also be in used the GROUP BY clause, but that will give you invalid results for your question.
I believe you are looking for this query instead.
SELECT
t.*
FROM (
SELECT
STATION
, MIN(DATE) AS min_date
, AVG(AWND) AS avg_awnd
, AVG(SNOW) AS avg_snow
FROM
t
WHERE
DATE < '1/1/17'
GROUP BY
STATION
) AS t_aggregated
INNER JOIN
t
ON
t_aggregated.STATION = t.STATION
AND
t_aggregated.min_date = t.date
Is this what you want?
select station, name, strftime('%Y-%m') as yyyymm,
avg(snow)
from t
group by station, name, strftime('%Y-%m');
You can add a where clause to limit the data to a particular period of time. For instance, for 2016:
select station, name, strftime('%Y-%m', date) as yyyymm,
avg(snow)
from t
where date >= '2016-01-01' and
date < '2017-01-01'
group by station, name, strftime('%Y-%m', date);
Alright, so after working with a friend on the program for a bit, we both figured out that we had to actually call the program into a loop and execute before committing it to the file. Here's what we wrote:
with open("Average2016.csv",'w') as f:
writer = csv.writer(f)
writer.writerow(['STATION','NAME','DATE','AWND','SNOW'])
'''
Fun for loop for generating dates. This uses zfill to pad the dates to 2 decimals
and checks whether we are on Dec. or not. If we are, skip to next January.
Then we use an f-string to create a SQL command and execute it and then write
the return value into the CSV.
'''
for x in range(1, 13):
date1 = '2016-' + str(x).zfill(2) + '-01'
date2 = '2016-' + str(x + 1).zfill(2) + '-01'
if (x == 12):
date2 = '2017-01-01'
sqlCmd = f"SELECT STATION, NAME, DATE, AWND, AVG(SNOW) FROM t WHERE (DATE >= '{date1}' AND DATE < '{date2}') GROUP BY STATION"
db_val = cur.execute(sqlCmd)
writer.writerows(db_val)
I wanted to say that this is how I was writing it in the first place, but I think it's called slightly differently than the (frankly messy) way I was calling it before. Thanks everyone else for the help though!
I have one database with two tables, both have a column called barcode, the aim is to retrieve barcode from one table and search for the entries in the other where extra information of that certain barcode is stored. I would like to have bothe retrieved data to be saved in a DataFrame. The problem is when I want to insert the retrieved data into DataFrame from the second query, it stores only the last entry:
import mysql.connector
import pandas as pd
cnx = mysql.connector(user,password,host,database)
query_barcode = ("SELECT barcode FROM barcode_store")
cursor = cnx.cursor()
cursor.execute(query_barcode)
data_barcode = cursor.fetchall()
Up to this point everything works smoothly, and here is the part with problem:
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_info = pd.DataFrame(cursor.fetchall())
pro_info contains only the last matching barcode information! While I want to retrieve all the information for each data_barcode match.
That's because you are consistently overriding existing pro_info with new data in each loop iteration. You should rather do something like:
query_info = ("SELECT product_code FROM product_info")
cursor.execute(query_info)
pro_info = pd.DataFrame(cursor.fetchall())
Making so many SELECTs is redundant since you can get all records in one SELECT and instantly insert them to your DataFrame.
#edit: However if you need to use the WHERE statement to fetch only specific products, you need to store records in a list until you insert them to DataFrame. So your code will eventually look like:
pro_list = []
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_list.append(cursor.fetchone())
pro_info = pd.DataFrame(pro_list)
Cheers!