execute_values with SELECT statement

execute_values with SELECT statement - python

I am using table merging in order to select items from my db against a list of parameter tuples. The query works fine, but cur.fetchall() does not return the entire table that I want.
For example:
data = (
(1, '2020-11-19'),
(1, '2020-11-20'),
(1, '2020-11-21'),
(2, '2020-11-19'),
(2, '2020-11-20'),
(2, '2020-11-21')
)
query = """
with data(song_id, date) as (
values %s
)
select t.*
from my_table t
join data d
on t.song_id = d.song_id and t.date = d.date::date
"""
execute_values(cursor, query, data)
results = cursor.fetchall()
In practice, my list of tuples to check against is thousands of rows long, and I expect the response to also be thousands of rows long.
But I am only getting 5 rows back if I call cur.fetchall() at the end of this request.
I know that this is because execute_values batches the requests, but there is some strange behavior.
If I pass page_size=10 then I get 2 items back. And if I set fetch=True then I get no results at all (even though the rowcount does not match that).
My thought was to batch these requests, but the page_size for the batch is not matching the number of items that I'm expecting per batch.
How should I change this request so that I can get all the results I'm expecting?
Edit: (years later after much experience with this)
What you really want to do here, is use the COPY command to bulk insert your data into a temporary dataframe. Then use that temporary dataframe to merge on both your columns as you would a normal table. With psycopg2 you can use the copy_expert method to perform the COPY. To reiterate (according to this example) here's how you would do that...
Also... trust me when I say this... if SPEED is an issue for you, this is by far, not-even-close, the fastest method out there.
code in this example is not tested
df = pd.DataFrame('<whatever your dataframe is>')
# Start by creating the temporary table
string = '''
create temp table mydata as (
item_id int,
date date
);
'''
cur.execute(string)
# Now you need to generate an sql string that will copy
# your data into the db
string = sql.SQL("""
copy {} ({})
from stdin (
format csv,
null "NaN",
delimiter ',',
header
)
""").format(sql.Identifier('mydata'), sql.SQL(',').join([sql.Identifier(i) for i in df.columns])
# Write your dataframe to the disk as a csv
df.to_csv('./temp_dataframe.csv', index=False, na_rep='NaN')
# Copy into the database
with open('./temp_dataframe.csv') as csv_file:
cur.copy_expert(string, csv_file)
# Now your data should be in your temporary table, so we can
# perform our select like normal
string = '''
select t.*
from my_table t
join mydata d
on t.item_id = d.item_id and t.date = d.date
'''
cur.execute(string)
data = cur.fetchall()

Related

Insert data from pandas into sql db - keys doesn't fit columns

I have a database with around 10 columns. Sometimes I need to insert a row which has only 3 of the required columns, the rest are not in the dic.
The data to be inserted is a dictionary named row :
(this insert is to avoid duplicates)
row = {'keyword':'abc','name':'bds'.....}
df = pd.DataFrame([row]) # df looks good, I see columns and 1 row.
engine = getEngine()
connection = engine.connect()
df.to_sql('temp_insert_data_index', connection, if_exists ='replace',index=False)
result = connection.execute(('''
INSERT INTO {t} SELECT * FROM temp_insert_data_index
ON CONFLICT DO NOTHING''').format(t=table_name))
connection.close()
Problem : when I don't have all columns in the row(dic), it will insert dic fields by order (a 3 keys dic will be inserted to the first 3 columns) and not to the right columns. ( I expect the keys in dic to fit the db columns)
Why ?

Consider explicitly naming the columns to be inserted in INSERT INTO and SELECT clauses which is best practice for SQL append queries. Doing so, the dynamic query should work for all or subset of columns. Below uses F-string (available Python 3.6+) for all interpolation to larger SQL query:
# APPEND TO STAGING TEMP TABLE
df.to_sql('temp_insert_data_index', connection, if_exists='replace', index=False)
# STRING OF COMMA SEPARATED COLUMNS
cols = ", ".join(df.columns)
sql = (
f"INSERT INTO {table_name} ({cols}) "
f"SELECT {cols} FROM temp_insert_data_index "
"ON CONFLICT DO NOTHING"
)
result = connection.execute(sql)
connection.close()

Why does multi-columns indexing in SQLite slow down the query's performance, unless indexing all columns?

I am trying to optimize the performance of a simple query to a SQLite database by using indexing. As an example, the table has 5M rows, 5 columns; the SELECT statement is to pick up all columns and the WHERE statement checks for only 2 columns. However, unless I have all columns in the multi-column index, the performance of the query is worse than without any index.
Did I index the column incorrectly, or when selecting all columns, am I supposed to include all of them in the index in order to improve performance?
Below each case # is the result I got when creating the SQLite database in hard-disk. However, for some reason using the ':memory:' mode made all the indexing cases faster than without index.
import sqlite3
import datetime
import pandas as pd
import numpy as np
import os
import time
# Simulate the data
size = 5000000
apps = [f'{i:010}' for i in range(size)]
dates = np.random.choice(pd.date_range('2016-01-01', '2019-01-01').to_pydatetime().tolist(), size)
prod_cd = np.random.choice([f'PROD_{i}' for i in range(30)], size)
models = np.random.choice([f'MODEL{i}' for i in range(15)], size)
categories = np.random.choice([f'GROUP{i}' for i in range(10)], size)
# create a db in memory
conn = sqlite3.connect(':memory:', detect_types=sqlite3.PARSE_DECLTYPES)
c = conn.cursor()
# Create table and insert data
c.execute("DROP TABLE IF EXISTS experiment")
c.execute("CREATE TABLE experiment (appId TEXT, dtenter TIMESTAMP, prod_cd TEXT, model TEXT, category TEXT)")
c.executemany("INSERT INTO experiment VALUES (?, ?, ?, ?, ?)", zip(apps, dates, prod_cd, models, categories))
# helper functions
def time_it(func):
def wrapper(*args, **kwargs):
start = time.time()
result = func(*args, **kwargs)
print("time for {} function is {}".format(func.__name__, time.time() - start))
return result
return wrapper
#time_it
def read_db(query):
df = pd.read_sql_query(query, conn)
return df
#time_it
def run_query(query):
output = c.execute(query).fetchall()
print(output)
# The main query
query = "SELECT * FROM experiment WHERE prod_cd IN ('PROD_1', 'PROD_5', 'PROD_10') AND dtenter >= '2018-01-01'"
# CASE #1: WITHOUT ANY INDEX
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 2.4783718585968018
# CASE #2: WITH INDEX FOR COLUMNS IN WHERE STATEMENT
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 3.221407890319824
# CASE #3: WITH INDEX FOR MORE THEN WHAT IN WHERE STATEMENT, BUT NOT ALL COLUMNS
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter, appId, category)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>>time for read_db function is 3.176532745361328
# CASE #4: WITH INDEX FOR ALL COLUMNS
run_query("DROP INDEX IF EXISTs idx")
run_query("CREATE INDEX idx ON experiment(prod_cd, dtenter, appId, category, model)")
run_query("EXPLAIN QUERY PLAN " + query)
df = read_db(query)
>>> time for read_db function is 0.8257918357849121

The SQLite Query Optimizer Overview says:
When doing an indexed lookup of a row, the usual procedure is to do a binary search on the index to find the index entry, then extract the rowid from the index and use that rowid to do a binary search on the original table. Thus a typical indexed lookup involves two binary searches.
Index entries are not in the same order as the table entries, so if a query returns data from most of the table's pages, all those random-access lookups are slower than just scanning all table rows.
Index lookups are more efficient than a table scan only if your WHERE condition filters out much more rows than are returned.
SQLite assumes that lookups on indexed columns have a high selectivity. You can get better estimates by running ANALYZE after filling the table.
But if all your queries are in a form where an index does not help, it wold be a better idea to not use an index at all.
When you create an index over all columns used in the query, the additional table accesses are no longer necessary:
If, however, all columns that were to be fetched from the table are already available in the index itself, SQLite will use the values contained in the index and will never look up the original table row. This saves one binary search for each row and can make many queries run twice as fast.
When an index contains all of the data needed for a query and when the original table never needs to be consulted, we call that index a "covering index".

Looping Python Parameters Through SQL Code

I need to create the following report scalable:
query = """
(SELECT
'02/11/2019' as Week_of,
media_type,
campaign,
count(ad_start_ts) as frequency
FROM usotomayor.digital
WHERE ds between 20190211 and 20190217
GROUP BY 1,2,3)
UNION ALL
(SELECT
'02/18/2019' as Week_of,
media_type,
campaign,
count(ad_start_ts) as frequency
FROM usotomayor.digital
WHERE ds between 20190211 and 20190224
GROUP BY 1,2,3)
"""
#Converting to dataframe
query2 = spark.sql(query).toPandas()
query2
However, as you can see I cannot make this report scalable if I have a long list of dates for each SQL query that I need to union.
My first attempt at looping in a list of date variables into the SQL script is as follows:
dfys = ['20190217','20190224']
df2 = ['02/11/2019','02/18/2019']
for i in df2:
date=i
for j in dfys:
date2=j
query = f"""
SELECT
'{date}' as Week_of,
raw.media_type,
raw.campaign,
count(raw.ad_start_ts) as frequency
FROM usotomayor.digital raw
WHERE raw.ds between 20190211 and {date2}
GROUP BY 1,2,3
"""
#Converting to dataframe
query2 = spark.sql(query).toPandas()
query2
However, this is not working for me. I think I need to loop through the sql query itself, but I don't know how to do this. Can someone help me?

As a commenter said "this is not working for me" is not very specific so let's start at specifying the problem. You need to execute a query for each pair of dates you need to execute these queries as a loop and save the result (or actually union them, but then you need to change your query logic).
You could do it like this:
dfys = ['20190217', '20190224']
df2 = ['02/11/2019', '02/18/2019']
query_results = list()
for start_date, end_date in zip(dfys, df2):
query = f"""
SELECT
'{start_date}' as Week_of,
raw.media_type,
raw.campaign,
count(raw.ad_start_ts) as frequency
FROM usotomayor.digital raw
WHERE raw.ds between 20190211 and {end_date}
GROUP BY 1,2,3
"""
query_results.append(spark.sql(query).toPandas())
query_results[0]
query_results[1]
Now you get a list of your results (query_results).

sqlite selecting multiple tables

I have a database in sqlite with c.300 tables. Currently i am iterating through a list and appending the data.
Is there a faster way / more pythonic way of doing this?
df = []
for i in Ave.columns:
try:
df2 = get_mcap(i)
df.append(df2)
#print (i)
except:
pass
df = pd.concat(df, axis=0
Ave is a dataframe where the column in the list i want to iterate through.
def get_mcap(Ticker):
cnx = sqlite3.connect('Market_Cap.db')
df = pd.read_sql_query("SELECT * FROM '%s'"%(Ticker), cnx)
df.columns = ['Date', 'Mcap-Ave', 'Mcap-High', 'Mcap-Low']
df = df.set_index('Date')
df.index = pd.to_datetime(df.index)
cnx.close
return df

Before I post my solution, I should include a quick warning that you should never use string manipulation to generate SQL queries unless it's absolutely unavoidable, and in such cases you need to be certain that you are in control of the data which is being used to format the strings and it won't contain anything that will cause the query to do something unintended.
With that said, this seems like one of those situations where you do need to use string formatting, since you cannot pass table names as parameters. Just make sure there's no way that users can alter what is contained within your list of tables.
Onto the solution. It looks like you can get your list of tables using:
tables = Ave.columns.tolist()
For my simple example, I'm going to use:
tables = ['table1', 'table2', 'table3']
Then use the following code to generate a single query:
query_template = 'select * from {}'
query_parts = []
for table in tables:
query = query_template.format(table)
query_parts.append(query)
full_query = ' union all '.join(query_parts)
Giving:
'select * from table1 union all select * from table2 union all select * from table3'
You can then simply execute this one query to get your results:
cnx = sqlite3.connect('Market_Cap.db')
df = pd.read_sql_query(full_query, cnx)
Then from here you should be able to set the index, convert to datetime etc, but now you only need to do these operations once rather than 300 times. I imagine the overall runtime of this should now be much faster.

Put retrieved data from MySQL query into DataFrame pandas by a for loop

I have one database with two tables, both have a column called barcode, the aim is to retrieve barcode from one table and search for the entries in the other where extra information of that certain barcode is stored. I would like to have bothe retrieved data to be saved in a DataFrame. The problem is when I want to insert the retrieved data into DataFrame from the second query, it stores only the last entry:
import mysql.connector
import pandas as pd
cnx = mysql.connector(user,password,host,database)
query_barcode = ("SELECT barcode FROM barcode_store")
cursor = cnx.cursor()
cursor.execute(query_barcode)
data_barcode = cursor.fetchall()
Up to this point everything works smoothly, and here is the part with problem:
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_info = pd.DataFrame(cursor.fetchall())
pro_info contains only the last matching barcode information! While I want to retrieve all the information for each data_barcode match.

That's because you are consistently overriding existing pro_info with new data in each loop iteration. You should rather do something like:
query_info = ("SELECT product_code FROM product_info")
cursor.execute(query_info)
pro_info = pd.DataFrame(cursor.fetchall())
Making so many SELECTs is redundant since you can get all records in one SELECT and instantly insert them to your DataFrame.
#edit: However if you need to use the WHERE statement to fetch only specific products, you need to store records in a list until you insert them to DataFrame. So your code will eventually look like:
pro_list = []
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_list.append(cursor.fetchone())
pro_info = pd.DataFrame(pro_list)
Cheers!

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

execute_values with SELECT statement - python

Related

Insert data from pandas into sql db - keys doesn't fit columns

Why does multi-columns indexing in SQLite slow down the query's performance, unless indexing all columns?

Looping Python Parameters Through SQL Code

sqlite selecting multiple tables

Put retrieved data from MySQL query into DataFrame pandas by a for loop

Categories

Resources