What is the most efficient way to query my SQL (T-SQL) database when I want to inner join the queried data onto a pandas dataframe afterwards?
I don't know how to pass information into SQL from Python via a PYODBC query so my current best idea is to form the query in a way that I know aligns with my Python dataframe (i.e. I know all the information has STARTDATE > 2016, so it's easy for me to request that, and I know that PRODUCT = Private_Car). However if I use:
SELECT *
FROM rmrClaim
WHERE (PRODUCT = 'Private_Car') AND (YEAR >= 2016)
I am still going to bring in far more data than necessary. What I would rather be able to do is select only data which contains my merge key (ID) from the SQL DB.
Is there a more efficient way to query the DB so that given a pandas dataframe I can only bring the data which I will need for inner joining afterwards?
Can I pass a list from python into a sql query using PYODBC?
Edit - Trying to phrase differently:
I have a dataframe from CSV (dataframe A), and I want to take data from my SQL DB to produce a dataframe (dataframe B). The data in my SQL DB is much much larger than the data in dataframe A so I want to be able to send a SQL query which only requests data that is within dataframe A so that I don't end up with a dataframe B which is 10x larger than dataframe A. My current idea for this is to use knowledge I have of dataframe A (i.e. that all of the data in dataframe A is after 2016) however if there is a way to pass a list into my SQL query I can more efficiently query a subset of data
use the pyodbc and write your query before passing it to pandas dataframe. Here is an example:
import pandas as pd
import pyodbc
connstr = "Driver={SQL Server};Server=MSSQLSERVER;Database=Claims;Trusted_Connection=yes;"
df = pd.read_sql("SELECT * FROM rmrClaim WHERE (PRODUCT = 'Private_Car') AND (YEAR >= 2016) AND ID in {} ".format(dfA.Column), pyodbc.connect(connstr))
df
Related
I am currently doing an Association Rules project and this is my first time working with SQL Server. I have a Pandas Dataframe with all the results and want to transfer them to an SQL table.
Although, the dataframe has the shape of (1788020, 4) and when running the code it takes too long and it stops at 500ish rows.
Just in case, this is the code I am using:
cursor2 = conn2.cursor()
cursor2.execute("truncate table APriori_test")
for index, row in dataset.iterrows():
cursor2.execute("INSERT INTO APriori_test(antecedents,consequents,support,confidence) values (?,?,?,?)",row.antecedents,row.consequents,row.support,row.confidence)
conn2.commit()
Although, when I insert for example only 1000 rows at a time it runs smoothly with no problems.
How can I automatically set to insert the data in branches of for example 10000 rows each time?
I am open to other suggestions.
Thank you!
If you are using pandas you might find useful sqlalchemy + pandas.DataFrame.to_sql. I've never used it with SQL Server but your code should be something like:
import pandas as pd
# you have to import your driver, e.g. import pyodbc
from sqlalchemy import create_engine
# replace with your connection string
engine = create_engine("dialect+driver://username:password#host:port/database")
df = pd.DataFrame({'A': [1,2,3], 'B':[4,5,6]})
df.to_sql('MyTable', con=engine, if_exists='append', index=False)
I'm trying to figure out how to treat a pandas data frame as a SQL table when querying a database in Python.
I'm coming from a SAS background where work tables can easily be incorporated into direct database queries.
For example:
Select a.first_col,
b.second_col
from database.table1 a
left join work.table1 b on a.id = b.id;
Here work.table1 is not in the database, but is a dataset held in the local SAS server.
In my research I have found ways to write a data frame to a database and then include that in the query. I do not have write access to the database, so that is not an option for me.
I also know that I can use sqlalchemy with pd.to_sql() to put a data frame into a SQL engine, but I can't figure out if there is a way to connect that engine with the pyodbc connection I have with the database.
I also tried this though I didn't think it would work (names of tables and columns altered).
df = pd.DataFrame([A342,B432,W345],columns=['id'])
query = '''
select a.id, b.id
from df a
left join database.base_table b on a.id= b.id
'''
query_results = pd.read_sql_query(query,connection)
As I expected it didn't work.
I'm connecting to a Netezza database, I'm not sure if that matters.
I don't think it is possible, it would have to be written to its own table in order to be queried, although I'm not familiar with Netezza.
Why not perform the join (in pandas a "merge") purely in pandas? Understand if the sql table is massive this isn't feasible but,
query = """
SELECT id, x
FROM a
"""
a = pd.read_sql(query, conn)
df = # some dataframe in memory
pd.merge(df, a, on='id', how='left')
see https://pandas.pydata.org/docs/getting_started/comparison/comparison_with_sql.html for good docs about sql and pandas similarities
I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:
Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
If the date and time does not yet exist, create a new row and insert it to my database.
Below are my questions:
I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
Can UPSERT also be performed directly using pandas.DataFrame.to_sql?
My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).
Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.
import sqlite3
import pandas as pd
csv_data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def manual_upsert():
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data") # Viewing Data from Column
data = cur.fetchall()
old_data_list = [] # Collection of All Dates already in Database table.
for line in data:
old_data_list.append(line[0]) # I suppose you Date Column is on 0 Index.
for new_data in csv_data:
if new_data[0] in old_data_list:
cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?", # it will update column based on date if condition is true
(new_data[1],new_data[2],new_data[3],new_data[0]))
else:
cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)", # It will insert new row if date is not found.
(new_data[0],new_data[1],new_data[2],new_data[3]))
con.commit()
con.close()
manual_upsert()
First, even though the questions are related, ask them separately in the future.
There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE
Use a transaction and the statements are going to be executed in bulk.
As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).
I currently have a Pandas dataframe and am trying to see if I can reference it when querying an Oracle table in Python. Ideally, I would like to reference the pandas dataframe in the join statement of an oracle query.
In Python, I'm able to pass a string into oracle where statement, like this:
select *
from a
where
var in (""" + df + """)
But reach an oracle limit on the number of things 1000 elements that can be passed in through my where statement.
Ideally, I would like to use join on the dataframe directly in order to avoid the Oracle limit. Eventually, I'd like to store this new data into a separate dataframe.
select *
from df pd
left join a oracle
on pd.var = oracle.var
Is there any other way to access existing pandas when trying to run an oracle query in pandas?
I am running a sql notebook on databricks. I would like to analyze a table with half a billion records in it. I can run simple sql queries on the data. However, I need to change the date column type from str to date.
Unfortunately, update/alter statements do not seem to be supported by sparkSQL so it seems I cannot modify the data in the table.
What would be the one-line of code that would allow me to convert the SQL table to a python data structure (in pyspark) in the next cell?
Then I could modify the file and return it to SQL.
dataFrame = sqlContext.sql('select * from myTable')
df=sqlContext.sql("select * from table")
To convert dataframe back to sql view,
df.createOrReplaceTempView("myview")