I have been looking since yesterday about the way I could convert the output of an SQL Query into a Pandas dataframe.
For example a code that does this :
data = select * from table
I've tried so many codes I've found on the internet but nothing seems to work.
Note that my database is stored in Azure DataBricks and I can only access the table using its URL.
Thank you so much !
Hope this would help you out. Both insertion & selection are in this code for reference.
def db_insert_user_level_info(table_name):
#Call Your DF Here , as an argument in the function or pass directly
df=df_parameter
params = urllib.parse.quote_plus("DRIVER={SQL Server};SERVER=DESKTOP-ITAJUJ2;DATABASE=githubAnalytics")
engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
engine.connect()
table_row_count=select_row_count(table_name)
df_row_count=df.shape[0]
if table_row_count == df_row_count:
print("Data Cannot Be Inserted Because The Row Count is Same")
else:
df.to_sql(name=table_name,con=engine, index=False, if_exists='append')
print("********************************** DONE EXECTUTED SUCCESSFULLY ***************************************************")
def select_row_count(table_name):
cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0};"
"Server=DESKTOP-ITAJUJ2;"
"Database=githubAnalytics;"
"Trusted_Connection=yes;")
cur = cnxn.cursor()
try:
db_cmd = "SELECT count(*) FROM "+table_name
res = cur.execute(db_cmd)
# Do something with your result set, for example print out all the results:
for x in res:
return x[0]
except:
print("Table is not Available , Please Wait...")
Using sqlalchemy to connect to the database, and the built-in method read_sql_query from pandas to go straight to a DataFrame:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine(url)
connection = engine.connect()
query = "SELECT * FROM table"
df = pd.read_sql_query(query,connection)
Related
I have a database connection:
import sqlalchemy as sa
engine = sa.create_engine('my_info')
connection = engine.connect()
Subsequently I have a function:
import pandas as pd
def load_data(connection):
sql = 'select * from tablename'
df = pd.read_sql(sql, con = connection)
return df
This is part of an app in Streamlit that I'm working on, I need streamlit to cache the output of my load_data function. I thought it worked like this:
import pandas as pd
import streamlit as st
#st.cache()
def load_data(connection):
sql = 'select * from tablename'
df = pd.read_sql(sql, con = connection)
return df
But this gives me the following error:
UnhashableTypeError: Cannot hash object of type builtins.weakref, found in the arguments of load_data().
The error is much longer, and if it helps I will post it. The error also contains a link to the streamlit documentation. I read it and reformulated my code to look like this:
#st.cache()
def DBConnection():
engine = sa.create_engine("my_info")
conn = engine.connect()
return conn
conn = DBConnection()
#st.cache(hash_funcs={DBConnection: id})
def load_data(connection):
sql = 'select * from tablename'
df = pd.read_sql(sql, con = connection)
return df
But this gives me a NameError:
NameError: name 'DBConnection' is not defined
I've run out of idea's to try, any help would be highly appreciated. It is very possible that I misunderstood the documentation as it assumes a lot of prior knowledge about the process of hashing and caching.
Combine the two methods and use:
#st.cache(allow_output_mutation=true)
Code:
#st.cache(allow_output_mutation=true)
def load_data():
engine = sa.create_engine("my_info")
conn = engine.connect()
sql = 'select * from tablename'
df = pd.read_sql(sql, con = conn )
return df
For more you can read in documentation
I am looking for a way to insert a big set of data into a SQL Server table in Python. The problem is that my dataframe in Python has over 200 columns, currently I am using this code:
import pyodbc
import pandas as pd
server = 'yourservername'
database = 'AdventureWorks'
username = 'username'
password = 'yourpassword'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+'UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
for index, row in df.iterrows():
cursor.execute("INSERT INTO dbo.mytable (A,B,C)values(?,?,?)", row.A, row.B, row.C)
cnxn.commit()
cursor.close()
The problem is in INSERT INTO dbo.mytable (A, B, C) VALUES (?,?,?)", row.A, row.B, row.C as I need to insert a data with over 200 columns and specifying each of these columns is not really time efficient :(
I would appreciate any help!
Create connection in SqlAlchemy
Use df.to_sql() with chunksize param. Link to doc
ps. in my cases connection not in sqlalchemy not working in to_sql - function
Ok, I finally found a way:
serverName = 'xxx'
dataBase = 'zzz'
conn_str = urllib.parse.quote_plus(r'DRIVER={SQL Server};SERVER=' + serverName + r';DATABASE=' + dataBase + r';TRUSTED_CONNECTION=yes')
conn = 'mssql+pyodbc:///?odbc_connect={}'.format(conn_str)
engine = sqlalchemy.create_engine(conn,poolclass=NullPool)
connection = engine.connect()
df.to_sql("TableName", engine, schema='SchemaName', if_exists='append', index= True, chunksize=200)
connection.close()
I'm trying to connect to SQL server 2019 via sqlalchemy. I'm using both mssql+pyodbc and msql+pyodbc_mssql, but on both cases it cannot connect, always returns default_schema_name not defined.
Already checked database, user schema defined and everything.
Example:
from sqlalchemy import create_engine
import urllib
from sqlalchemy import create_engine
server = 'server'
database = 'db'
username = 'user'
password = 'pass'
#cnxn = 'DRIVER={ODBC Driver 17 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password+';Trusted_Connection=yes'
cnxn = 'DSN=SQL Server;SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password+';Trusted_Connection=yes'
params = urllib.parse.quote_plus(cnxn)
engine = create_engine('mssql+pyodbc:///?odbc_connect=%s' % params)
cnxn = engine.connect()
return None, dialect.default_schema_name
AttributeError: 'MSDialect_pyodbc' object has no attribute 'default_schema_name'
TIA.....
Hopefully the following provides enough for a minimum viable sample. I'm using it in a larger script to move 12m rows 3x a day, and for that reason I've included an example of chunking that I pinched from elsewhere.
#Set up enterprise DB connection
# Enterprise DB to be used
DRIVER = "ODBC Driver 17 for SQL Server"
USERNAME = "SQLUsername"
PSSWD = "SQLPassword"
SERVERNAME = "SERVERNAME01"
INSTANCENAME = "\SQL_01"
DB = "DATABASE_Name"
TABLE = "Table_Name"
#Set up SQL database connection variable / path
#I have included this as an example that can be used to chunk data up
conn_executemany = sql.create_engine(
f"mssql+pyodbc://{USERNAME}:{PSSWD}#{SERVERNAME}{INSTANCENAME}/{DB}?driver={DRIVER}", fast_executemany=True
)
#Used for SQL Loading from Pandas DF
def chunker(seq, size):
return (seq[pos : pos + size] for pos in range(0, len(seq), size))
#Used for SQL Loading from Pandas DF
def insert_with_progress(df, engine, table="", schema="dbo"):
con = engine.connect()
# Replace table
#engine.execute(f"DROP TABLE IF EXISTS {schema}.{table};") #This only works for SQL Server 2016 or greater
try:
engine.execute(f"DROP TABLE Temp_WeatherGrids;")
except:
print("Unable to drop temp table")
try:
engine.execute(f"CREATE TABLE [dbo].[Temp_WeatherGrids]([col_01] [int] NULL,[Location] [int] NULL,[DateTime] [datetime] NULL,[Counts] [real] NULL) ON [PRIMARY] TEXTIMAGE_ON [PRIMARY];")
except:
print("Unable to create temp table")
# Insert with progress
SQL_SERVER_CHUNK_LIMIT = 250000
chunksize = math.floor(SQL_SERVER_CHUNK_LIMIT / len(df.columns))
for chunk in chunker(df, chunksize):
chunk.to_sql(
name=table,
con=con,
if_exists="append",
index=False
)
if __name__ == '__main__':
# intialise data. Example - make your own dataframe. DateTime should be pandas datetime objects.
data = {'Col_01':[0, 1, 2, 3],
'Location':['Bar', 'Pub', 'Brewery', 'Bottleshop'],
'DateTime':["1/1/2018", "1/1/2019", "1/1/2020", "1/1/2021"],
'Counts':[1, 2, 3, 4}
# Create DataFrame
df = pd.DataFrame(data)
insert_with_progress(df, conn_executemany, table=TABLE)
del [df]
I have got a DataFrame which has got around 30,000+ rows and 150+ columns. So, currently I am using the following code to insert the data into MySQL. But since it is reading the rows one at a time, it is taking too much time to insert all the rows into MySql.
Is there any way in which I can insert the rows all at once or in batches? The constraint here is that I need to use only PyMySQL, I cannot install any other library.
import pymysql
import pandas as pd
# Create dataframe
data = pd.DataFrame({
'book_id':[12345, 12346, 12347],
'title':['Python Programming', 'Learn MySQL', 'Data Science Cookbook'],
'price':[29, 23, 27]
})
# Connect to the database
connection = pymysql.connect(host='localhost',
user='root',
password='12345',
db='book')
# create cursor
cursor=connection.cursor()
# creating column list for insertion
cols = "`,`".join([str(i) for i in data.columns.tolist()])
# Insert DataFrame recrds one by one.
for i,row in data.iterrows():
sql = "INSERT INTO `book_details` (`" +cols + "`) VALUES (" + "%s,"*(len(row)-1) + "%s)"
cursor.execute(sql, tuple(row))
# the connection is not autocommitted by default, so we must commit to save our changes
connection.commit()
# Execute query
sql = "SELECT * FROM `book_details`"
cursor.execute(sql)
# Fetch all the records
result = cursor.fetchall()
for i in result:
print(i)
connection.close()
Thank You.
Try using SQLALCHEMY to create an Engine than you can use later with pandas df.to_sql function. This function writes rows from pandas dataframe to SQL database and it is much faster than iterating your DataFrame and using the MySql cursor.
Your code would look something like this:
import pymysql
import pandas as pd
from sqlalchemy import create_engine
# Create dataframe
data = pd.DataFrame({
'book_id':[12345, 12346, 12347],
'title':['Python Programming', 'Learn MySQL', 'Data Science Cookbook'],
'price':[29, 23, 27]
})
db_data = 'mysql+mysqldb://' + 'root' + ':' + '12345' + '#' + 'localhost' + ':3306/' \
+ 'book' + '?charset=utf8mb4'
engine = create_engine(db_data)
# Connect to the database
connection = pymysql.connect(host='localhost',
user='root',
password='12345',
db='book')
# create cursor
cursor=connection.cursor()
# Execute the to_sql for writting DF into SQL
data.to_sql('book_details', engine, if_exists='append', index=False)
# Execute query
sql = "SELECT * FROM `book_details`"
cursor.execute(sql)
# Fetch all the records
result = cursor.fetchall()
for i in result:
print(i)
engine.dispose()
connection.close()
You can take a look to all the options this function has in pandas doc
It is faster to push a file to the SQL server and let the server manage the input.
So first push the data to a CSV file.
data.to_csv("import-data.csv", header=False, index=False, quoting=2, na_rep="\\N")
And then load it at once into the SQL table.
sql = "LOAD DATA LOCAL INFILE \'import-data.csv\' \
INTO TABLE book_details FIELDS TERMINATED BY \',\' ENCLOSED BY \'\"\' \
(`" +cols + "`)"
cursor.execute(sql)
Possible improvements.
remove or disable indexes on the table(s)
Take the commit out of the loop
Now try and load the data.
Generate a CSV file and load using ** LOAD DATA INFILE ** - this would be issued from within mysql.
I'm trying to drop an existing table, do a query and then recreate the table using the pandas to_sql function. This query works in pgadmin, but not here. Any ideas of if this is a pandas bug or if my code is wrong?
Specific error is ValueError: Table 'a' already exists.
import pandas.io.sql as psql
from sqlalchemy import create_engine
engine = create_engine(r'postgresql://user#localhost:port/dbname')
c = engine.connect()
conn = c.connection
sql = """
drop table a;
select * from some_table limit 1;
"""
df = psql.read_sql(sql, con=conn)
print df.head()
df.to_sql('a', engine)
conn.close()
Why are you doing this like that? There is a shorter way: the if_exists kwag in to_sql. Try this:
import pandas.io.sql as psql
from sqlalchemy import create_engine
engine = create_engine(r'postgresql://user#localhost:port/dbname')
c = engine.connect()
conn = c.connection
sql = """
select * from some_table limit 1;
"""
df = psql.read_sql(sql, con=conn)
print df.head()
# Notice how below line is different. You forgot the schema argument
df.to_sql('a', con=conn, schema=schema_name, if_exists='replace')
conn.close()
According to docs:
replace: If table exists, drop it, recreate it, and insert data.
Ps. Additional tip:
This is better way to handle the connection:
with engine.connect() as conn, conn.begin():
sql = """select * from some_table limit 1"""
df = psql.read_sql(sql, con=conn)
print df.head()
df.to_sql('a', con=conn, schema=schema_name, if_exists='replace')
Because it ensures that your connection is always closed, even if your program exits with an error. This is important to prevent data corruption. Further, I would just use this:
import pandas as pd
...
pd.read_sql(sql, conn)
instead of the way you are doing it.
So, if I was in your place writing that code, it would look like this:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine(r'postgresql://user#localhost:port/dbname')
with engine.connect() as conn, conn.begin():
df = pd.read_sql('select * from some_table limit 1', con=conn)
print df.head()
df.to_sql('a', con=conn, schema=schema_name, if_exists='replace')