I have got a DataFrame which has got around 30,000+ rows and 150+ columns. So, currently I am using the following code to insert the data into MySQL. But since it is reading the rows one at a time, it is taking too much time to insert all the rows into MySql.
Is there any way in which I can insert the rows all at once or in batches? The constraint here is that I need to use only PyMySQL, I cannot install any other library.
import pymysql
import pandas as pd
# Create dataframe
data = pd.DataFrame({
'book_id':[12345, 12346, 12347],
'title':['Python Programming', 'Learn MySQL', 'Data Science Cookbook'],
'price':[29, 23, 27]
})
# Connect to the database
connection = pymysql.connect(host='localhost',
user='root',
password='12345',
db='book')
# create cursor
cursor=connection.cursor()
# creating column list for insertion
cols = "`,`".join([str(i) for i in data.columns.tolist()])
# Insert DataFrame recrds one by one.
for i,row in data.iterrows():
sql = "INSERT INTO `book_details` (`" +cols + "`) VALUES (" + "%s,"*(len(row)-1) + "%s)"
cursor.execute(sql, tuple(row))
# the connection is not autocommitted by default, so we must commit to save our changes
connection.commit()
# Execute query
sql = "SELECT * FROM `book_details`"
cursor.execute(sql)
# Fetch all the records
result = cursor.fetchall()
for i in result:
print(i)
connection.close()
Thank You.
Try using SQLALCHEMY to create an Engine than you can use later with pandas df.to_sql function. This function writes rows from pandas dataframe to SQL database and it is much faster than iterating your DataFrame and using the MySql cursor.
Your code would look something like this:
import pymysql
import pandas as pd
from sqlalchemy import create_engine
# Create dataframe
data = pd.DataFrame({
'book_id':[12345, 12346, 12347],
'title':['Python Programming', 'Learn MySQL', 'Data Science Cookbook'],
'price':[29, 23, 27]
})
db_data = 'mysql+mysqldb://' + 'root' + ':' + '12345' + '#' + 'localhost' + ':3306/' \
+ 'book' + '?charset=utf8mb4'
engine = create_engine(db_data)
# Connect to the database
connection = pymysql.connect(host='localhost',
user='root',
password='12345',
db='book')
# create cursor
cursor=connection.cursor()
# Execute the to_sql for writting DF into SQL
data.to_sql('book_details', engine, if_exists='append', index=False)
# Execute query
sql = "SELECT * FROM `book_details`"
cursor.execute(sql)
# Fetch all the records
result = cursor.fetchall()
for i in result:
print(i)
engine.dispose()
connection.close()
You can take a look to all the options this function has in pandas doc
It is faster to push a file to the SQL server and let the server manage the input.
So first push the data to a CSV file.
data.to_csv("import-data.csv", header=False, index=False, quoting=2, na_rep="\\N")
And then load it at once into the SQL table.
sql = "LOAD DATA LOCAL INFILE \'import-data.csv\' \
INTO TABLE book_details FIELDS TERMINATED BY \',\' ENCLOSED BY \'\"\' \
(`" +cols + "`)"
cursor.execute(sql)
Possible improvements.
remove or disable indexes on the table(s)
Take the commit out of the loop
Now try and load the data.
Generate a CSV file and load using ** LOAD DATA INFILE ** - this would be issued from within mysql.
Related
I have a database that contains multiple tables, and I am trying to import each table as a pandas dataframe. I can do this for a single table as follows:
import pandas as pd
import pandas.io.sql as psql
import pypyodbc
conn = pypyodbc.connect("DRIVER={SQL Server};\
SERVER=serveraddress;\
UID=uid;\
PWD=pwd;\
DATABASE=db")
df1 = psql.read_frame('SELECT * FROM dbo.table1', conn)
The number of tables in the database will change, and at any time I would like to be able to import each table into its own dataframe. How can I get all of these tables into pandas?
Depending on your SQL server, you can inspect the tables in a database.
For example:
tables_df = pd.read_sql('SELECT table_name FROM database_name', conn)
Now your table names are accessible as a pandas data frame, you just need to parse it out:
table_name_list = tables_df.table_name
select_template = 'SELECT * FROM {table_name}'
frames_dict = {}
for tname in table_name_list:
query = select_template.format(table_name = tname)
frames_dict[tname] = pd.read_sql(query, conn)
Your dictionary frames_dict contains all the dataframes with the table_name as the key
I am looking for a way to insert a big set of data into a SQL Server table in Python. The problem is that my dataframe in Python has over 200 columns, currently I am using this code:
import pyodbc
import pandas as pd
server = 'yourservername'
database = 'AdventureWorks'
username = 'username'
password = 'yourpassword'
cnxn = pyodbc.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+'UID='+username+';PWD='+ password)
cursor = cnxn.cursor()
for index, row in df.iterrows():
cursor.execute("INSERT INTO dbo.mytable (A,B,C)values(?,?,?)", row.A, row.B, row.C)
cnxn.commit()
cursor.close()
The problem is in INSERT INTO dbo.mytable (A, B, C) VALUES (?,?,?)", row.A, row.B, row.C as I need to insert a data with over 200 columns and specifying each of these columns is not really time efficient :(
I would appreciate any help!
Create connection in SqlAlchemy
Use df.to_sql() with chunksize param. Link to doc
ps. in my cases connection not in sqlalchemy not working in to_sql - function
Ok, I finally found a way:
serverName = 'xxx'
dataBase = 'zzz'
conn_str = urllib.parse.quote_plus(r'DRIVER={SQL Server};SERVER=' + serverName + r';DATABASE=' + dataBase + r';TRUSTED_CONNECTION=yes')
conn = 'mssql+pyodbc:///?odbc_connect={}'.format(conn_str)
engine = sqlalchemy.create_engine(conn,poolclass=NullPool)
connection = engine.connect()
df.to_sql("TableName", engine, schema='SchemaName', if_exists='append', index= True, chunksize=200)
connection.close()
I have been looking since yesterday about the way I could convert the output of an SQL Query into a Pandas dataframe.
For example a code that does this :
data = select * from table
I've tried so many codes I've found on the internet but nothing seems to work.
Note that my database is stored in Azure DataBricks and I can only access the table using its URL.
Thank you so much !
Hope this would help you out. Both insertion & selection are in this code for reference.
def db_insert_user_level_info(table_name):
#Call Your DF Here , as an argument in the function or pass directly
df=df_parameter
params = urllib.parse.quote_plus("DRIVER={SQL Server};SERVER=DESKTOP-ITAJUJ2;DATABASE=githubAnalytics")
engine = create_engine("mssql+pyodbc:///?odbc_connect=%s" % params)
engine.connect()
table_row_count=select_row_count(table_name)
df_row_count=df.shape[0]
if table_row_count == df_row_count:
print("Data Cannot Be Inserted Because The Row Count is Same")
else:
df.to_sql(name=table_name,con=engine, index=False, if_exists='append')
print("********************************** DONE EXECTUTED SUCCESSFULLY ***************************************************")
def select_row_count(table_name):
cnxn = pyodbc.connect("Driver={SQL Server Native Client 11.0};"
"Server=DESKTOP-ITAJUJ2;"
"Database=githubAnalytics;"
"Trusted_Connection=yes;")
cur = cnxn.cursor()
try:
db_cmd = "SELECT count(*) FROM "+table_name
res = cur.execute(db_cmd)
# Do something with your result set, for example print out all the results:
for x in res:
return x[0]
except:
print("Table is not Available , Please Wait...")
Using sqlalchemy to connect to the database, and the built-in method read_sql_query from pandas to go straight to a DataFrame:
import pandas as pd
from sqlalchemy import create_engine
engine = create_engine(url)
connection = engine.connect()
query = "SELECT * FROM table"
df = pd.read_sql_query(query,connection)
I'm getting an unexpected error when using sqlite3 in python with pandas. I'm using a sqlite database for an analysis I'm doing, so it's single-user, single-computer. I'm in Python 3.9.1, with sqlite 3.33.0 and pandas 1.2.1.
The short description is that I'm trying to loop over rows of Table1, and for each row, insert data into Table2 based on an API request using an ID stored in Table1. The API gets me a lot more columns than I need for Table2, so I do the following to insert it into a new temporary table, then copy over the columns I need into Table1:
my_dataframe.to_sql("tmp", conn, if_exists="replace", index=False)
cur.execute("INSERT INTO Table1 (col1, col2) SELECT col1, col2 FROM Table2")
The problem is, on the second iteration of the loop, I get an error when pandas tries to drop the tmp table. Here is the full code:
def get_data(api_id, conn):
my_dataframe = call_to_api(api_id)
my_dataframe.to_sql("tmp", conn, if_exists="replace", index=False)
cur.execute("INSERT INTO Table1 (col1, col2) SELECT col1, col2 FROM Table2")
for chunk in pd.read_sql_query("SELECT id_for_api FROM Table1", conn, chunksize=10):
ids = chunk["id_for_api"].values
for api_id in ids:
get_data(api_id, conn)
The error I get is:
DatabaseError: Execution failed on sql 'DROP TABLE "tmp"': database table is locked
which is raised by this line:
pd.DataFrame(data).to_sql("tmp", conn, if_exists="replace", index=False)
I've tried everything I could think of to fix this:
changing the connection to be isolation_level=None (autocommit)
adding conn.commit() after the INSERT statement
creating a new cursor within the get_data function (cur = conn.cursor())
creating a new connection for use in the outer loop with read_sql_query (conn2 = sqlite3.connect('mydb.db'))
What am I missing? Is there something about sqlite isolation levels or locking that I don't understand?
When you make your connection, set autocommit=True
#contextlib.contextmanager
def database_connect():
db_conn = pyodbc.connect(
autocommit=True, # needed to prevent locks in DB with SPs
)
try:
yield db_conn
finally:
db_conn.close()
...
with database_connect() as db_conn:
df = pd.read_sql_query(
f"EXEC {sp_table}.{sp_name} " + ",".join(f"#{a}=?" for a in kwargs.keys()),
db_conn,
params=kwargs.values()
)
I am looking to work in python with a table that I have in SQL. I want to store the entire table in a matrix called 'mat' and then get the output after the python code so I can read the table with SQL again. This is how I started:
import pyodbc
import pandas as pd
server = 'myserver'
database = 'mydatabase'
username = 'myuser'
password = 'mypassword'
cnxn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
******Python code*******
mat=pd.read_sql('select * from mytable order by time' , con = cnxn)
How should I read the table to store it in mat and then how do I send it back to SQL?
You have already read the data into a DataFrame. If you want to convert a dataframe to a matrix, do mat.values. If you want to write the data to a sql table, you will have to create a cursor and use it to insert the data.
cursor = cnxn.cursor()
cursor.execute(''' INSERT INTO myTable (FirstName, LastName) VALUES ('Wilsamson', 'Shiphrah') ''')
If you have multiple values, you should use the executemany command;
values = list(zip(mat['FirstName'].values.tolist(), mat['LastName'].values.tolist()))
cursor.executemany('''INSERT INTO myTable (FirstName, LastName) VALUES (?, ?)''', values);
At the end of the INSERT statement, you will need to commit the inserts before closing your cursor and connection.
cursor.commit()
cursor.close()
cnxn.close()
If you want to convert
This is how I do it.
import mysql.connector
import pandas as pd
import numpy as np
# use this to display ALL columns...useful, but definitely not required
pd.set_option('display.max_columns', None)
mydb = mysql.connector.connect(
host="localhost",
user="duser_name",
passwd="pswd",
database="db_naem"
)
mycursor = mydb.cursor()
mycursor.execute("SELECT * FROM YourTable")
myresult = mycursor.fetchall()
df = pd.DataFrame(myresult)
df.to_csv('C:\\path_here\\test.csv', sep=',')
You can easily convert a dataframe to a matrix.
np.array(df.to_records().view(type=np.matrix))
But I'm not sure why you want to do that. I think datframes are a lot more practical for most people's needs.