I'm getting an unexpected error when using sqlite3 in python with pandas. I'm using a sqlite database for an analysis I'm doing, so it's single-user, single-computer. I'm in Python 3.9.1, with sqlite 3.33.0 and pandas 1.2.1.
The short description is that I'm trying to loop over rows of Table1, and for each row, insert data into Table2 based on an API request using an ID stored in Table1. The API gets me a lot more columns than I need for Table2, so I do the following to insert it into a new temporary table, then copy over the columns I need into Table1:
my_dataframe.to_sql("tmp", conn, if_exists="replace", index=False)
cur.execute("INSERT INTO Table1 (col1, col2) SELECT col1, col2 FROM Table2")
The problem is, on the second iteration of the loop, I get an error when pandas tries to drop the tmp table. Here is the full code:
def get_data(api_id, conn):
my_dataframe = call_to_api(api_id)
my_dataframe.to_sql("tmp", conn, if_exists="replace", index=False)
cur.execute("INSERT INTO Table1 (col1, col2) SELECT col1, col2 FROM Table2")
for chunk in pd.read_sql_query("SELECT id_for_api FROM Table1", conn, chunksize=10):
ids = chunk["id_for_api"].values
for api_id in ids:
get_data(api_id, conn)
The error I get is:
DatabaseError: Execution failed on sql 'DROP TABLE "tmp"': database table is locked
which is raised by this line:
pd.DataFrame(data).to_sql("tmp", conn, if_exists="replace", index=False)
I've tried everything I could think of to fix this:
changing the connection to be isolation_level=None (autocommit)
adding conn.commit() after the INSERT statement
creating a new cursor within the get_data function (cur = conn.cursor())
creating a new connection for use in the outer loop with read_sql_query (conn2 = sqlite3.connect('mydb.db'))
What am I missing? Is there something about sqlite isolation levels or locking that I don't understand?
When you make your connection, set autocommit=True
#contextlib.contextmanager
def database_connect():
db_conn = pyodbc.connect(
autocommit=True, # needed to prevent locks in DB with SPs
)
try:
yield db_conn
finally:
db_conn.close()
...
with database_connect() as db_conn:
df = pd.read_sql_query(
f"EXEC {sp_table}.{sp_name} " + ",".join(f"#{a}=?" for a in kwargs.keys()),
db_conn,
params=kwargs.values()
)
Related
I am trying to create few tables in Postgres from pandas dataframe but I am kept getting this error.
psycopg2.errors.InvalidForeignKey: there is no unique constraint matching given keys for referenced table "titles"
After looking into this problem for hours, i finally found that when I am inserting the data into parent table from pandas dataframe, the primary key constraint gets removed for some reasons and due to that I am getting this error when trying to refernece it from another table.
But I am not having this problem when I am using pgAdmin4 to create the table and inserting few rows of data manually.
you can see when I created the tables using pgAdmin, the primary key and foreign keys are getting created as expected and I have no problem with it.
But when I try to insert the data from pandas dataframe using psycopg2 library, the primary key is not getting created.
I Can't able to understand why is this happening.
The code I am using to create the tables -
# function for faster data insertion
def psql_insert_copy(table, conn, keys, data_iter):
"""
Execute SQL statement inserting data
Parameters
----------
table : pandas.io.sql.SQLTable
conn : sqlalchemy.engine.Engine or sqlalchemy.engine.Connection
keys : list of str
Column names
data_iter : Iterable that iterates the values to be inserted
"""
# gets a DBAPI connection that can provide a cursor
dbapi_conn = conn.connection
with dbapi_conn.cursor() as cur:
s_buf = StringIO()
writer = csv.writer(s_buf)
writer.writerows(data_iter)
s_buf.seek(0)
columns = ", ".join('"{}"'.format(k) for k in keys)
if table.schema:
table_name = "{}.{}".format(table.schema, table.name)
else:
table_name = table.name
sql = "COPY {} ({}) FROM STDIN WITH CSV".format(table_name, columns)
cur.copy_expert(sql=sql, file=s_buf)
def create_titles_table():
# connect to the database
conn = psycopg2.connect(
dbname="imdb",
user="postgres",
password=os.environ.get("DB_PASSWORD"),
host="localhost",
)
# create a cursor
c = conn.cursor()
print()
print("Creating titles table...")
c.execute(
"""CREATE TABLE IF NOT EXISTS titles(
title_id TEXT PRIMARY KEY,
title_type TEXT,
primary_title TEXT,
original_title TEXT,
is_adult INT,
start_year REAL,
end_year REAL,
runtime_minutes REAL
)
"""
)
# commit changes
conn.commit()
# read the title data
df = load_data("title.basics.tsv")
# replace \N with nan
df.replace("\\N", np.nan, inplace=True)
# rename columns
df.rename(
columns={
"tconst": "title_id",
"titleType": "title_type",
"primaryTitle": "primary_title",
"originalTitle": "original_title",
"isAdult": "is_adult",
"startYear": "start_year",
"endYear": "end_year",
"runtimeMinutes": "runtime_minutes",
},
inplace=True,
)
# drop the genres column
title_df = df.drop("genres", axis=1)
# convert the data types from str to numeric
title_df["start_year"] = pd.to_numeric(title_df["start_year"], errors="coerce")
title_df["end_year"] = pd.to_numeric(title_df["end_year"], errors="coerce")
title_df["runtime_minutes"] = pd.to_numeric(
title_df["runtime_minutes"], errors="coerce"
)
# create SQLAlchemy engine
engine = create_engine(
"postgresql://postgres:" + os.environ["DB_PASSWORD"] + "#localhost:5432/imdb"
)
# insert the data into titles table
title_df.to_sql(
"titles", engine, if_exists="replace", index=False, method=psql_insert_copy
)
# commit changes
conn.commit()
# close cursor
c.close()
# close the connection
conn.close()
print("Completed!")
print()
def create_genres_table():
# connect to the database
conn = psycopg2.connect(
dbname="imdb",
user="postgres",
password=os.environ.get("DB_PASSWORD"),
host="localhost",
)
# create a cursor
c = conn.cursor()
print()
print("Creating genres table...")
c.execute(
"""CREATE TABLE IF NOT EXISTS genres(
title_id TEXT NOT NULL,
genre TEXT,
FOREIGN KEY (title_id) REFERENCES titles(title_id)
)
"""
)
# commit changes
conn.commit()
# read the data
df = load_data("title.basics.tsv")
# replace \N with nan
df.replace("\\N", np.nan, inplace=True)
# rename columns
df.rename(columns={"tconst": "title_id", "genres": "genre"}, inplace=True)
# select only relevant columns
genres_df = df[["title_id", "genre"]].copy()
genres_df = genres_df.assign(genre=genres_df["genre"].str.split(",")).explode(
"genre"
)
# create engine
engine = create_engine(
"postgresql://postgres:" + os.environ["DB_PASSWORD"] + "#localhost:5432/imdb"
)
# insert the data into genres table
genres_df.to_sql(
"genres", engine, if_exists="replace", index=False, method=psql_insert_copy
)
# commit changes
conn.commit()
# close cursor
c.close()
# close the connection
conn.close()
print("Completed!")
print()
if __name__ == "__main__":
print()
print("Creating IMDB Database...")
# connect to the database
conn = psycopg2.connect(
dbname="imdb",
user="postgres",
password=os.environ.get("DB_PASSWORD"),
host="localhost",
)
# create the titles table
create_titles_table()
# create genres table
create_genres_table()
# close the connection
conn.close()
print("Done with Everything!")
print()
I think the problem is to_sql(if_exists="replace"). Try using to_sql(if_exists="append") - my understanding is that "replace" drops the whole table and creates a new one with no constraints.
I have got a DataFrame which has got around 30,000+ rows and 150+ columns. So, currently I am using the following code to insert the data into MySQL. But since it is reading the rows one at a time, it is taking too much time to insert all the rows into MySql.
Is there any way in which I can insert the rows all at once or in batches? The constraint here is that I need to use only PyMySQL, I cannot install any other library.
import pymysql
import pandas as pd
# Create dataframe
data = pd.DataFrame({
'book_id':[12345, 12346, 12347],
'title':['Python Programming', 'Learn MySQL', 'Data Science Cookbook'],
'price':[29, 23, 27]
})
# Connect to the database
connection = pymysql.connect(host='localhost',
user='root',
password='12345',
db='book')
# create cursor
cursor=connection.cursor()
# creating column list for insertion
cols = "`,`".join([str(i) for i in data.columns.tolist()])
# Insert DataFrame recrds one by one.
for i,row in data.iterrows():
sql = "INSERT INTO `book_details` (`" +cols + "`) VALUES (" + "%s,"*(len(row)-1) + "%s)"
cursor.execute(sql, tuple(row))
# the connection is not autocommitted by default, so we must commit to save our changes
connection.commit()
# Execute query
sql = "SELECT * FROM `book_details`"
cursor.execute(sql)
# Fetch all the records
result = cursor.fetchall()
for i in result:
print(i)
connection.close()
Thank You.
Try using SQLALCHEMY to create an Engine than you can use later with pandas df.to_sql function. This function writes rows from pandas dataframe to SQL database and it is much faster than iterating your DataFrame and using the MySql cursor.
Your code would look something like this:
import pymysql
import pandas as pd
from sqlalchemy import create_engine
# Create dataframe
data = pd.DataFrame({
'book_id':[12345, 12346, 12347],
'title':['Python Programming', 'Learn MySQL', 'Data Science Cookbook'],
'price':[29, 23, 27]
})
db_data = 'mysql+mysqldb://' + 'root' + ':' + '12345' + '#' + 'localhost' + ':3306/' \
+ 'book' + '?charset=utf8mb4'
engine = create_engine(db_data)
# Connect to the database
connection = pymysql.connect(host='localhost',
user='root',
password='12345',
db='book')
# create cursor
cursor=connection.cursor()
# Execute the to_sql for writting DF into SQL
data.to_sql('book_details', engine, if_exists='append', index=False)
# Execute query
sql = "SELECT * FROM `book_details`"
cursor.execute(sql)
# Fetch all the records
result = cursor.fetchall()
for i in result:
print(i)
engine.dispose()
connection.close()
You can take a look to all the options this function has in pandas doc
It is faster to push a file to the SQL server and let the server manage the input.
So first push the data to a CSV file.
data.to_csv("import-data.csv", header=False, index=False, quoting=2, na_rep="\\N")
And then load it at once into the SQL table.
sql = "LOAD DATA LOCAL INFILE \'import-data.csv\' \
INTO TABLE book_details FIELDS TERMINATED BY \',\' ENCLOSED BY \'\"\' \
(`" +cols + "`)"
cursor.execute(sql)
Possible improvements.
remove or disable indexes on the table(s)
Take the commit out of the loop
Now try and load the data.
Generate a CSV file and load using ** LOAD DATA INFILE ** - this would be issued from within mysql.
here's a run down of what I'd like to do: I have a list of table names, and I want to run sql against an oracle database and pull back the table name and row count for every table in my table list. However, not every table name in my list of table names is necessarily actually in the database. This causes my code to throw a database error. What I would like to do, is whenever I come to a table name that is not in the database, I create a dataframe that contains the table name and instead of count(*), there's some text that says 'table not found', or something similar. At the end of the loop I'm concatenating all of the dataframes into one dataframe. The overall goal here is to validate that certain tables exist and that they have the expected row counts.
query_list=[]
df_List=[]
connstr= '%s/%s#%s' %(username, password, server)
conn = cx_Oracle.connect(connstr)
with conn:
query_list = ["SELECT '%s' as tbl, count(*) FROM %s." %(elm, database) +elm for elm in table_list]
df_List = [pd.read_sql(elm,conn) for elm in query_list]
df = pd.concat(df_List)
Consider try/except handling to return query output or table not found output:
def get_table_count(sql, conn, elm):
try:
return pd.read_sql(sql, conn)
except:
return pd.DataFrame({'tbl': elm, 'note': 'table not found'}, index = [0])
with conn:
sql = "SELECT '{t}' as tbl, count(*) as table_count FROM {d}.{t}"
df_List = [get_table_count(sql.format(t = elm, d = database), conn, elm) \
for elm in table_list]
df = pd.concat(df_List, ignore_index = True)
Get a list of all the Table Names which are in the DB, then create a loop to query each Table to get the row count.
Here is a SQL statement to get a list of all Tables in an Oracle DB:
SQL:
SELECT DISTINCT TABLE_NAME FROM ALL_TAB_COLUMNS ORDER BY TABLE_NAME ASC;
Python (to make list of tables you want row counts for and which exist in the DB):
list(set(tables_that_exist_in_DB) - (set(tables_that_exist_in_DB) - set(list_of_tables_you_want)))
I am looking to work in python with a table that I have in SQL. I want to store the entire table in a matrix called 'mat' and then get the output after the python code so I can read the table with SQL again. This is how I started:
import pyodbc
import pandas as pd
server = 'myserver'
database = 'mydatabase'
username = 'myuser'
password = 'mypassword'
cnxn = pyodbc.connect('DRIVER={ODBC Driver 13 for SQL Server};SERVER='+server+';DATABASE='+database+';UID='+username+';PWD='+ password)
******Python code*******
mat=pd.read_sql('select * from mytable order by time' , con = cnxn)
How should I read the table to store it in mat and then how do I send it back to SQL?
You have already read the data into a DataFrame. If you want to convert a dataframe to a matrix, do mat.values. If you want to write the data to a sql table, you will have to create a cursor and use it to insert the data.
cursor = cnxn.cursor()
cursor.execute(''' INSERT INTO myTable (FirstName, LastName) VALUES ('Wilsamson', 'Shiphrah') ''')
If you have multiple values, you should use the executemany command;
values = list(zip(mat['FirstName'].values.tolist(), mat['LastName'].values.tolist()))
cursor.executemany('''INSERT INTO myTable (FirstName, LastName) VALUES (?, ?)''', values);
At the end of the INSERT statement, you will need to commit the inserts before closing your cursor and connection.
cursor.commit()
cursor.close()
cnxn.close()
If you want to convert
This is how I do it.
import mysql.connector
import pandas as pd
import numpy as np
# use this to display ALL columns...useful, but definitely not required
pd.set_option('display.max_columns', None)
mydb = mysql.connector.connect(
host="localhost",
user="duser_name",
passwd="pswd",
database="db_naem"
)
mycursor = mydb.cursor()
mycursor.execute("SELECT * FROM YourTable")
myresult = mycursor.fetchall()
df = pd.DataFrame(myresult)
df.to_csv('C:\\path_here\\test.csv', sep=',')
You can easily convert a dataframe to a matrix.
np.array(df.to_records().view(type=np.matrix))
But I'm not sure why you want to do that. I think datframes are a lot more practical for most people's needs.
I am using python library IBM_DB with which I am able to establish connection and read tables into dataframes.
The problem comes when writing into a DB2 table (INSERT query) from a dataframe source in python.
Below is sample code for connection but can someone help me how to insert all records from a dataframe into the target table in DB2 ?
import pandas as pd
import ibm_db
ibm_db_conn = ibm_db.connect("DATABASE="+"database_name"+";HOSTNAME="+"localhost"+";PORT="+"50000"+";PROTOCOL=TCPIP;UID="+"db2user"+";PWD="+"password#123"+";", "","")
import ibm_db_dbi
conn = ibm_db_dbi.Connection(ibm_db_conn)
df=pd.read_sql("SELECT * FROM SCHEMA1.TEST_TABLE",conn)
print df
I am also able to insert a record manually if given SQL syntax with hard coded values :
query = "INSERT INTO SCHEMA1.TEST_TABLE (Col1, Col2, Col3) VALUES('A', 'B', 0)"
print query
stmt = ibm_db.exec_immediate(ibm_db_conn, query)
print stmt
What I am unable to achieve is to insert from a dataframe and append it to the table.
I've tried DATAFRAME.to_SQL() as well but it errors out with the following :
df.to_sql(name='TEST_TABLE', con=conn, flavor=None, schema='SCHEMA1', if_exists='append', index=True, index_label=None, chunksize=None, dtype=None)
This errors out saying :
pandas.io.sql.DatabaseError: Execution failed on sql 'SELECT name FROM sqlite_master WHERE type='table' AND name=?;': ibm_db_dbi::ProgrammingError: SQLNumResultCols failed: [IBM][CLI Driver][DB2/LINUXX8664] SQL0204N "SCHEMA1.SQLITE_MASTER" is an undefined name. SQLSTATE=42704 SQLCODE=-204
You can write a pandas data frame into ibm db2 using ibm_db.execute_many().
subset = df[['col1','col2', 'col3']]
tuple_of_tuples = tuple([tuple(x) for x in subset.values])
sql = "INSERT INTO Schema.Table VALUES(?,?,?)"
cnn = ibm_db.connect("DATABASE=database;HOSTNAME=127.0.0.1;PORT=50000;PROTOCOL=TCPIP;UID=username;PWD=password;", "", "")
stmt = ibm_db.prepare(cnn, sql)
ibm_db.execute_many(stmt, tuple_of_tuples)