I currently have a Python dataframe that is 23 columns and 20,000 rows.
Using Python code, I want to write my data frame into a MSSQL server that I have the credentials for.
As a test I am able to successfully write some values into the table using the code below:
connection = pypyodbc.connect('Driver={SQL Server};'
'Server=XXX;'
'Database=XXX;'
'uid=XXX;'
'pwd=XXX')
cursor = connection.cursor()
for index, row in df_EVENT5_15.iterrows():
cursor.execute("INSERT INTO MODREPORT(rowid, OPCODE, LOCATION, TRACKNAME)
cursor.execute("INSERT INTO MODREPORT(rowid, location) VALUES (?,?)", (5, 'test'))
connection.commit()
But how do I write all the rows in my data frame table to the MSSQL server? In order to do so, I need to code up the following steps in my Python environment:
Delete all the rows in the MSSQL server table
Write my dataframe to the server
When you say Python data frame, I'm assuming you're using a Pandas dataframe. If it's the case, then you could use the to_sql function.
df.to_sql("MODREPORT", connection, if_exists="replace")
The if_exists argument set to replace will delete all the rows in the existing table before writing the records.
I realise it's been a while since you asked but the easiest way to delete ALL the rows in the SQL server table (point 1 of the question) would be to send the command
TRUNCATE TABLE Tablename
This will drop all the data in the table but leave the table and indexes empty so you or the DBA would not need to recreate it. It also uses less of the transaction log when it runs.
Related
I have a relatively small sqlite3 database (~2.6GB) with 820k rows and 26 columns (single table). I run an iterative process, and every time new data is generated, the data is placed in a pandas dataframe, and inserted into the sqlite database with the function insert_values_to_table. This process operates fine and is very fast.
After every data insert, the database is sanitized from its duplicate row listings (all 26 columns need to be duplicate) with the function sanitize_database. This operation connects to the database in similar fashion, creates a cursor, and executes the following logic: Create new temporary_table with only unique values from original table --> Delete all rows from original table --> Insert all rows from temporary table into empty original table --> Drop the temporary table.
It works, but the sanitize_database function is extremely slow, and can easily take up to an hour for even this small dataset. I tried to set a certain column as primary key, or to unique value, however, pandas.DataFrame.to_sql does not allow for this operation as it can either insert the whole dataframe at once, or none at all. That functionality can be reviewed here (append_skipdupes).
Is there a method to make this process more efficient?
#Function to insert pandas dataframes into SQLITE3 database
def insert_values_to_table(table_name, output):
conn = connect_to_db("/mnt/wwn-0x5002538e00000000-part1/DATABASE/table_name.db")
#if connection exists perform data insertion
if conn is not None:
c = conn.cursor()
#Add pandas data (output) into sql database
output.to_sql(name=table_name, con=conn, if_exists='append', index=False)
#Close connection
conn.close()
print('SQL insert process finished')
#To keep only unique rows in SQLITE3 database
def sanitize_database():
conn = connect_to_db("/mnt/wwn-0x5002538e00000000-part1/DATABASE/table_name.db")
c = conn.cursor()
c.executescript("""
CREATE TABLE temp_table as SELECT DISTINCT * FROM table_name;
DELETE FROM table_name;
INSERT INTO table_name SELECT * FROM temp_table;
DROP TABLE temp_table
""")
conn.close()
I want to import data of file "save.csv" into my actian PSQL database table "new_table" but i got error
ProgrammingError: ('42000', "[42000] [PSQL][ODBC Client Interface][LNA][PSQL][SQL Engine]Syntax Error: INSERT INTO 'new_table'<< ??? >> ('name','address','city') VALUES (%s,%s,%s) (0) (SQLPrepare)")
Below is my code:
connection = 'Driver={Pervasive ODBC Interface};server=localhost;DBQ=DEMODATA'
db = pyodbc.connect(connection)
c=db.cursor()
#create table i.e new_table
csv = pd.read_csv(r"C:\Users\user\Desktop\save.csv")
for row in csv.iterrows():
insert_command = """INSERT INTO new_table(name,address,city) VALUES (row['name'],row['address'],row['city'])"""
c.execute(insert_command)
c.commit()
Pandas have a built-in function that empty a pandas-dataframe into a sql-database called pd.to_sql(). This might be what you are looking for. Using this you dont have to manually insert one row at a time but you can insert the entire dataframe at once.
If you want to keep using your method, the issue might be that the table "new_table" hasn't been created yet in the database. And thus you first need something like this:
CREATE TABLE new_table
(
Name [nvarchar](100) NULL,
Address [nvarchar](100) NULL,
City [nvarchar](100) NULL
)
EDIT:
You can use to_sql() like this on tables that already exist in the database:
df.to_sql(
"new_table",
schema="name_of_the_schema",
con=c.session.connection(),
if_exists="append", # <--- This will append an already existing table
chunksize=10000,
index=False,
)
I have tried the same, in my case the table is created , I just want to insert each row from pandas dataframe into the database using Actian PSQL
I have created a database using sqlite3 in python that has thousands of tables. Each of these tables contains thousands of rows and ten columns. One of the columns is the date and time of an event: it is a string that is formatted as YYYY-mm-dd HH:MM:SS, which I have defined to be the primary key for each table. Every so often, I collect some new data (hundreds of rows) for each of these tables. Each new dataset is pulled from a server and loaded in directly as a pandas data frame or is stored as a CSV file. The new data contains the same ten columns as my original data. I need to update the tables in my database using this new data in the following way:
Given a table in my database, for each row in the new dataset, if the date and time of the row matches the date and time of an existing row in my database, update the remaining columns of that row using the values in the new dataset.
If the date and time does not yet exist, create a new row and insert it to my database.
Below are my questions:
I've done some searching on Google and it looks like I should be using the UPSERT (merge) functionality of sqlite but I can't seem to find any examples showing how to use it. Is there an actual UPSERT command, and if so, could someone please provide an example (preferably with sqlite3 in Python) or point me to a helpful resource?
Also, is there a way to do this in bulk so that I can UPSERT each new dataset into my database without having to go row by row? (I found this link, which suggests that it is possible, but I'm new to using databases and am not sure how to actually run the UPSERT command.)
Can UPSERT also be performed directly using pandas.DataFrame.to_sql?
My backup solution is loading in the table to be UPSERTed using pd.read_sql_query("SELECT * from table", con), performing pandas.DataFrame.merge, deleting the said table from the database, and then adding in the updated table to the database using pd.DataFrame.to_sql (but this would be inefficient).
Instead of going through upsert command, why don't you create your own algorithim that will find values and replace them if date & time is found, else it will insert new row. Check out my code, i wrote for you. Let me know if you are still confused. You can even do that for hundereds of tables just by replacing table name in algorithim with some variable and changing it for the whole list of your table names.
import sqlite3
import pandas as pd
csv_data = pd.read_csv("my_CSV_file.csv") # Your CSV Data Path
def manual_upsert():
con = sqlite3.connect(connection_str)
cur = con.cursor()
cur.execute("SELECT * FROM my_CSV_data") # Viewing Data from Column
data = cur.fetchall()
old_data_list = [] # Collection of All Dates already in Database table.
for line in data:
old_data_list.append(line[0]) # I suppose you Date Column is on 0 Index.
for new_data in csv_data:
if new_data[0] in old_data_list:
cur.execute("UPDATE my_CSV_data SET column1=?, column2=?, column3=? WHERE my_date_column=?", # it will update column based on date if condition is true
(new_data[1],new_data[2],new_data[3],new_data[0]))
else:
cur.execute("INSERT INTO my_CSV_data VALUES(?,?,?,?)", # It will insert new row if date is not found.
(new_data[0],new_data[1],new_data[2],new_data[3]))
con.commit()
con.close()
manual_upsert()
First, even though the questions are related, ask them separately in the future.
There is documentation on UPSERT handling in SQLite that documents how to use it but it is a bit abstract. You can check examples and discussion here: SQLite - UPSERT *not* INSERT or REPLACE
Use a transaction and the statements are going to be executed in bulk.
As presence of this library suggests to_sql does not create UPSERT commands (only INSERT).
I am trying to upload data from a csv file (its on my local desktop) to my remote SQL database. This is my query
dsn = "dsnname";pwd="password"
import pyodbc
csv_data =open(r'C:\Users\folder\Desktop\filename.csv')
def func(dsn):
cnnctn=pyodbc.connect(dsn)
cnnctn.autocommit =True
cur=cnnctn.cursor()
for rows in csv_data:
cur.execute("insert into database.tablename (colname) value(?)", rows)
cur.commit()
cnnctn.commit()
cur.close()
cnnctn.close()
return()
c=func(dsn)
The problem is that all of my data gets uploaded in one col- that I specified. If I don't specify a col name it won't run. I have 9 cols in my database table and I want to upload this data into separate cols.
When you insert with SQL, you need to make sure you are telling which columns you want to be inserting on. For example, when you execute:
INSERT INTO table (column_name) VALUES (val);
You are letting SQL know that you want to map column_name to val for that specific row. So, you need to make sure that the number of columns in the first parentheses matches the number of values in the second set of parentheses.
I am using the below python code to update postgres DB column valuebased on Id. This loop has to run for thousands of records and it is taking longer time.
Is there a way where I can pass array of dataframe values instead of looping each row?
for i in range(0,len(df)):
QUERY=""" UPDATE "Table" SET "value"='%s' WHERE "Table"."id"='%s'
""" % (df['value'][i], df['id'][i])
cur.execute(QUERY)
conn.commit()
Depends on a library you use to communicate with PostgreSQL, but usually bulk inserts are much faster via COPY FROM command.
If you use psycopg2 it is as simple as following:
cursor.copy_from(io.StringIO(string_variable), "destination_table", columns=('id', 'value'))
Where string_variable is tab and new line delimited dataset like 1\tvalue1\n2\tvalue2\n.
To achieve a performant bulk update I would do:
Create a temporary table: CREATE TEMPORARY TABLE tmp_table;;
Insert records with copy_from;
Just update destination table with query UPDATE destination_table SET value = t.value FROM tmp_table t WHERE id = t.id or any other preferred syntax