How can I insert pandas dataframe to database without data duplication?

How can I insert pandas dataframe to database without data duplication? - python

I used .to_sql function to insert data. But it can't check duplicated insert data. (It only can check the duplicated tables)
source code) when I run twice of the source code below.
userData.to_sql(con=engine, name='test_quest_complete', schema='test', if_exists='append')
the results) Same data was inserted in the table.
0 2016-11-14 00:00:10 AAAA
1 2016-11-14 00:00:20 BBBB
0 2016-11-14 00:00:10 AAAA
1 2016-11-14 00:00:20 BBBB
How can I insert pandas dataframe to database without data duplication?
(Also, I tried to use load data local infile, but I can't use it by reason of security issues.)

If you have administration rights on your database, I would suggest you to put some constraints on the table itself. Then the python insertion will raise an exception (and you can intercept it).
Else you can also try to retrieve the data first from the table and merge it inside pandas. Then do a group by on all the columns and get the non-existent data as a new dataframe and insert it.

import pandas as pd
import pypyodbc
from sqlalchemy import create_engine
##Data of Excel File - ExcelData(Sheet1)
##id name
##1 11
##2 22
##3 33
##4 44
##5 55
##CREATE TABLE [test].[test_quest_complete](
## [id] [int] NULL,
## [name] [int] NULL
##)
TblName="test_quest_complete"
cnxn = pypyodbc.connect("dsn=mydsn;Trusted_Connection=Yes")
engine = create_engine("mssql+pyodbc://mydsn")
file_name="C:\Users\poonamr\Desktop\ExcelData.xlsx"
xl = pd.ExcelFile(file_name)
userData = xl.parse("Sheet1")
print(userData)
sql="Select * From test." + TblName
tblData=pd.read_sql(sql,cnxn)
print(tblData)
Finalresult=pd.concat([userData, tblData]).drop_duplicates(keep=False)
print(Finalresult)
Finalresult.to_sql(TblName, engine, if_exists='append',schema='test', index=False)

Related

Is there a way to store both a pandas dataframe and separate string var in the same SQLite table?

Disclaimer: This is my first time posting a question here, so I apologize if I didn't post this properly.
I recently started learning how to use SQLite in python. As the title suggests, I have a python object with a string and a pandas dataframe attributes and want to know if/how I can add both of these to the same SQLite table. Below is the code I have thus far. The mydb.db file gets created successfully, but on insert I get the following error message:
sqlite3.InterfaceError: Error binding parameter :df- probably unsupported type.
I know you can use df.to_sql('mydbs', conn) to store a pandas dataframe in an SQL table, but this wouldn't seem to allow for an additional string to be added to the same table and then retrieved separately from the dataframe. Any solutions or alternative suggestions are appreciated.
Python Code:
# Python 3.7
import sqlite3
import pandas as pd
import myclass
conn = sqlite3.connect("mydb.db")
c = conn.cursor()
c.execute("""CREATE TABLE mydbs (
name text,
df blob
)""")
conn.commit()
c.execute("INSERT INTO mydbs VALUES (:name, :df)", {'name': myclass.name, 'df': myclass.df})
conn.commit()
conn.close()

It looks like you are trying to store a dataframe in an SQL table 'cell'. This is a bit odd, since sql is used for storing tables of data... and a dataframe is something that arguably should be stored as a table on its own (hence the built in pandas function). To accomplish what you want specifically, you could pickle the dataframe and store it
import codecs
import pickle
import pandas as pd
import sqlite3
df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
pickled = codecs.encode(pickle.dumps(df), "base64").decode()
df
foo bar
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
Store & Retrieve:
conn = sqlite3.connect("mydb.db")
c = conn.cursor()
c.execute("""CREATE TABLE mydbs (
name text,
df text
)""")
c.execute("INSERT INTO mydbs VALUES (:name, :df)", {'name': 'name', 'df': pickled})
conn.commit()
c.execute('SELECT * FROM mydbs')
result = c.fetchall()
unpickled = pickle.loads(codecs.decode(result[0][1].encode(), "base64"))
conn.close()
unpickled
foo bar
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
If you wanted to store the dataframe as an sql table (which imo makes more sense and is simpler) and you needed to have a name with it you could just add a column 'name' to the df:
import pandas as pd
from sqlalchemy import create_engine
df = pd.DataFrame({"foo": range(5), "bar": range(5, 10)})
df
foo bar
0 0 5
1 1 6
2 2 7
3 3 8
4 4 9
Add name column, then save to db and retrieve:
df['name'] = 'the df name'
engine = create_engine('sqlite://', echo=False)
df.to_sql('users', con=engine)
r = engine.execute("SELECT * FROM users").fetchall()
r = pd.read_sql('users', con=engine)
r
index foo bar name
0 0 0 5 the df name
1 1 1 6 the df name
2 2 2 7 the df name
3 3 3 8 the df name
4 4 4 9 the df name
But even that method may not be ideal, since you are effectively adding an extra column of data for each df, and this could get costly if you are working on a large project where database size is a factor, and maybe even speed (although SQL is quite fast). In this case, it may be best to use relational tables. For this I refer you here since there is no point re-writing the code here. Using a relational model would be the most 'proper' solution imo, since it fully embodies the purpose of SQL.

Using pandas to write df to sqlite

I'm trying to create a sqlite db from a csv file. After some searching it seems like this is possible using a pandas df. I've tried following some tutorials and the documentation but I can't figure this error out. Here's my code:
# Import libraries
import pandas, csv, sqlite3
# Create sqlite database and cursor
conn = sqlite3.connect('test.db')
c = conn.cursor()
# Create the table of pitches
c.execute("""CREATE TABLE IF NOT EXISTS pitches (
pitch_type text,
game_date text,
release_speed real
)""")
conn.commit()
df = pandas.read_csv('test2.csv')
df.to_sql('pitches', conn, if_exists='append', index=False)
conn.close()
When I run this code, I get the following error:
sqlite3.OperationalError: table pitches has no column named SL
SL is the first value in the first row in my csv file. I can't figure out why it's looking at the csv value as a column name, unless it thinks the first row of the csv should be the headers and is trying to match that to column names in the table? I don't think that was it either though because I tried changing the first value to an actual column name and got the same error.
EDIT:
When I have the headers in the csv, the dataframe looks like this:
pitch_type game_date release_speed
0 SL 8/31/2017 81.9
1 SL 8/31/2017 84.1
2 SL 8/31/2017 81.9
... ... ... ...
2919 SL 8/1/2017 82.3
2920 CU 8/1/2017 78.7
[2921 rows x 3 columns]
and I get the following error:
sqlite3.OperationalError: table pitches has no column named game_date
When I take the headers out of the csv file:
SL 8/31/2017 81.9
0 SL 8/31/2017 84.1
1 SL 8/31/2017 81.9
2 SL 8/31/2017 84.1
... .. ... ...
2918 SL 8/1/2017 82.3
2919 CU 8/1/2017 78.7
[2920 rows x 3 columns]
and I get the following error:
sqlite3.OperationalError: table pitches has no column named SL
EDIT #2:
I tried taking the table creation out of the code entirely, per this answer, with the following code:
# Import libraries
import pandas, csv, sqlite3
# Create sqlite database and cursor
conn = sqlite3.connect('test.db')
c = conn.cursor()
df = pandas.read_csv('test2.csv')
df.to_sql('pitches', conn, if_exists='append', index=False)
conn.close()
and still get the
sqlite3.OperationalError: table pitches has no column named SL
error
EDIT #3:
I changed the table creation code to the following:
# Create the table of pitches
dropTable = 'DROP TABLE pitches'
c.execute(dropTable)
createTable = "CREATE TABLE IF NOT EXISTS pitches(pitch_type text, game_date text, release_speed real)"
c.execute(createTable)
and it works now. Not sure what exactly changed, as it looks basically the same to me, but it works.

If you are trying to create a table from a csv file you can just run sqlite3 and do:
sqlite> .mode csv
sqlite> .import c:/path/to/file/myfile.csv myTableName

Check your column names. I am able to replicate your code successfully with no errors. The names variable gets all the columns names from the sqlite table and you can compare them with the dataframe headers with df.columns.
# Import libraries
import pandas as pd, csv, sqlite3
# Create sqlite database and cursor
conn = sqlite3.connect('test.db')
c = conn.cursor()
# Create the table of pitches
c.execute("""CREATE TABLE IF NOT EXISTS pitches (
pitch_type text,
game_date text,
release_speed real
)""")
conn.commit()
test = conn.execute('SELECT * from pitches')
names = [description[0] for description in test.description]
print(names)
df = pd.DataFrame([['SL','8/31/2017','81.9']],columns = ['pitch_type','game_date','release_speed'])
df.to_sql('pitches', conn, if_exists='append', index=False)
conn.execute('SELECT * from pitches').fetchall()
>> [('SL', '8/31/2017', 81.9), ('SL', '8/31/2017', 81.9)]
I am guessing there might be some whitespaces in your column headers.

As you can see from pandas read_csv docs:
header : int or list of ints, default 'infer'
Row number(s) to use as the column names, and the start of the
data. Default behavior is to infer the column names: if no names
are passed the behavior is identical to ``header=0`` and column
names are inferred from the first line of the file, if column
names are passed explicitly then the behavior is identical to
``header=None``. Explicitly pass ``header=0`` to be able to
replace existing names. The header can be a list of integers that
specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be
skipped (e.g. 2 in this example is skipped). Note that this
parameter ignores commented lines and empty lines if
``skip_blank_lines=True``, so header=0 denotes the first line of
data rather than the first line of the file.
That means read_csv using your first row as header names.

How to insert Pandas DataFrame into Cassandra?

I have a dataframe as below:
df
date time open high low last
01-01-2017 11:00:00 37 45 36 42
01-01-2017 11:23:00 36 43 33 38
01-01-2017 12:00:00 45 55 35 43
....
I want to write it into cassandra. It's kind of bulk upload after processing on data in python.
The schema for cassandra is as below:
CREATE TABLE ks.table1(date text, time text, open float, high float, low
float, last float, PRIMARY KEY(date, time))
To insert single row into cassandra we can use cassandra-driver in python but I couldn't find any details about uploading an entire dataframe.
from cassandra.cluster import Cluster
session.execute(
"""
INSERT INTO ks.table1 (date,time,open,high,low,last)
VALUES (01-01-2017, 11:00:00, 37, 45, 36, 42)
""")
P.S: The similar question have been asked earlier, but doesn't have answer to my question.

Even i was facing this problem but i figured out that even while uploading Millions of rows(19 Million to be exact) into Cassandra its didn't take much time.
Coming to your problem,you can use cassandra Bulk LOADER
to get your job done.
EDIT 1:
You can use prepared statements to help uplaod data into cassandra table while iterating through the dataFrame.
from cassandra.cluster import Cluster
cluster = Cluster(ip_address)
session = cluster.connect(keyspace_name)
query = "INSERT INTO data(date,time,open,high,low,last) VALUES (?,?,?,?,?,?)"
prepared = session.prepare(query)
"?" is used to input variables
for item in dataFrame:
session.execute(prepared, (item.date_value,item.time_value,item.open_value,item.high_value,item.low_value,item.last_value))
or
for item in dataFrame:
session.execute(prepared, (item[0],item[1],item[2],item[3],item[4],item[5]))
What i mean is that use for loop to extract data and upload using session.execute().
for more info on prepared statements
Hope this helps..

Nice option is to use batches. First you can split df into even partitions (thanks to Python/Pandas - partitioning a pandas DataFrame in 10 disjoint, equally-sized subsets) and then put each partition as batch into Cassandra. Batch size is limited by Cassandra (cassandra.yaml) setting:
batch_size_fail_threshold_in_kb: 50
The code for batch insert of Pandas df:
from cassandra.cluster import Cluster
from cassandra import ConsistencyLevel
from cassandra.query import BatchStatement
CASSANDRA_PARTITION_NUM = 1500
def write_to_cassandra(df):
cassandra_cluster = Cluster('ip')
session = cassandra_cluster.connect('keyspace')
prepared_query = session.prepare('INSERT INTO users(id, name) VALUES (?,?)')
for partition in split_to_partitions(df, CASSANDRA_PARTITION_NUM):
batch = BatchStatement(consistency_level=ConsistencyLevel.QUORUM)
for index, item in partition.iterrows():
batch.add(prepared_query, (item.id, item.name))
session.execute(batch)
def split_to_partitions(self, df, partition_number):
permuted_indices = np.random.permutation(len(df))
partitions = []
for i in range(partition_number):
partitions.append(df.iloc[permuted_indices[i::partition_number]])
return partitions
Update:
Do it only when batch is within the same partition.

Insert geometry point mysql from pandas Dataframe

I am using pandas and a Dataframe to deal with some data. I want to load the data into a mySQL dabase where one of the fields is a Point.
In the file I am parsing with python I have the lat and lon of the points.
I have created a dataframe (df) with the point information (id and coords):
id coords
A GeomFromText( ' POINT(40.87 3.80) ' )
I have saved in coords the command required in mySQL to create a Point from the text. However, when executing:
from sqlalchemy import create_engine
engine = create_engine(dbconnection)
df.to_sql("point_test",engine, index=False, if_exists="append")
I got the following error:
DataError: (mysql.connector.errors.DataError) 1416 (22003): Cannot get
geometry object from data you send to the GEOMETRY field
Triggered because df.to_sql transforms the GeomFromText( ' POINT(40.87
3.80) ' ) into string as "GeomFromText( ' POINT(40.87 3.80) ' )" when it should be the execution of the function GeomFromText in mySQL.
Does anyone has a suggestion about how to insert in mySQL geometrical fields information originally in text form using pandas dataframe?

A work around is to create a temporary table with the String of the geometrical information that need to be added and then update the point_test table with a call to ST_GeomFromText from the temporary table.
Assuming database with table point_test with id (VARCHAR(5)) and coords(POINT):
a.Create dataframe df as an example with point "A" and "B"
dfd = np.array([['id','geomText'],
["A","POINT( 50.2 5.6 )"],
["B","POINT( 50.2 50.4 )"]])
df=pd.DataFrame(data=dfd[1:,:], columns=dfd[0,:])
b.Add point "A" and "B" into point_test but only the id and add the string "geomText" into the table temp_point_test
df[['id']].to_sql("point_test",engine, index=False, if_exists="append")
df[['id', 'geomText']].to_sql("temp_point_test",engine, index=False, if_exists="append")
c. Update table point_test with the point from table temp_point_test applying the ST_GeomFromText() to the select. Finally, drop temp_point_test:
conn = engine.connect()
conn.execute("update point_test pt set pt.coords=(select ST_GeomFromText(geomText) from temp_point_test tpt "+
"where pt.id=tpt.id)")
conn.execute("drop table temp_point_test")
conn.close()

append the data to already existing table in pandas using to_sql

I have the following data frame
ipdb> csv_data
country sale date trans_factor
0 India 403171 12/01/2012 1
1 Bhutan 394096 12/01/2012 2
2 Nepal super 12/01/2012 3
3 madhya 355883 12/01/2012 4
4 sudan man 12/01/2012 5
As of now i am using below code to insert data in table, like if table already exists, drop it and create new table
csv_file_path = data_mapping_record.csv_file_path
original_csv_header = pandas.read_csv(csv_file_path).columns.tolist()
csv_data = pandas.read_csv(csv_file_path, skiprows=[0], names=original_csv_header, infer_datetime_format=True)
table_name = data_mapping_record.csv_file_path.split('/')[-1].split('.')[0]
engine = create_engine(
'postgresql://username:password#localhost:5432/pandas_data')
# Delete table if already exits
engine.execute("""DROP TABLE IF EXISTS "%s" """ % (table_name))
# Write the pandas dataframe to database using sqlalchemy and pands.to_sql
csv_data_frame.to_sql(table_name, engine, chunksize=1000)
But what i need is, without deleting the table, if table already exists just append the data to the already existing one, is there any way in pandas to_sql method ?

IIUC you can simply use if_exists='append' parameter:
csv_data_frame.to_sql(table_name, engine, if_exists='append', chunksize=1000)
from docs:
if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
fail: If
table exists, do nothing.
replace: If table exists, drop it, recreate
it, and insert data.
append: If table exists, insert data. Create if
does not exist.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I insert pandas dataframe to database without data duplication? - python

Related

Is there a way to store both a pandas dataframe and separate string var in the same SQLite table?

Using pandas to write df to sqlite

How to insert Pandas DataFrame into Cassandra?

Insert geometry point mysql from pandas Dataframe

append the data to already existing table in pandas using to_sql

Categories

Resources