I have the following data frame
ipdb> csv_data
country sale date trans_factor
0 India 403171 12/01/2012 1
1 Bhutan 394096 12/01/2012 2
2 Nepal super 12/01/2012 3
3 madhya 355883 12/01/2012 4
4 sudan man 12/01/2012 5
As of now i am using below code to insert data in table, like if table already exists, drop it and create new table
csv_file_path = data_mapping_record.csv_file_path
original_csv_header = pandas.read_csv(csv_file_path).columns.tolist()
csv_data = pandas.read_csv(csv_file_path, skiprows=[0], names=original_csv_header, infer_datetime_format=True)
table_name = data_mapping_record.csv_file_path.split('/')[-1].split('.')[0]
engine = create_engine(
'postgresql://username:password#localhost:5432/pandas_data')
# Delete table if already exits
engine.execute("""DROP TABLE IF EXISTS "%s" """ % (table_name))
# Write the pandas dataframe to database using sqlalchemy and pands.to_sql
csv_data_frame.to_sql(table_name, engine, chunksize=1000)
But what i need is, without deleting the table, if table already exists just append the data to the already existing one, is there any way in pandas to_sql method ?
IIUC you can simply use if_exists='append' parameter:
csv_data_frame.to_sql(table_name, engine, if_exists='append', chunksize=1000)
from docs:
if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
fail: If
table exists, do nothing.
replace: If table exists, drop it, recreate
it, and insert data.
append: If table exists, insert data. Create if
does not exist.
Related
I am trying to get column names from my postgres sql table using psycopg2 but it is returning unordered column list not same as how columns are shown in table.
This is how database table look when saved as pandas dataframe:
cur.execute("Select * from actor")
tupples = cur.fetchall()
cur.execute("select column_name from information_schema.columns where table_name = 'actor'")
column_name = cur.fetchall()
df = pd.DataFrame(tupples,columns = column_name)
(actor_id,) (last_update,) (first_name,) (last_name,)
1 PENELOPE GUINESS 2006-02-15 04:34:33
2 NICK WAHLBERG 2006-02-15 04:34:33
3 ED CHASE 2006-02-15 04:34:33
4 JENNIFER DAVIS 2006-02-15 04:34:33
5 JOHNNY LOLLOBRIGIDA 2006-02-15 04:34:33
This is how database table looked like when i see in pgadmin2:
I just want the column_name to return the column names of sql table as shown in image.
I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()
I'm trying to create a sqlite db from a csv file. After some searching it seems like this is possible using a pandas df. I've tried following some tutorials and the documentation but I can't figure this error out. Here's my code:
# Import libraries
import pandas, csv, sqlite3
# Create sqlite database and cursor
conn = sqlite3.connect('test.db')
c = conn.cursor()
# Create the table of pitches
c.execute("""CREATE TABLE IF NOT EXISTS pitches (
pitch_type text,
game_date text,
release_speed real
)""")
conn.commit()
df = pandas.read_csv('test2.csv')
df.to_sql('pitches', conn, if_exists='append', index=False)
conn.close()
When I run this code, I get the following error:
sqlite3.OperationalError: table pitches has no column named SL
SL is the first value in the first row in my csv file. I can't figure out why it's looking at the csv value as a column name, unless it thinks the first row of the csv should be the headers and is trying to match that to column names in the table? I don't think that was it either though because I tried changing the first value to an actual column name and got the same error.
EDIT:
When I have the headers in the csv, the dataframe looks like this:
pitch_type game_date release_speed
0 SL 8/31/2017 81.9
1 SL 8/31/2017 84.1
2 SL 8/31/2017 81.9
... ... ... ...
2919 SL 8/1/2017 82.3
2920 CU 8/1/2017 78.7
[2921 rows x 3 columns]
and I get the following error:
sqlite3.OperationalError: table pitches has no column named game_date
When I take the headers out of the csv file:
SL 8/31/2017 81.9
0 SL 8/31/2017 84.1
1 SL 8/31/2017 81.9
2 SL 8/31/2017 84.1
... .. ... ...
2918 SL 8/1/2017 82.3
2919 CU 8/1/2017 78.7
[2920 rows x 3 columns]
and I get the following error:
sqlite3.OperationalError: table pitches has no column named SL
EDIT #2:
I tried taking the table creation out of the code entirely, per this answer, with the following code:
# Import libraries
import pandas, csv, sqlite3
# Create sqlite database and cursor
conn = sqlite3.connect('test.db')
c = conn.cursor()
df = pandas.read_csv('test2.csv')
df.to_sql('pitches', conn, if_exists='append', index=False)
conn.close()
and still get the
sqlite3.OperationalError: table pitches has no column named SL
error
EDIT #3:
I changed the table creation code to the following:
# Create the table of pitches
dropTable = 'DROP TABLE pitches'
c.execute(dropTable)
createTable = "CREATE TABLE IF NOT EXISTS pitches(pitch_type text, game_date text, release_speed real)"
c.execute(createTable)
and it works now. Not sure what exactly changed, as it looks basically the same to me, but it works.
If you are trying to create a table from a csv file you can just run sqlite3 and do:
sqlite> .mode csv
sqlite> .import c:/path/to/file/myfile.csv myTableName
Check your column names. I am able to replicate your code successfully with no errors. The names variable gets all the columns names from the sqlite table and you can compare them with the dataframe headers with df.columns.
# Import libraries
import pandas as pd, csv, sqlite3
# Create sqlite database and cursor
conn = sqlite3.connect('test.db')
c = conn.cursor()
# Create the table of pitches
c.execute("""CREATE TABLE IF NOT EXISTS pitches (
pitch_type text,
game_date text,
release_speed real
)""")
conn.commit()
test = conn.execute('SELECT * from pitches')
names = [description[0] for description in test.description]
print(names)
df = pd.DataFrame([['SL','8/31/2017','81.9']],columns = ['pitch_type','game_date','release_speed'])
df.to_sql('pitches', conn, if_exists='append', index=False)
conn.execute('SELECT * from pitches').fetchall()
>> [('SL', '8/31/2017', 81.9), ('SL', '8/31/2017', 81.9)]
I am guessing there might be some whitespaces in your column headers.
As you can see from pandas read_csv docs:
header : int or list of ints, default 'infer'
Row number(s) to use as the column names, and the start of the
data. Default behavior is to infer the column names: if no names
are passed the behavior is identical to ``header=0`` and column
names are inferred from the first line of the file, if column
names are passed explicitly then the behavior is identical to
``header=None``. Explicitly pass ``header=0`` to be able to
replace existing names. The header can be a list of integers that
specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be
skipped (e.g. 2 in this example is skipped). Note that this
parameter ignores commented lines and empty lines if
``skip_blank_lines=True``, so header=0 denotes the first line of
data rather than the first line of the file.
That means read_csv using your first row as header names.
I am using pandas and a Dataframe to deal with some data. I want to load the data into a mySQL dabase where one of the fields is a Point.
In the file I am parsing with python I have the lat and lon of the points.
I have created a dataframe (df) with the point information (id and coords):
id coords
A GeomFromText( ' POINT(40.87 3.80) ' )
I have saved in coords the command required in mySQL to create a Point from the text. However, when executing:
from sqlalchemy import create_engine
engine = create_engine(dbconnection)
df.to_sql("point_test",engine, index=False, if_exists="append")
I got the following error:
DataError: (mysql.connector.errors.DataError) 1416 (22003): Cannot get
geometry object from data you send to the GEOMETRY field
Triggered because df.to_sql transforms the GeomFromText( ' POINT(40.87
3.80) ' ) into string as "GeomFromText( ' POINT(40.87 3.80) ' )" when it should be the execution of the function GeomFromText in mySQL.
Does anyone has a suggestion about how to insert in mySQL geometrical fields information originally in text form using pandas dataframe?
A work around is to create a temporary table with the String of the geometrical information that need to be added and then update the point_test table with a call to ST_GeomFromText from the temporary table.
Assuming database with table point_test with id (VARCHAR(5)) and coords(POINT):
a.Create dataframe df as an example with point "A" and "B"
dfd = np.array([['id','geomText'],
["A","POINT( 50.2 5.6 )"],
["B","POINT( 50.2 50.4 )"]])
df=pd.DataFrame(data=dfd[1:,:], columns=dfd[0,:])
b.Add point "A" and "B" into point_test but only the id and add the string "geomText" into the table temp_point_test
df[['id']].to_sql("point_test",engine, index=False, if_exists="append")
df[['id', 'geomText']].to_sql("temp_point_test",engine, index=False, if_exists="append")
c. Update table point_test with the point from table temp_point_test applying the ST_GeomFromText() to the select. Finally, drop temp_point_test:
conn = engine.connect()
conn.execute("update point_test pt set pt.coords=(select ST_GeomFromText(geomText) from temp_point_test tpt "+
"where pt.id=tpt.id)")
conn.execute("drop table temp_point_test")
conn.close()
I used .to_sql function to insert data. But it can't check duplicated insert data. (It only can check the duplicated tables)
source code) when I run twice of the source code below.
userData.to_sql(con=engine, name='test_quest_complete', schema='test', if_exists='append')
the results) Same data was inserted in the table.
0 2016-11-14 00:00:10 AAAA
1 2016-11-14 00:00:20 BBBB
0 2016-11-14 00:00:10 AAAA
1 2016-11-14 00:00:20 BBBB
How can I insert pandas dataframe to database without data duplication?
(Also, I tried to use load data local infile, but I can't use it by reason of security issues.)
If you have administration rights on your database, I would suggest you to put some constraints on the table itself. Then the python insertion will raise an exception (and you can intercept it).
Else you can also try to retrieve the data first from the table and merge it inside pandas. Then do a group by on all the columns and get the non-existent data as a new dataframe and insert it.
import pandas as pd
import pypyodbc
from sqlalchemy import create_engine
##Data of Excel File - ExcelData(Sheet1)
##id name
##1 11
##2 22
##3 33
##4 44
##5 55
##CREATE TABLE [test].[test_quest_complete](
## [id] [int] NULL,
## [name] [int] NULL
##)
TblName="test_quest_complete"
cnxn = pypyodbc.connect("dsn=mydsn;Trusted_Connection=Yes")
engine = create_engine("mssql+pyodbc://mydsn")
file_name="C:\Users\poonamr\Desktop\ExcelData.xlsx"
xl = pd.ExcelFile(file_name)
userData = xl.parse("Sheet1")
print(userData)
sql="Select * From test." + TblName
tblData=pd.read_sql(sql,cnxn)
print(tblData)
Finalresult=pd.concat([userData, tblData]).drop_duplicates(keep=False)
print(Finalresult)
Finalresult.to_sql(TblName, engine, if_exists='append',schema='test', index=False)