Using pandas to write df to sqlite - python

I'm trying to create a sqlite db from a csv file. After some searching it seems like this is possible using a pandas df. I've tried following some tutorials and the documentation but I can't figure this error out. Here's my code:
# Import libraries
import pandas, csv, sqlite3
# Create sqlite database and cursor
conn = sqlite3.connect('test.db')
c = conn.cursor()
# Create the table of pitches
c.execute("""CREATE TABLE IF NOT EXISTS pitches (
pitch_type text,
game_date text,
release_speed real
)""")
conn.commit()
df = pandas.read_csv('test2.csv')
df.to_sql('pitches', conn, if_exists='append', index=False)
conn.close()
When I run this code, I get the following error:
sqlite3.OperationalError: table pitches has no column named SL
SL is the first value in the first row in my csv file. I can't figure out why it's looking at the csv value as a column name, unless it thinks the first row of the csv should be the headers and is trying to match that to column names in the table? I don't think that was it either though because I tried changing the first value to an actual column name and got the same error.
EDIT:
When I have the headers in the csv, the dataframe looks like this:
pitch_type game_date release_speed
0 SL 8/31/2017 81.9
1 SL 8/31/2017 84.1
2 SL 8/31/2017 81.9
... ... ... ...
2919 SL 8/1/2017 82.3
2920 CU 8/1/2017 78.7
[2921 rows x 3 columns]
and I get the following error:
sqlite3.OperationalError: table pitches has no column named game_date
When I take the headers out of the csv file:
SL 8/31/2017 81.9
0 SL 8/31/2017 84.1
1 SL 8/31/2017 81.9
2 SL 8/31/2017 84.1
... .. ... ...
2918 SL 8/1/2017 82.3
2919 CU 8/1/2017 78.7
[2920 rows x 3 columns]
and I get the following error:
sqlite3.OperationalError: table pitches has no column named SL
EDIT #2:
I tried taking the table creation out of the code entirely, per this answer, with the following code:
# Import libraries
import pandas, csv, sqlite3
# Create sqlite database and cursor
conn = sqlite3.connect('test.db')
c = conn.cursor()
df = pandas.read_csv('test2.csv')
df.to_sql('pitches', conn, if_exists='append', index=False)
conn.close()
and still get the
sqlite3.OperationalError: table pitches has no column named SL
error
EDIT #3:
I changed the table creation code to the following:
# Create the table of pitches
dropTable = 'DROP TABLE pitches'
c.execute(dropTable)
createTable = "CREATE TABLE IF NOT EXISTS pitches(pitch_type text, game_date text, release_speed real)"
c.execute(createTable)
and it works now. Not sure what exactly changed, as it looks basically the same to me, but it works.

If you are trying to create a table from a csv file you can just run sqlite3 and do:
sqlite> .mode csv
sqlite> .import c:/path/to/file/myfile.csv myTableName

Check your column names. I am able to replicate your code successfully with no errors. The names variable gets all the columns names from the sqlite table and you can compare them with the dataframe headers with df.columns.
# Import libraries
import pandas as pd, csv, sqlite3
# Create sqlite database and cursor
conn = sqlite3.connect('test.db')
c = conn.cursor()
# Create the table of pitches
c.execute("""CREATE TABLE IF NOT EXISTS pitches (
pitch_type text,
game_date text,
release_speed real
)""")
conn.commit()
test = conn.execute('SELECT * from pitches')
names = [description[0] for description in test.description]
print(names)
df = pd.DataFrame([['SL','8/31/2017','81.9']],columns = ['pitch_type','game_date','release_speed'])
df.to_sql('pitches', conn, if_exists='append', index=False)
conn.execute('SELECT * from pitches').fetchall()
>> [('SL', '8/31/2017', 81.9), ('SL', '8/31/2017', 81.9)]
I am guessing there might be some whitespaces in your column headers.

As you can see from pandas read_csv docs:
header : int or list of ints, default 'infer'
Row number(s) to use as the column names, and the start of the
data. Default behavior is to infer the column names: if no names
are passed the behavior is identical to ``header=0`` and column
names are inferred from the first line of the file, if column
names are passed explicitly then the behavior is identical to
``header=None``. Explicitly pass ``header=0`` to be able to
replace existing names. The header can be a list of integers that
specify row locations for a multi-index on the columns
e.g. [0,1,3]. Intervening rows that are not specified will be
skipped (e.g. 2 in this example is skipped). Note that this
parameter ignores commented lines and empty lines if
``skip_blank_lines=True``, so header=0 denotes the first line of
data rather than the first line of the file.
That means read_csv using your first row as header names.

Related

Insert data from pandas into sql db - keys doesn't fit columns

I have a database with around 10 columns. Sometimes I need to insert a row which has only 3 of the required columns, the rest are not in the dic.
The data to be inserted is a dictionary named row :
(this insert is to avoid duplicates)
row = {'keyword':'abc','name':'bds'.....}
df = pd.DataFrame([row]) # df looks good, I see columns and 1 row.
engine = getEngine()
connection = engine.connect()
df.to_sql('temp_insert_data_index', connection, if_exists ='replace',index=False)
result = connection.execute(('''
INSERT INTO {t} SELECT * FROM temp_insert_data_index
ON CONFLICT DO NOTHING''').format(t=table_name))
connection.close()
Problem : when I don't have all columns in the row(dic), it will insert dic fields by order (a 3 keys dic will be inserted to the first 3 columns) and not to the right columns. ( I expect the keys in dic to fit the db columns)
Why ?
Consider explicitly naming the columns to be inserted in INSERT INTO and SELECT clauses which is best practice for SQL append queries. Doing so, the dynamic query should work for all or subset of columns. Below uses F-string (available Python 3.6+) for all interpolation to larger SQL query:
# APPEND TO STAGING TEMP TABLE
df.to_sql('temp_insert_data_index', connection, if_exists='replace', index=False)
# STRING OF COMMA SEPARATED COLUMNS
cols = ", ".join(df.columns)
sql = (
f"INSERT INTO {table_name} ({cols}) "
f"SELECT {cols} FROM temp_insert_data_index "
"ON CONFLICT DO NOTHING"
)
result = connection.execute(sql)
connection.close()

Can't fill rows correctly using for loop during CSV import to SQL Server

I'm trying to import CSV file to SQL Server using Python. However, during last parts of code something is not working as I'd like it to. Every row looks the same after importing it to SQL Server, even though rows in DataFrame are different.
CODE:
import pypyodbc as pdb
import plotly.graph_objects as go
import numpy as np
server = 'LAPTOP-124CPEDE\SQLEXPRESS'
database = 'Datasets'
conn = pdb.connect('DRIVER={SQL Server};SERVER='+server+';DATABASE='+database+';Trusted_Connection=yes;')
cursor = conn.cursor()
data = pd.read_csv(r'C:\Users\aleks\Desktop\zadania python+sql\dataset_1.csv', sep=';')
df = pd.DataFrame(data, columns=['a_timestamp', 'any_date', 'some_money', 'weird_name', 'count_of_something'])
df
cursor.execute('CREATE TABLE dataset_1 (a_timestamp nvarchar(255), any_date date, some_money float, weird_name nvarchar(255), count_of_something float)')
tuple = (row.a_timestamp, row.any_date, row.some_money, row.weird_name, row.count_of_something)
for row in df.itertuples():
cursor.execute('''
INSERT INTO Datasets.dbo.dataset_1 (a_timestamp, any_date, some_money, weird_name, count_of_something)
VALUES (?,?,?,?,?)
''',
tuple)
conn.commit()
Instead of inserting the same row of dataframe, assigned before the for loop, you can obtain values of a dataframe's row on each iteration in for loop and insert them to table of database using this approach:
for row in df.itertuples():
values = (row.a_timestamp, row.any_date, row.some_money, row.weird_name, row.count_of_something)
cursor.execute('''
INSERT INTO Datasets.dbo.dataset_1 (a_timestamp,
any_date,
some_money,
weird_name,
count_of_something)
VALUES (?,?,?,?,?)
''',
values)

Insert geometry point mysql from pandas Dataframe

I am using pandas and a Dataframe to deal with some data. I want to load the data into a mySQL dabase where one of the fields is a Point.
In the file I am parsing with python I have the lat and lon of the points.
I have created a dataframe (df) with the point information (id and coords):
id coords
A GeomFromText( ' POINT(40.87 3.80) ' )
I have saved in coords the command required in mySQL to create a Point from the text. However, when executing:
from sqlalchemy import create_engine
engine = create_engine(dbconnection)
df.to_sql("point_test",engine, index=False, if_exists="append")
I got the following error:
DataError: (mysql.connector.errors.DataError) 1416 (22003): Cannot get
geometry object from data you send to the GEOMETRY field
Triggered because df.to_sql transforms the GeomFromText( ' POINT(40.87
3.80) ' ) into string as "GeomFromText( ' POINT(40.87 3.80) ' )" when it should be the execution of the function GeomFromText in mySQL.
Does anyone has a suggestion about how to insert in mySQL geometrical fields information originally in text form using pandas dataframe?
A work around is to create a temporary table with the String of the geometrical information that need to be added and then update the point_test table with a call to ST_GeomFromText from the temporary table.
Assuming database with table point_test with id (VARCHAR(5)) and coords(POINT):
a.Create dataframe df as an example with point "A" and "B"
dfd = np.array([['id','geomText'],
["A","POINT( 50.2 5.6 )"],
["B","POINT( 50.2 50.4 )"]])
df=pd.DataFrame(data=dfd[1:,:], columns=dfd[0,:])
b.Add point "A" and "B" into point_test but only the id and add the string "geomText" into the table temp_point_test
df[['id']].to_sql("point_test",engine, index=False, if_exists="append")
df[['id', 'geomText']].to_sql("temp_point_test",engine, index=False, if_exists="append")
c. Update table point_test with the point from table temp_point_test applying the ST_GeomFromText() to the select. Finally, drop temp_point_test:
conn = engine.connect()
conn.execute("update point_test pt set pt.coords=(select ST_GeomFromText(geomText) from temp_point_test tpt "+
"where pt.id=tpt.id)")
conn.execute("drop table temp_point_test")
conn.close()

append the data to already existing table in pandas using to_sql

I have the following data frame
ipdb> csv_data
country sale date trans_factor
0 India 403171 12/01/2012 1
1 Bhutan 394096 12/01/2012 2
2 Nepal super 12/01/2012 3
3 madhya 355883 12/01/2012 4
4 sudan man 12/01/2012 5
As of now i am using below code to insert data in table, like if table already exists, drop it and create new table
csv_file_path = data_mapping_record.csv_file_path
original_csv_header = pandas.read_csv(csv_file_path).columns.tolist()
csv_data = pandas.read_csv(csv_file_path, skiprows=[0], names=original_csv_header, infer_datetime_format=True)
table_name = data_mapping_record.csv_file_path.split('/')[-1].split('.')[0]
engine = create_engine(
'postgresql://username:password#localhost:5432/pandas_data')
# Delete table if already exits
engine.execute("""DROP TABLE IF EXISTS "%s" """ % (table_name))
# Write the pandas dataframe to database using sqlalchemy and pands.to_sql
csv_data_frame.to_sql(table_name, engine, chunksize=1000)
But what i need is, without deleting the table, if table already exists just append the data to the already existing one, is there any way in pandas to_sql method ?
IIUC you can simply use if_exists='append' parameter:
csv_data_frame.to_sql(table_name, engine, if_exists='append', chunksize=1000)
from docs:
if_exists : {‘fail’, ‘replace’, ‘append’}, default ‘fail’
fail: If
table exists, do nothing.
replace: If table exists, drop it, recreate
it, and insert data.
append: If table exists, insert data. Create if
does not exist.

don't get duplicated values when exporting sqlite3 file to csv using python?

i have a sqlite3 database that has multiple (six) tables and i need it to be imported to csv, but when i try to import it, i get a duplicated value if a column (in a table) is larger than another (in another table).
ie: this is how my sqlite3 database file looks like:
column on table1 column on table2 column on table3
25 30 20
30
this is the result on the .csv file (using this script as example)
25,30,20
30,30,20
and this is the result i need it to show:
25,30,20
30
EDIT: Ok, this is how i add the values to each table, based on the python documentation example (executed each time a value entry is used):
import sqlite3
conn = sqlite3.connect('database.db')
c = conn.cursor()
# Create table
c.execute('''CREATE TABLE table
(column int)''')
# Insert a row of data
c.execute("INSERT INTO table VALUES (value)")
# Save (commit) the changes
conn.commit()
# We can also close the cursor if we are done with it
c.close()
any help?
-Regards...
This is how you could do this.
import sqlite3
con = sqlite3.connect('database')
cur = con.Cursor()
cur.execute('select table1.column, table2.column, table3.column from table1, table2, table3')
# result should look like ((25, 30, 20), (30,)), if I remember correctly
results = cur.fetchall()
output = '\n'.join(','.join(str(i) for i in line) for line in results)
Note: this code is untested, written out of my head, but I hope you get the idea.
UPDATE: apparently I made some mistakes in the script and somehow sql 'magically' pads the result (you might have guessed now that I'm not a sql guru :D). Another way to do it would be:
import sqlite3
conn = sqlite3.connect('database.db')
cur = conn.cursor()
tables = ('table1', 'table2', 'table3')
results = [list(cur.execute('select column from %s' % table)) for table in tables]
def try_list(lst, i):
try:
return lst[i]
except IndexError:
return ('',)
maxlen = max(len(i) for i in results)
results2 = [[try_list(line, i) for line in results] for i in xrange(maxlen)]
output = '\n'.join(','.join(str(i[0]) for i in line) for line in results2)
print output
which produces
25,30,20
30,,
This is probably an overcomplicated way to do it, but it is 0:30 right now for me, so I'm not on my best...
At least it gets the desired result.

Categories