I am new to Python, but have a seemingly very simply exercise that I am struggling to figure out. This is a two part issue.
First: I have a list of Json objects that I am getting from an API. I am wanting to enter each list item as a row in a Dataframe to preserve that row's json object/dict for storing in a database to allow for editing and reposting. In addition, I am wanting to convert the list into a standard Dataframe (easy enough). In essence, it will be a standard dataframe with the row's raw json contained as an additional column. I've managed to accomplish this by joining a Series to a Dataframe, but I am relying on the index to join. My first question is whether the join can be done based not just on the index, but to join based on the 'id' value in the dataframe to the 'id' contained in each element in the json object/dict in the list? The rationale being that in doing so, I'd eliminate order concerns and I'd be 100% sure the json object is associated with the correct dataframe row.
Second: As I mentioned above, I've simply joined the series to the dataframe using the index, which works in this case because the both use the same 'data' list. The output looks good when I print but when I go to insert it into MSSQL, it does not like the json dictionary ('Json_Series' column). The insert works fine when I just eliminate that field. I could see how an inline fstring for the insert concatenation might work if you did some type of cast or convert on the dict, but I will be doing this for many APIs so I am trying to keep from having to write custom insert language (i.e. would like to rely on a to_sql or equivalent method/class/library). I have also tried changing the column with .astype('str') prior to insert, but that doesn't work as has been documented elsewhere.
The error I get is:
sqlalchemy.exc.ProgrammingError: (pyodbc.ProgrammingError) ('Invalid parameter type. param-index=0 param-type=dict', 'HY105')
[SQL: INSERT INTO dbo.[TestInsert_JsonObject_Python] ([Json_Series], id, value) VALUES (?, ?, ?), (?, ?, ?)]
[parameters: ({'id': 1, 'value': 2}, 1, 2, {'id': 3, 'value': 4}, 3, 4)]
Clearly, it is the first parameter and it sees it as a dictionary. Removing that column resolves the issue.
Here is what I've tried. This shows the Series (df), the dataframe (df2) and the combined join (inner_merged'). The join is vulnerable in the future to order if I were to join different lists. I am needing the join to reference the internal 'id' in the json dict when joining against the dataframe's 'id' column The combined join ('inner_merged') is however in the form that I'd like to be able to just insert into SQL.
import pandas as pd
from sqlalchemy import create_engine
from sqlalchemy.engine import URL
data = [{"id": 1, "value": 2}, {"id": 3, "value": 4}]
df = pd.Series(data, name='Json_Series')
df2= pd.DataFrame(data)
inner_merged = pd.merge(df, df2, left_index=True, right_index=True)
print(df)
print(df2)
# inner_merged['Json_Series'] = inner_merged['Json_Series'].astype('str')
print(inner_merged)
Here is the MSSQL table creation:
CREATE TABLE [dbo].[TestInsert_JsonObject_Python](
[Json_Series] [varchar](500) NULL,
[id] [int] NULL,
[value] int Null,
)
GO
CREATE TABLE [dbo].[TestInsert_NoJsonObject_Python](
[id] [int] NULL,
[value] int Null,
)
GO
Here is the insert - the first one without the json object and the second with:
server = 'enteryourserver'
database = 'enteryourdatabase'
username = 'enteryourusername'
password = 'enteryourpassword'
driver = '{ODBC Driver 18 for SQL Server}'
table1 = 'TestInsert_NoJsonObject_Python'
table2 = 'TestInsert_JsonObject_Python'
schema = 'dbo'
connection_string = f"DRIVER={driver};SERVER={server};DATABASE={database};UID={username};PWD={password}"
connection_url = URL.create("mssql+pyodbc", query={"odbc_connect": connection_string})
engine = create_engine(connection_url)
df2.to_sql(table1, con=engine, schema= schema, if_exists='replace', index=False)
inner_merged.to_sql(table2, con=engine, schema= schema, if_exists='replace', index=False)
As I mentioned, I am totally new to this so any suggestions are much appreciated.
Related
I've been looking around so hopefully someone here can assist:
I'm attempting to use cx_Oracle in python to interface with a database; my task is to insert data from an excel file to an empty (but existing) table.
I have the excel file with almost all of the same column names as the columns in the database's table, so I essentially want to check if the columns share the same name; and if so, I insert that column from the excel (dataframe --pandas) file to the table in Oracle.
import pandas as pd
import numpy as np
import cx_Oracle
df = pd.read_excel("employee_info.xlsx")
con = None
try:
con = cx_Oracle.connect (
config.username,
config.password,
config.dsn,
encoding = config.encoding)
except cx_Oracle.Error as error:
print(error)
finally:
cursor = con.cursor()
rows = [tuple(x) for x in df.values]
cursor.executemany( ''' INSERT INTO ODS.EMPLOYEES({x} VALUES {rows}) '''
I'm not sure what sql I should put or if there's a way I can use a for-loop to iterate through the columns but my main issue stems from how can I dynamically add these for when our dataset grows in columns?
I check the columns that match by using:
sql = "SELECT * FROM ODS.EMPLOYEES"
cursor.execute(sql)
data = cursor.fetchall()
col_names = []
for i in range (0, len(cursor.description)):
col_names.append(cursor.description[i][0])
a = np.intersect1d(df.columns, col_names)
print("common columns:", a)
that gives me a list of all the common columns; so I'm not sure? I've renamed the columns in my excel file to match the columns in the database's table but my issue is that how can I match these in a dynamic/automated way so I can continue to add to my datasets without worrying about changing the code.
Bonus: I also am importing SQL in a case statement to create a new column where I'm rolling up a few other columns; if there's a way to add this to the first part of my SQL or if it's advisable to do all manipulations before using an insert statement that'll be helpful to know as well.
Look at https://github.com/oracle/python-oracledb/blob/main/samples/load_csv.py
You would replace the CSV reading bit with parsing your data frame. You need to construct a SQL statement similar to the one used in that example:
sql = "insert into LoadCsvTab (id, name) values (:1, :2)"
For each spreadsheet column that you decide matches a table column, construct the (id, name) bit of the statement and add another id to the bind section (:1, :2).
I have been using sqlite3 with python for creating databases. Till Now I have been successful,
But Unfortunately I have No way Out Of This. I have A Table With 63 columns but I Want To Select Only 62 Out Of Them, I am Sure That I can write The Names of The Columns In The Select Statement. But Writing 62 Of Them seems like a non-logical(for a programmer like me) idea for me. I am using Python-sqlite3 databases. Is There A Way Out Of This
I'm Sorry If I am Grammarly Mistaken.
Thanks in advance
With Sqlite, you can:
do a PRAGMA table_info(tablename); query to get a result set that describes that table's columns
pluck the column names out of that result set and remove the one you don't want
compose a column list for the select statement using e.g. ', '.join(column_names) (though you might want to consider a higher-level SQL statement builder instead of playing with strings).
Example
A simple example using a simple table and an in-memory SQLite database:
import sqlite3
con = sqlite3.connect(":memory:")
con.executescript("CREATE TABLE kittens (id INTEGER, name TEXT, color TEXT, furriness INTEGER, age INTEGER)")
columns = [row[1] for row in con.execute("PRAGMA table_info(kittens)")]
print(columns)
selected_columns = [column for column in columns if column != 'age']
print(selected_columns)
query = f"SELECT {', '.join(selected_columns)} FROM kittens"
print(query)
This prints out
['id', 'name', 'color', 'furriness', 'age']
['id', 'name', 'color', 'furriness']
SELECT id, name, color, furriness FROM kittens
I have a pandas dataframe containing two columns: ID and MY_DATA. I have an SQL database that contains a column named ID and some other data. I want to match the ID of the SQL database column to the rows of the dataframe ID column and update it with a new column MY_DATA.
So far I used the following:
import sqlite3
df = pd.read_csv('my_filename.csv')
con = sqlite3.connect('my_database.sqlite')
cur = con.cursor()
for row in cur.execute('SELECT ID FROM main;'):
for i in len(df):
if (row[i] == df.ID.iloc[i]):
update_sqldb(df, i)
However, I think this way of having two nested for-loops is probably ugly and not very pythonic. I thought that maybe I should use the map() function, but is this the right direction to go?
How can I use the df.to_sql(if_exists = 'append') to append ONLY the unique values between the dataframe and the database. In other words, I would like to evaluate the duplicates between the DF and the DB and drop those duplicates before writing to the database.
Is there a parameter for this?
I understand that the parameters if_exists = 'append' and if_exists = 'replace'is for the entire table - not the unique entries.
I am using:
sqlalchemy
pandas dataframe with the following datatypes:
index: datetime.datetime <-- Primary Key
float
float
float
float
integer
string <--- Primary Key
string<---- Primary Key
I'm stuck on this so your help is much appreciated. -Thanks
In pandas, there is no convenient argument in to_sql to append only non-duplicates to a final table. Consider using a staging temp table that pandas always replaces and then run a final append query to migrate temp table records to final table accounting only for unique PK's using the NOT EXISTS clause.
engine = sqlalchemy.create_engine(...)
df.to_sql(name='myTempTable', con=engine, if_exists='replace')
with engine.begin() as cn:
sql = """INSERT INTO myFinalTable (Col1, Col2, Col3, ...)
SELECT t.Col1, t.Col2, t.Col3, ...
FROM myTempTable t
WHERE NOT EXISTS
(SELECT 1 FROM myFinalTable f
WHERE t.MatchColumn1 = f.MatchColumn1
AND t.MatchColumn2 = f.MatchColumn2)"""
cn.execute(sql)
This would be an ANSI SQL solution and not restricted to vendor-specific methods like UPSERT and so is compliant in practically all SQL-integrated relational databases.
I was wondering if there is a way to update all rows of a pandas dataframe in one query to mysql.
I select a dataframe from mysql. Then I do some calculations and then I want the rows in the mysql table to update to the rows in the dataframe. I do not select the complete table so I cannot just replace the table.
the column order/type remain unchanged so it just needs to replace/update the rows and I have a primary key indexed, auto-increment 'id' column if this makes any difference.
thanks
The error I get when trying to create the sql statement from the post Bob commented below.
58 d = {'col1': 'val1', 'col2': 'val2'}
59 sql = 'UPDATE table SET {}'.format(', '.join('{}=%s'.format(k) for k in d))
60 print sql
61 sql undefined, k = 'col2', global d = {'col1': 'val1', 'col2': 'val2'}
<type 'exceptions.ValueError'>: zero length field name in format
args = ('zero length field name in format',)
message = 'zero length field name in format'
I don't think that's possible with Pandas. At least not directly from Pandas. I know that you can use to_sql() to append or replace, but that doesn't help you very much.
You could try converting a dataframe to a dict with to_dict() and then executing an Update statement with values from the dict and a mysql cursor.
UPDATE
You might be using a version (like 2.6) of python that requires positional arguments in the format()
sql = 'UPDATE table SET {0}'.format(', '.join('{0}=%s'.format(k) for k in d))