I am retrieving data from a Postgresdb and storing it in a Pandas dataframe for further processing. While doing that I want to update the queried table and set a flag saying that these rows are getting processed.
engine = engine = create_engine(connection_string, connect_args=credentials)
query = load_query(filename='queries/get_data.sql')
df = pd.read_sql(query, engine)
ids = df['id']
update_query = "update table1" +\
"set status = 'processing'," +\
f"where session_id in ({ids})"
with engine.connect() as con:
rs = con.execute(update_query)
The dataframe then looks like this:
ID
descr
Cell 1
Cell 2
Cell 3
Cell 4
Now I want to update the column "status". What do I need to do? I know I need a list, separeted by commas and each value in qoutes... But I wasnt able to build id.
Help appreciated
Related
I'm trying to load database with number of tables with different number of columns from pandas into MS SQL Server, I've reached the last step:
I am dynamically creating the new tables in the database, base on the rows in pandas, fetching the data ... but I have issue with last step - adding the data into the columns, names of rows are being gathered dynamically, however I cannot pass them to the system so it would threat them as name of rows not as strings
with conn.cursor() as cursor:
cursor.execute("IF EXISTS(SELECT * FROM INFORMATION_SCHEMA.TABLES WHERE TABLE_NAME = '"+foldern[0]+"' AND TABLE_SCHEMA = 'dbo') DROP TABLE [dbo].["+foldern[0]+"];")
skuel = "CREATE TABLE " + foldern[0] + " ( "+all_columns+" ) AS NODE"
print(foldern[0])
cursor.execute(skuel)
conn.commit()
for index, row in df2.iterrows():
skuel2 = "INSERT INTO " + foldern[0] + " ( "+all_columns2+" ) values("+all_columns3+")"
cursor.execute(skuel2, *namesofrows)
Under +all_columns2+ - I do get name of columns, under values I got as many "?" as there is columns
Under "namesofrows" is a tupple that contains the names of rows from df2 (in format row.ROWNAME ), but those rows are not "resolved", what I mean by that, is that instead of having the values of rows coming from pandas, I'm having number of rows with same string which is the name of row (like row.id, row.types)
Any idea how to proceed with that?
That is good point, I was not aware of SQLAlchemy possibility which I will probably use - but - btw. I did solve the issue
resolvedvalues = []
for e in namesofrows:
resolvedvalues.append(eval(e))
cursor.execute(skuel2,(resolvedvalues))
I am trying to write result into Oracle Database using executemany command. However, I am neither getting any error message nor the database table is getting updated.
Using same connection object and cursor I am able to connect to database, extract data and insert the data into new table. However, I am NOT able to update the result to a existing table
I am using Oracle 19c database , python 3.8.8 and cx_oracle 8.0.0.
Reading the file from Oracle database table 'bench_project_d'. To reproduce the error I have created a csv file and reading the data from csv file.
The data has 7 fields
ROW_ID type: NUMBER(19,0),
GROUP_ID type: NUMBER(19,0),
PLANNED_UNITS type: NUMBER,
IQR_PTU(Calculated on the fly),
Q1_PTU (Calculated on the fly) ,
Q2_PTU (Calculated on the fly) ,
ANOMALY type: NUMBER
All the fields are having data except the new column "ANOMALY".
In this field , all are null values. This is a field where we want to store the results.
While importing the data to python, we are taking first 6 feature, calculate the anomaly field and push the anomaly result to database
'''python
#connecting to Database
username="**********"
password="********"
tsn="*********"
conn = cx_Oracle.connect(username, password, tsn)
cur = conn.cursor()
global_df = pd.read_csv("project_data.csv")
#Filtering the group having more than 3 projects
grouped = global_df.groupby("GROUP_ID")
filtered_df = grouped.filter(lambda x: x["GROUP_ID"].count()>3)
filtered_df
#Handling zero Interquartile Range
x = filtered_df[filtered_df['IQR_PTU'] == 0]['GROUP_ID'].unique()
for i in x:
filtered_df.loc[filtered_df['GROUP_ID'] == i,'IQR_PTU'] = (filtered_df[filtered_df['GROUP_ID']==i]['PLANNED_UNITS'].quantile(0.95)) - (filtered_df[filtered_df['GROUP_ID']==i]['PLANNED_UNITS'].quantile(0.05))
#Creating the output 'Anomaly' field
filtered_df['ANOMALY'] =0
filtered_df.loc[filtered_df['PLANNED_UNITS'] > (filtered_df['Q2_PTU']+(1.5*filtered_df['IQR_PTU'])),'ANOMALY']=1
filtered_df.loc[filtered_df['PLANNED_UNITS'] < (filtered_df['Q1_PTU']-(1.5*filtered_df['IQR_PTU'])),'ANOMALY']=-1
#Formating the Dataframe
result_df = df.loc[:,['ROW_ID','GROUP_ID', 'ANOMALY']]
result_df = result_df.round(2)
result_df=result_df.astype('object')
value_to_stre_db=result_df.values.tolist()
#Pushing the result to Database table bench_project_d
statement = 'update bench_project_d set GROUP_ID = :2, ANOMALY= :3 where ROW_ID = :1'
cur.executemany(statement, value_to_stre_db)
conn.commit()
EDIT 1:
I have tried to convert list of array to list of tuples and executed the same code again. But still no luck.
rows = [tuple(x) for x in value_to_stre_db]
#Pushing the result to Database table bench_project_d
statement = ''update bench_project_d set GROUP_ID = :2, ANOMALY= :3 where ROW_ID = :1''
cur.executemany(statement, rows)
conn.commit()
I got a number of dataframes with information associated with different systems.
Now I'm trying to write the inforamtion to multiple tables (number of systems I got) by using sqlAlchemy.
(I'm pretty new to python and sqlAlchemy tho)
So I'm wondering if theres a nicer possibility to write the values of each column of the dataframe to DIFFERENT tables?
E.g. Column 1 of dataframe 3, 4 to table 1, column 2 of dataframe 3, 4 to table 2, and so on.
Also I keep getting the integrity error "Duplicaty entry" if any values are written twice to the same column in the table.
x = 0
for index in a_id:
table_sim = Table(
f'simulated_for_sys_{np.int_(index)}', meta,
Column('timestamp', DateTime, primary_key = True),
Column('system__id', Integer),
Column('simulated_yield_in_kWh', Float),
Column('global_irradiance_tilted_in_kWh_per_m2', Float)
)
# Checking if table already exist
if not engine.dialect.has_table(engine, f'simulated_for_sys_{np.int_(index)}'):
print("Tables created", table_sim)
# Specified table
meta.create_all(engine)
else:
print("Table already exists...")
conn = engine.connect()
# Write timestamps from 01.01.xxxx till now
for timestamp_utc in timestamp_df['timestamp_utc']:
print(timestamp_utc)
ins = table_sim.insert().values(timestamp = timestamp_utc.to_pydatetime() )
result = conn.execute(ins)
# Write id to table (giving duplicate error..)
for system_id in sys_ids_df['system_id']:
ins1 = table_sim.insert().values(system__id = system_id)
conn = engine.connect()
result = conn.execute(ins1)
# Write pr information from column x of dataframeto table (also duplicate error in between if
# same values appear)
colname = f'pr_{x}'
for colname in pr_daily_df[f'{colname}']:
ins2 = table_sim.insert().values(simulated_yield_in_kWh = colname)
result = conn.execute(ins2)
# Write rad information from column x of dataframe to table (also duplicate error in between if
# same values
appear)
colname = f'rad_{x}'
for colname in rad_daily_df[f'{colname}']:
ins3 = table_sim.insert().values(global_irradiance_tilted_in_kWh_per_m2 = colname)
result = conn.execute(ins3)
x += 1
I am performing an ETL task where I am querying tables in a Data Warehouse to see if it contains IDs in a DataFrame (df) which was created by joining tables from the operational database.
The DataFrame only has ID columns from each joined table in the operational database. I have created a variable for each of these columns, e.g. 'billing_profiles_id' as below:
billing_profiles_dim_id = df['billing_profiles_dim_id']
I am attempting to iterated row by row to see if the ID here is in the 'billing_profiles_dim' table of the Data Warehouse. Where the ID is not present, I want to populate the DWH tables row by row using the matching ID rows in the ODB:
for key in billing_profiles_dim_id:
sql = "SELECT * FROM billing_profiles_dim WHERE id = '"+str(key)+"'"
dwh_cursor.execute(sql)
result = dwh_cursor.fetchone()
if result == None:
sqlQuery = "SELECT * from billing_profile where id = '"+str(key)+"'"
sqlInsert = "INSERT INTO billing_profile_dim VALUES ('"+str(key)+"','"+billing_profile.name"')
op_cursor = op_connector.execute(sqlInsert)
billing_profile = op_cursor.fetchone()
So far at least, I am receiving the following error:
SyntaxError: EOL while scanning string literal
This error message points at the close of barcket at
sqlInsert = "INSERT INTO billing_profile_dim VALUES ('"+str(key)+"','"+billing_profile.name"')
Which I am currently unable to solve. I'm also aware that this code may run into another problem or two. Could someone please see how I can solve the current issue and please ensure that I head down the correct path?
You are missing a double tick and a +
sqlInsert = "INSERT INTO billing_profile_dim VALUES ('"+str(key)+"','"+billing_profile.name+"')"
But you should really switch to prepared statements like
sql = "SELECT * FROM billing_profiles_dim WHERE id = '%s'"
dwh_cursor.execute(sql,(str(key),))
...
sqlInsert = ('INSERT INTO billing_profile_dim VALUES '
'(%s, %s )')
dwh_cursor.execute(sqlInsert , (str(key), billing_profile.name))
I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()