Inserting a DataFrame into a database in Pandas - python

I need to iterate through a DataFrame using a for loop to insert preexisting data into a database. The data frame is called gelsdf and the database table is Gels. I cannot figure out how to make the VALUES portion of the insert clause work without receiving an error.
Here's my code:
count = 1
for i in range(len(gelsdf)):
curr = gelsdf.loc[count]
%sql INSERT INTO Gels (GelID, Brand, GelCode, Size, GelName, GelLocation) VALUES (curr[0], curr[1], [curr[2], curr[3], curr[4], curr[5])
count += 1

Related

Why is my second database column being populated with 'none'?

I am trying to create a database with three columns, URL which is the location of the data I am aiming to scrape, STATUS which is the ticker symbol of the stock, and STATUS which will be used to inform whether the data has been acquired yet, or not.
import sqlite3
conn = sqlite3.connect('tickers.db')
conn.execute('''CREATE TABLE TAB(URL, TICKER, STATUS default "Not started");''')
for i in url_list:
conn.execute("INSERT INTO TAB(URL) VALUES(?)",(i,))
for j in ticklist:
conn.execute("INSERT INTO TAB(TICKER) VALUES(?)",(j,))
for row in conn.execute("SELECT URL, TICKER, STATUS from TAB"):
print('URL={i}'.format(i=row[0]))
print('TICKER={i}'.format(i=row[1]))
print('STATUS={i}'.format(i=row[2]))
To populate the URL column I have used a list of URL's, similarly I am trying to the same thing with TICKER, however when I run the code, the column is only populated with 'none' for all rows.
Output
URL=https://api.pushshift.io/reddit/search/submission/?q=$AACG&subreddit=wallstreetbets&metadata=true&size=0&after=1610928000&before=1613088000
TICKER=None
STATUS=Not started
URL=https://api.pushshift.io/reddit/search/submission/?q=$AACIU&subreddit=wallstreetbets&metadata=true&size=0&after=1610928000&before=1613088000
TICKER=None
STATUS=Not started
Instead of trying to populate the columns, why not insert as rows directly?
Assuming url_list and ticklist are of equal length (and even if not) you can Try this:
for i, j in zip(url_list,ticklist):
conn.execute("INSERT INTO TAB(URL, TICKER) VALUES(?,?)",(i,j))
That way you are adding the values as expected and not creating new rows with every insert

Update multiple rows of SQL table from Python script

I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()

Big query insert /delete to table

I have a table X in big query with 170,000 rows . The values on this table as based on complex calculations done on the values from a table Y. These are done in python so as to automate the ingestion when Y gets updated.
Every time Y updates, I recompute the values needed for X in my script and insert them using the script below using streaming:
def stream_data(table, json_data):
data = json.loads(str(json_data))
# Reload the table to get the schema.
table.reload()
rows = [data]
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row into {}'.format( table))
else:
print('Errors:')
The problem here is that I have to delete all rows in the table before I insert . I know a query to do this but it fails because big query does not allow DML when there is a streaming buffer on the table and this is for one day apparently.
IS there a workaround where I can delete all rows in X , recompute based on Y and then insert the new values using the code above ??
Possibly turning the streaming buffer off ??!!
Another option would be to drop the whole table and recreate it . But my table is huge with 60 columns and the JSON for the schema would be huge . I couldn't find samples where I can create a new table with schema passed from json/file ? Some samples in this would be great.
A third option is to make the streaming insert smart that it does an update instead of insert if the row has changed . This again is a DML operation and goes back to original problem.
UPDATE:
another approach I tried is to delete the table and recreate it . Before delete I copy the schema so I can set it in the new table.:
def stream_data( json_data):
bigquery_client = bigquery.Client("myproject")
dataset = bigquery_client.dataset("mydataset")
table = dataset.table("test")
data = json.loads(json_data)
schema=table.schema
table.delete()
table = dataset.table("test")
# Set the table schema
table = dataset.table("test",schema)
table.create()
rows = [data]
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row ')
else:
print('Errors:')
This gives me an error :
ValueError: Set either 'view_query' or 'schema'.
UPDATE 2:
Key was to do a
table.reload() before
schema=table.schema to fix the above!

Put retrieved data from MySQL query into DataFrame pandas by a for loop

I have one database with two tables, both have a column called barcode, the aim is to retrieve barcode from one table and search for the entries in the other where extra information of that certain barcode is stored. I would like to have bothe retrieved data to be saved in a DataFrame. The problem is when I want to insert the retrieved data into DataFrame from the second query, it stores only the last entry:
import mysql.connector
import pandas as pd
cnx = mysql.connector(user,password,host,database)
query_barcode = ("SELECT barcode FROM barcode_store")
cursor = cnx.cursor()
cursor.execute(query_barcode)
data_barcode = cursor.fetchall()
Up to this point everything works smoothly, and here is the part with problem:
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_info = pd.DataFrame(cursor.fetchall())
pro_info contains only the last matching barcode information! While I want to retrieve all the information for each data_barcode match.
That's because you are consistently overriding existing pro_info with new data in each loop iteration. You should rather do something like:
query_info = ("SELECT product_code FROM product_info")
cursor.execute(query_info)
pro_info = pd.DataFrame(cursor.fetchall())
Making so many SELECTs is redundant since you can get all records in one SELECT and instantly insert them to your DataFrame.
#edit: However if you need to use the WHERE statement to fetch only specific products, you need to store records in a list until you insert them to DataFrame. So your code will eventually look like:
pro_list = []
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_list.append(cursor.fetchone())
pro_info = pd.DataFrame(pro_list)
Cheers!

Import from excel file a current range of cells to MySql using Python

My Project involves
1) Import data from excel file
path="...\dataexample.xls"
databook=xlrd.open_workbook(path)
mydatasheet=databook.sheet_by_index(0)
2) Connect to a localhost database
database = MySQLdb.connect (host=myhost, user = myuser, passwd = mypasswd, db = dbname)
3) Import a current range of cell of cells to the database
My. dataexample.xls has 12 rows and 122 cols and for my INSERT QUERY I need only A3:J12 cells
After some search I'am in the point where:
Preparation for the query and
cursor = database.cursor()
query = """INSERT INTO agiosathanasios(record,Stn_Code,Raw_Dist,Snow,Snow_corr,Smp,Raw_Dist_QC,Snow_final,Snow_final_pos) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
Collect the correct cells for the query
for row in range(3,12):
values=[]
for col in range(0,10):
values.append(mydatasheet.cell(row,col).value)
print values
I was trying to put after values.append the following code database, cursor.execute(query,values)so I can import the value I want.
But... it does work...
how can I fix this? How can I put the current values to my query ?
So, according to tracebacks, you provide not all parameters to your query or there is some problem with converting values to string (None in values). Check values before calling cursor.execute!
Try to convert to string values from excel:
values.append(str(mydatasheet.cell(row,col).value))

Categories