I have a table X in big query with 170,000 rows . The values on this table as based on complex calculations done on the values from a table Y. These are done in python so as to automate the ingestion when Y gets updated.
Every time Y updates, I recompute the values needed for X in my script and insert them using the script below using streaming:
def stream_data(table, json_data):
data = json.loads(str(json_data))
# Reload the table to get the schema.
table.reload()
rows = [data]
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row into {}'.format( table))
else:
print('Errors:')
The problem here is that I have to delete all rows in the table before I insert . I know a query to do this but it fails because big query does not allow DML when there is a streaming buffer on the table and this is for one day apparently.
IS there a workaround where I can delete all rows in X , recompute based on Y and then insert the new values using the code above ??
Possibly turning the streaming buffer off ??!!
Another option would be to drop the whole table and recreate it . But my table is huge with 60 columns and the JSON for the schema would be huge . I couldn't find samples where I can create a new table with schema passed from json/file ? Some samples in this would be great.
A third option is to make the streaming insert smart that it does an update instead of insert if the row has changed . This again is a DML operation and goes back to original problem.
UPDATE:
another approach I tried is to delete the table and recreate it . Before delete I copy the schema so I can set it in the new table.:
def stream_data( json_data):
bigquery_client = bigquery.Client("myproject")
dataset = bigquery_client.dataset("mydataset")
table = dataset.table("test")
data = json.loads(json_data)
schema=table.schema
table.delete()
table = dataset.table("test")
# Set the table schema
table = dataset.table("test",schema)
table.create()
rows = [data]
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row ')
else:
print('Errors:')
This gives me an error :
ValueError: Set either 'view_query' or 'schema'.
UPDATE 2:
Key was to do a
table.reload() before
schema=table.schema to fix the above!
Related
I need to iterate through a DataFrame using a for loop to insert preexisting data into a database. The data frame is called gelsdf and the database table is Gels. I cannot figure out how to make the VALUES portion of the insert clause work without receiving an error.
Here's my code:
count = 1
for i in range(len(gelsdf)):
curr = gelsdf.loc[count]
%sql INSERT INTO Gels (GelID, Brand, GelCode, Size, GelName, GelLocation) VALUES (curr[0], curr[1], [curr[2], curr[3], curr[4], curr[5])
count += 1
I have a massive table (over 100B records), that I added an empty column to. I parse strings from another field (string) if the required string is available, extract an integer from that field, and want to update it in the new column for all rows that have that string.
At the moment, after data has been parsed and saved locally in a dataframe, I iterate on it to update the Redshift table with clean data. This takes approx 1sec/iteration, which is way too long.
My current code example:
conn = psycopg2.connect(connection_details)
cur = conn.cursor()
clean_df = raw_data.apply(clean_field_to_parse)
for ind, row in clean_df.iterrows():
update_query = build_update_query(row.id, row.clean_integer1, row.clean_integer2)
cur.execute(update_query)
where update_query is a function to generate the update query:
def update_query(id, int1, int2):
query = """
update tab_tab
set
clean_int_1 = {}::int,
clean_int_2 = {}::int,
updated_date = GETDATE()
where id = {}
;
"""
return query.format(int1, int2, id)
and where clean_df is structured like:
id . field_to_parse . clean_int_1 . clean_int_2
1 . {'int_1':'2+1'}. 3 . np.nan
2 . {'int_2':'7-0'}. np.nan . 7
Is there a way to update specific table fields in bulk, so that there is no need to execute one query at a time?
I'm parsing the strings and running the update statement from Python. The database is stored on Redshift.
As mentioned, consider pure SQL and avoid iterating through billions of rows by pushing the Pandas data frame to Postgres as a staging table and then run one single UPDATE across both tables. With SQLAlchemy you can use DataFrame.to_sql to create a table replica of data frame. Even add an index of the join field, id, and drop the very large staging table at end.
from sqlalchemy import create_engine
engine = create_engine("postgresql+psycopg2://myuser:mypwd!#myhost/mydatabase")
# PUSH TO POSTGRES (SAME NAME AS DF)
clean_df.to_sql(name="clean_df", con=engine, if_exists="replace", index=False)
# SQL UPDATE (USING TRANSACTION)
with engine.begin() as conn:
sql = "CREATE INDEX idx_clean_df_id ON clean_df(id)"
conn.execute(sql)
sql = """UPDATE tab_tab t
SET t.clean_int_1 = c.int1,
t.clean_int_2 = c.int2,
t.updated_date = GETDATE()
FROM clean_df c
WHERE c.id = t.id
"""
conn.execute(sql)
sql = "DROP TABLE IF EXISTS clean_df"
conn.execute(sql)
engine.dispose()
I have inherited a Postgres database, and am currently in the process of cleaning it. I have created an algorithm to find the rows where the data is bad. The algorithm is encoded into the function called checkProblems(). Using this, I am able to select the rows that contains the bad rows, as shown below ...
schema = findTables(dbName)
conn = psycopg2.connect("dbname='%s' user='postgres' host='localhost'"%dbName)
cur = conn.cursor()
results = []
for t in tqdm(sorted(schema.keys())):
n = 0
cur.execute('select * from %s'%t)
for i, cs in enumerate(tqdm(cur)):
if checkProblem(cs):
n += 1
results.append({
'tableName': t,
'totalRows': i+1,
'badRows' : n,
})
cur.close()
conn.close()
print pd.DataFrame(results)[['tableName', 'badRows', 'totalRows']]
Now, I need to delete the rows that are bad. I have two different ways of doing it. First, I can write the clean rows in a temporary table, and rename the table. I think that this option is too memory-intensive. It would be much better if I would be able to just delete the specific record at the cursor. Is this even an option?
Otherwise, what is the best way of deleting a record under such circumstances? I am guessing that this should be a relatively common thing that database administrators do ...
Of course that delete the specific record at the cursor is better. You can do something like:
for i, cs in enumerate(tqdm(cur)):
if checkProblem(cs):
# if cs is a tuple with cs[0] being the record id.
cur.execute('delete from %s where id=%d'%(t, cs[0]))
Or you can store the ids of the bad records and then do something like
DELETE FROM table WHERE id IN (id1,id2,id3,id4)
I have one database with two tables, both have a column called barcode, the aim is to retrieve barcode from one table and search for the entries in the other where extra information of that certain barcode is stored. I would like to have bothe retrieved data to be saved in a DataFrame. The problem is when I want to insert the retrieved data into DataFrame from the second query, it stores only the last entry:
import mysql.connector
import pandas as pd
cnx = mysql.connector(user,password,host,database)
query_barcode = ("SELECT barcode FROM barcode_store")
cursor = cnx.cursor()
cursor.execute(query_barcode)
data_barcode = cursor.fetchall()
Up to this point everything works smoothly, and here is the part with problem:
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_info = pd.DataFrame(cursor.fetchall())
pro_info contains only the last matching barcode information! While I want to retrieve all the information for each data_barcode match.
That's because you are consistently overriding existing pro_info with new data in each loop iteration. You should rather do something like:
query_info = ("SELECT product_code FROM product_info")
cursor.execute(query_info)
pro_info = pd.DataFrame(cursor.fetchall())
Making so many SELECTs is redundant since you can get all records in one SELECT and instantly insert them to your DataFrame.
#edit: However if you need to use the WHERE statement to fetch only specific products, you need to store records in a list until you insert them to DataFrame. So your code will eventually look like:
pro_list = []
query_info = ("SELECT product_code FROM product_info WHERE barcode=%s")
for each_barcode in data_barcode:
cursor.execute(query_info % each_barcode)
pro_list.append(cursor.fetchone())
pro_info = pd.DataFrame(pro_list)
Cheers!
My Project involves
1) Import data from excel file
path="...\dataexample.xls"
databook=xlrd.open_workbook(path)
mydatasheet=databook.sheet_by_index(0)
2) Connect to a localhost database
database = MySQLdb.connect (host=myhost, user = myuser, passwd = mypasswd, db = dbname)
3) Import a current range of cell of cells to the database
My. dataexample.xls has 12 rows and 122 cols and for my INSERT QUERY I need only A3:J12 cells
After some search I'am in the point where:
Preparation for the query and
cursor = database.cursor()
query = """INSERT INTO agiosathanasios(record,Stn_Code,Raw_Dist,Snow,Snow_corr,Smp,Raw_Dist_QC,Snow_final,Snow_final_pos) VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s)"""
Collect the correct cells for the query
for row in range(3,12):
values=[]
for col in range(0,10):
values.append(mydatasheet.cell(row,col).value)
print values
I was trying to put after values.append the following code database, cursor.execute(query,values)so I can import the value I want.
But... it does work...
how can I fix this? How can I put the current values to my query ?
So, according to tracebacks, you provide not all parameters to your query or there is some problem with converting values to string (None in values). Check values before calling cursor.execute!
Try to convert to string values from excel:
values.append(str(mydatasheet.cell(row,col).value))