How to update a parameter in query (python + bigquery) - python

I am trying to make multiple calls to export a large data set form Bigquery into csv, via python. (e.g. 0-10000th row, 10001th-20000th row etc). But I am not sure how to set a dynamic param correctly. i.e. keep updating a and b.
The reason why I need to put the query into a loop is because the dataset is too big for a one time extraction.
a = 0
b = 10000
while a <= max(counts): #i.e. counts = 7165920
query = """
SELECT *
FROM `bigquery-public-data.ethereum_blockchain.blocks`
limit #a, #b
"""
params = [
bigquery.ScalarQueryParameter('a', 'INT', a),
bigquery.ScalarQueryParameter('b', 'INT', b) ]
query_job = client.query(query)
export_file = open("output.csv","a")
output = csv.writer(export_file, lineterminator='\n')
for rows in query_job:
output.writerow(rows)
export_file.close()
a = b +1
b = b+b
For a small data set without using a loop, I am able to get the output without any params (I just limit to 10 but that is for a single pull).
But when I tried the above method, I keep getting errors.

Suggestion of another approach
To export a table
As you want to export the whole content of the table as a CSV, I would advise you to use an ExtractJob. It is meant to send the content of a table to Google Cloud Storage, as a CSV or JSON. Here's a nice example from the docs:
destination_uri = 'gs://{}/{}'.format(bucket_name, 'shakespeare.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US') # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
For a query
Pandas has a read_gbq function to load the result of a query in a DataFrame. If the result of the query fits in memory, you could use this then call to_csv() on the resulting DataFrame. Be sure to install the pandas-gbq package to do this.
If the query result is too big, add a destination to your QueryJobConfig, so it writes to Google Cloud Storage.
Answer to your question
You could simply use string formatting:
query = """
SELECT *
FROM `bigquery-public-data.ethereum_blockchain.blocks`
WHERE some_column = {}
LIMIT {}
"""
query_job = client.query(query.format(desired_value, number_lines))
(This places desired_value in the WHERE and number_lines in the LIMIT)
If you want to use the scalar query parameters you'll have to create a job config:
my_config = bigquery.job.QueryJobConfig()
my_config.query_parameters = params # this is the list of ScalarQueryParameter's
client.query(query, job_config=my_config)

Related

execute_values with SELECT statement

I am using table merging in order to select items from my db against a list of parameter tuples. The query works fine, but cur.fetchall() does not return the entire table that I want.
For example:
data = (
(1, '2020-11-19'),
(1, '2020-11-20'),
(1, '2020-11-21'),
(2, '2020-11-19'),
(2, '2020-11-20'),
(2, '2020-11-21')
)
query = """
with data(song_id, date) as (
values %s
)
select t.*
from my_table t
join data d
on t.song_id = d.song_id and t.date = d.date::date
"""
execute_values(cursor, query, data)
results = cursor.fetchall()
In practice, my list of tuples to check against is thousands of rows long, and I expect the response to also be thousands of rows long.
But I am only getting 5 rows back if I call cur.fetchall() at the end of this request.
I know that this is because execute_values batches the requests, but there is some strange behavior.
If I pass page_size=10 then I get 2 items back. And if I set fetch=True then I get no results at all (even though the rowcount does not match that).
My thought was to batch these requests, but the page_size for the batch is not matching the number of items that I'm expecting per batch.
How should I change this request so that I can get all the results I'm expecting?
Edit: (years later after much experience with this)
What you really want to do here, is use the COPY command to bulk insert your data into a temporary dataframe. Then use that temporary dataframe to merge on both your columns as you would a normal table. With psycopg2 you can use the copy_expert method to perform the COPY. To reiterate (according to this example) here's how you would do that...
Also... trust me when I say this... if SPEED is an issue for you, this is by far, not-even-close, the fastest method out there.
code in this example is not tested
df = pd.DataFrame('<whatever your dataframe is>')
# Start by creating the temporary table
string = '''
create temp table mydata as (
item_id int,
date date
);
'''
cur.execute(string)
# Now you need to generate an sql string that will copy
# your data into the db
string = sql.SQL("""
copy {} ({})
from stdin (
format csv,
null "NaN",
delimiter ',',
header
)
""").format(sql.Identifier('mydata'), sql.SQL(',').join([sql.Identifier(i) for i in df.columns])
# Write your dataframe to the disk as a csv
df.to_csv('./temp_dataframe.csv', index=False, na_rep='NaN')
# Copy into the database
with open('./temp_dataframe.csv') as csv_file:
cur.copy_expert(string, csv_file)
# Now your data should be in your temporary table, so we can
# perform our select like normal
string = '''
select t.*
from my_table t
join mydata d
on t.item_id = d.item_id and t.date = d.date
'''
cur.execute(string)
data = cur.fetchall()

Update multiple rows in MySQL with Pandas dataframe

I have worked on a dataframe (previously extracted from a table with SQLAlchemy), and now I want to retrieve the changes updating that table.
I have done it in this very unefficient way:
engine = sql.create_engine(connect_string)
connection = engine.connect()
metadata = sql.MetaData()
pbp = sql.Table('playbyplay', metadata, autoload=True, autoload_with=engine)
for i in range(1,len(playbyplay_substitutions)):
query_update = ('update playbyplay set Player_1_Visitor = {0}, Player_2_Visitor = {1} ,Player_3_Visitor = {2} ,Player_4_Visitor = {3} ,Player_5_Visitor = {4} where id_match = {5} and actionNumber = {6}'.format(playbyplay_substitutions.loc[i,'Player_1_Visitor_y'], playbyplay_substitutions.loc[i,'Player_2_Visitor_y'], playbyplay_substitutions.loc[i,'Player_3_Visitor_y'], playbyplay_substitutions.loc[i,'Player_4_Visitor_y'], playbyplay_substitutions.loc[i,'Player_5_Visitor_y'], playbyplay_substitutions.loc[i,'id_match'],playbyplay_substitutions.loc[i,'actionNumber']))
connection.execute(query_update)
playbyplay_substitutions is my dataframe, playbyplay is my table, and the rest are the fields that I want to update or the keys in my table. I am looking for a more efficient solution than the one that I currently have for SQLAlchemy integrated with MySQL.
Consider using proper placeholders instead of manually formatting strings:
query_update = sql.text("""
UPDATE playbyplay
SET Player_1_Visitor = :Player_1_Visitor_y
, Player_2_Visitor = :Player_2_Visitor_y
, Player_3_Visitor = :Player_3_Visitor_y
, Player_4_Visitor = :Player_4_Visitor_y
, Player_5_Visitor = :Player_5_Visitor_y
WHERE id_match = :id_match AND actionNumber = :actionNumber
""")
# .iloc[1:] mimics the original for-loop that started from 1
args = playbyplay_substitutions[[
'Player_1_Visitor_y', 'Player_2_Visitor_y', 'Player_3_Visitor_y',
'Player_4_Visitor_y', 'Player_5_Visitor_y', 'id_match',
'actionNumber']].iloc[1:].to_dict('record')
connection.execute(query_update, args)
If your driver is sufficiently clever, this allows it to prepare a statement once and reuse it over the data, instead of emitting queries one by one. This also avoids possible accidental SQL injection problems, where your data resembles SQL constructs when formatted as a string manually.

Pandas Join DataTable to SQL Table to Prevent Memory Errors

So I have about 4-5 million rows of data per table. I have about 10-15 of these tables. I created a table that will join 30,000 rows to some of these million rows based on some ID and snapshot date.
Is there a way to write my existing data table to a SQL query where it will filter the results down for me so that I do not have to load the entire tables into memory?
At the moment I've been loading each table in one at a time, and then releasing the memory. However, it still takes up 100% memory on my computer.
for table in tablesToJoin:
if df is not None:
print("DF LENGTH", len(df))
query = """SET NOCOUNT ON; SELECT * FROM """ + table + """ (nolock) where snapshotdate = '"""+ date +"""'"""
query += """ SET NOCOUNT OFF;"""
start = time.time()
loadedDf = pd.read_sql_query(query, conn)
if df is None:
df = loadedDf
else:
loadedDf.info(verbose=True, null_counts=True)
df.info(verbose=True, null_counts=True)
df = df.merge(loadedDf, how='left', on=["MemberID", "SnapshotDate"])
#df = df.fillna(0)
print("DATA AFTER ALL MERGING", len(df))
print("Length of data loaded:", len(loadedDf))
print("Time to load data from sql", (time.time() - start))
I once faced the same problem as you are. My solution was to filter as much as possible in the SQL layer. Since I don't have your code and your DB, what I write below is untested code and very possibly contain bugs. You will have to correct them as needed.
The idea is to read as little as possible from the DB. pandas is not designed to analyze frames of millions of rows (at least on a typical computer). To do that, pass the filter criteria from df to your DB call:
from sqlalchemy import MetaData, and_, or_
engine = ... # construct your SQL Alchemy engine. May correspond to your `conn` object
meta = MetaData()
meta.reflect(bind=engine, only=tablesToJoin)
for table in tablesToJoin:
t = meta[table]
# Building the WHERE clause. This is equivalent to:
# WHERE ((MemberID = <MemberID 1>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 2>) AND (SnapshotDate = date))
# OR ((MemberID = <MemberID 3>) AND (SnapshotDate = date))
cond = _or(**[and_(t.c['MemberID'] == member_id, t.c['SnapshotDate'] == date) for member_id in df['MemberID'] ])
# Be frugal here: only get the columns that you need, or you will blow your memory
# If you specify None, it's equivalent to a `SELECT *`
statement = t.select(None).where(cond)
# Note that it's `read_sql`, not `read_sql_query` here
loadedDf = pd.read_sql(statement, engine)
# loadedDf should be much smaller now since you have already filtered it at the DB level
# Now do your joins...

Create/Replace a Big Query table with a new one generated by a Cloud Function

I am trying to get a table from BQ, transform it into a df, apply modifications to it and then upload it back to BQ replacing the former version.
This function is meant to run as a Google Cloud Function.
The Service Account that GCS uses has all the authorisations necessary to access BQ.
I have been using this so far but it's not working:
EXTRACT DFP TABLE FROM BQ
import google.cloud.bigquery as bigquery
import pandas_gbq as pd
def execute_function(request):
#Defines Client access & Project to be accessed
client = bigquery.Client()
SQL = """
SELECT LINE_ITEM_NAME, TOTAL_LINE_ITEM_LEVEL_CLICKS
FROM `DATASET.table_imported`
WHERE DATE = DATE_ADD(CURRENT_DATE(), INTERVAL -1 DAY)
AND LINE_ITEM_TYPE = 'PRICE_PRIORITY'
ORDER BY TOTAL_LINE_ITEM_LEVEL_CLICKS DESC
LIMIT 1000"""
query_job = client.query(SQL)
dfp = query_job.result()
### CREATE WRONG NAMING TABLE SORTED BY CLICKS DESCENDING
dfp = dfp.to_dataframe()
dfp.columns = ['li_name', 'clicks_number']
dfp.set_index('li_name', drop = False, inplace = True)
wrong_naming = pd.DataFrame()
li_names = []
clicks_numbers = []
#Adds line items that have names with number of fields not equal to 7 only if not already present
for row in range(len(dfp)):
if len(dfp.iloc[row, 0].split('_')) != 7 and dfp.iloc[row, 0] not in li_names:
li_names.append(dfp.iloc[row, 0])
clicks_numbers.append(dfp.iloc[row, 1])
#Adds created lists to wrong_naming dataframe
wrong_naming['li_name'] = li_names
wrong_naming['clicks_number'] = clicks_numbers
### EXPORT / SAVE TABLE TO GBQ
wrong_naming.to_gbq('DATA_LAKE_MODELING_US.wrong_naming',
if_exists='replace')
But although it worked when I ran the code on my local system, on the cloud it doesn't.
HereĀ“s the error message that I get: "Error: could not handle the request"
Any idea how I can solve this?

Big query insert /delete to table

I have a table X in big query with 170,000 rows . The values on this table as based on complex calculations done on the values from a table Y. These are done in python so as to automate the ingestion when Y gets updated.
Every time Y updates, I recompute the values needed for X in my script and insert them using the script below using streaming:
def stream_data(table, json_data):
data = json.loads(str(json_data))
# Reload the table to get the schema.
table.reload()
rows = [data]
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row into {}'.format( table))
else:
print('Errors:')
The problem here is that I have to delete all rows in the table before I insert . I know a query to do this but it fails because big query does not allow DML when there is a streaming buffer on the table and this is for one day apparently.
IS there a workaround where I can delete all rows in X , recompute based on Y and then insert the new values using the code above ??
Possibly turning the streaming buffer off ??!!
Another option would be to drop the whole table and recreate it . But my table is huge with 60 columns and the JSON for the schema would be huge . I couldn't find samples where I can create a new table with schema passed from json/file ? Some samples in this would be great.
A third option is to make the streaming insert smart that it does an update instead of insert if the row has changed . This again is a DML operation and goes back to original problem.
UPDATE:
another approach I tried is to delete the table and recreate it . Before delete I copy the schema so I can set it in the new table.:
def stream_data( json_data):
bigquery_client = bigquery.Client("myproject")
dataset = bigquery_client.dataset("mydataset")
table = dataset.table("test")
data = json.loads(json_data)
schema=table.schema
table.delete()
table = dataset.table("test")
# Set the table schema
table = dataset.table("test",schema)
table.create()
rows = [data]
errors = table.insert_data(rows)
if not errors:
print('Loaded 1 row ')
else:
print('Errors:')
This gives me an error :
ValueError: Set either 'view_query' or 'schema'.
UPDATE 2:
Key was to do a
table.reload() before
schema=table.schema to fix the above!

Categories