BigQuery execution time

BigQuery execution time - python

After I send a query to be executed in BigQuery, how can I find the execution time and other statistics about this job?
The Python API for BigQuery has a suggestive field timeline, but barely mentioned in the documentation. However, this method returns an empty iterable.
So far, I have been running this Python code, with the BigQuery Python API.
from google.cloud import bigquery
bq_client = bigquery.Client()
job = bq_client.query(some_sql_query, location="US")
ts = list(job.timeline)
if len(ts) > 0:
for t in ts:
print(t)
else:
print('No timeline results ???') # <-- no timeline results.

When your job is "DONE", the BigQuery API returns a JobStatistics object (described here) that is accessible in Python through the attributes of your job object.
The available attrs are accessible here.
For accessing the time taken by your job you mainly have the job.created, job.started and job.ended attributes.
To come back to your code snippet, you can try something like this:
from google.cloud import bigquery
bq_client = bigquery.client()
job = bq_client.query(some_sql_query, location="US")
while job.running():
print("Running..")
sleep(0.1)
print("The job duration was {}".format(job.ended - job.started))

In case you need to have job logs available after some time and ready for analysis I'd suggest creating a log sink for your big query jobs. It would create table in big query, where job logs are dumped. After that you can easily run analysis to determine how long they took and how much they cost.
First create a sink in Operations/Logging:
Make sure to set sink as BigQuery Dataset (pick one) and add query_job_completed as logs to include in the sink. After a while you will see new table in that data set similar to: {project}.{dataset}.cloudaudit_googleapis_com_data_access_{date}
Go you BigQuery and create a view. This view for example will show how much data each job consumed (and how much it cost):
SELECT
protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes as totalBilledBytes
,protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.endTime as endTime
,protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.query
,protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.datasetId targetDataSet
,protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobConfiguration.query.destinationTable.tableId targetTable
,protopayload_auditlog.authenticationInfo.principalEmail
,(protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes / 1000000000000.0) * 5.0 as billed
FROM `{project}.{dataset}.cloudaudit_googleapis_com_data_access_*`
where protopayload_auditlog.servicedata_v1_bigquery.jobCompletedEvent.job.jobStatistics.totalBilledBytes > 0

Related

generate dynamic task using hooks without running them in backend

I have a simple dag - that takes argument from mysql db - (like sql, subject)
Then I have a function creating report out and send to particular email.
Here is code snippet.
def s_report(k,**kwargs):
body_sql = list2[k][4]
request1 = "({})".format(body_sql)
dwh_hook = SnowflakeHook(snowflake_conn_id="snowflake_conn")
df1 = dwh_hook.get_pandas_df(request1)
df2 = df1.to_html()
body_Text = list2[k][3]
html_content = f"""HI Team, Please find report<br><br>
{df2} <br> </br>
<b>Thank you!</b><br>
"""
return EmailOperator(task_id="send_email_snowflake{}".format(k), to=list2[k][1],
subject=f"{list2[k][2]}", html_content=html_content, dag=dag)
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
The s_report is getting generated dynamically, But the real problem is hook is continously submitting query in backend, While dag is stopped still its submitting query in backend.
I can use pythonoperator, but its not generating dynamic task.

A couple of things:
By looking at your code, in particular the lines:
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
we can determine that if your first task succeeds, namely, mysql_list, then the tasks downstream to it, namely, the s_report calls should begin executing. You have precisely len(list) of them. Within each s_report call there is exactly one dwh_hook.get_pandas_df(request) call, so I believe your DAG should be making len(list) calls of this type provided mysql_list task succeeds.
As for the mismatch you see in your Snowflake logs, I can't advise you here. I'd need more details. Keep in mind that the call get_pandas_df might have a retry mechanism (i.e. if cannot reach snowflake, retry) which might explain why your Snowflake logs show a bunch of requests.
If your DAG finishes successfully, (i.e. end_operator tasks finishes successfully), you are correct. There should be no requests in your Snowflake logs that came post-DAG end.
If you want more insight as to how your DAG interacts with your Snowflake resource, I'd suggest having a single s_report task like so:
mysql_list >> [ s_report(0)] >> end_operator
and see the behaviour in the logs.

How to insert data in redshift using either of boto3 or psycopg2 python libraries

Which library is best to use among "boto3" and "Psycopg2" for redshift operations in python lambda functions:
Lookup for a table in redshift cluster
Create a table in redshift cluster
Insert data in redshift cluster
I would appretiate if i am answered with following:
python code for either of the library that addresses all of the above 3 needs.
Thanks in Advance!!

Connecting directly to Redshift from Lambda with psycopg2 is the simpler, more straight-forward way to go but comes with a significant limitation. Lambda functions have run-time limits and even if your SQL commands don't exceed the max run-time, you will be paying for the Lambda function to wait for Redshift to complete the SQL. For fast-running SQL commands things run quickly and this isn't a problem but inserting data can take some time depending on the amount of data.
If all your Redshift actions are less than a few seconds (and won't grow longer with time) then psycopg2 connecting directly to Redshift is likely the way to go. If the data insert takes a minute or 2 BUT this process doesn't run very often (daily) then psycopg2 may still be the way to go as Lambda isn't very expensive when run in frequently. It is a process simplicity vs. cost calculation.
Using Redshift Data API is more complicated. This process lets you fire the SQL to Redshift and terminate the Lambda. A later running Lambda checks to see if the SQL has completed and the results of the SQL are checked. The SQL not completing means that Lambda needs to be invoke at a later time to see if things are complete. This polling process often is done by a Step Function and a set of different Lambda functions. Not super difficult but a level of complexity above a single Lambda. Since this is a polling process there is a wait time between checks for results which if too long leads to latency and if too short over-polling and additional costs.
If you need to have Data API for time-out reasons then you may want to use both psycopg2 for short running queries to the database - like 'does this table exist?'. Use Data API for long-running steps like 'insert this 1TB set of data into Redshift'.

Sample basic python code for all three operations using boto3.
import json
import boto3
clientdata = boto3.client('redshift-data')
# looks up table and returns true if found
def lookup_table(table_name):
response = clientdata.list_tables(
ClusterIdentifier='redshift-cluster-1',
Database='dev',
DbUser='awsuser',
TablePattern=table_name
)
print(response)
if ( len(response['Tables']) == 0 ):
return False
else:
return True
# creates table with one integer column
def create_table(table_name):
sqlstmt = 'CREATE TABLE '+table_name+' (col1 integer);'
print(sqlstmt)
response = clientdata.execute_statement(
ClusterIdentifier='redshift-cluster-1',
Database='dev',
DbUser='awsuser',
Sql=sqlstmt,
StatementName='CreateTable'
)
print(response)
# inserts one row with integer value for col1
def insert_data(table_name, dval):
print(dval)
sqlstmt = 'INSERT INTO '+table_name+'(col1) VALUES ('+str(dval)+');'
response = clientdata.execute_statement(
ClusterIdentifier='redshift-cluster-1',
Database='dev',
DbUser='awsuser',
Sql=sqlstmt,
StatementName='InsertData'
)
print(response)
result = lookup_table('date')
if ( result ):
print("Table exists.")
else:
print("Table does not exist!")
create_table("testtab")
insert_data("testtab", 11)
I am not using Lambda, instead executing it just from my shell. Hope this helps. Assuming credentials and default region are already set up for the client.

boto3 how to get the logstream form a sagemaker transform job?

i am able to crete the job and it fail, using boto3
import boto3
session = boto3.session.Session()
client = session.client('sagemaker')
descibe = client.describe_transform_job(TransformJobName="my_transform_job_name")
in the ui i can see the button to go to the logs, i can use boto3 to retrive the logs if hardcode the group name and the log-stream.
but how can i get the Log stream from the batch transfrom job? shouldnt be a field with logstream or something like that in the ".describe_transform_job"?

sagemaker doesnt provide a direct way to do it, the way to do it, is to also use the log client.
get the log streams corresponding to your batchtransform_job
client_logs = boto3.client('logs')
log_groups =
client_logs.describe_log_streams(logGroupName="the_log_group_name", logStreamNamePrefix=transform_job_name)
log_streams_names= []
for i in log_groups["logStreams"]:
log_streams_names.append(i["logStreamName"])
and this will give a list of "project_name/virtualMachine_id" that is the machines that your code was run depending on how many instances you set.
After you can run for each of the log_streams
for i_stream_name in log_streams_names:
client_logs.get_log_events("the_stream_log_name", "the_log_group_name")
now you can loop and print the lines of the log stream event =)

I need to scrape logs from cloud watch logs and load it to s3 and from s3 to data warehouse

I have several lambda functions. I need to scrape my logs generated from all of my lambda functions and load to our internal data warehouse. I thought of these solutions.
Have a lambda function subscribed to my lambda function's cloudwatch log groups and polish and log messages and push it to s3.
Pros: Works and simple to implement.
Cons: There is no way for me to
"replay". Say My exporter failed for some reason. I wouldn't be able
to replay this action.
Have a lambda function that runs every 10 min or so and creates export task and scrapes logs from cloudwatch and loads them to s3.
import boto3
client = boto3.client('logs')
response = client.create_export_task(
taskName='export_task',
logGroupName='/aws/lambda/<lambda_function_1>',
fromTime=from_time,
to=to_time,
destination='<application_logs>',
destinationPrefix='<lambda_function_1>'
)
response = client.create_export_task(
taskName='export_task',
logGroupName='/aws/lambda/<lambda_function_2>',
fromTime=from_time,
to=to_time,
destination='<application_logs>',
destinationPrefix='<lambda_function_2>'
)
Second create_export_task fails here
An error occurred (LimitExceededException) when calling the
CreateExportTask operation: Resource limit exceeded."
I cant create multiple export task. Is there a way to address this?

From AWS docs: One active (running or pending) export task at a time, per account. This limit cannot be changed.
U can use the below function to check if the status has been changed to 'COMPLETED'
response = client.create_export_task(
taskName='export_cw_to_s3',
logGroupName='/ecs/',
logStreamNamePrefix=org_id,
fromTime=int((yesterday-unix_start).total_seconds() * 1000),
to=int((today-unix_start).total_seconds() * 1000),
destination='test-bucket',
destinationPrefix=f'random-string/{today.year}/{today.month}/{today.day}/{org_id}')
taskId = (response['taskId'])
status = 'RUNNING'
while status in ['RUNNING','PENDING']:
response_desc = client.describe_export_tasks(
taskId=taskId
)
status = response_desc['exportTasks'][0]['status']['code']

Came across the same error message and the reason is you can only have one running/pending export task per account at a given time hence this task is failing. From AWS docs: One active (running or pending) export task at a time, per account. This limit cannot be changed.
https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/cloudwatch_limits_cwl.html

Sometimes one createExport task stays in pending state for long preventing other lambda functions with the same task to run. You could see this task and cancel it allowing the other functions to run.

Cassandra low performance?

I have to choose Cassandra or MongoDB(or another nosql database, I accept suggestions) for a project with a lot of inserts(1M/day).
So I create a small test to measure the write performance. Here's the code to insert in Cassandra:
import time
import os
import random
import string
import pycassa
def get_random_string(string_length):
return ''.join(random.choice(string.letters) for i in xrange(string_length))
def connect():
"""Connect to a test database"""
connection = pycassa.connect('test_keyspace', ['localhost:9160'])
db = pycassa.ColumnFamily(connection,'foo')
return db
def random_insert(db):
"""Insert a record into the database. The record has the following format
ID timestamp
4 random strings
3 random integers"""
record = {}
record['id'] = str(time.time())
record['str1'] = get_random_string(64)
record['str2'] = get_random_string(64)
record['str3'] = get_random_string(64)
record['str4'] = get_random_string(64)
record['num1'] = str(random.randint(0, 100))
record['num2'] = str(random.randint(0, 1000))
record['num3'] = str(random.randint(0, 10000))
db.insert(str(time.time()), record)
if __name__ == "__main__":
db = connect()
start_time = time.time()
for i in range(1000000):
random_insert(db)
end_time = time.time()
print "Insert time: %lf " %(end_time - start_time)
And the code to insert in Mongo it's the same changing the connection function:
def connect():
"""Connect to a test database"""
connection = pymongo.Connection('localhost', 27017)
db = connection.test_insert
return db.foo2
The results are ~1046 seconds to insert in Cassandra, and ~437 to finish in Mongo.
It's supposed that Cassandra it's much faster than Mongo inserting data. So , What i'm doing wrong?

There is no equivalent to Mongo's unsafe mode in Cassandra. (We used to have one, but we took it out, because it's just a Bad Idea.)
The other main problem is that you're doing single-threaded inserts. Cassandra is designed for high concurrency; you need to use a multithreaded test. See the graph at the bottom of http://spyced.blogspot.com/2010/01/cassandra-05.html (actual numbers are over a year out of date but the principle is still true).
The Cassandra source distribution has such a test included in contrib/stress.

If I am not mistaken, Cassandra allows you to specify whether or not you are doing a MongoDB-equivalent "safe mode" insert. (I dont recall the name of that feature in Cassandra)
In other words, Cassandra may be configured to write to disk and then return as opposed to the default MongoDB configuration which immediately returns after performing an insert without knowing if the insert was successful or not. It just means that your application never waits for a pass\fail from the server.
You can change that behavior by using safe mode in MongoDB but this is known to have a large impact on performance. Enable safe mode and you may see different results.

You will harness true power of Cassandra once you have multiple nodes running. Any node will be able to take a write request. Multithreading a client is only flooding more requests to same instance which is not going to help after a point.
Check cassandra log for the events that happen during your tests. Cassandra will initiate a disk write once the Memtable is full (this is configurable, make it large enough and you will be dealing on in RAM + disk writes of commit log). If disk write for Memtable happen during your test then it will slow it down. I do not know when MongoDB writes to disk.

Might I suggest taking a look at Membase here? It's used in exactly the same way as memcached and is fully distributed so you can continuously scale your write input rate simply by adding more servers and/or more RAM.
For this case, you'll definitely want to go with a client-side Moxi to give you the best performance. Take a look at our wiki: wiki.membase.org for examples and let me know if you need any further instruction...I'm happy to walk you through it and I'm certain that Membase can handle this load easily.

Create batch mutator for doing
multiple insert, update, and remove
operations using as few roundtrips as
possible.
http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.batch
Batch mutator helped me reduce insert time in at least half

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.