I am reading a table [200,000 rows; 40 columns] from BigQuery in a Flask application (using the Pandas GBQ library); the goal is to display some summary data as HTML tables in the UI.
An upstream server is returning a 504 (gateway timeout) error response when I attempt to read this table from my Flask function, however it is successful when running my cloud console.
I have tried reading chunksizes, however this has not solved the issue.
project_id = "<projectid>"
bq_table = request.form.get('bq_table')
query = "SELECT * FROM `<dataset>.{bq_table}`;".format(bq_table=bq_table)
df = pd.DataFrame(pq.read_gbq(query, project_id, dialect='standard'))
I am expecting the query to run synchronously (not ideal, I know, but this is for an ad-hoc DQ tool).
Related
I trying to load data that I have in a pandas data frame into a Redshift cluster using AWS lambda. I can't use a connector with the redshift endpoint url because the current VPC setup doesn't allow this connection.
What I can do is use the Redshift Data API. Which works like this:
redshift_data_api_client = boto3.client('redshift-data')
redshift_database = 'my_redshift_db'
DbUser = 'my_user'
ClusterIdentifier = 'my_redshift_cluster'
query = '''
select *
from some_schema.some_table t
limit 10
'''
res = redshift_data_api_client.execute_statement(
Database=redshift_database, DbUser=DbUser, Sql=query,
ClusterIdentifier=ClusterIdentifier)
time.sleep(10)
query_id = res["Id"]
response = redshift_data_api_client.get_statement_result(
Id=query_id)
The problem is that I haven't been able to integrate the Redshift Data API with a pandas dataframe. Ideally, I would like to be able to do something like:
redshift_data_api_client.insert_from_pandas(table, my_dataframe)
If that's not an option, I'd like to generate the INSERT SQL statement as string from the data frame, so I could do:
insert_query = my_dataframe.get_insert_sql_statement()
res = redshift_data_api_client.execute_statement(
Database=redshift_database,
DbUser=DbUser,
Sql=insert_query,
ClusterIdentifier=ClusterIdentifier)
But I couldn't find a way to do that either. Pandas has a to_sql function, but it sends the data directly to a db connection (which I don't have), it doesn't generate the INSERT statement as string.
Any help would be greatly appreciated :)
I am receiving a data drop into my GCS bucket daily and have a cloud function that moves said csv data to a BigQuery Table (see code below).
import datetime
def load_table_uri_csv(table_id):
# [START bigquery_load_table_gcs_csv]
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
table_id = "dataSet.dataTable"
job_config = bigquery.LoadJobConfig(
write_disposition=bigquery.WriteDisposition.WRITE_APPEND,
source_format=bigquery.SourceFormat.CSV, skip_leading_rows=1, autodetect=True,
)
uri = "gs://client-data/team/looker-client-" + str(datetime.date.today()) + ".csv"
load_job = client.load_table_from_uri(
uri, table_id, job_config=job_config
) # Make an API request.
load_job.result() # Waits for the job to complete.
destination_table = client.get_table(table_id) # Make an API request.
print("Loaded {} rows.".format(destination_table.num_rows))
# [END bigquery_load_table_gcs_csv]
However, the data comes with a 2 day look back resulting in repeated data in the BigQuery table.
Is there a way for me to update this cloud function to only pull in the most recent date from the csv once it is dropped off? This way I can easily avoid duplicative data within the reporting.
Or, maybe theres a way for me to run a scheduled query via BigQuery to resolve this?
For reference, the date column within the CSV comes in a TIMESTAMP schema.
Any and all help is appreciated!
There is seems to be no way to do this directly from Google Cloud Platform, unfortunately. You will need filter your information somehow before loading it.
You could review the information from the CSV in your code or through another medium.
It's also possible to submit a feature request for Google to consider this functionality.
I have a sqlalchemy connection setup to snowflake which works, as I can run some queries and get results out. The attempts to query are also logged in my user_query history.
My connection:
engine = create_engine(URL(
user, password, account, database, warehouse, role
))
connection = engine.connect()
However, most of the time my queries fail returning Operational Error (i.e. its a snowflake error) https://docs.sqlalchemy.org/en/13/errors.html#error-e3q8. But these same queries will run fine in the snowflake web UI.
For example if I run
test_query = 'SELECT * FROM TABLE DB1.SCHEMA1.TABLE1'
test = pd.read_sql(test_query, connection)
When I look at my query_history it shows the sqlalchemy query failing, then a second later the base query itself being run successfully. However I'm not sure where this output goes in the snowflake setup, and why its not transferring through my sqlalchemy connection. What I'm seeing...
Query = 'DESC TABLE /* sqlalchemy:_has_object */ "SELECT * FROM DB1"."SCHEMA1"."TABLE1"
Error code = 2003 Error message = SQL compilation error: Database
'"SELECT * FROM DB1" does not exist.
Then 1 second later, the query itself will run successfully, but not clear where this goes as it doesn't get sent over the connection.
Query = SELECT * FROM TABLE DB1.SCHEMA1.TABLE1
Any help much appreciated!
Thanks
You can try adding schema also here
engine = create_engine(URL(
account = '',
user = '',
password = '',
database = '',
schema = '',
warehouse = '',
role='',
))
connection = engine.connect()
It is very unlikely that the query is running in WebUI and fails with syntax error when connected via CLI or other modes.
Suggest you print the query which is via CLI or via a connector, run the same to WebUI and also note that from which role you're running the query.
Please share what is your finding.
The mentioned query (SELECT * FROM TABLE DB1.SCHEMA1.TABLE1) is not a snowflake supported SQL syntax.
Link here will help you more with details.
Hope this helps!
I'm currently trying to build a data pipeline from an AWS Athena database so my team can query information using Python. However, I'm running into an issue with insufficient permissions.
We are able to query the data in Tableau, but we wanted to integrate it into an app we are developing.
Here is the code we followed from PyAthena's documentation.
from pyathena import connect
import pandas as pd
conn = connect(aws_access_key_id='YOUR_ACCESS_KEY_ID',
aws_secret_access_key='YOUR_SECRET_ACCESS_KEY',
s3_staging_dir='s3://YOUR_S3_BUCKET/path/to/',
region_name='us-west-2')
df = pd.read_sql("SELECT * FROM many_rows", conn)
print(df.head())
Here is the resulting error.
OperationalError: Insufficient permissions to execute the query. User: arn:aws:iam::OUR_ADDRESS:user/USER is not authorized to perform: glue:GetTable on resource: arn:aws:glue:us-west-2:OUR_ADDRESS:table/default/OUR_DATABASE
I'm guessing that this is an issue with IAM permissions on the Server side with respect to Amazon Glue. But I'm not sure how to resolve it.
This is a simple code to export from Biq Query to Google storage, in CSV format
def export_data():
client = bigquery.Client()
project = 'xxxxx'
dataset_id = 'xxx'
table_id = 'xxx'
bucket_name = 'xxx'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'EXPORT_FILE.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='EU') # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
It works perfectly for general tables, BUT when I try to export data from saved table VIEW, it failed with this error:
BadRequest: 400 Using table xxx:xxx.xxx#123456 is not allowed for this operation because of its type. Try using a different table that is of type TABLE.
Does exist any way to export data from table view?
What I'm trying to achieve is, to get the data from BigQuery in CSV format, and upload to Google analytics Product Data
BigQuery views are subject to a few limitations:
You cannot run a BigQuery job that exports data from a view.
There are more than 10+ other limitations which I didn't posted in the answer as they might change. Follow the link to read all of them.
You need to query your view and write the results to a destination table, and then issue an export job on the destination table.