Google Cloud DataLab + BigQuery: how to set region/zone/location - python

I'm using the Datalab for a Python notebook that loads data from Cloud Storage into BigQuery basically following this example.
I then saw that my original data in the Cloud Storage bucket is in the EU (eu-west3-a), the VM that executes the Datalab is in the same region, but the final data in BigQuery is in the US.
According to this post I tried setting the location for the dataset in code, but did not work. This is because there is no such option defined in the Datalab.Bigquery Python module.
So my question is: How do I set the location (zone and region) for the BigQuery dataset and its containing tables?
This is my code:
# data: https://www.kaggle.com/benhamner/sf-bay-area-bike-share/data
%%gcs read --object gs://my_bucket/kaggle/station.csv --variable stations
# CSV will be read as bytes first
df_stations = pd.read_csv(StringIO(stations))
schema = bq.Schema.from_data(df_stations)
# Create an empty dataset
#bq.Dataset('kaggle_bike_rentals').create(location='europe-west3-a')
bq.Dataset('kaggle_bike_rentals').create()
# Create an empty table within the dataset
table_stations = bq.Table('kaggle_bike_rentals.stations').create(schema = schema, overwrite = True)
# load data directly from cloud storage into the bigquery table. the locally loaded Pandas dataframe won't be used here
table_stations.load('gs://my_bucket/kaggle/station.csv', mode='append', source_format = 'csv', csv_options=bq.CSVOptions(skip_leading_rows = 1))
Update: Meanwhile I manually created the dataset in the BigQuery Web-UI and used it in code without creating it there. Now an exception will be raised if the dataset is not existing thus forbidding to create a one in code that will result in default location US.

Have you tried bq.Dataset('[your_dataset]').create(location='EU')?

BigQuery locations are set on a dataset level. Tables take their location based on the dataset they are in.
Setting the location of a dataset at least outside of Datalab:
from google.cloud import bigquery
bigquery_client = bigquery.Client(project='your_project')
dataset_ref = bigquery_client.dataset('your_dataset_name')
dataset = bigquery.Dataset(dataset_ref)
dataset.location = 'EU'
dataset = bigquery_client.create_dataset(dataset)
Based on the code snippet from here: https://cloud.google.com/bigquery/docs/datasets

Related

BigQuery: too many table dml insert operations for this table

I'm trying to import more than 200M records on different computers (n=20) to my BigQuery table via Python client. Each computer runs every 10. second a job (with multiple rows)
from google.cloud import bigquery
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.getcwd() + '/api.json'
print('Import job started, total rows:' + str(len(queries)))
client = bigquery.Client()
for q in queries:
results = client.query(q)
for err in results:
print(err)
But I'm getting the following error:
google.api_core.exceptions.Forbidden: 403 Exceeded rate limits: too
many table dml insert operations for this table. For more information,
see https://cloud.google.com/bigquery/troubleshooting-errors
The data are being generated on run-time. So I have to import the data on run-time. I'm also not sure if BigQuery is good for that. Spanner seems to be better but it costs me too much.
How can I avoid this error?
Thank you very much.
There are 4 major ways to insert data into BigQuery tables.
Batch load a set of data records.
Stream individual records or batches of records.
Use queries to generate new data and append or overwrite the results to a table.
Use a third-party application or service.
I think you are using the 3rd option, which is DML INSERT. It's not designed for large-scale high-frequency data loading use case.
In your use case, it seems the 2nd option, streaming data, could be a good fit.
Example
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of table to append to.
# table_id = "your-project.your_dataset.your_table"
rows_to_insert = [
{u"full_name": u"Phred Phlyntstone", u"age": 32},
{u"full_name": u"Wylma Phlyntstone", u"age": 29},
]
errors = client.insert_rows_json(table_id, rows_to_insert) # Make an API request.
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
You could see more details here.
https://cloud.google.com/bigquery/streaming-data-into-bigquery

How to increase network efficiency on a Google Cloud VM

I'm currently building an ETL on a Google Cloud based VM (Windows Server 2019 - 4 vCPUs) to execute the following process:
Extract some tables from a MySQL replica db
Adjust data types for Google BigQuery conformities
Upload the data to BigQuery using Python's pandas_gbq library
To illustrate, here are some parts of the actual code (Python, iterator over one table):
while True:
# GENERATES AN MYSQL QUERY BASED ON THE COLUMNS AND THEIR
# RESPECTIVE TYPES, USING A DICTIONARY TO CONVERT
# MYSQL D_TYPES TO PYTHON D_TYPES
sql_query = gen_query(cols_dict=col_types, table=table,
pr_key=p_key, offset=offset)
cursor = cnx.cursor(buffered=True)
cursor.execute(sql_query)
if cursor.rowcount == 0:
break
num_fields = len(cursor.description)
field_names = [i[0] for i in cursor.description]
records = cursor.fetchall()
df = pd.DataFrame(records, columns=columns)
offset += len(df.index)
print('Ok, df structured')
# CHECK FOR DATETIME COLUMNS
col_parse_date = []
for column in columns:
if col_types[column] == 'datetime64':
try:
df[column] = df[column].astype(col_types[column])
col_parse_date.append(column)
except:
df[column] = df[column].astype(str)
for i in to_bgq:
if i['name'] == column:
i['type'] = 'STRING'
# UPLOAD DATAFRAME TO GOOGLE BIGQUERY
df.to_csv('carga_etl.csv', float_format='%.2f',
index=False, sep='|')
print('Ok, csv recorded')
df = ''
df = pd.read_csv('carga_etl.csv', sep='|')
print('Ok, csv read')
df.to_gbq(destination_table='tr.{}'.format(table),
project_id='iugu-bi', if_exists='append', table_schema=to_bgq)
The logic is based on a query generator; it gets the MySQL table Schema and adjusts it to BigQuery formats (e.g. Blob to STRING, int(n) to INTEGER etc.), querying the full results (paginated with an offset, 500K rows per page) and saving it in a dataframe to then upload it to my new database.
Well, the ETL does its job, and I'm currently migrating my tables to the cloud. However, I'm worried I'm subutilizing my resources, due to network traffic gaps. Here is the network report (bytes/sec) from my VM reporting section:
VM Network Bytes report
According to that report, my in/out network data peaks at 2/3 MBs, which is really low compared to the average 1GBs available if I use the machine to download something from my browser, for example.
My point is, what am I doing wrong here? Is there any way to increase my MySQL query/fetch speed and my upload speed to BigQuery?
I understand that you are transforming datetime64 to a compatible BigQuery Data type, correct me if I am wrong.
I have a few recommendations:
You can use Dataflow as it is a ETL product and it is optimized for performance
Depending on your overall use case and if you are using CloudSQL/MySQL, you can use BigQuery Federated queries.
Again depending on your use case, you caould use a MySQL dump and upload the data in GCS or directly to BigQuery.

Upload of a Pandas DataFrame to Google Cloud Storage is adding unwanted empty lines

I am trying to upload a DataFrame to Cloud Storage and then to BigQuery, but I keep getting errors on the BigQuery import. I believe this is because the file I have written to Cloud Storage had unwanted empty lines. e.g.
I'm expecting this to be written to the file:
product, quantityWarehouse, quantityStore
'12345',1,3
But instead I get the following:
product, quantityWarehouse, quantityStore
'12345',1,3
This is my code:
my_data = pd.DataFrame([['12345',1,3]],columns=["productId", "quantityWarehouse", "quantityStore"])
my_data.to_csv("gs://my_bucket/my_data.csv", index=False)

Python loads data from Redshift to S3

I am trying to load some Redshift query results to S3. So far I am using pandas_redshift but I got stuck:
import pandas_redshift as pr
pr.connect_to_redshift( dbname = 'dbname',
host = 'xxx.us-east- 1.redshift.amazonaws.com',
port = 5439,
user = 'xxx',
password = 'xxx')
pr.connect_to_s3(aws_access_key_id = 'xxx',
aws_secret_access_key = 'xxx',
bucket = 'dxxx',
subdirectory = 'dir')
And here is the data that I want to dump to S3:
sql_statement = '''
select
provider,
provider_code
from db1.table1
group by provider, provider_code;
'''
df = pr.redshift_to_pandas(sql_statement)
The df was created successfully but how to do the next step, which is to put this dataframe to S3?
The method you are looking at is very inefficient.
to do this the right way you will need a way to run sql on redshift - via e.g. python.
the following sql should be run
unload ('select provider,provider_code
from db1.table1
group by provider, provider_code;')
to 's3://mybucket/myfolder/unload/'
access_key_id '<access-key-id>'
secret_access_key '<secret-access-key>';
see here fore documentation.
As Jon Scott mentions if your goal is to move data from redshift to S3, then the pandas_redshift package is not the right method. The package is meant to allow you to easily move data from redshift to a Pandas DataFrame on your local machine, or move data from a Pandas DataFrame on your local machine to redshift. It is worth noting that running the command you already have:
df = pr.redshift_to_pandas(sql_statement)
Pulls the data directly from redshift to your computer without involving S3 at all. However this command:
pr.pandas_to_redshift(df, 'schema.your_new_table_name')
Copies the DataFrame to a CSV in S3, then runs a query to copy CSV to redshift (This step requires that you ran pr.connect_to_s3 successfully). It does not perform any cleanup of the S3 bucket so a side effect of this is that the data will end up in the bucket you specify.

How to define BigQuery schema using Standard SQL?

I'd like to use BigQuery Standard SQL in a new project, however I am not able to find any examples on how to define the schema, everything points at Legacy SQL. In particular, I want to use ARRAY and STRUCT.
One way to create a table in BigQuery is by using the API calls. There is no CREATE table syntax.
Creating a table
BigQuery offers various ways to create a new table as detailed here:
You can create an empty table by using the command line tool's bq mk command or by using the BigQuery API tables.insert() method.
You can load a table from a CSV or JSON data file (compressed or uncompressed), from an Avro file, or from a Cloud Datastore backup.
You can create a table from a query result.
You can copy a table
You can define a table over a file in Cloud Storage
you can use Standard SQL types when you define your table schema (see Elliotts answer) and there is a tichet about to update in docs as well. Vote/star here.
lots of Python samples are on GitHub simple as:
def create_table(dataset_name, table_name, project=None):
"""Creates a simple table in the given dataset.
If no project is specified, then the currently active project is used.
"""
bigquery_client = bigquery.Client(project=project)
dataset = bigquery_client.dataset(dataset_name)
if not dataset.exists():
print('Dataset {} does not exist.'.format(dataset_name))
return
table = dataset.table(table_name)
# Set the table schema
table.schema = (
bigquery.SchemaField('Name', 'STRING'),
bigquery.SchemaField('Age', 'INTEGER'),
bigquery.SchemaField('Weight', 'FLOAT'),
)
table.create()
print('Created table {} in dataset {}.'.format(table_name, dataset_name))
You can create a table with a schema that uses standard SQL types. Here is an example of a valid schema:
{
"a": "ARRAY<STRUCT<x INT64, y STRING>>",
"b": "STRUCT<z DATE>",
"c": "INT64"
}
If you put this in a file such as sample_schema.json, you can create a table from it using bq mk:
bq mk --schema sample_schema.json -t your_dataset.YourTableName
Outside of the bq client, the tables.insert API also supports standard SQL type names.

Categories