I'm currently building an ETL on a Google Cloud based VM (Windows Server 2019 - 4 vCPUs) to execute the following process:
Extract some tables from a MySQL replica db
Adjust data types for Google BigQuery conformities
Upload the data to BigQuery using Python's pandas_gbq library
To illustrate, here are some parts of the actual code (Python, iterator over one table):
while True:
# GENERATES AN MYSQL QUERY BASED ON THE COLUMNS AND THEIR
# RESPECTIVE TYPES, USING A DICTIONARY TO CONVERT
# MYSQL D_TYPES TO PYTHON D_TYPES
sql_query = gen_query(cols_dict=col_types, table=table,
pr_key=p_key, offset=offset)
cursor = cnx.cursor(buffered=True)
cursor.execute(sql_query)
if cursor.rowcount == 0:
break
num_fields = len(cursor.description)
field_names = [i[0] for i in cursor.description]
records = cursor.fetchall()
df = pd.DataFrame(records, columns=columns)
offset += len(df.index)
print('Ok, df structured')
# CHECK FOR DATETIME COLUMNS
col_parse_date = []
for column in columns:
if col_types[column] == 'datetime64':
try:
df[column] = df[column].astype(col_types[column])
col_parse_date.append(column)
except:
df[column] = df[column].astype(str)
for i in to_bgq:
if i['name'] == column:
i['type'] = 'STRING'
# UPLOAD DATAFRAME TO GOOGLE BIGQUERY
df.to_csv('carga_etl.csv', float_format='%.2f',
index=False, sep='|')
print('Ok, csv recorded')
df = ''
df = pd.read_csv('carga_etl.csv', sep='|')
print('Ok, csv read')
df.to_gbq(destination_table='tr.{}'.format(table),
project_id='iugu-bi', if_exists='append', table_schema=to_bgq)
The logic is based on a query generator; it gets the MySQL table Schema and adjusts it to BigQuery formats (e.g. Blob to STRING, int(n) to INTEGER etc.), querying the full results (paginated with an offset, 500K rows per page) and saving it in a dataframe to then upload it to my new database.
Well, the ETL does its job, and I'm currently migrating my tables to the cloud. However, I'm worried I'm subutilizing my resources, due to network traffic gaps. Here is the network report (bytes/sec) from my VM reporting section:
VM Network Bytes report
According to that report, my in/out network data peaks at 2/3 MBs, which is really low compared to the average 1GBs available if I use the machine to download something from my browser, for example.
My point is, what am I doing wrong here? Is there any way to increase my MySQL query/fetch speed and my upload speed to BigQuery?
I understand that you are transforming datetime64 to a compatible BigQuery Data type, correct me if I am wrong.
I have a few recommendations:
You can use Dataflow as it is a ETL product and it is optimized for performance
Depending on your overall use case and if you are using CloudSQL/MySQL, you can use BigQuery Federated queries.
Again depending on your use case, you caould use a MySQL dump and upload the data in GCS or directly to BigQuery.
Related
I have multiple .csv.gz files (each greater than 10GB) that need to be parsed - multiple rows are read to create one row insertion. The approach I'm taking is as follows:
read .csv.gz file
save soon-to-be-inserted rows into a buffer
if there is enough data in the buffer, perform multirow insertion to database table
Now snowflake limits maximum number of expressions to 16384. I've been running this for about a day but the speed at which it is inserting is very slow. I am using sqlalchemy to do this:
url = "snowflake://<my snowflake url>"
engine = create_engine(url)
savedvalues = []
with pd.read_csv(datapath, header=0, chunksize=10**6) as reader:
for chunk in reader:
for index, row in chunk.iterrows():
"""
<parsing data>
"""
savedvalues.append(<parsed values>)
if(len(savedvalues) > 16384):
stmt = mytable.insert().values(savedvalues)
with engine.connect() as conn:
conn.execute(stmt)
savedvalues = []
Is there a faster way to insert data into snowflake database tables?
I'm looking into COPY INTO <table> operation but not sure if this is truly faster than what I'm doing right now.
Any suggestions would be much appreciated!
Here is an article describing a Python multithreaded approach to bulk loading into Snowflake Zero to Snowflake: Multi-Threaded Bulk Loading with Python. Also note to optimize the number of parallel operations for a load, Snowflake recommends data files roughly 100-250 MB (or larger) in size compressed.
I'm trying to import more than 200M records on different computers (n=20) to my BigQuery table via Python client. Each computer runs every 10. second a job (with multiple rows)
from google.cloud import bigquery
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.getcwd() + '/api.json'
print('Import job started, total rows:' + str(len(queries)))
client = bigquery.Client()
for q in queries:
results = client.query(q)
for err in results:
print(err)
But I'm getting the following error:
google.api_core.exceptions.Forbidden: 403 Exceeded rate limits: too
many table dml insert operations for this table. For more information,
see https://cloud.google.com/bigquery/troubleshooting-errors
The data are being generated on run-time. So I have to import the data on run-time. I'm also not sure if BigQuery is good for that. Spanner seems to be better but it costs me too much.
How can I avoid this error?
Thank you very much.
There are 4 major ways to insert data into BigQuery tables.
Batch load a set of data records.
Stream individual records or batches of records.
Use queries to generate new data and append or overwrite the results to a table.
Use a third-party application or service.
I think you are using the 3rd option, which is DML INSERT. It's not designed for large-scale high-frequency data loading use case.
In your use case, it seems the 2nd option, streaming data, could be a good fit.
Example
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of table to append to.
# table_id = "your-project.your_dataset.your_table"
rows_to_insert = [
{u"full_name": u"Phred Phlyntstone", u"age": 32},
{u"full_name": u"Wylma Phlyntstone", u"age": 29},
]
errors = client.insert_rows_json(table_id, rows_to_insert) # Make an API request.
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
You could see more details here.
https://cloud.google.com/bigquery/streaming-data-into-bigquery
I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will of course get duplicate records in the database. Instead of multiple records being inserted I want Glue to try and update that record if it notices a field has changed, each record has a unique id. Is this possible?
I followed the similar approach which is suggested as 2nd option by Yuriy. Get existing data as well as new data and then do some processing to merge to of them and write with ovewrite mode. Following code would help you to get an idea about how to solve this problem.
sc = SparkContext()
glueContext = GlueContext(sc)
#get your source data
src_data = create_dynamic_frame.from_catalog(database = src_db, table_name = src_tbl)
src_df = src_data.toDF()
#get your destination data
dst_data = create_dynamic_frame.from_catalog(database = dst_db, table_name = dst_tbl)
dst_df = dst_data.toDF()
#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)
#Finally save data to destination with OVERWRITE mode
merged_df.write.format('jdbc').options( url = dest_jdbc_url,
user = dest_user_name,
password = dest_password,
dbtable = dest_tbl ).mode("overwrite").save()
Unfortunately there is no elegant way to do it with Glue. If you would write to Redshift you could use postactions to implement Redshift merge operation. However, it's not possible for other jdbc sinks (afaik).
Alternatively in your ETL script you can load existing data from a database to filter out existing records before saving. However if your DB table is big then the job may take a while to process it.
Another approach is to write into a staging table with mode 'overwrite' first (replace existing staging data) and then make a call to a DB via API to copy new records only into a final table.
I have used INSERT into table .... ON DUPLICATE KEY.. for UPSERTs into the Aurora RDS running mysql engine. Maybe this would be a reference for your use case. We cannot use a JDBC since we have only APPEND, OVERWRITE, ERROR modes currently supported.
I am not sure of the RDS database engine you are using, and following is an example for mysql UPSERTS.
Please see this reference, where i have posted a solution using INSERT INTO TABLE..ON DUPLICATE KEY for mysql :
Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array
I'm using the Datalab for a Python notebook that loads data from Cloud Storage into BigQuery basically following this example.
I then saw that my original data in the Cloud Storage bucket is in the EU (eu-west3-a), the VM that executes the Datalab is in the same region, but the final data in BigQuery is in the US.
According to this post I tried setting the location for the dataset in code, but did not work. This is because there is no such option defined in the Datalab.Bigquery Python module.
So my question is: How do I set the location (zone and region) for the BigQuery dataset and its containing tables?
This is my code:
# data: https://www.kaggle.com/benhamner/sf-bay-area-bike-share/data
%%gcs read --object gs://my_bucket/kaggle/station.csv --variable stations
# CSV will be read as bytes first
df_stations = pd.read_csv(StringIO(stations))
schema = bq.Schema.from_data(df_stations)
# Create an empty dataset
#bq.Dataset('kaggle_bike_rentals').create(location='europe-west3-a')
bq.Dataset('kaggle_bike_rentals').create()
# Create an empty table within the dataset
table_stations = bq.Table('kaggle_bike_rentals.stations').create(schema = schema, overwrite = True)
# load data directly from cloud storage into the bigquery table. the locally loaded Pandas dataframe won't be used here
table_stations.load('gs://my_bucket/kaggle/station.csv', mode='append', source_format = 'csv', csv_options=bq.CSVOptions(skip_leading_rows = 1))
Update: Meanwhile I manually created the dataset in the BigQuery Web-UI and used it in code without creating it there. Now an exception will be raised if the dataset is not existing thus forbidding to create a one in code that will result in default location US.
Have you tried bq.Dataset('[your_dataset]').create(location='EU')?
BigQuery locations are set on a dataset level. Tables take their location based on the dataset they are in.
Setting the location of a dataset at least outside of Datalab:
from google.cloud import bigquery
bigquery_client = bigquery.Client(project='your_project')
dataset_ref = bigquery_client.dataset('your_dataset_name')
dataset = bigquery.Dataset(dataset_ref)
dataset.location = 'EU'
dataset = bigquery_client.create_dataset(dataset)
Based on the code snippet from here: https://cloud.google.com/bigquery/docs/datasets
I am trying to pull a 1.7G file into a pandas dataframe from a Greenplum postgres data source. The psycopg2 driver takes 8 or so minutes to load. Using the pandas "chunksize" parameter does not help as the psycopg2 driver selects all data into memory, then hands it off to pandas, using a lot more than 2G of RAM.
To get around this, I'm trying to use a named cursor, but all the examples I've found then loop through row by row. And that just seems slow. But the main problem appears to that my SQL just stops working in the named query for some unknown reason.
Goals
load the data as quickly as possible without doing any "unnatural
acts"
use SQLAlchemy if possible - used for consistency
have the results in a pandas dataframe for fast in-memory processing (alternatives?)
Have a "pythonic" (elegant) solution. I'd love to do this with a context manager but haven't gotten that far yet.
/// Named Cursor Chunky Access Test
import pandas as pd
import psycopg2
import psycopg2.extras
/// Connect to database - works
conn_chunky = psycopg2.connect(
database=database, user=username, password=password, host=hostname)
/// Open named cursor - appears to work
cursor_chunky = conn_chunky.cursor(
'buffered_fetch', cursor_factory=psycopg2.extras.DictCursor)
cursor_chunky.itersize = 100000
/// This is where the problem occurs - the SQL works just fine in all other tests, returns 3.5M records
result = cursor_chunky.execute(sql_query)
/// result returns None (normal behavior) but result is not iterable
df = pd.DataFrame(result.fetchall())
The pandas call returns AttributeError: 'NoneType' object has no attribute 'fetchall' Failure seems due to named cursor being used. Have tried fetchone, fetchmany, etc. Note the goal here is to let the server chunk and serve up the data in large chunks such that there is a balance of bandwidth and CPU usage. Looping through a df = df.append(row) is just plain fugly.
See related questions (not the same issue):
Streaming data from Postgres into Python
psycopg2 leaking memory after large query
Added standard client side chunking code per request
nrows = 3652504
size = nrows / 1000
idx = 0
first_loop = True
for dfx in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size):
if first_loop:
df = dfx
first_loop = False
else:
df = df.append(dfx,ignore_index=True)
UPDATE:
#Chunked access
start = time.time()
engine = create_engine(conn_str)
size = 10**4
df = pd.concat((x for x in pd.read_sql(iso_cmdb_base, engine, coerce_float=False, chunksize=size)),
ignore_index=True)
print('time:', (time.time() - start)/60, 'minutes or ', time.time() - start, 'seconds')
OLD answer:
I'd try to read data from PostgreSQL using internal Pandas method: read_sql():
from sqlalchemy import create_engine
engine = create_engine('postgresql://user#localhost:5432/dbname')
df = pd.read_sql(sql_query, engine)