I'm trying to import more than 200M records on different computers (n=20) to my BigQuery table via Python client. Each computer runs every 10. second a job (with multiple rows)
from google.cloud import bigquery
import os
os.environ["GOOGLE_APPLICATION_CREDENTIALS"] = os.getcwd() + '/api.json'
print('Import job started, total rows:' + str(len(queries)))
client = bigquery.Client()
for q in queries:
results = client.query(q)
for err in results:
print(err)
But I'm getting the following error:
google.api_core.exceptions.Forbidden: 403 Exceeded rate limits: too
many table dml insert operations for this table. For more information,
see https://cloud.google.com/bigquery/troubleshooting-errors
The data are being generated on run-time. So I have to import the data on run-time. I'm also not sure if BigQuery is good for that. Spanner seems to be better but it costs me too much.
How can I avoid this error?
Thank you very much.
There are 4 major ways to insert data into BigQuery tables.
Batch load a set of data records.
Stream individual records or batches of records.
Use queries to generate new data and append or overwrite the results to a table.
Use a third-party application or service.
I think you are using the 3rd option, which is DML INSERT. It's not designed for large-scale high-frequency data loading use case.
In your use case, it seems the 2nd option, streaming data, could be a good fit.
Example
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of table to append to.
# table_id = "your-project.your_dataset.your_table"
rows_to_insert = [
{u"full_name": u"Phred Phlyntstone", u"age": 32},
{u"full_name": u"Wylma Phlyntstone", u"age": 29},
]
errors = client.insert_rows_json(table_id, rows_to_insert) # Make an API request.
if errors == []:
print("New rows have been added.")
else:
print("Encountered errors while inserting rows: {}".format(errors))
You could see more details here.
https://cloud.google.com/bigquery/streaming-data-into-bigquery
Related
When I tried to save the data in the cloudant database from watson studion using spark & python environment, I got this error "HTTPError: 413 Client Error: Request Entity Too Large document_too_large " How could I fix Error 413?
My code:
def db_data_writing(credential:dict, db_name:str, files:dict):
"""
DB data writing
This function reach the online database and write the input files as a new document
Parameters
----------
credential: dict
credential to access the online Cloudant database instance
db_name: str
name of the database on which write data
files: dict
json containing data to write on db
Returns
-------
bool
exit status
"""
# Feed credential and establish a connection
client = Cloudant.iam(credential['username'],credential['apikey'],connect=True)
# Select the database instance on which write results
db = client[db_name]
# write and save document
db.create_document(files).save()
# shutdown the connection
client.disconnect()
credential={xxxxxxxxxxxxxxxxxx}
db_name = "xxxxxxxxx"
for k in range(len(dataPreparation.index)):# type of dataPreparation is DataFrame
print(dataPreparation.loc[k])
i+=1
db_data_writing(credential, db_name, dataPreparation.loc[k] )
Thanks
Cloudant imposes limits on document sizes and also on the total request body size. A document must not be larger than 1MB, and a request body must be no larger than 11MB. If you exceed either of those, you will get a 413 error back.
These limits are there "for your own protection". Even using documents approaching 1MB is somewhat suboptimal -- you really should try to keep documents within a few kB to hit the performance sweetspot when using Cloudant.
https://cloud.ibm.com/docs/Cloudant?topic=Cloudant-documents
So to work around this you need to revisit your data model and ensure your documents are smaller than 1MB.
I have successfully connected Python to a redshift table with my Jupyter Notebook.
I sampled 1 day of data (176707 rows) and performed a function using Missingno to assess how much data is missing and where. No problem there.
Here's the code so far (redacted for security)...
#IMPORT NECESSARY PACKAGES FOR REDSHIFT CONNECTION AND DATA VIZ
import psycopg2
from getpass import getpass
from pandas import read_sql
import seaborn as sns
import missingno as msno
#PASSWORD INPUT PROMPT
pwd = getpass('password')
#REDSHIFT CREDENTIALS
config = { 'dbname': 'abcxyz',
'user':'abcxyz',
'pwd':pwd,
'host':'abcxyz.redshift.amazonaws.com',
'port':'xxxx'
}
#CONNECTION UDF USING REDSHIFT CREDS AS DEFINED ABOVE
def create_conn(*args,**kwargs):
config = kwargs['config']
try:
con=psycopg2.connect(dbname=config['dbname'], host=config['host'],
port=config['port'], user=config['user'],
password=config['pwd'])
return con
except Exception as err:
print(err)
#DEFINE CONNECTION
con = create_conn(config=config)
#SQL TO RETRIEVE DATASET AND STORE IN DATAFRAME
df = read_sql("select * from schema.table where date = '2020-06-07'", con=con)
# MISSINGNO VIZ
msno.bar(df, labels=True, figsize=(50, 20))
This produces the following, which is exactly what I want to see:
However, I need to perform this task on a subset of the entire table, not just one day.
I ran...
SELECT "table", size, tbl_rows FROM SVV_TABLE_INFO
...and I can see that the table is a total size of 9GB and 32.5M rows, although the sample I need to assess the data completion of is 11M rows
So far I have identified 2 options for retrieving a larger dataset than the ~18k rows from my initial attempt.
These are:
1) Using chunksize
2) Using Dask
Using Chunksize
I replaced the necessary line of code with this:
#SQL TO RETRIEVE DATASET AND STORE IN DATAFRAME
df = read_sql("select * from derived.page_views where column_name = 'something'", con=con, chunksize=100000)
This still took several hours to run on a MacBook Pro 2.2 GHz Intel Core i7 with 16 GB RAM and gave memory warnings toward the end of the task.
When it was complete I wasn't able to view the chunks anyway and the kernel disconnected, meaning the data held in memory was lost and I'd essentially wasted a morning.
My question is:
Assuming this is not an entirely foolish endeavour, would Dask be a better approach? If so, how could I perform this task using Dask?
The Dask documentation gives this example:
df = dd.read_sql_table('accounts', 'sqlite:///path/to/bank.db',
npartitions=10, index_col='id') # doctest: +SKIP
But I don't understand how I could apply this to my scenario whereby I have connected to a redshift table in order to retrieve the data.
Any help gratefully recieved.
I'm currently building an ETL on a Google Cloud based VM (Windows Server 2019 - 4 vCPUs) to execute the following process:
Extract some tables from a MySQL replica db
Adjust data types for Google BigQuery conformities
Upload the data to BigQuery using Python's pandas_gbq library
To illustrate, here are some parts of the actual code (Python, iterator over one table):
while True:
# GENERATES AN MYSQL QUERY BASED ON THE COLUMNS AND THEIR
# RESPECTIVE TYPES, USING A DICTIONARY TO CONVERT
# MYSQL D_TYPES TO PYTHON D_TYPES
sql_query = gen_query(cols_dict=col_types, table=table,
pr_key=p_key, offset=offset)
cursor = cnx.cursor(buffered=True)
cursor.execute(sql_query)
if cursor.rowcount == 0:
break
num_fields = len(cursor.description)
field_names = [i[0] for i in cursor.description]
records = cursor.fetchall()
df = pd.DataFrame(records, columns=columns)
offset += len(df.index)
print('Ok, df structured')
# CHECK FOR DATETIME COLUMNS
col_parse_date = []
for column in columns:
if col_types[column] == 'datetime64':
try:
df[column] = df[column].astype(col_types[column])
col_parse_date.append(column)
except:
df[column] = df[column].astype(str)
for i in to_bgq:
if i['name'] == column:
i['type'] = 'STRING'
# UPLOAD DATAFRAME TO GOOGLE BIGQUERY
df.to_csv('carga_etl.csv', float_format='%.2f',
index=False, sep='|')
print('Ok, csv recorded')
df = ''
df = pd.read_csv('carga_etl.csv', sep='|')
print('Ok, csv read')
df.to_gbq(destination_table='tr.{}'.format(table),
project_id='iugu-bi', if_exists='append', table_schema=to_bgq)
The logic is based on a query generator; it gets the MySQL table Schema and adjusts it to BigQuery formats (e.g. Blob to STRING, int(n) to INTEGER etc.), querying the full results (paginated with an offset, 500K rows per page) and saving it in a dataframe to then upload it to my new database.
Well, the ETL does its job, and I'm currently migrating my tables to the cloud. However, I'm worried I'm subutilizing my resources, due to network traffic gaps. Here is the network report (bytes/sec) from my VM reporting section:
VM Network Bytes report
According to that report, my in/out network data peaks at 2/3 MBs, which is really low compared to the average 1GBs available if I use the machine to download something from my browser, for example.
My point is, what am I doing wrong here? Is there any way to increase my MySQL query/fetch speed and my upload speed to BigQuery?
I understand that you are transforming datetime64 to a compatible BigQuery Data type, correct me if I am wrong.
I have a few recommendations:
You can use Dataflow as it is a ETL product and it is optimized for performance
Depending on your overall use case and if you are using CloudSQL/MySQL, you can use BigQuery Federated queries.
Again depending on your use case, you caould use a MySQL dump and upload the data in GCS or directly to BigQuery.
I am running an apache beam dataflow job, which reads from a bucket, performs some transformation and write to bigquery.
But the records are inserted into the streaming buffer.
validated_data = (p1
| 'Read files from Storage '+url >> beam.io.ReadFromText(url)
| 'Validate records ' + url >> beam.Map(data_ingestion.validate, url)\
.with_outputs(SUCCESS_TAG_KEY, FAILED_TAG_KEY, main="main")
)
all_data, _, _ = validated_data
success_records = validated_data[SUCCESS_TAG_KEY]
failed_records = validated_data[FAILED_TAG_KEY]
(success_records
| 'Extracting row from tagged row {}'.format(url) >> beam.Map(lambda row: row['row'])
| 'Write to BigQuery table for {}'.format(url) >> beam.io.WriteToBigQuery(
table=data_ingestion.get_table(tmp=TEST, run_date=data_ingestion.run_date),
create_disposition=beam.io.BigQueryDisposition.CREATE_NEVER,
write_disposition=beam.io.BigQueryDisposition.WRITE_APPEND
)
)
Actually, I need to delete the partition before running above as a way to avoid duplicated records for ingestion time partitioned table.
And Say If I run this job more than 1 time for the same file, without truncating the table, the table will end up having duplicate records.
And because last records are in streaming buffer, the delete partition table command does not actually remove the partition.
Below is the code I am using to truncate the table. and this code runs before running the pipeline
client = bigquery.Client()
dataset = TABLE_MAP['dataset']
table = TABLE_MAP[sentiment_pipeline][table_type]['table']
table_id = "{}${}".format(table, format_date(run_date, '%Y%m%d'))
table_ref = client.dataset(dataset).table(table_id)
output = client.delete_table(table_ref)
According to BigQuery documentation, you may have to wait 30 minutes in order to make a DML statement on a a streaming table, and schema changes like delete/truncate tables might result in data loss for some scenarios. Here are some workarounds you could try for dealing with duplicates in a streaming scenario.
Additionally, Apache Beam and Dataflow now supports batch insert for python, so it might be a good way to avoid streaming limitations.
I'm using AWS Glue to move multiple files to an RDS instance from S3. Each day I get a new file into S3 which may contain new data, but can also contain a record I have already saved with some updates values. If I run the job multiple times I will of course get duplicate records in the database. Instead of multiple records being inserted I want Glue to try and update that record if it notices a field has changed, each record has a unique id. Is this possible?
I followed the similar approach which is suggested as 2nd option by Yuriy. Get existing data as well as new data and then do some processing to merge to of them and write with ovewrite mode. Following code would help you to get an idea about how to solve this problem.
sc = SparkContext()
glueContext = GlueContext(sc)
#get your source data
src_data = create_dynamic_frame.from_catalog(database = src_db, table_name = src_tbl)
src_df = src_data.toDF()
#get your destination data
dst_data = create_dynamic_frame.from_catalog(database = dst_db, table_name = dst_tbl)
dst_df = dst_data.toDF()
#Now merge two data frames to remove duplicates
merged_df = dst_df.union(src_df)
#Finally save data to destination with OVERWRITE mode
merged_df.write.format('jdbc').options( url = dest_jdbc_url,
user = dest_user_name,
password = dest_password,
dbtable = dest_tbl ).mode("overwrite").save()
Unfortunately there is no elegant way to do it with Glue. If you would write to Redshift you could use postactions to implement Redshift merge operation. However, it's not possible for other jdbc sinks (afaik).
Alternatively in your ETL script you can load existing data from a database to filter out existing records before saving. However if your DB table is big then the job may take a while to process it.
Another approach is to write into a staging table with mode 'overwrite' first (replace existing staging data) and then make a call to a DB via API to copy new records only into a final table.
I have used INSERT into table .... ON DUPLICATE KEY.. for UPSERTs into the Aurora RDS running mysql engine. Maybe this would be a reference for your use case. We cannot use a JDBC since we have only APPEND, OVERWRITE, ERROR modes currently supported.
I am not sure of the RDS database engine you are using, and following is an example for mysql UPSERTS.
Please see this reference, where i have posted a solution using INSERT INTO TABLE..ON DUPLICATE KEY for mysql :
Error while using INSERT INTO table ON DUPLICATE KEY, using a for loop array