Bigquery (and pandas) - ensure data-insert consistency

Bigquery (and pandas) - ensure data-insert consistency - python

In my python project, I need to fill a bigquery table with a relational dataframe. I'm having a lot of trouble at creating a new table from scratch and being sure that the first data I upload to it are actually put into the table.
I've read the page https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency and have seen that applying a insertId to the insert query would solve the problem, but since I use pandas's dataframes, the function to_gbq of the pandas-gbq package seems to be perfect for this task. Yet, when using to_gbq function and a new table is created/replaced, sometimes (apparently randomly) the first data chunk is not written into the table.
Does anybody know how to ensure the complete insertion of a DataFrame into a bigquery new created table? Thanks

I believe you are encountering https://github.com/pydata/pandas-gbq/issues/75. Basically, Pandas using the BigQuery streaming API to write data into tables, but the streaming API has a delay after table creation to when it starts working.
Edit: Version 0.3.0 of pandas-gbq fixes this issue by using a load job to upload data frames to BigQuery instead of streaming.
In the meantime, I'd recommend using a "load job" to create the tables. For example, using the client.load_table_from_file method in the google-cloud-bigquery package.
from google.cloud.bigquery import LoadJobConfig
from six import StringIO
destination_table = client.dataset(dataset_id).table(table_id)
job_config = LoadJobConfig()
job_config.write_disposition = 'WRITE_APPEND'
job_config.source_format = 'NEWLINE_DELIMITED_JSON'
rows = []
for row in maybe_a_dataframe:
row_json = row.to_json(force_ascii=False, date_unit='s', date_format='iso')
rows.append(row_json)
body = StringIO('{}\n'.format('\n'.join(rows)))
client.load_table_from_file(
body,
destination_table,
job_config=job_config).result()
Edit: This code sample fails for columns containing non-ASCII characters. See https://github.com/pydata/pandas-gbq/pull/108

Related

Pandas df `to_gbq` with nested data

I'm working in a limited Airflow environment in which I don't have access to google-cloud-bigquery but do have access to pandas-gbq. My goal is to load some JSON API data using some schema involving records into a BigQuery table. My strategy is to first read all the data into a pandas dataframe using a dictionary to represent the records: e.g.
uuid
metadata1
...
001
{u'time_updated': u'', u'name':u'jeff'}
...
Then, I've been trying to use pandas_gbq.to_gbq to load into BQ. The issue is I get
Error at Row: 0, Reason: invalid, Location: metadata1, Message: This field: metadata1 is not a record. and I realize this is because from the Google Cloud website it says that pandas-gbq "Converts the DataFrame to CSV format before sending to the API, which does not support nested or array values."
And so I won't be able to upload a dataframe with records to BQ in this way since again I can't use google-cloud-bigquery in my environment.
What would be the best strategy for me to upload my data to BQ (around 30k rows and 6 or so columns with 8ish nested fields each)?
I know this sounds like a very bad strategy but I could upload a flattened version of all fields ini a record as a single string to the BQ table and then run a query from my code to replace these flattened fields with their record-form versions. But this seems really bad since for a time, the table would contain the wrong schema.
Any thoughts would be much appreciated. Thanks in advance.

Bigquery data not getting inserted

I'm using python client library to insert data to big query table. The code is as follows.
client = bigquery.Client(project_id)
errors = client.insert_rows_json(table=tablename,json_rows=data_to_insert)
assert errors == []
There are no errors, but the data is also not getting inserted.
Sample JSON rows:
[{'a':'b','c':'d'},{'a':'f','q':'r'},.....}]
What's the problem? No exception also

client.insert_rows_json method using StreamingInsert .
Inserting data to BigQuery using StreamingInsert will be cause of latency on table preview on BigQuery console.
The data is not appeared immediately. So,
You need to query them to confirm the data inserted.

It can be 2 possible situations:
your data does not match the schema
your table is freshly created, and the update is just not yet available
References:
Related GitHub issue
Data availability

got the answer to my question. The problem was I was inserting one more column data for which data was not there. I found a hack in order to find out if the data is not inserting to bigquery table.
Change the data to newline delimited json with the keys as the column names and values as values you want for that particular column.
bq --location=US load --source_format=NEWLINE_DELIMITED_JSON dataset.tablename newline_delimited_json_file.json. Run this command in you terminal and see if throws any errors. If it throws an error it's likely that something is wrong with your data/table schema.
Change the data/table schema as per the error and retry inserting the same via python.
It's better if the python API throws an error/exception like on the terminal, it would be helpful.

Best way to read in part of a huge table to AWS GLUE

I'm having some trouble loading a large file from my data lake (currently stored in postgres) into AWS GLUE. It is 4.3 Billion rows.
In testing, I've found that the table is too large to be fully read in.
Here's how I'm loading the data frame:
large_dynamic_frame = glueContext.create_dynamic_frame.from_catalog(database = "DBNAME",
table_name = "TABLENAME",
redshift_tmp_dir = args["TempDir"],
transformation_ctx = "TABLECTX")
Important Factors
I don't need the whole data frame! I'll ultimately filter based on a couple of attributes and join with smaller tables.
I've already tried using a push_down_predicate, but that required the data to be stored in S3 using a specific folder organization and unfortunately I don't get to choose the pre-existing format of this table.
I've also tried reading in the table and simply re-organizing it to the S3 folder organization necessary for pushdown_predicate to work, but the process ends with "exit code 1" after 5 hours of running.
Primary Question
How can I read in part of a table without using a pushdown predicate?

You can also use pure spark/pyspark code in Glue and take advantage of its read methods.
You can see in their documentation, how to read from redshift or in general any SQL DB through JDBC. Even reading data from a query like the following example:
# Read data from a query
df = spark.read \
.format("com.databricks.spark.redshift") \
.option("url", "jdbc:redshift://redshifthost:5439/database?user=username&password=pass") \
.option("query", "select x, count(*) my_table group by x") \
.option("tempdir", "s3n://path/for/temp/data") \
.load()
I have found that AWS Glue only implemented a small fraction of spark functionality, so I will recommend going with spark/pySpark when you have something complex to work on.

Unfortunately predicate pushdown works only for S3 as you figured out already.
Regarding the "exit code 1", is your data in S3 in raw CSV format? Can you try create multi-part bzip2 or lz4. In that case, the load will be shared by multiple workers.
How many DPUs you have allocated for the task. This article gives a nice overview of DPU capacity planning.
Or you can create a view in Postgres and use that as source.
Please let me know if that helped.

Bigquery Python API create partitioned table by specific field

I need to create a table in Bigquery partitioned by a specific field. I have noticed that this is only available via API Rest. Is there a way to do this via Python API?
Any help?

My guess is that the docs just haven't been updated yet (not that rolling a http request and calling the API would be hard anyway), because if you look at the code for the BigQuery Python client library, it does indeed appear to support specifying the field when creating a partitioned table:

Expanding on Graham Polley's answer: You can set this by setting the time_partitioning property.
Something like this:
import google.cloud.bigquery as bq
bq_client = bq.Client()
dataset = bq_client.dataset('dataset_name')
table = dataset.table('table_name')
table = bq.Table(table, schema=[
bq.SchemaField('timestamp', 'TIMESTAMP', 'REQUIRED'),
bq.SchemaField('col_name', 'STRING', 'REQUIRED')])
table.time_partitioning = bq.TimePartitioning(field='timestamp')
bq_client.create_table(table)

Extract metadata about table using BigQuery Client API

I have a table in a BigQuery dataset and I'm trying to find out when the table was last modified via the BigQuery client API.
I have tried (in Python)
from gcloud import bigquery
client = bigquery.Client(project="my_project")
dataset = client.dataset("my_dataset")
tables = dataset.list_tables()
table = tables[0][5] # Extract the table that I want
I can check that I've got the right table by running print(table.name), however I don't know how to get the table metadata. In particular, I want to know how to find out when the table was last modified.
Although, I've written the above in Python (I'm more familiar with it than other programming languages) I don't mind if the answer is in Python or Javascript (I think I'm going to have to implement it in the latter).

Under the hood, tables = dataset.list_tables() is making an API request to Tables.list. The result of this request does not contain all the table meta information - like last modified for example.
The Tables.get API request is needed for this type of table information. To make this request you need to call reload() on the table. For example:
bigquery_service = bigquery.Client()
dataset = bigquery_service.dataset("<your-dataset>")
tables = dataset.list_tables()
for table in tables:
table.reload()
print(table.modified)
In my test/dataset, this prints:
2016-12-30 08:57:15.679000+00:00
2016-12-18 23:57:24.570000+00:00
2016-12-19 05:18:28.371000+00:00
See here (Github) and here (Python docs) for more details.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bigquery (and pandas) - ensure data-insert consistency - python

Related

Pandas df `to_gbq` with nested data

Bigquery data not getting inserted

Best way to read in part of a huge table to AWS GLUE

Bigquery Python API create partitioned table by specific field

Extract metadata about table using BigQuery Client API

Categories

Resources