Bigquery Python API create partitioned table by specific field - python

I need to create a table in Bigquery partitioned by a specific field. I have noticed that this is only available via API Rest. Is there a way to do this via Python API?
Any help?

My guess is that the docs just haven't been updated yet (not that rolling a http request and calling the API would be hard anyway), because if you look at the code for the BigQuery Python client library, it does indeed appear to support specifying the field when creating a partitioned table:

Expanding on Graham Polley's answer: You can set this by setting the time_partitioning property.
Something like this:
import google.cloud.bigquery as bq
bq_client = bq.Client()
dataset = bq_client.dataset('dataset_name')
table = dataset.table('table_name')
table = bq.Table(table, schema=[
bq.SchemaField('timestamp', 'TIMESTAMP', 'REQUIRED'),
bq.SchemaField('col_name', 'STRING', 'REQUIRED')])
table.time_partitioning = bq.TimePartitioning(field='timestamp')
bq_client.create_table(table)

Related

Bigquery : Create table if not exist and load data using Python and Apache AirFlow

First, I get all data using MySQL query from production database then store that data as NEW LINE DELIMITED JSON in google cloud storage, what I want to do is:
1. check if the table exists
2. if the table doesn't exist, create the table using autodetect schema
3. store data
All of this will be scheduled in airflow. What really confused me is number 2, how can I do this in Python ? or can Airflow do this automatically?
Airflow can do this automatically. The create_disposition parameter creates the table if needed. And the autodetect parameter does exactly what you need. This is for Airflow 1.10.2.
GCS_to_BQ = GoogleCloudStorageToBigQueryOperator(
task_id='gcs_to_bq',
bucket='test_bucket',
source_objects=['folder1/*.csv', 'folder2/*.csv'],
destination_project_dataset_table='dest_table',
source_format='CSV',
create_disposition='CREATE_IF_NEEDED',
write_disposition='WRITE_TRUNCATE',
bigquery_conn_id='bq-conn',
google_cloud_storage_conn_id='gcp-conn',
autodetect=True, # This uses autodetect
dag=dag
)
From BigQuery commandline, if your json file is on GCS, then Loading JSON data with schema auto-detection does 2 + 3 for you in one command.
Looking at AirFlow document, GoogleCloudStorageToBigQueryOperator seems doing the same thing. I checked its source, it simply call BigQuery load api. I believe it will do what you want.
When it is unclear what each argument means, you can search BigQuery Jobs api using argument name.
E.g., to achieve 1 in your task list, you only need to specify:
write_disposition (string) – The write disposition if the table already exists.
But in order to know what string you need to pass as write_disposition, you have to search on BigQuery document.

How to get dataset information via the Python API

I'm trying to get dataset information via the Python client libraries. In the BigQuery UI I can see the created date, data location etc are set, but when I try to get it via the API, it's just returning None. In the documentation, it says "(None until set from the server)", but if I can see it on the UI, I assumed (presumably wrongly) that it was set.
Here's my code, what am I doing wrong?
dataset_ref = client.dataset('myDatasetName')
dataset_info = bigquery.dataset.Dataset(dataset_ref)
print(dataset_info.created)
You were almost there.
dataset_ref = client.dataset('myDatasetName')
dataset_info = client.get_dataset(dataset_ref)
print(dataset_info.created)

Bigquery (and pandas) - ensure data-insert consistency

In my python project, I need to fill a bigquery table with a relational dataframe. I'm having a lot of trouble at creating a new table from scratch and being sure that the first data I upload to it are actually put into the table.
I've read the page https://cloud.google.com/bigquery/streaming-data-into-bigquery#dataconsistency and have seen that applying a insertId to the insert query would solve the problem, but since I use pandas's dataframes, the function to_gbq of the pandas-gbq package seems to be perfect for this task. Yet, when using to_gbq function and a new table is created/replaced, sometimes (apparently randomly) the first data chunk is not written into the table.
Does anybody know how to ensure the complete insertion of a DataFrame into a bigquery new created table? Thanks
I believe you are encountering https://github.com/pydata/pandas-gbq/issues/75. Basically, Pandas using the BigQuery streaming API to write data into tables, but the streaming API has a delay after table creation to when it starts working.
Edit: Version 0.3.0 of pandas-gbq fixes this issue by using a load job to upload data frames to BigQuery instead of streaming.
In the meantime, I'd recommend using a "load job" to create the tables. For example, using the client.load_table_from_file method in the google-cloud-bigquery package.
from google.cloud.bigquery import LoadJobConfig
from six import StringIO
destination_table = client.dataset(dataset_id).table(table_id)
job_config = LoadJobConfig()
job_config.write_disposition = 'WRITE_APPEND'
job_config.source_format = 'NEWLINE_DELIMITED_JSON'
rows = []
for row in maybe_a_dataframe:
row_json = row.to_json(force_ascii=False, date_unit='s', date_format='iso')
rows.append(row_json)
body = StringIO('{}\n'.format('\n'.join(rows)))
client.load_table_from_file(
body,
destination_table,
job_config=job_config).result()
Edit: This code sample fails for columns containing non-ASCII characters. See https://github.com/pydata/pandas-gbq/pull/108

Extract metadata about table using BigQuery Client API

I have a table in a BigQuery dataset and I'm trying to find out when the table was last modified via the BigQuery client API.
I have tried (in Python)
from gcloud import bigquery
client = bigquery.Client(project="my_project")
dataset = client.dataset("my_dataset")
tables = dataset.list_tables()
table = tables[0][5] # Extract the table that I want
I can check that I've got the right table by running print(table.name), however I don't know how to get the table metadata. In particular, I want to know how to find out when the table was last modified.
Although, I've written the above in Python (I'm more familiar with it than other programming languages) I don't mind if the answer is in Python or Javascript (I think I'm going to have to implement it in the latter).
Under the hood, tables = dataset.list_tables() is making an API request to Tables.list. The result of this request does not contain all the table meta information - like last modified for example.
The Tables.get API request is needed for this type of table information. To make this request you need to call reload() on the table. For example:
bigquery_service = bigquery.Client()
dataset = bigquery_service.dataset("<your-dataset>")
tables = dataset.list_tables()
for table in tables:
table.reload()
print(table.modified)
In my test/dataset, this prints:
2016-12-30 08:57:15.679000+00:00
2016-12-18 23:57:24.570000+00:00
2016-12-19 05:18:28.371000+00:00
See here (Github) and here (Python docs) for more details.

Insert a row in to google BigQuery table from the values of a python list

I am a newbie who is exploring Google BigQuery. I would like to insert a row into the BigQuery table from a python list which contains the row values.
To be more specific my list looks like this: [100.453, 108.75, 50.7773]
I found a couple of hints from BigQuery-Python library insert
and also looked in to pandas bigquery writer but not sure whether they are perfect for my usecase.
What would be the better solution?
Thanks in advance.
Lot's of resources but I usually find code examples to be the most informative for beginning.
Here's an excellent collection of bigquery python code samples: https://github.com/googleapis/python-bigquery/tree/master/samples.
One straight forward way to insert rows:
from google.cloud import bigquery
bq_client = bigquery.Client()
table = bq_client.get_table("{}.{}.{}".format(PROJECT, DATASET, TABLE))
rows_to_insert = [{u"COL1": 100.453, u"COL2": 108.75, u"COL3": 50.7773}, {u"COL1": 200.348, u"COL2": 208.29, u"COL3": 60.7773}]
errors = bq_client.insert_rows_json(table, rows_to_insert)
if errors == []:
print("success")
Lastly to verify if it's inserted successfully use:
bq query --nouse_legacy_sql 'SELECT * FROM `PROJECT.DATASET.TABLE`'
Hope that helps everyone!
To work with Google Cloud Platform services using Python, I would recommend using python google-cloud and for BigQuery specifically the submodule google-cloud-bigquery(this was also recommended by #polleyg. This is an open-source Python idiomatic client maintained by the Google. This will allow you to easily use all the google cloud services in a simple and consistent way.
More specifically, the example under Insert rows into a table’s data in the documentation shows how to insert Python tuples/lists into a BigQuery table.
However depending on your needs, you might need other options, my ordering of options:
If the you use code that has a native interface with Google Services (e.g. BigQuery) and this suits your needs, use this. In your case test if Pandas-BigQuery works for you.
If your current code/modules don't have a native interface, try the Google maintained idiomatic client google-cloud.
If that doesn't suit your needs, use an external idiomatic client like tylertreat/BigQuery-Python. The problem is that you will have different inconsistent clients for the different services. The benefit can be that it adds some functionalities not provided in the google-cloud module.
Finally, if you work with very new alpha/beta features, use the APIs directly with the Google API module, this is will always give you access to the latest APIs, but is a bit harder to work with. So only use this if the previous options don't give you what you need.
The Google BigQuery docs show you how:
https://cloud.google.com/bigquery/streaming-data-into-bigquery#bigquery_stream_data_python
https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/bigquery/cloud-client/stream_data.py

Categories