Upload table to bigquery using Colab, specifying schema in job_config - python

I am trying to write a table to bigquery using Colab. The best way to do it I find is using client and job_config. It is important that I maintain control over how data is written as I plan to use below code for different tasks. The last step that eludes me is setting up schema. I do not want someone's query to crash as say Year is suddenly integer instead of a string. The below code should work? Or perhaps I need to use "job_config.schema_update_options", but I am not sure how schema object should look ? I cannot use pandas gbq as it is too slow to write to dataframe first. Table would be overwritten each month, that is why write_truncate. Thanks
schema_1 = [
{ "name": "Region", "type": "STRING", "mode": "NULLABLE" },
{ "name": "Product", "type": "STRING", "mode": "NULLABLE" } ]
schemma2 = [('Region', 'STRING', 'NULLABLE', None, ()),
('Product', 'STRING', 'NULLABLE', None, ())]
"""Create a Google BigQuery input table.
In the code below, the following actions are taken:
* A new dataset is created "natality_regression."
* A query is run, the output of which is stored in a new "regression_input" table.
"""
from google.cloud import bigquery
# Create a new Google BigQuery client using Google Cloud Platform project defaults.
project_id = 'nproject'
client = bigquery.Client(project=project_id)
# Prepare a reference to a new dataset for storing the query results.
dataset_id = "natality_regression"
table_id = "regression_input"
table_id_full = f"{project_id}.{dataset_id}.{table_id}"
# Configure the query job.
job_config = bigquery.QueryJobConfig()
# Set the destination table to where you want to store query results.
job_config.destination = table_id_full
job_config.write_disposition = 'WRITE_TRUNCATE' # WRITE_APPEND
job_config.schema = schemma2
#job_config.schema_update_options = ???
#job_config.schema = schema_1
# Set up a query in Standard SQL
query = """
SELECT * FROM `nproject.SalesData1.Sales1` LIMIT 15
"""
# Run the query.
query_job = client.query(query, job_config=job_config)
query_job.result() # Waits for the query to finish
print('danski')

This needs to be doing effectively by the job_config but the syntaxis for python is like this:
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("mode", "STRING"),
]
)
You can find more details here: https://cloud.google.com/bigquery/docs/schemas?hl=es_419
You can create the table first too, as you mentioned in the query:
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"
schema = [
bigquery.SchemaField("full_name", "STRING", mode="REQUIRED"),
bigquery.SchemaField("age", "INTEGER", mode="REQUIRED"),
]
table = bigquery.Table(table_id, schema=schema)
table = client.create_table(table) # Make an API request.
print(
"Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)

Related

Load into BigQuery - one column to handle arbitrary json field (empty array, dict with different fields, etc.)

We have the following three JSONs with data that should be loaded in the same table:
{ "name": "tom", "customValues": [] }
{ "name": "joe", "customValues": { "member": "1" } }
{ "name": "joe", "customValues": { "year": "2020", "number": "3" } }
We load data with the python bigquery.LoadJobConfig function:
job_config = bigquery.LoadJobConfig(
schema=SCHEMA_MAP.get(bq_table) if autodetect == False else None,
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE if remove_old == True else bigquery.WriteDisposition.WRITE_APPEND,
autodetect=autodetect
)
SCHEMA_MAP is a dictionary of arrays, where each array in the schema for one of our tables. We define our BigQuery schema in python using the python bigquery.SchemaField function. If each of the 3 JSONs above were going into 3 different tables, I would have their table schemas defined as:
SCHEMA_T1 = [
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("customValues", "STRING", mode="REPEATED")
]
SCHEMA_T2 = [
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("customValues", "RECORD", mode="REPEATED", fields=[
bigquery.SchemaField("member", "STRING")
])
]
SCHEMA_T3 = [
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("customValues", "RECORD", mode="REPEATED", fields=[
bigquery.SchemaField("year", "STRING"),
bigquery.SchemaField("number", "STRING")
])
]
Is it possible to define the customValues column to handle all 3 of these different data types in one single table? How would the schema be defined for this? Currently, if SCHEMA_T1 is used and data in the forms of T2 or T3 is uploaded, the upload fails and it returns an error Error while reading data, error message: JSON parsing error in row starting at position 0: JSON object specified for non-record field: customValues. Similar errors for the other schemas. Is there a generic any json field in BigQuery that can be used for this?
As JSON feature is still in preview for bigquery (see launch stages). As a workaround, you can use load_table_from_dataframe from the bigquery client to load data from data columns that might require some refinement before pushing into our working table.
Let's see on your scenario, lets said we have a data.json file with raw data:
data.json
[
{
"name": "tom",
"customValues": []
},
{
"name": "joe",
"customValues": {
"member": "1"
}
},
{
"name": "joe",
"customValues": {
"year": "2020",
"number": "3"
}
}
]
And we have a single table on bigquery that we need to populate.
create or replace table "my-project.my-dataset.a-table" (
name STRING,
customValues STRING
)
load.py
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
table_id = "project-id.dataset-id.a-table"
df = pd.read_json('data.json')
df["customValues"]= df["customValues"].apply(str)
print(df.shape)
print(df.head())
job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.PARQUET, autodetect=True)
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()
table = client.get_table(table_id)
print("Loaded {} rows and {} columns to {}".format(table.num_rows, len(table.schema), table_id))
output
| **name** | **customValues** |
|----------|---------------------------------|
| tom | [] |
| joe | {'member': '1'} |
| joe | {'year': '2020', 'number': '3'} |
As you can see, regardless of the structure of customValues we are able to have that inserted on our working table ( which only have 2 columns ). We load the json data into a dataframe and then we just update the datatype column to fit our column type by using apply. For more information about using apply please visit this link.
BigQuery now supports JSON as a data type (from January 2022, you are lucky on the timing!), ref here
Therefore you should be able to go with:
SCHEMA_T = [
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("customValues", "JSON")
]

how to call python function by getting mongob collection values

how to create document and collection in mongodb to make python code configuration. Get attribute name, datatype, function to be called from mongodb ?
mongodb collection sample example
db.attributes.insertMany([
{ attributes_names: "email", attributes_datype: "string", attributes_isNull="false", attributes_std_function = "email_valid" }
{ attributes_names: "address", attributes_datype: "string", attributes_isNull="false", attributes_std_function = "address_valid" }
]);
Python script and function
def email_valid(df):
df1 = df.withColumn(df.columns[0], regexp_replace(lower(df.columns[0]), "^a-zA-Z0-9#\._\-| ", ""))
extract_expr = expr(
"regexp_extract_all(emails, '(\\\w+([\\\.-]?\\\w+)*#\\[A-Za-z\-\.]+([\\\.-]?\\\w+)*(\\\.\\\w{2,3})+)', 0)")
df2 = df1.withColumn(df.columns[0], extract_expr) \
.select(df.columns[0])
return df2
How to get all the mongodb values in python script and call the function according to attribues.
To create MongoDB collection from a python script :
import pymongo
# connect to your mongodb client
client = pymongo.MongoClient(connection_url)
# connect to the database
db = client[database_name]
# get the collection
mycol = db[collection_name]
from bson import ObjectId
from random_object_id import generate
# create a sample dictionary for the collection data
mydict = { "_id": ObjectId(generate()),
"attributes_names": "email",
"attributes_datype": "string",
"attributes_isNull":"false",
"attributes_std_function" : "email_valid" }
# insert the dictionary into the collection
mycol.insert_one(mydict)
To insert multiple values in the MongoDB, use insert_many() instead of insert_one() and pass the list of dictionary to it. So your list of dictionary will look like this
mydict = [{ "_id": ObjectId(generate()),
"attributes_names": "email",
"attributes_datype": "string",
"attributes_isNull":"false",
"attributes_std_function" : "email_valid" },
{ "_id": ObjectId(generate()),
"attributes_names": "email",
"attributes_datype": "string",
"attributes_isNull":"false",
"attributes_std_function" : "email_valid" }]
To get all the data from MongoDB collection into python script :
data = list()
for x in mycol.find():
data.append(x)
import pandas as pd
data = pd.json_normalize(data)
And then access the data as you access an element of a list of dictionaries:
value = data[0]["attributes_names"]

How to copy a non partitioned table into an ingestion time partitioned table in bigquery using python?

The use case is as follow:
We have a table foo that has its data replaced every day. We want to start keeping old data in a history ingestion-time partitioned based table called foo_HIST.
I have the following code for google-cloud bigquery: 1.6.1
bq_client = bigquery.Client(project=env_conf.gcp_project_id)
dataset = bigquery.dataset.DatasetReference(
env_conf.gcp_project_id, env_conf.bq_dataset
)
full_table_src = table_conf.table_name()
table_src = dataset.table(full_table_src)
table_dst_name = f"{full_table_src}_HIST"
table_dst = dataset.table(table_dst_name)
table_dst.time_partitioning = bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.HOUR,
)
# Truncate per partition.
job_config = bigquery.CopyJobConfig(
create_disposition="CREATE_IF_NEEDED",
write_disposition="WRITE_TRUNCATE",
)
job = bq_client.copy_table(table_src, table_dst, job_config=job_config)
The new table is indeed created, but when I check it with bq cli, it does not seem to be a partition based table. Here is the output.
bq show --format=prettyjson dataset_id.foo_HIST
{
"creationTime": "1616418131814",
"etag": "iqfdDzv2ifdsfERfwTiFjQ==",
"id": "project_id:dataset_id.foo_HIST",
"kind": "bigquery#table",
"lastModifiedTime": "1616418131814",
"location": "EU",
"numBytes": "32333",
"numLongTermBytes": "0",
"numRows": "406",
"schema": {
"fields": [
{
"mode": "NULLABLE",
"name": "MPG",
"type": "FLOAT"
},
]
},
"selfLink": "https://bigquery.googleapis.com/bigquery/v2/projects/project_id/datasets/dataset_id/tables/foo_HIST",
"tableReference": {
"datasetId": "dataset_id",
"projectId": "project_id",
"tableId": "foo_HIST"
},
"type": "TABLE"
}
To anyone wondering how to copy a non partitioned table into a partitioned table (creating it if needed) in python:
It seems like CopyJob does not support this out of the box on the contrary of a QueryJob. Here is the final snippet using QueryJob:
bq_client = bigquery.Client(project=gcp_project_id)
dataset = bigquery.dataset.DatasetReference(
gcp_project_id, dataset_id
)
table_src = dataset.table(table_name)
table_dst_name = f"{table_name}_HIST"
table_dst = dataset.table(table_dst_name)
query = f"""
SELECT *
FROM `{project_id}`.dataset_id:table_name
"""
job_config = bigquery.QueryJobConfig(
create_disposition="CREATE_IF_NEEDED",
write_disposition="WRITE_APPEND",
time_partitioning=bigquery.TimePartitioning(
type_=bigquery.TimePartitioningType.HOUR,
),
use_legacy_sql=False,
allow_large_results=True,
destination=table_dst,
)
job = bq_client.query(query, job_config=job_config)
job.result() # Wait for job to finish

Google BigQuery: In Python, column addition makes all the other columns Nullable

I have a table that already exists with the following schema:
{
"schema": {
"fields": [
{
"mode": "required",
"name": "full_name",
"type": "string"
},
{
"mode": "required",
"name": "age",
"type": "integer"
}]
}
}
It already contains entries like:
{'full_name': 'John Doe',
'age': int(33)}
I want to insert a new record with a new field and have the load job automatically add the new column as it loads. The new format looks like this:
record = {'full_name': 'Karen Walker',
'age': int(48),
'zipcode': '63021'}
My code is as follows:
from google.cloud import bigquery
client = bigquery.Client(project=projectname)
table = client.get_table(table_id)
config = bigquery.LoadJobConfig()
config.autoedetect = True
config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
config.schema_update_options = [
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
]
job = client.load_table_from_json([record], table, job_config=config)
job.result()
This results in the following error:
400 Provided Schema does not match Table my_project:my_dataset:mytable. Field age has changed mode from REQUIRED to NULLABLE
I can fix this by changing config.schema_update_options as follows:
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
]
This allows me to insert the new record, with zipcode added to the schema, but it causes both full_name and age to become NULLABLE, which is not the behavior I want. Is there a way to prevent schema auto-detect from changing the existing columns?
If you need to add fields to your schema, you can do the following:
from google.cloud import bigquery
client = bigquery.Client()
table = client.get_table("your-project.your-dataset.your-table")
original_schema = table.schema # Get your current table's schema
new_schema = original_schema[:] # Creates a copy of the schema.
# Add new field to schema
new_schema.append(bigquery.SchemaField("new_field", "STRING"))
# Set new schema in your table object
table.schema = new_schema
# Call API to update your table with the new schema
table = client.update_table(table, ["schema"])
After updating your table's schema you can load your new records with this additional field ignoring any schema configurations.

How to protect against SQL Injection with pandas read_gbq

How do I use pandas_gbq.read_gbq safely to protect against SQL Injections as I cannot in the docs find a way to parametrize it
I've looked at the docs at a way to parametrize as well as googles website and other sources.
df_valid = read_gbq(QUERY_INFO.format(variable), project_id='project-1622', location='EU') Where query looks like SELECT name, date FROM table WHERE id = '{0}'
I can input p' or '1'='1 and it works
Per Google BigQuery docs, you have to use a specified configuration with SQL parameterized statement:
import pandas as pd
sql = "SELECT name, date FROM table WHERE id = #id"
query_config = {
'query': {
'parameterMode': 'NAMED',
'queryParameters': [
{
'name': 'id',
'parameterType': {'type': 'STRING'},
'parameterValue': {'value': 1}
}
]
}
}
df = pd.read_gbq(sql, project_id='project-1622', location='EU', configuration=query_config)

Categories