Google BigQuery: In Python, column addition makes all the other columns Nullable - python

I have a table that already exists with the following schema:
{
"schema": {
"fields": [
{
"mode": "required",
"name": "full_name",
"type": "string"
},
{
"mode": "required",
"name": "age",
"type": "integer"
}]
}
}
It already contains entries like:
{'full_name': 'John Doe',
'age': int(33)}
I want to insert a new record with a new field and have the load job automatically add the new column as it loads. The new format looks like this:
record = {'full_name': 'Karen Walker',
'age': int(48),
'zipcode': '63021'}
My code is as follows:
from google.cloud import bigquery
client = bigquery.Client(project=projectname)
table = client.get_table(table_id)
config = bigquery.LoadJobConfig()
config.autoedetect = True
config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
config.schema_update_options = [
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
]
job = client.load_table_from_json([record], table, job_config=config)
job.result()
This results in the following error:
400 Provided Schema does not match Table my_project:my_dataset:mytable. Field age has changed mode from REQUIRED to NULLABLE
I can fix this by changing config.schema_update_options as follows:
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
]
This allows me to insert the new record, with zipcode added to the schema, but it causes both full_name and age to become NULLABLE, which is not the behavior I want. Is there a way to prevent schema auto-detect from changing the existing columns?

If you need to add fields to your schema, you can do the following:
from google.cloud import bigquery
client = bigquery.Client()
table = client.get_table("your-project.your-dataset.your-table")
original_schema = table.schema # Get your current table's schema
new_schema = original_schema[:] # Creates a copy of the schema.
# Add new field to schema
new_schema.append(bigquery.SchemaField("new_field", "STRING"))
# Set new schema in your table object
table.schema = new_schema
# Call API to update your table with the new schema
table = client.update_table(table, ["schema"])
After updating your table's schema you can load your new records with this additional field ignoring any schema configurations.

Related

Upload table to bigquery using Colab, specifying schema in job_config

I am trying to write a table to bigquery using Colab. The best way to do it I find is using client and job_config. It is important that I maintain control over how data is written as I plan to use below code for different tasks. The last step that eludes me is setting up schema. I do not want someone's query to crash as say Year is suddenly integer instead of a string. The below code should work? Or perhaps I need to use "job_config.schema_update_options", but I am not sure how schema object should look ? I cannot use pandas gbq as it is too slow to write to dataframe first. Table would be overwritten each month, that is why write_truncate. Thanks
schema_1 = [
{ "name": "Region", "type": "STRING", "mode": "NULLABLE" },
{ "name": "Product", "type": "STRING", "mode": "NULLABLE" } ]
schemma2 = [('Region', 'STRING', 'NULLABLE', None, ()),
('Product', 'STRING', 'NULLABLE', None, ())]
"""Create a Google BigQuery input table.
In the code below, the following actions are taken:
* A new dataset is created "natality_regression."
* A query is run, the output of which is stored in a new "regression_input" table.
"""
from google.cloud import bigquery
# Create a new Google BigQuery client using Google Cloud Platform project defaults.
project_id = 'nproject'
client = bigquery.Client(project=project_id)
# Prepare a reference to a new dataset for storing the query results.
dataset_id = "natality_regression"
table_id = "regression_input"
table_id_full = f"{project_id}.{dataset_id}.{table_id}"
# Configure the query job.
job_config = bigquery.QueryJobConfig()
# Set the destination table to where you want to store query results.
job_config.destination = table_id_full
job_config.write_disposition = 'WRITE_TRUNCATE' # WRITE_APPEND
job_config.schema = schemma2
#job_config.schema_update_options = ???
#job_config.schema = schema_1
# Set up a query in Standard SQL
query = """
SELECT * FROM `nproject.SalesData1.Sales1` LIMIT 15
"""
# Run the query.
query_job = client.query(query, job_config=job_config)
query_job.result() # Waits for the query to finish
print('danski')
This needs to be doing effectively by the job_config but the syntaxis for python is like this:
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("mode", "STRING"),
]
)
You can find more details here: https://cloud.google.com/bigquery/docs/schemas?hl=es_419
You can create the table first too, as you mentioned in the query:
# Construct a BigQuery client object.
client = bigquery.Client()
# TODO(developer): Set table_id to the ID of the table to create.
# table_id = "your-project.your_dataset.your_table_name"
schema = [
bigquery.SchemaField("full_name", "STRING", mode="REQUIRED"),
bigquery.SchemaField("age", "INTEGER", mode="REQUIRED"),
]
table = bigquery.Table(table_id, schema=schema)
table = client.create_table(table) # Make an API request.
print(
"Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)

Load into BigQuery - one column to handle arbitrary json field (empty array, dict with different fields, etc.)

We have the following three JSONs with data that should be loaded in the same table:
{ "name": "tom", "customValues": [] }
{ "name": "joe", "customValues": { "member": "1" } }
{ "name": "joe", "customValues": { "year": "2020", "number": "3" } }
We load data with the python bigquery.LoadJobConfig function:
job_config = bigquery.LoadJobConfig(
schema=SCHEMA_MAP.get(bq_table) if autodetect == False else None,
source_format=bigquery.SourceFormat.NEWLINE_DELIMITED_JSON,
write_disposition=bigquery.WriteDisposition.WRITE_TRUNCATE if remove_old == True else bigquery.WriteDisposition.WRITE_APPEND,
autodetect=autodetect
)
SCHEMA_MAP is a dictionary of arrays, where each array in the schema for one of our tables. We define our BigQuery schema in python using the python bigquery.SchemaField function. If each of the 3 JSONs above were going into 3 different tables, I would have their table schemas defined as:
SCHEMA_T1 = [
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("customValues", "STRING", mode="REPEATED")
]
SCHEMA_T2 = [
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("customValues", "RECORD", mode="REPEATED", fields=[
bigquery.SchemaField("member", "STRING")
])
]
SCHEMA_T3 = [
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("customValues", "RECORD", mode="REPEATED", fields=[
bigquery.SchemaField("year", "STRING"),
bigquery.SchemaField("number", "STRING")
])
]
Is it possible to define the customValues column to handle all 3 of these different data types in one single table? How would the schema be defined for this? Currently, if SCHEMA_T1 is used and data in the forms of T2 or T3 is uploaded, the upload fails and it returns an error Error while reading data, error message: JSON parsing error in row starting at position 0: JSON object specified for non-record field: customValues. Similar errors for the other schemas. Is there a generic any json field in BigQuery that can be used for this?
As JSON feature is still in preview for bigquery (see launch stages). As a workaround, you can use load_table_from_dataframe from the bigquery client to load data from data columns that might require some refinement before pushing into our working table.
Let's see on your scenario, lets said we have a data.json file with raw data:
data.json
[
{
"name": "tom",
"customValues": []
},
{
"name": "joe",
"customValues": {
"member": "1"
}
},
{
"name": "joe",
"customValues": {
"year": "2020",
"number": "3"
}
}
]
And we have a single table on bigquery that we need to populate.
create or replace table "my-project.my-dataset.a-table" (
name STRING,
customValues STRING
)
load.py
from google.cloud import bigquery
import pandas as pd
client = bigquery.Client()
table_id = "project-id.dataset-id.a-table"
df = pd.read_json('data.json')
df["customValues"]= df["customValues"].apply(str)
print(df.shape)
print(df.head())
job_config = bigquery.LoadJobConfig(source_format=bigquery.SourceFormat.PARQUET, autodetect=True)
job = client.load_table_from_dataframe(df, table_id, job_config=job_config)
job.result()
table = client.get_table(table_id)
print("Loaded {} rows and {} columns to {}".format(table.num_rows, len(table.schema), table_id))
output
| **name** | **customValues** |
|----------|---------------------------------|
| tom | [] |
| joe | {'member': '1'} |
| joe | {'year': '2020', 'number': '3'} |
As you can see, regardless of the structure of customValues we are able to have that inserted on our working table ( which only have 2 columns ). We load the json data into a dataframe and then we just update the datatype column to fit our column type by using apply. For more information about using apply please visit this link.
BigQuery now supports JSON as a data type (from January 2022, you are lucky on the timing!), ref here
Therefore you should be able to go with:
SCHEMA_T = [
bigquery.SchemaField("name", "STRING"),
bigquery.SchemaField("customValues", "JSON")
]

how to call python function by getting mongob collection values

how to create document and collection in mongodb to make python code configuration. Get attribute name, datatype, function to be called from mongodb ?
mongodb collection sample example
db.attributes.insertMany([
{ attributes_names: "email", attributes_datype: "string", attributes_isNull="false", attributes_std_function = "email_valid" }
{ attributes_names: "address", attributes_datype: "string", attributes_isNull="false", attributes_std_function = "address_valid" }
]);
Python script and function
def email_valid(df):
df1 = df.withColumn(df.columns[0], regexp_replace(lower(df.columns[0]), "^a-zA-Z0-9#\._\-| ", ""))
extract_expr = expr(
"regexp_extract_all(emails, '(\\\w+([\\\.-]?\\\w+)*#\\[A-Za-z\-\.]+([\\\.-]?\\\w+)*(\\\.\\\w{2,3})+)', 0)")
df2 = df1.withColumn(df.columns[0], extract_expr) \
.select(df.columns[0])
return df2
How to get all the mongodb values in python script and call the function according to attribues.
To create MongoDB collection from a python script :
import pymongo
# connect to your mongodb client
client = pymongo.MongoClient(connection_url)
# connect to the database
db = client[database_name]
# get the collection
mycol = db[collection_name]
from bson import ObjectId
from random_object_id import generate
# create a sample dictionary for the collection data
mydict = { "_id": ObjectId(generate()),
"attributes_names": "email",
"attributes_datype": "string",
"attributes_isNull":"false",
"attributes_std_function" : "email_valid" }
# insert the dictionary into the collection
mycol.insert_one(mydict)
To insert multiple values in the MongoDB, use insert_many() instead of insert_one() and pass the list of dictionary to it. So your list of dictionary will look like this
mydict = [{ "_id": ObjectId(generate()),
"attributes_names": "email",
"attributes_datype": "string",
"attributes_isNull":"false",
"attributes_std_function" : "email_valid" },
{ "_id": ObjectId(generate()),
"attributes_names": "email",
"attributes_datype": "string",
"attributes_isNull":"false",
"attributes_std_function" : "email_valid" }]
To get all the data from MongoDB collection into python script :
data = list()
for x in mycol.find():
data.append(x)
import pandas as pd
data = pd.json_normalize(data)
And then access the data as you access an element of a list of dictionaries:
value = data[0]["attributes_names"]

How to insert document with collection.update_many() into Collection (MongoDB) using Pymongo (No Duplicated)

I insert Document into Collection with collection.update() because each data I have a postID to different. I want to when I run if a post was inserted in MongoDB, the post will be updated (not insert a new post with postID overlapping with postID first). This is a structure of my data:
comment1 = [
{
'commentParentId': parent_content.text,
'parentId': parent_ID,
'posted': child_time.text,
'postID':child_ID,
'author':
{
'name': child_name.text
},
'content': child_content.text
},
...............
]
This is my code, i used to insert data :
client = MongoClient()
db = client['comment_data2']
db.collection_data = db['comments']
for i in data_comment:
db.collection_data.update_many(
{db.collection_data.find({"postID": {"$in": i["postID"]}})},
{"$set": i},
{'upsert': True}
)
But I have a Error : TypeError: filter must be an instance of dict, bson.son.SON, or other type that inherits from collections.Mapping in line {'upsert': True}. And {db.collection_data.find({"postID": {"$in": i["postID"]}})} is right?
you can use this code:
db.collection_data.update_many(
{"postId": i["postID"]},
{"$set":i},
upsert = True
)

How can I query a MongoDB database with different child levels?

I'm new to mongoDB using pymongo. I'm trying to query a collection and also get a specific child from a field. This is what I tried:
import pymongo
import csv
from pymongo import MongoClient
connection = MongoClient()
db = connection.database
collection1 = db.data1
collection2 = db.data2
writer = csv.writer(open("Result_example.csv", "w"))
with open('Data_example.csv') as csvfile:
spamreader = csv.reader(csvfile, delimiter=';')
for row in spamreader:
for rows in collection1.find({"_id": row[0]}, { "childs.first.name": 1}):
writer.writerow([row[0], rows.get("childs.first.name")])
The database structure is like this:
child
first
name
What I want to get is the name...Any ideas?
Thanks!!!
Other than the field not being pluralized in the example structure that was provided, the following query looks fine.
for rows in collection1.find({"_id": row[0]}, { "child.first.name": 1}):
Note that child field is singular.
rows is a reference to a dictionary object like below:
{
'child': {
'first': {
'name': 'Vorname'
}
}
}
rows.get("childs.first.name") returns None in writer.writerow([row[0], rows.get("childs.first.name")])
You can retrieve the name using
rows.get('child').get('first').get('name')
Or
rows['child']['first']['name']
You can save these nested key accesses by running an aggregation that returns the document id and firstname in place of collection1.find({"_id": row[0]}, { "child.first.name": 1}).
children_names = db.collection1.aggregate([
{
'$match': {'_id': ObjectId(row[0])}
},
{
'$replaceRoot': {'newRoot': {'_id': '$_id', 'first_name': '$child.first.name' }}
},
])
Key access could then be done once.
for rows in children_names:
writer.writerow([row[0], rows.get("first_name")])

Categories