I have a JSON data that is inconsistent with fields.
{
"Firsthouse": {
"Doors": "10",
"windows": "9"
},
"Secondhouse": {
"doors": "1",
"windows": "10",
"pools": "2"
}
}
In "Secondhouse" field "pools" is present while it is absent in "Firsthouse".
If I want to write an insert query, do I need to have 6 different queries for presence/absence of such fields, like below:
#This is a query when 3 fields are present
query = "insert into table (doors,windows,pools) values (%s,%s,%s)"
q_tup = data_list_3Fields
cursor.executemany(query, q_tup)
#This is a query when 4 fields are present
query = "insert into table (doors,windows,pools,floors) values (%s,%s,%s,%s)"
q_tup = data_list_4Fields
cursor.executemany(query, q_tup)
Is there a proper approach to do this?
Related
I have a table that already exists with the following schema:
{
"schema": {
"fields": [
{
"mode": "required",
"name": "full_name",
"type": "string"
},
{
"mode": "required",
"name": "age",
"type": "integer"
}]
}
}
It already contains entries like:
{'full_name': 'John Doe',
'age': int(33)}
I want to insert a new record with a new field and have the load job automatically add the new column as it loads. The new format looks like this:
record = {'full_name': 'Karen Walker',
'age': int(48),
'zipcode': '63021'}
My code is as follows:
from google.cloud import bigquery
client = bigquery.Client(project=projectname)
table = client.get_table(table_id)
config = bigquery.LoadJobConfig()
config.autoedetect = True
config.source_format = bigquery.SourceFormat.NEWLINE_DELIMITED_JSON
config.write_disposition = bigquery.WriteDisposition.WRITE_APPEND
config.schema_update_options = [
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
]
job = client.load_table_from_json([record], table, job_config=config)
job.result()
This results in the following error:
400 Provided Schema does not match Table my_project:my_dataset:mytable. Field age has changed mode from REQUIRED to NULLABLE
I can fix this by changing config.schema_update_options as follows:
bigquery.SchemaUpdateOption.ALLOW_FIELD_ADDITION,
bigquery.SchemaUpdateOption.ALLOW_FIELD_RELAXATION
]
This allows me to insert the new record, with zipcode added to the schema, but it causes both full_name and age to become NULLABLE, which is not the behavior I want. Is there a way to prevent schema auto-detect from changing the existing columns?
If you need to add fields to your schema, you can do the following:
from google.cloud import bigquery
client = bigquery.Client()
table = client.get_table("your-project.your-dataset.your-table")
original_schema = table.schema # Get your current table's schema
new_schema = original_schema[:] # Creates a copy of the schema.
# Add new field to schema
new_schema.append(bigquery.SchemaField("new_field", "STRING"))
# Set new schema in your table object
table.schema = new_schema
# Call API to update your table with the new schema
table = client.update_table(table, ["schema"])
After updating your table's schema you can load your new records with this additional field ignoring any schema configurations.
I need to save json file data onto mysql db. Currently mysql db mydb with table1 and table2. Need to save some part of data json file to table1 and some to table2. Each table1 and table2 have been created.
Below is sample of json data file
{
"response": {
"dev_log": {
"data": [
{
"id": "1",
"timestamp": "2020-01-16 10:11:12",
"email": "johnd#gmail.com"
},
{
"id": "2",
"timestamp": "2020-02-27 15:33:34",
"email": "zack#gmail.com"
},
{
"id": "3",
"timestamp": "2020-02-27 15:34:07",
"email": "edy#yahoo.com"
}
],
"total_dev_log": "1423"
},
"client_log": {
"data": [
{
"customer_city": "LONDON",
"customer_login": "AAAAAAAAAAAAAA",
"customer_state": "MC",
"details": "aaaaaaaaaaa-bbbbbbbbbbbbbbb-cccccccccccccc ",
"log_number": "1",
"dept": "Sales",
"staff_id": "S123",
"staff_name": "EricY",
"timestamp": "2020-02-27 15:57:24"
},
{
"customer_city": "SINGAPORE",
"customer_login": "BBBBBBBBBBBBB",
"customer_state": "XX",
"details": "ddddddddd-eeeeeeeeeeee-ffffffffffff ",
"log_number": "1",
"dept": "Eng",
"staff_id": "S456",
"staff_name": "YongG",
"timestamp": "2020-02-27 15:57:24"
}
],
"total_hero_query": "13"
},
"response_time": "0.723494",
"transaction_id": "909122",
"transaction_status": "OK",
"transaction_time": "Fri Feb 28 15:27:51 2020"
}
}
Here... in json we have 'dev_log' and 'client_log'.
Thus..all the values of dev_log should be save onto table1 and client_log onto table2 of mydb databse.
The drafted code below
import pymysql
import os
import json
#import ast
#Read Json string file
with open('datfile.json', 'r') as f:
datDict = json.load(f)
#connect to MySQL
con = pymysql.connect(host = 'localhost',user = 'root',passwd = 'root',db = 'mydb')
cursor = con.cursor()
#Parse data to SQL insert
#for i, item in enumerate(datDict):
#id = ("id", None)
#timestamp = ("timestamp", None)
#email = ("email", None)
#cursor.execute("INSERT INTO mytable (id, timestamp, email) VALUES (%s, %s, %s)", (id, timestamp, email))
cursor.executemany("""INSERT INTO table1 VALUES(id, timestamp, email)""", datDict['response']['dev_log']['data'])
con.commit()
con.close()
I'm not sure how to save file to sql and some more to 2 different tables. As for now.. i can run the code without any error but return null value as below;
mysql> SELECT * FROM table1;
+------+-----------+-------+
| id | timestamp | email |
+------+-----------+-------+
| NULL | NULL | NULL |
| NULL | NULL | NULL |
| NULL | NULL | NULL |
+------+-----------+-------+
3 rows in set (0.00 sec)
mydb database and table table1 have been created.
I really appreciate your help and advise how I can execute further.
Thank you to all
Json is only a text representation, so you should first load it into Python objects (dicts and lists). On this is done, data['response']['dev_log']['data'] is a nice list of dicts, as is data['response']['client_log']['data'].
So you should be able to use queries close to:
data = json.load(jsonfile)
...
curs = con.cursor() # con is a connection to your database...
curs.executemany("""INSERT INTO dev VALUES(:id, :timestamp, :email)""",
data['response']['dev_log']['data'])
Respected People,
I am having problem in handling JSON data sent to the server, using requests, as I am unable to frame the MySQL query.
{
"Firsthouse": {
"Doors": "10",
"windows": "9"
},
"Secondhouse": {
"doors": "1",
"windows": "10",
"pools": "2"
}
}
This is how is I am processing the data on server:
request_data = request.get_json()
load_data = json.loads(request_data)
If the JSON were consistent wrt fields (like No "pools" in Firsthouse), I'd have the following query after some further processing:
for i in load_data:
rows = i['doors'],i['windows'],i['pools']
data_list.append(rows)
query = "insert into table (doors,windows,pools) values (%s,%s,%s)"
q_tup = data_list
cursor.executemany(query, q_tup)
But, the fields are not fixed in my JSON, there could be five max fields - doors, windows, pools, floors, chimneys.
Should I write 5 queries based on the presence of fields in JSON data, using If-else block ?
Many thanks for hint/ideas.
This is a two-part question. If you're checking this out, thanks for your time!
Is there a way to make my query faster?
I previously asked a question here, and was eventually able to solve the problem myself.
However, the query I devised to produce my desired results is VERY slow (25+ minutes) when run against my database, which contains 40,000+ records.
The query is serving its purpose, but I'm hoping one of you brilliant people can point out to me how to make the query perform at a more preferred speed.
My query:
with dupe as (
select
json_document->'Firstname'->0->'Content' as first_name,
json_document->'Lastname'->0->'Content' as last_name,
identifiers->'RecordID' as record_id
from (
select *,
jsonb_array_elements(json_document->'Identifiers') as identifiers
from staging
) sub
group by record_id, json_document
order by last_name
)
select * from dupe da where (
select count(*) from dupe db
where db.record_id = da.record_id
) > 1;
Again, some sample data:
Row 1:
{
"Firstname": "Bobb",
"Lastname": "Smith",
"Identifiers": [
{
"Content": "123",
"RecordID": "123",
"SystemID": "Test",
"LastUpdated": "2017-09-12T02:23:30.817Z"
},
{
"Content": "abc",
"RecordID": "abc",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
},
{
"Content": "def",
"RecordID": "def",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
}
]
}
Row 2:
{
"Firstname": "Bob",
"Lastname": "Smith",
"Identifiers": [
{
"Content": "abc",
"RecordID": "abc",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:26.020Z"
}
]
}
If I were to bring in my query's results, or a portion of the results, into a Python environment where they could be manipulated using Pandas, how could I iterate over the results of my query (or the sub-query) in order to achieve the same end result as with my original query?
Is there an easier way, using Python, to iterate through my un-nested json array in the same way that Postgres does?
For example, after performing this query:
select
json_document->'Firstname'->0->'Content' as first_name,
json_document->'Lastname'->0->'Content' as last_name,
identifiers->'RecordID' as record_id
from (
select *,
jsonb_array_elements(json_document->'Identifiers') as identifiers
from staging
) sub
order by last_name;
How, using Python/Pandas, can i take that query's results and perform something like:
da = datasets[query_results] # to equal my dupe da query
db = datasets[query_results] # to equal my dupe db query
Then perform the equivalent of
select * from dupe da where (
select count(*) from dupe db
where db.record_id = da.record_id
) > 1;
in Python?
I apologize if I do not provide enough information here. I am a Python novice. Any and all help is greatly appreciated! Thanks!!
Try the following, which eliminates your count(*) and instead uses exists.
with dupe as (
select id,
json_document->'Firstname'->0->'Content' as first_name,
json_document->'Lastname'->0->'Content' as last_name,
identifiers->'RecordID' as record_id
from
(select
*,
jsonb_array_elements(json_document->'Identifiers') as identifiers
from staging ) sub
group by
id,
record_id,
json_document
order by last_name )
select * from dupe da
where exists
(select *
from dupe db
where
db.record_id = da.record_id
and db.id != da.id
)
Consider reading the raw, unqueried values of the Postgres json column type and use pandas json_normalize() to bind into a flat dataframe. From there use pandas drop_duplicates.
To demonstrate, below parses your one json data into three-row dataframe for each corresponding Identifiers records:
import json
import pandas as pd
json_str = '''
{
"Firstname": "Bobb",
"Lastname": "Smith",
"Identifiers": [
{
"Content": "123",
"RecordID": "123",
"SystemID": "Test",
"LastUpdated": "2017-09-12T02:23:30.817Z"
},
{
"Content": "abc",
"RecordID": "abc",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
},
{
"Content": "def",
"RecordID": "def",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
}
]
}
'''
data = json.loads(json_str)
df = pd.io.json.json_normalize(data, 'Identifiers', ['Firstname','Lastname'])
print(df)
# Content LastUpdated RecordID SystemID Lastname Firstname
# 0 123 2017-09-12T02:23:30.817Z 123 Test Smith Bobb
# 1 abc 2017-09-13T10:10:21.598Z abc Test Smith Bobb
# 2 def 2017-09-13T10:10:21.598Z def Test Smith Bobb
For your database, consider connecting with your DB-API such as psycopg2 or sqlAlchemy and parse each json as a string accordingly. Admittedly, there may be other ways to handle json as seen in the psycopg2 docs but below receives data as text and parses on python side:
import psycopg2
conn = psycopg2.connect("dbname=test user=postgres")
cur = conn.cursor()
cur.execute("SELECT json_document::text FROM staging;")
df = pd.io.json.json_normalize([json.loads(row[0]) for row in cur.fetchall()],
'Identifiers', ['Firstname','Lastname'])
df = df.drop_duplicates(['RecordID'])
cur.close()
conn.close()
Is there a tool to convert a sql statement into python, if it's possible. For example:
(CASE WHEN var = 2 then 'Yes' else 'No' END) custom_var
==>
customVar = 'Yes' if var == 2 else 'No'
I am trying to provide a API for ETL-like transformations from a json input. Here's an example of an input:
{
"ID": 4,
"Name": "David",
"Transformation: "NewField = CONCAT (ID, Name)"
}
And we would translate this into:
{
"ID": 4,
"Name": "David",
"NewField: "4David"
}
Or, is there a better transformation language that could be used here over SQL?
Is SET NewField = CONCAT (ID, Name) actually valid sql? (if Newfield is a variable do you need to declare it and prefix with "#"?). If you want to just execute arbitrary SQL, you could hack something together with sqlite:
import sqlite3
import json
query = """
{
"ID": "4",
"Name": "David",
"Transformation": "SELECT ID || Name AS NewField FROM inputdata"
}"""
query_dict = json.loads(query)
db = sqlite3.Connection('mydb')
db.execute('create table inputdata ({} VARCHAR(100));'.format(' VARCHAR(100), '.join(query_dict.keys())))
db.execute('insert into inputdata ({}) values ("{}")'.format(','.join(query_dict.keys()),'","'.join(query_dict.values())))
r = db.execute(query_dict['Transformation'])
response = {}
response[r.description[0][0]] = r.fetchone()[0]
print(response)
#{'NewField': '4David'}
db.execute('drop table inputdata;')
db.close()