Function to Parse Arrays in JSON to Insert Into a SQL Table - python

Trying to take a web API's JSON response and populate a SQL database with the results.
Part of the JSON response has this array:
"MediaLinks": [
{
"MediaType": "Datasheets",
"SmallPhoto": "",
"Thumbnail": "",
"Title": "SN54HC374, SN74HC374",
"Url": "http://www.ti.com/general/docs/suppproductinfo.tsp?distId=10&gotoUrl=http%3A%2F%2Fwww.ti.com%2Flit%2Fgpn%2Fsn74hc374"
},
{
"MediaType": "Product Photos",
"SmallPhoto": "http://media.digikey.com/photos/Texas%20Instr%20Photos/296-20-DIP_sml.jpg",
"Thumbnail": "http://media.digikey.com/photos/Texas%20Instr%20Photos/296-20-DIP_tmb.jpg",
"Title": "20-DIP,R-PDIP-Txx",
"Url": "http://media.digikey.com/photos/Texas%20Instr%20Photos/296-20-DIP.jpg"
},
{
"MediaType": "Featured Product",
"SmallPhoto": "",
"Thumbnail": "",
"Title": "Logic Solutions",
"Url": "https://www.digikey.com/en/product-highlight/t/texas-instruments/logic-solutions "
},
{
"MediaType": "Featured Product",
"SmallPhoto": "",
"Thumbnail": "",
"Title": "Analog Solutions",
"Url": "https://www.digikey.com/en/product-highlight/t/texas-instruments/analog-solutions "
},
{
"MediaType": "PCN Design/Specification",
"SmallPhoto": "",
"Thumbnail": "",
"Title": "Copper Bond Wire Revision A 04/Dec/2013",
"Url": "http://media.digikey.com/pdf/PCNs/Texas%20Instruments/PCN20120223003A_Copper-wire.pdf"
},
{
"MediaType": "PCN Design/Specification",
"SmallPhoto": "",
"Thumbnail": "",
"Title": "Material Set 30/Mar/2017",
"Url": "http://media.digikey.com/pdf/PCNs/Texas%20Instruments/PCN20170310000.pdf"
}
],
For testing I've issued the request and then written the response to a file and I'm experimenting with this file to come up with the correct code
conn.request("POST", "/services/partsearch/v2/partdetails", json.dumps(payload), headers)
res = conn.getresponse()
data = res.read()
data_return = json.loads(data)
print(json.dumps(data_return, indent=4))
with open(y["DigiKeyPartNumber"]+".json", "w") as write_file:
json.dump(data_return, write_file, indent=4, sort_keys=True)
write_file.close()
Then in my test code I've tried this:
import json
with open(r"C:\Users\george\OneDrive\Documents\296-1592-5-ND.json") as json_file:
data = json.load(json_file)
values = ""
placeholder = '?'
thelist = []
thelist = list(data['PartDetails']['MediaLinks'])
print(type(thelist))
#print(thelist)
placeholders = ', '.join(placeholder for unused in (data['PartDetails']['MediaLinks']))
query = 'INSERT INTO thetable VALUES(%s)' % placeholders
print(query)
But this just produces the following output:
<class 'list'>
INSERT INTO thetable VALUES(?, ?, ?, ?, ?, ?)
For reference this creates what I think will work except for the trailing comma:
if len(data['PartDetails']['MediaLinks']):
print('The length is: ' + str(len(data['PartDetails']['MediaLinks'])))
#print(type(data['PartDetails']['MediaLinks']))
for mediadata in data['PartDetails']['MediaLinks']:
#print(mediadata)
for element in mediadata:
#print(element + ' is "' + mediadata[element] + '"')
values += '"' + mediadata[element] + '", '
#print(list(data['PartDetails']['MediaLinks'][1]))
print(values + "\n")
values = ""
else:
print('It is empty')
Which produces this:
The length is: 6
"Datasheets", "", "", "SN54HC374, SN74HC374", "http://www.ti.com/general/docs/suppproductinfo.tsp?distId=10&gotoUrl=http%3A%2F%2Fwww.ti.com%2Flit%2Fgpn%2Fsn74hc374",
"Product Photos", "http://media.digikey.com/photos/Texas%20Instr%20Photos/296-20-DIP_sml.jpg", "http://media.digikey.com/photos/Texas%20Instr%20Photos/296-20-DIP_tmb.jpg", "20-DIP,R-PDIP-Txx", "http://media.digikey.com/photos/Texas%20Instr%20Photos/296-20-DIP.jpg",
"Featured Product", "", "", "Logic Solutions", "https://www.digikey.com/en/product-highlight/t/texas-instruments/logic-solutions ",
"Featured Product", "", "", "Analog Solutions", "https://www.digikey.com/en/product-highlight/t/texas-instruments/analog-solutions ",
"PCN Design/Specification", "", "", "Copper Bond Wire Revision A 04/Dec/2013", "http://media.digikey.com/pdf/PCNs/Texas%20Instruments/PCN20120223003A_Copper-wire.pdf",
"PCN Design/Specification", "", "", "Material Set 30/Mar/2017", "http://media.digikey.com/pdf/PCNs/Texas%20Instruments/PCN20170310000.pdf",
In the table I've created in SQL it uses the same column names as the keys in the JSON array. There are several arrays in the JSON response so I'm hoping to create a generic function that accepts the JSON array and creates the correct SQL INSERT statements to populate the tables with the JSON data. I'm planning on using pyodbc and best case is something that works for both Python 2.7 as well as 3.x
Updated Information:
I found the following code snippet which comes very close:
for thedata in data['PartDetails']['MediaLinks']:
keys, values = zip(*thedata.items())
print(values) #This will create the VALUES for the INSERT Statement
print(keys) #This will create the COLUMNS, need to add the PartDetailsId field
I was trying to find a way to get the keys before I ran this for loop because I would have to replace the print statements with the actual SQL INSERT statement.
When I check type(newdata['PartDetails']['MediaLinks']) is returns <class 'list'> in Python 3.7.4 so even though it looks like a dictionary it's treated like a list and .keys() fails to try and grab the keys

Just for completeness I want to post a formatted code snippet that is working for me. This would not have been possible without #barmar 's help so thanks again.
The end goal is to convert this into a function so that I can pass in the arrays from a JSON response and have it populate the correct SQL tables with the data. This is close to being complete but not quite there yet.
import pyodbc
conn = pyodbc.connect('Driver={SQL Server};Server=GEORGE-OFFICE3\SQLEXPRESS01;Database=Components;')
cursor = conn.cursor()
with open(r"C:\Users\george\OneDrive\Documents\296-1592-5-ND.json") as json_file:
data = json.load(json_file)
x = tuple(data['PartDetails']['MediaLinks'][0])
a = str(x).replace("'","").replace("(","")
query = "INSERT INTO MediaLinks (PartDetailsId, " + a + " VALUES(" + str(data['PartDetails']['PartId'])
b = ""
for i in range(len(x)):
b += ", ?"
b += ")"
query += b
cursor.executemany(query, [tuple(d.values()) for d in data['PartDetails']['MediaLinks']])
cursor.commit()
conn.close()

Use cursor.executemany() to execute the query on all the rows in the MediaLinks list.
You can't pass the dictionaries directly, though, because iterating over a dictionary returns the keys, not the values. You need to convert this to a list of values, using one of the methods in How to convert list of dictionaries into list of lists
colnames = ", ".join (data['PartDetails']['MediaLinks'][0].keys())
placeholders = ", ".join(["?"] * len(data['PartDetails']['MediaLinks'][0]))
query = "INSERT INTO MediaLInks (" + colnames + ") VALUES (" + placeholders + ")"
cursor.executemany(query, [tuple(d.values()) for d in data['PartDetails']['MediaLinks']])

Related

How to get a value of a dictionary within a list of dictionaries when the dict equals a certain name?

I want to retrieve certain values from a list object that contains strings and dictionaries. If the dictionary equals a certain name I want to add it to the output file.
First I read in a json file which looks like this:
{
"event": "user",
"timestamp": {
"$numberDouble": "1671459681.4369426"
},
"metadata": {
"model_id": "125817626"
},
"text": "hello",
"parse_data": {
"intent": {
"name": "greet",
"confidence": {
"$numberDouble": "1.0"
}
},
I have events that equal "user" and those that equal "bot". What I want to have in the end is an output file that has user and bot in each line. At the beginning of the line I also want to have the timestamp, formatted in human readable time, and at the end the text like
2020-03-13 12:11:25 user: hello
2020-03-13 12:11:28 bot: Hi
However, I do not know how to access the value of "timestamp", format it and then print it together with "user" and "text" in one line.
What I have done so far is that:
import json
from datetime import datetime
f = open('1234.json')
data = json.load(f)
all_events = [e for e in data.get('events', list()) if e.get('event')=='user' or e.get('event')=='bot']
file_object = open('1234.txt', 'a')
for event in all_events:
file_object.write(str(event.get('timestamp'))+ " "+ event.get('event')+ ": " + event.get('text', '')+ "\n")
f.close()
What I get with this is:
{'$numberDouble': '1671459681.4369426'} user: hello
I know that I can use something like this to format the time
ts = int("1584101485")
print(datetime.utcfromtimestamp(ts).strftime('%Y-%m-%d %H:%M:%S'))
But I do not know how to wrap this in my file_object.write command
Instead of getting the timestamp key's value, you need to parse the $numberDouble value inside of the timestamp dictionary:
file_object.write(
datetime.utcfromtimestamp(float(event['timestamp']['$numberDouble'])).strftime('%Y-%m-%d %H:%M:%S') + " "+ event.get('event')+ ": " + event.get('text', '')+ "\n"
)

How to assign variables using JSON data in python

I wrote a python code using MySQL data, but then I decided to use JSON as a "database" rather than MySQL.
This is MySQL code :
mydb = mysql.connector.connect(host="localhost", user="nn", passwd="passpass")
mycursor = mydb.cursor()
event_fabricant = input('Inscrivez le nom de la compagnie : ')
mycursor.execute("""SELECT name_company,inspecteur1, inspecteur2, inspecteur3, ville, email FROM listedatabase.entreprises_inspecteurs WHERE name_company = %s""", (event_fabricant,))
data = mycursor.fetchall()
if data:
row = data[0]
event_location = row[4]
event_email = row [5]
How do I assign data like I did with MySQL but with JSON?
This is a sample of my JSON data, and below what I did so far.
JSON SAMPLE :
[
{
"id": 1,
"name_company": "Acier Michel",
"inspecteur1": "Hou, L",
"inspecteur2": "Caana, C",
"inspecteur3": "Luc, C",
"type": "Water",
"location": "Laval"
},
{
"id": 2,
"name_company": "Aciers ABC Inc.",
"inspecteur1": "Vali, M",
"inspecteur2": "Alemane, K",
"inspecteur3": "laszik, M",
"type": "NA",
"location": "St-Joseph de Sorel"
}
]
This is what I did so far but it's not exactly what i want :
import json
database = "convertcsv.json"
data = json.loads(open(database).read())
name_company = input("type company name: ")
for item in data:
if item['nom_entreprise'] == name_company:
print(item['inspecteur1'])
else:
print("Not Found")
What I need instead is to be able to assign to variable1 the inspecteur1 name.
If you want to assign the data just use: variable1 = item["inspecteur1"].
One issue with your JSON code above is that it will print Not Found for every record that does NOT match which I don't think it's what you want. Try:
found = False
for item in data:
if item['nom_entreprise'] == name_company:
print(item['inspecteur1'])
found = True
if not found:
print("Not Found")
If you feel like MySQL is too complex for your needs, may I suggest SQLite? It's supported out of the box in Python, there's no server process (just a file) and you get all the database features that JSON does not provide by itself.

How can my Postgres query perform faster? Can I use Python to provide faster iteration?

This is a two-part question. If you're checking this out, thanks for your time!
Is there a way to make my query faster?
I previously asked a question here, and was eventually able to solve the problem myself.
However, the query I devised to produce my desired results is VERY slow (25+ minutes) when run against my database, which contains 40,000+ records.
The query is serving its purpose, but I'm hoping one of you brilliant people can point out to me how to make the query perform at a more preferred speed.
My query:
with dupe as (
select
json_document->'Firstname'->0->'Content' as first_name,
json_document->'Lastname'->0->'Content' as last_name,
identifiers->'RecordID' as record_id
from (
select *,
jsonb_array_elements(json_document->'Identifiers') as identifiers
from staging
) sub
group by record_id, json_document
order by last_name
)
select * from dupe da where (
select count(*) from dupe db
where db.record_id = da.record_id
) > 1;
Again, some sample data:
Row 1:
{
"Firstname": "Bobb",
"Lastname": "Smith",
"Identifiers": [
{
"Content": "123",
"RecordID": "123",
"SystemID": "Test",
"LastUpdated": "2017-09-12T02:23:30.817Z"
},
{
"Content": "abc",
"RecordID": "abc",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
},
{
"Content": "def",
"RecordID": "def",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
}
]
}
Row 2:
{
"Firstname": "Bob",
"Lastname": "Smith",
"Identifiers": [
{
"Content": "abc",
"RecordID": "abc",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:26.020Z"
}
]
}
If I were to bring in my query's results, or a portion of the results, into a Python environment where they could be manipulated using Pandas, how could I iterate over the results of my query (or the sub-query) in order to achieve the same end result as with my original query?
Is there an easier way, using Python, to iterate through my un-nested json array in the same way that Postgres does?
For example, after performing this query:
select
json_document->'Firstname'->0->'Content' as first_name,
json_document->'Lastname'->0->'Content' as last_name,
identifiers->'RecordID' as record_id
from (
select *,
jsonb_array_elements(json_document->'Identifiers') as identifiers
from staging
) sub
order by last_name;
How, using Python/Pandas, can i take that query's results and perform something like:
da = datasets[query_results] # to equal my dupe da query
db = datasets[query_results] # to equal my dupe db query
Then perform the equivalent of
select * from dupe da where (
select count(*) from dupe db
where db.record_id = da.record_id
) > 1;
in Python?
I apologize if I do not provide enough information here. I am a Python novice. Any and all help is greatly appreciated! Thanks!!
Try the following, which eliminates your count(*) and instead uses exists.
with dupe as (
select id,
json_document->'Firstname'->0->'Content' as first_name,
json_document->'Lastname'->0->'Content' as last_name,
identifiers->'RecordID' as record_id
from
(select
*,
jsonb_array_elements(json_document->'Identifiers') as identifiers
from staging ) sub
group by
id,
record_id,
json_document
order by last_name )
select * from dupe da
where exists
(select *
from dupe db
where
db.record_id = da.record_id
and db.id != da.id
)
Consider reading the raw, unqueried values of the Postgres json column type and use pandas json_normalize() to bind into a flat dataframe. From there use pandas drop_duplicates.
To demonstrate, below parses your one json data into three-row dataframe for each corresponding Identifiers records:
import json
import pandas as pd
json_str = '''
{
"Firstname": "Bobb",
"Lastname": "Smith",
"Identifiers": [
{
"Content": "123",
"RecordID": "123",
"SystemID": "Test",
"LastUpdated": "2017-09-12T02:23:30.817Z"
},
{
"Content": "abc",
"RecordID": "abc",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
},
{
"Content": "def",
"RecordID": "def",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
}
]
}
'''
data = json.loads(json_str)
df = pd.io.json.json_normalize(data, 'Identifiers', ['Firstname','Lastname'])
print(df)
# Content LastUpdated RecordID SystemID Lastname Firstname
# 0 123 2017-09-12T02:23:30.817Z 123 Test Smith Bobb
# 1 abc 2017-09-13T10:10:21.598Z abc Test Smith Bobb
# 2 def 2017-09-13T10:10:21.598Z def Test Smith Bobb
For your database, consider connecting with your DB-API such as psycopg2 or sqlAlchemy and parse each json as a string accordingly. Admittedly, there may be other ways to handle json as seen in the psycopg2 docs but below receives data as text and parses on python side:
import psycopg2
conn = psycopg2.connect("dbname=test user=postgres")
cur = conn.cursor()
cur.execute("SELECT json_document::text FROM staging;")
df = pd.io.json.json_normalize([json.loads(row[0]) for row in cur.fetchall()],
'Identifiers', ['Firstname','Lastname'])
df = df.drop_duplicates(['RecordID'])
cur.close()
conn.close()

Translate SQL to Python, if possible

Is there a tool to convert a sql statement into python, if it's possible. For example:
(CASE WHEN var = 2 then 'Yes' else 'No' END) custom_var
==>
customVar = 'Yes' if var == 2 else 'No'
I am trying to provide a API for ETL-like transformations from a json input. Here's an example of an input:
{
"ID": 4,
"Name": "David",
"Transformation: "NewField = CONCAT (ID, Name)"
}
And we would translate this into:
{
"ID": 4,
"Name": "David",
"NewField: "4David"
}
Or, is there a better transformation language that could be used here over SQL?
Is SET NewField = CONCAT (ID, Name) actually valid sql? (if Newfield is a variable do you need to declare it and prefix with "#"?). If you want to just execute arbitrary SQL, you could hack something together with sqlite:
import sqlite3
import json
query = """
{
"ID": "4",
"Name": "David",
"Transformation": "SELECT ID || Name AS NewField FROM inputdata"
}"""
query_dict = json.loads(query)
db = sqlite3.Connection('mydb')
db.execute('create table inputdata ({} VARCHAR(100));'.format(' VARCHAR(100), '.join(query_dict.keys())))
db.execute('insert into inputdata ({}) values ("{}")'.format(','.join(query_dict.keys()),'","'.join(query_dict.values())))
r = db.execute(query_dict['Transformation'])
response = {}
response[r.description[0][0]] = r.fetchone()[0]
print(response)
#{'NewField': '4David'}
db.execute('drop table inputdata;')
db.close()

Selecting fields from JSON output

Using Python, how can i extract the field id to a variable? Basicaly, i to transform this:
{
"accountWide": true,
"criteria": [
{
"description": "some description",
"id": 7553,
"max": 1,
"orderIndex": 0
}
]
}
to something like
print "Description is: " + description
print "ID is: " + id
print "Max value is : " + max
Assume you stored that dictionary in a variable called values. To get id in to a variable, do:
idValue = values['criteria'][0]['id']
If that json is in a file, do the following to load it:
import json
jsonFile = open('your_filename.json', 'r')
values = json.load(jsonFile)
jsonFile.close()
If that json is from a URL, do the following to load it:
import urllib, json
f = urllib.urlopen("http://domain/path/jsonPage")
values = json.load(f)
f.close()
To print ALL of the criteria, you could:
for criteria in values['criteria']:
for key, value in criteria.iteritems():
print key, 'is:', value
print ''
Assuming you are dealing with a JSON-string in the input, you can parse it using the json package, see the documentation.
In the specific example you posted you would need
x = json.loads("""{
"accountWide": true,
"criteria": [
{
"description": "some description",
"id": 7553,
"max": 1,
"orderIndex": 0
}
]
}""")
description = x['criteria'][0]['description']
id = x['criteria'][0]['id']
max = x['criteria'][0]['max']

Categories