Convert sql query result into JSON array in python

Convert sql query result into JSON array in python - python

The result I got from SQLite in Python looks like this:
{"John", "Alice"}, {"John", "Bob"}, {"Jogn", "Cook"} ......
I want to convert the result into JSON format like this:
{
"Teacher": "John",
"Students": ["Alice", "Bob", "Cook" .....]
}
I used GROUP_CONCAT to concat all the students' name and the following code:
row_headers = [x[0] for x in cursor.description] #this will extract row headers
result = []
for res in cursor.fetchall():
result.append(dict(zip(row_headers, res)))
I was able to get this result:
{
"Teacher": "John",
"Students": "Alice, Bob, Cook"
}
How can I make the students into array format?

If your version of sqlite has the JSON1 extension enabled, it's easy to do in pure SQL:
SELECT json_object('Teacher', teacher,
'Students', json_group_array(student)) AS result
FROM ex
GROUP BY teacher;
DB Fiddle example

You could just do result["Students"] = result["Students"].split(", ").

Related

SELECT inside JSON structure return empty

Using Azure Cosmos DB with the Python SDK, I'm trying to select a value inside a JSON file structured like this:
{
"id": "40",
"data": [
{
"x": "0.0959",
"y": "-0.1303",
"z": "0.0202"
}
]
}
My query works with getting all three values x, y, z inside data but when I try to select a single value with data.x it returns an empty list. My query looks like this:
Select f.data, f.id from file as f where f.id = "40"
What am I doing wrong?

data field type seems as an Array, so below query has worked for me with your data,
Select f.data[0].x, f.data[0].y, f.data[0].z from file as f where f.id = '40'

Is there a way to find record in mongo by matching field string with an array of values

I have the below record
{
"title": "Kim floral jacquard minidress",
"designer": "Rotate Birger Christensen"
}
How can I find a record in the collection using an array of values. For example, I have the below array values. Because "title" field contains the "floral" value, the record is selected.
['floral', 'dresses']
The query I am using below doesn't work. :(
queryParam = ['floral', 'dresses']
def get_query(queryParam, gender):
query = {
"gender": gender
}
if (len(queryParam) != 0):
query["title"] = {"$in": queryParam}
return query
products_query = get_query(query, gender)
products = mongo.db.products.find(products_query)

To add to the previous answer, there's a little bit more to do to get this to work in pymongo. You have to use re.compile() to get the regex search to work:
import re
queryParam = [re.compile('floral'), re.compile('dresses')]
Alternatively you could use this approach which removes the need for the $in operator:
import re
queryParam = [re.compile('floral|dresses')]
And once you've done that you don't even need to use re.compile:
queryParam = 'floral|dress'
...
query = {"title": {"$regex": queryParam}}
Take your pick.

You need to do regex search along with $in operator :
db.collectionName.find( { title: { $in: [ /floral/, /dresses/ ] } })

How to convert any nested json into a pandas dataframe

I'm currently working on a project that will be analyzing multiple data sources for information, other data sources are fine but I am having a lot of trouble with json and its sometimes deeply nested structure. I have tried to turn the json into a python dictionary, but with not much luck as it can start to struggle as it gets more complicated. For example with this sample json file:
{
"Employees": [
{
"userId": "rirani",
"jobTitleName": "Developer",
"firstName": "Romin",
"lastName": "Irani",
"preferredFullName": "Romin Irani",
"employeeCode": "E1",
"region": "CA",
"phoneNumber": "408-1234567",
"emailAddress": "romin.k.irani#gmail.com"
},
{
"userId": "nirani",
"jobTitleName": "Developer",
"firstName": "Neil",
"lastName": "Irani",
"preferredFullName": "Neil Irani",
"employeeCode": "E2",
"region": "CA",
"phoneNumber": "408-1111111",
"emailAddress": "neilrirani#gmail.com"
}
]
}
after converting to dictionary and doing dict.keys() only returns "Employees".
I then resorted to instead opt for a pandas dataframe and I could achieve what I wanted by calling json_normalize(dict['Employees'], sep="_") but my problem is that it must work for ALL jsons and looking at the data beforehand is not an option so my method of normalizing this way will not always work. Is there some way I could write some sort of function that would take in any json and convert it into a nice pandas dataframe? I have searched for about 2 weeks for answers bt with no luck regarding my specific problem. Thanks

I've had to do that in the past (Flatten out a big nested json). This blog was really helpful. Would something like this work for you?
Note, like the others have stated, for this to work for EVERY json, is a tall task, I'm merely offering a way to get started if you have a wider range of json format objects. I'm assuming they will be relatively CLOSE to what you posted as an example with hopefully similarly structures.)
jsonStr = '''{
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani#gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani#gmail.com"
}]
}'''
It flattens out the entire json into single rows, then you can put into a dataframe. In this case it creates 1 row with 18 columns. Then iterates through those columns, using the number values within those column names to reconstruct into multiple rows. If you had a different nested json, I'm thinking it theoretically should work, but you'll have to test it out.
import json
import pandas as pd
import re
def flatten_json(y):
out = {}
def flatten(x, name=''):
if type(x) is dict:
for a in x:
flatten(x[a], name + a + '_')
elif type(x) is list:
i = 0
for a in x:
flatten(a, name + str(i) + '_')
i += 1
else:
out[name[:-1]] = x
flatten(y)
return out
jsonObj = json.loads(jsonStr)
flat = flatten_json(jsonObj)
results = pd.DataFrame()
columns_list = list(flat.keys())
for item in columns_list:
row_idx = re.findall(r'\_(\d+)\_', item )[0]
column = item.replace('_'+row_idx+'_', '_')
row_idx = int(row_idx)
value = flat[item]
results.loc[row_idx, column] = value
print (results)
Output:
print (results)
Employees_userId ... Employees_emailAddress
0 rirani ... romin.k.irani#gmail.com
1 nirani ... neilrirani#gmail.com
[2 rows x 9 columns]

d={
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani#gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani#gmail.com"
}]
}
import pandas as pd
df=pd.DataFrame([x.values() for x in d["Employees"]],columns=d["Employees"][0].keys())
print(df)
Output
userId jobTitleName firstName ... region phoneNumber emailAddress
0 rirani Developer Romin ... CA 408-1234567 romin.k.irani#gmail.com
1 nirani Developer Neil ... CA 408-1111111 neilrirani#gmail.com
[2 rows x 9 columns]

For the particular JSON data given. My approach, which uses pandas package only, follows:
import pandas as pd
# json as python's dict object
jsn = {
"Employees" : [
{
"userId":"rirani",
"jobTitleName":"Developer",
"firstName":"Romin",
"lastName":"Irani",
"preferredFullName":"Romin Irani",
"employeeCode":"E1",
"region":"CA",
"phoneNumber":"408-1234567",
"emailAddress":"romin.k.irani#gmail.com"
},
{
"userId":"nirani",
"jobTitleName":"Developer",
"firstName":"Neil",
"lastName":"Irani",
"preferredFullName":"Neil Irani",
"employeeCode":"E2",
"region":"CA",
"phoneNumber":"408-1111111",
"emailAddress":"neilrirani#gmail.com"
}]
}
# get the main key, here 'Employees' with index '0'
emp = list(jsn.keys())[0]
# when you have several keys at this level, i.e. 'Employers' for example
# .. you need to handle all of them too (your task)
# get all the sub-keys of the main key[0]
all_keys = jsn[emp][0].keys()
# build dataframe
result_df = pd.DataFrame() # init a dataframe
for key in all_keys:
col_vals = []
for ea in jsn[emp]:
col_vals.append(ea[key])
# add a new column to the dataframe using sub-key as its header
# it is possible that values here is a nested object(s)
# .. such as dict, list, json
result_df[key]=col_vals
print(result_df.to_string())
Output:
userId lastName jobTitleName phoneNumber emailAddress employeeCode preferredFullName firstName region
0 rirani Irani Developer 408-1234567 romin.k.irani#gmail.com E1 Romin Irani Romin CA
1 nirani Irani Developer 408-1111111 neilrirani#gmail.com E2 Neil Irani Neil CA

How can my Postgres query perform faster? Can I use Python to provide faster iteration?

This is a two-part question. If you're checking this out, thanks for your time!
Is there a way to make my query faster?
I previously asked a question here, and was eventually able to solve the problem myself.
However, the query I devised to produce my desired results is VERY slow (25+ minutes) when run against my database, which contains 40,000+ records.
The query is serving its purpose, but I'm hoping one of you brilliant people can point out to me how to make the query perform at a more preferred speed.
My query:
with dupe as (
select
json_document->'Firstname'->0->'Content' as first_name,
json_document->'Lastname'->0->'Content' as last_name,
identifiers->'RecordID' as record_id
from (
select *,
jsonb_array_elements(json_document->'Identifiers') as identifiers
from staging
) sub
group by record_id, json_document
order by last_name
)
select * from dupe da where (
select count(*) from dupe db
where db.record_id = da.record_id
) > 1;
Again, some sample data:
Row 1:
{
"Firstname": "Bobb",
"Lastname": "Smith",
"Identifiers": [
{
"Content": "123",
"RecordID": "123",
"SystemID": "Test",
"LastUpdated": "2017-09-12T02:23:30.817Z"
},
{
"Content": "abc",
"RecordID": "abc",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
},
{
"Content": "def",
"RecordID": "def",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
}
]
}
Row 2:
{
"Firstname": "Bob",
"Lastname": "Smith",
"Identifiers": [
{
"Content": "abc",
"RecordID": "abc",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:26.020Z"
}
]
}
If I were to bring in my query's results, or a portion of the results, into a Python environment where they could be manipulated using Pandas, how could I iterate over the results of my query (or the sub-query) in order to achieve the same end result as with my original query?
Is there an easier way, using Python, to iterate through my un-nested json array in the same way that Postgres does?
For example, after performing this query:
select
json_document->'Firstname'->0->'Content' as first_name,
json_document->'Lastname'->0->'Content' as last_name,
identifiers->'RecordID' as record_id
from (
select *,
jsonb_array_elements(json_document->'Identifiers') as identifiers
from staging
) sub
order by last_name;
How, using Python/Pandas, can i take that query's results and perform something like:
da = datasets[query_results] # to equal my dupe da query
db = datasets[query_results] # to equal my dupe db query
Then perform the equivalent of
select * from dupe da where (
select count(*) from dupe db
where db.record_id = da.record_id
) > 1;
in Python?
I apologize if I do not provide enough information here. I am a Python novice. Any and all help is greatly appreciated! Thanks!!

Try the following, which eliminates your count(*) and instead uses exists.
with dupe as (
select id,
json_document->'Firstname'->0->'Content' as first_name,
json_document->'Lastname'->0->'Content' as last_name,
identifiers->'RecordID' as record_id
from
(select
*,
jsonb_array_elements(json_document->'Identifiers') as identifiers
from staging ) sub
group by
id,
record_id,
json_document
order by last_name )
select * from dupe da
where exists
(select *
from dupe db
where
db.record_id = da.record_id
and db.id != da.id
)

Consider reading the raw, unqueried values of the Postgres json column type and use pandas json_normalize() to bind into a flat dataframe. From there use pandas drop_duplicates.
To demonstrate, below parses your one json data into three-row dataframe for each corresponding Identifiers records:
import json
import pandas as pd
json_str = '''
{
"Firstname": "Bobb",
"Lastname": "Smith",
"Identifiers": [
{
"Content": "123",
"RecordID": "123",
"SystemID": "Test",
"LastUpdated": "2017-09-12T02:23:30.817Z"
},
{
"Content": "abc",
"RecordID": "abc",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
},
{
"Content": "def",
"RecordID": "def",
"SystemID": "Test",
"LastUpdated": "2017-09-13T10:10:21.598Z"
}
]
}
'''
data = json.loads(json_str)
df = pd.io.json.json_normalize(data, 'Identifiers', ['Firstname','Lastname'])
print(df)
# Content LastUpdated RecordID SystemID Lastname Firstname
# 0 123 2017-09-12T02:23:30.817Z 123 Test Smith Bobb
# 1 abc 2017-09-13T10:10:21.598Z abc Test Smith Bobb
# 2 def 2017-09-13T10:10:21.598Z def Test Smith Bobb
For your database, consider connecting with your DB-API such as psycopg2 or sqlAlchemy and parse each json as a string accordingly. Admittedly, there may be other ways to handle json as seen in the psycopg2 docs but below receives data as text and parses on python side:
import psycopg2
conn = psycopg2.connect("dbname=test user=postgres")
cur = conn.cursor()
cur.execute("SELECT json_document::text FROM staging;")
df = pd.io.json.json_normalize([json.loads(row[0]) for row in cur.fetchall()],
'Identifiers', ['Firstname','Lastname'])
df = df.drop_duplicates(['RecordID'])
cur.close()
conn.close()

Translate SQL to Python, if possible

Is there a tool to convert a sql statement into python, if it's possible. For example:
(CASE WHEN var = 2 then 'Yes' else 'No' END) custom_var
==>
customVar = 'Yes' if var == 2 else 'No'
I am trying to provide a API for ETL-like transformations from a json input. Here's an example of an input:
{
"ID": 4,
"Name": "David",
"Transformation: "NewField = CONCAT (ID, Name)"
}
And we would translate this into:
{
"ID": 4,
"Name": "David",
"NewField: "4David"
}
Or, is there a better transformation language that could be used here over SQL?

Is SET NewField = CONCAT (ID, Name) actually valid sql? (if Newfield is a variable do you need to declare it and prefix with "#"?). If you want to just execute arbitrary SQL, you could hack something together with sqlite:
import sqlite3
import json
query = """
{
"ID": "4",
"Name": "David",
"Transformation": "SELECT ID || Name AS NewField FROM inputdata"
}"""
query_dict = json.loads(query)
db = sqlite3.Connection('mydb')
db.execute('create table inputdata ({} VARCHAR(100));'.format(' VARCHAR(100), '.join(query_dict.keys())))
db.execute('insert into inputdata ({}) values ("{}")'.format(','.join(query_dict.keys()),'","'.join(query_dict.values())))
r = db.execute(query_dict['Transformation'])
response = {}
response[r.description[0][0]] = r.fetchone()[0]
print(response)
#{'NewField': '4David'}
db.execute('drop table inputdata;')
db.close()

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert sql query result into JSON array in python - python

If your version of sqlite has the JSON1 extension enabled, it's easy to do in pure SQL: SELECT json_object('Teacher', teacher, 'Students', json_group_array(student)) AS result FROM ex GROUP BY teacher; DB Fiddle example

You could just do result["Students"] = result["Students"].split(", ").

Related

SELECT inside JSON structure return empty

Is there a way to find record in mongo by matching field string with an array of values

How to convert any nested json into a pandas dataframe

How can my Postgres query perform faster? Can I use Python to provide faster iteration?

Translate SQL to Python, if possible

Categories

Resources