How to compare a json with a CSV file - python

I have one json payload which is used for one service request. After processing that payload(JSON) will be stored in S3 and through Athena we can download those data in CSV format. Now in the actual scenario, there are more than 100 fields. I want to verify their value through some automated script instead of manual.
say my sample payload is similar to the following:
{
"BOOK": {
"serialno": "123",
"author": "xyz",
"yearofpublish": "2015",
"price": "16"
}, "Author": [
{
"isbn": "xxxxx", "title": "first", "publisher": "xyz", "year": "2020"
}, {
"isbn": "yyyy", "title": "second", "publisher": "zmy", "year": "2019"
}
]
}
the sample csv will be like following:
Can anyone please help me how exactly I can do it on Python? Maybe the library or dictionary?

it looks like you just want to flatten out the JSON structure. It'll be easiest to loop over the "Author" list. Since the CSV has renamed the columns you'll need some way to represent that mapping. Based only on example, this works:
import json
fin=open(some_json_file, 'r')
j=json.load(fin)
result=[]
for author in j['Author']:
val = {'book_serialno': j['BOOK']['serialno'],
'book_author': j['BOOK']['author'],
'book_yearofpublish': j['BOOK']['yearofpublish'],
'book_price': j['BOOK']['price'],
'author_isbn': author['isbn'],
'author_title': author['title'],
'author_publisher': author['publisher'],
'author_year': author['year']}
result.append(val)
This is using a dictionary to show the mapping of data points to the new column names. You might be able to get away with using a list as well. Depends how you want to use it later on. To write to a CSV:
import csv
fout=open(some_csv_file, 'w')
writer=csv.writer(fout)
writer.writerow(result[0].keys())
writer.writerows(r.values() for r in result)
This writes the column names in the first row, then the data. If you don't want the column names, just leave out the writerow(...) line.

Related

Combining geoJSON/topoJSON with my own JSON generated from tabular data

I have a very large topoJSON file I generated from pulling census shapefiles for ZCTA and converting with mapshaper. https://www.census.gov/cgi-bin/geo/shapefiles/index.php?year=2021&layergroup=ZIP+Code+Tabulation+Areas
An example of this homemade JSON looks like:
zips = [
{
"zip": "01001",
"quantile": "high"
},
{
"zip": "01002",
"quantile": "high"
},
{
"zip": "52155",
"quantile": "low"
},
{
"zip": "52156",
"quantile": "low"
}
]
I would like to somehow fuse these JSONs together so that I can eventually fill the ZipCode polygons with specific colors based on the quantile value in my zips. Or is there perhaps a better way to essentially use my homemade JSON source as a lookup JSON so that I can find where the zip from my topojson matches my homemade json and extract the needed value.
In Python what I have tried so far was to iterate through the geometries and then loop through the lookup (homemade json) each time to find if the zip code values match, then I append that value. This however, takes a bit of time to process and I was curious if there were better methods to do something like this?
# loop through topojson file
for i in data['objects']['tl_2021_us_zcta520']['geometries']:
# loop through lookup homemade json and if the zip found matches, we will append a new key to our properties in the topojson
for j in lkp:
if i['properties']['zip'] == j['zip']:
i['properties']['quantile'] = j['quantile']

Converting csv to nested Json using python

I want to convert csv file to json file.
I have large data in csv file.
CSV Column Structure
This is my column structure in csv file . I has 200+ records.
id.oid libId personalinfo.Name personalinfo.Roll_NO personalinfo.addr personalinfo.marks.maths personalinfo.marks.physic clginfo.clgName clginfo.clgAddr clginfo.haveCert clginfo.certNo clginfo.certificates.cert_name_1 clginfo.certificates.cert_no_1 clginfo.certificates.cert_exp_1 clginfo.certificates.cert_name_2 clginfo.certificates.cert_no_2 clginfo.certificates.cert_exp_2 clginfo.isDept clginfo.NoofDept clginfo.DeptDetails.DeptName_1 clginfo.DeptDetails.location_1 clginfo.DeptDetails.establish_date_1 _v updatedAt.date
Expected Json
[{
"id":
{
"$oid": "00001"
},
"libId":11111,
"personalinfo":
{
"Name":"xyz",
"Roll_NO":101,
"addr":"aa bb cc ddd",
"marks":
[
"maths":80,
"physic":90
.....
]
},
"clginfo"
{
"clgName":"pqr",
"clgAddr":"qwerty",
"haveCert":true, //this is boolean true or false
"certNo":1, //this could be 1-10
"certificates":
[
{
"cert_name_1":"xxx",
"cert_no_1":12345,
"cert_exp.1":"20/2/20202"
},
{
"cert_name_2":"xxx",
"cert_no_2":12345,
"cert_exp_2":"20/2/20202"
},
......//could be up to 10
],
"isDept":true, //this is boolean true or false
"NoofDept":1 , //this could be 1-10
"DeptDetails":
[
{
"DeptName_1":"yyy",
"location_1":"zzz",
"establish_date_1":"1/1/1919"
},
......//up to 10 records
]
},
"__v": 1,
"updatedAt":
{
"$date": "2022-02-02T13:35:59.843Z"
}
}]
I have tried using pandas but I'm getting output as
My output
[{
"id.$oid": "00001",
"libId":11111,
"personalinfo.Name":"xyz",
"personalinfo.Roll_NO":101,
"personalinfo.addr":"aa bb cc ddd",
"personalinfo.marks.maths":80,
"personalinfo.marks.physic":90,
"clginfo.clgName":"pqr",
"clginfo.clgAddr":"qwerty",
"clginfo.haveCert":true,
"clginfo.certNo":1,
"clginfo.certificates.cert_name_1":"xxx",
"clginfo.certificates.cert_no_1":12345,
"clginfo.certificates.cert_exp.1":"20/2/20202"
"clginfo.certificates.cert_name_2":"xxx",
"clginfo.certificates.cert_no_2":12345,
"clginfo.certificates.cert_exp_2":"20/2/20202"
"clginfo.isDept":true,
"clginfo.NoofDept":1 ,
"clginfo.DeptDetails.DeptName_1":"yyy",
"clginfo.DeptDetails.location_1":"zzz",
"eclginfo.DeptDetails.stablish_date_1":"1/1/1919",
"__v": 1,
"updatedAt.$date": "2022-02-02T13:35:59.843Z",
}]
I am new to python I only know the basic Please help me getting this output.
200+ records is really tiny, so even naive solution is good.
It can't be totally generic because I don't see how it can be seen from the headers that certificates is a list, unless we rely on all names under certificates having _N at the end.
Proposed solution using only basic python:
read header row - split all column names by period. Iterate over resulting list and create nested dicts with appropriate keys and dummy values (if you want to handle lists: create array if current key ends with _N and use N as an index)
for all rows:
clone dictionary with dummy values
for each column use split keys from above to put the value into the corresponding dict. same solution from above for lists.
append the dictionary to list of rows

How to convert dynamic nested json into csv?

I have some dynamically generated nested json that I want to convert to a CSV file using python. I am trying to use pandas for this. My question is - is there a way to use this and flatten the json data to put in the csv without knowing the json keys that need flattened in advance? An example of my data is this:
{
"reports": [
{
"name": "report_1",
"details": {
"id": "123",
"more info": "zyx",
"people": [
"person1",
"person2"
]
}
},
{
"name": "report_2",
"details": {
"id": "123",
"more info": "zyx",
"actions": [
"action1",
"action2"
]
}
}
]
}
More nested json objects can be dynamically generated in the "details" section that I do not know about in advance but need to be represented in their own cell in the csv.
For the above example, I'd want the csv to look something like this:
Name, Id, More Info, People_1, People_2, Actions_1, Actions_2
report_1, 123, zxy, person1, person2, ,
report_2, 123, zxy , , , action1 , action2
Here's the code I have:
data = json.loads('{"reports": [{"name": "report_1","details": {"id": "123","more info": "zyx","people": ["person1","person2"]}},{"name": "report_2","details": {"id": "123","more info": "zyx","actions": ["action1","action2"]}}]}')
df = pd.json_normalize(data['reports'])
df.to_csv("test.csv")
And here is the outcome currently:
,name,details.id,details.more info,details.people,details.actions
0,report_1,123,zyx,"['person1', 'person2']",
1,report_2,123,zyx,,"['action1', 'action2']"
I think what your are lookig for is:
https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
If using pandas doesn't work for you, here's the more canonical Python way of doing it.
You're trying to write out a CSV file, and that implicitly means you must write out a header containing all the keys.
The constraint that you don't know the keys in advance means you can't do this in a single pass.
def convert_record_to_flat_dict(record):
# You need to figure out exactly how you want to do this; everything
# should be strings.
record.update(record.pop('details'))
return record
header = {}
rows = [[]] # Leave the header row blank for now.
csv_out = csv.writer(buffer)
for record in report:
record = convert_record_to_flat_dict(record)
for key in record.keys():
if key not in header:
header[key] = len(header)
rows[0].append(key)
row = [''] * len(header)
for key, index in header.items():
row[index] = record.get(key, '')
rows.append(row)
# And you can go back to ensure all rows have the same number of keys:
for row in rows:
row.extend([''] * (len(row) - len(header)))
Now you have a list of lists that's ready to be sent to csv.csvwriter() or the like.
If memory is an issue, another technique is to write out a temporary file and then reprocess it once you know the header.

How to extract data from JSON file for each entry?

So I am using the JSON package in python to extract data from generated JSON which would essentially fetched data from a firebase database which was then generated as a JSON file.
Within the given data set I want to extract all of the data corresponding to bills in each entry within the JSON file. For that I created a separate dictionary to add all of the elements corresponding to bills in the dataset.
When converted to CSV, the dataset looks like this:
csv for one entry
So I have the following code to do above operation. But as I create a new dictionary, there are certain entries which have null values designated as [] (see the csv file). I assigned list to store all those bills which would have the data in the bills column (essentially avoiding all the null entries). But as a I create a new list the required output is only getting stored in the first index of the new list or array. Please see the code below.
My code is as below:
filedata = open('requireddataset.json','r') data = json.load(filedata)
listoffields = [] # To produce it into a list with fields for dic
for dic in data:
try:
listoffields.append(dic['bills']) # only non-essential bill categories.
except KeyError:
pass
#print (listoffields[3]) # This would return the first payment entry within
# the JSON Array of objects.
for val in listoffields:
if val!=[]:
x = val[0] # only val[0] would contain data
#print (x)
myarray = np.array(val)
print(myarray[0]) # All of the data stored in only one index, any way to change this?
This is the output : output
This is how the original JSON file looks like : requireddataset.json
Essentially my question is the list listoffields would contain all the fields in it(from the JSON file), and bills in one of the fields. And within the column bills each entry again contains id, value, role and many other entries. Is there any way to extract only values from this and produce sum .
In the JSON file this is how it looks like for one entry :
[{"goal_savings": 0.0, "social_id": "", "score": 0, "country": "BR", "photo": "http://graph.facebook", "id": "", "plates": 3, "rcu": null, "name": "", "email": ".", "provider": "facebook", "phone": "", "savings": [], "privacyPolicyAccepted": true, "currentRole": "RoleType.PERSONAL", "empty_lives_date": null, "userId": "", "authentication_token": "-------", "onboard_status": "ONBOARDING_WIZARD", "fcmToken": ----------", "level": 1, "dni": "", "social_token": "", "lives": 10, "bills": [{"date": "2020-12-10", "role": "RoleType.PERSONAL", "name": "Supermercado", "category": "feeding", "periodicity": "PeriodicityType.NONE", "value": 100.0"}], "payments": [], "goals": [], "goalTransactions": [], "incomes": [], "achievements": [{"created_at":", "name": ""}]}]

In Python, how can I remove certain data from list to create a new one?

I'm pulling data from a json endpoint, which returns a list.
Some of the elements in this list, I want to throw out. I'm only interested in certain elements.
I'm pulling the data as such:
# Pull the data
url = "https://my-endpoint.com"
user = 'user1'
pwd = 'password1'
response = requests.get(url, auth=(user, pwd))
data = json.loads(response.text)
The payload looks similar to:
[{
"apples": {
"value": 0.0
},
"oranges": {
"value": 0.0
},
"name": "testing123"
},
{
"apples": {
"value": 0.0
},
"oranges": {
"value": 0.0
},
"name": "foobar"
},
{
"apples": {
"value": 0.0
},
"oranges": {
"value": 0.0
},
"name": "testing456"
}]
Assume that the above continues on with many other elements, but with a different name. How can I pull all of the data, but exclude what I want?
From the example above, I would like to pull all data for names "testing123" and "testing456", but exclude the data from "foobar".
The new list is what I would iterate over to pull the data I need for my purposes.
There's a good deal of mismatched braces in your question, but I think I've figured it out. You have 3 (+ many more) dictionaries in a list, each with it's own apples, oranges (or other) keys, and then a name key. You want a list of dictionaries with the same structure as this one, just only the dictionaries where name in set_of_preapproved_names. For the sake of brevity I'll assume you have such a list of names called OK_NAMES:
new_data = [Dict for Dict in data if Dict ["name"] in OK_NAMES]
There you go!
If instead you wanted to eliminate all names with a specific pattern:
new_data = [Dict for Dict in data if not Dict ["name"].startswith ("foobar")]
That should work
Btw I know it's almost never a good idea to name variables after a type, I was just doing it for clarity here.

Categories