I have downloaded a sample dataset from here that is a series of JSON objects. According to the website, each JSON object looks like below
{
"id": "4cd223df721b722b1c40689caa52932a41fcc223",
"title": "Knowledge-rich, computer-assisted composition of Chinese couplets",
"paperAbstract": "Recent research effort in poem composition has focused on the use of automatic language generation...",
"entities": [
"Conformance testing",
"Natural language generation",
"Natural language processing",
"Parallel computing",
"Stochastic grammar",
"Web application"
],
"s2Url": "https://semanticscholar.org/paper/4cd223df721b722b1c40689caa52932a41fcc223",
"s2PdfUrl": "",
"pdfUrls": [
"https://doi.org/10.1093/llc/fqu052"
],
"authors": [
{
"name": "John Lee",
"ids": [
"3362353"
]
},
"..."
],
"inCitations": [
"c789e333fdbb963883a0b5c96c648bf36b8cd242"
],
"outCitations": [
"abe213ed63c426a089bdf4329597137751dbb3a0",
"..."
],
"year": 2016,
"venue": "DSH",
"journalName": "DSH",
"journalVolume": "31",
"journalPages": "152-163",
"sources": [
"DBLP"
],
"doi": "10.1093/llc/fqu052",
"doiUrl": "https://doi.org/10.1093/llc/fqu052",
"pmid": ""
}
Eventually I need to work with the paperAbsrtract section only. I am loading this to a pandas dataframe like below
filename = "sample-S2-records"
df = pd.read_json(filename, lines=True)
df.head()
This shows all the doi and doiUrl column empty.
Also if I only select abstract column and check out the head, I see 2 of the 5 rows empty
abstract = df['paperAbstract']
abstract.head()
0
1 The search for new administrators in complex s...
2 The human N-formyl peptide receptor (FPR) is a...
3 Serum CA 19-9 (2-3 sialyl Le(a)) is a marker o...
4
Name: paperAbstract, dtype: object
Looks like the way I have created the dataframe is not the right approach. I am pretty confident they don't have any missing column.
What am I missing? Any suggestion?
I looked into a sample of your data, and I think you're getting correct results. If we were to parse the JSON by hand:
import json
filename = "sample-S2-records"
with open(filename, 'r') as f:
d = [json.loads(x) for x in f]
Then inspect the list of dictionaries, here's what we see:
>>> d[0]['paperAbstract']
''
So it really looks like the first lines paperAbstract field is empty.
P.S.: I think that the question needs to be closed, I doubt it's going to help anyone else
Related
As I'm fairly new to python I've tried various ways based on answers found here but can't manage to normalize my json file.
As I checked in Postman it has 4 levels of nested arrays. For suppliers I want to expand all the levels of data.The problem I have is with the score_card subtree for which I want to pull out all risk_groups , then name,risk_score and all riskstogether with name,risk_score and indicators with the deepest level containing again name and risk_score.
and I'm not sure whether it is possible in one go based on ambiguous names between levels.
Below I'm sharing a reproducible example:
data = {"suppliers": [
{
"id": 1260391,
"name": "2712270 MANITOBA LTD",
"duns": "rm-071-7291",
"erp_number": "15189067;15189067",
"material_group_ids": [
176069
],
"business_unit_ids": [
13728
],
"responsible_user_ids": [
37761,
37760,
37759,
37758,
37757,
37756,
36520,
37587,
36494,
22060,
37742,
36446,
36289
],
"address": {
"address1": "BOX 550 NE 16-20-26",
"address2": None,
"zip_code": "R0J 1W0",
"city": "RUSSELL",
"state": None,
"country_code": "ca",
"latitude": 50.7731176,
"longitude": -101.2862461
},
"score_card": {
"risk_score": 26.13,
"risk_groups": [
{
"name": "Viability",
"risk_score": 43.33,
"risks": [
{
"name": "Financial stability supplier",
"risk_score": None,
"indicators": [
{
"name": "Acquisitions",
"risk_score": None
}
]
}
]
},
]
}
}
]
}
And here is how it should look:
expected = [[1260391,'2712270 MANITOBA LTD','rm-071-7291',176069,13728,[37761,37760,
37759,37758,37757,37756,36520,37587,36494,22060,37742,36446,36289],
'BOX 550 NE 16-20-26','None','R0J 1W0','RUSSELL','None','ca',50.7731176,
-101.2862461,26.13,'Viability',43.33,'Financial stability supplier','None',
'Acquisitions','None']]
df = pd.DataFrame(expected,columns=['id','name','duns','material_groups_ids',
'business_unit_ids','responsible_user_ids',
'address.address1','address.address2','address.zip_code',
'address.city','address.state','address.country_code','address.latitude',
'address.longitude','risk_score','risk_groups.name',
'risk_groups.risk_score','risk_groups.risks.name',
'risk_groups.risks.risk_score','risk_groups.risks.indicators.name',
'risk_groups.risks.indicators.risk_score'])
What I tried since last 3 days is to use json_normalize:
example = pd.json_normalize(data['suppliers'],
record_path=['score_card','risk_groups','risks',
'indicators'],meta = ['id','name','duns','erp_number','material_group_ids',
'business_unit_ids','responsible_user_ids','address',['score_card','risk_score']],record_prefix='_')
but when I specify the fields for record_path which seems to be a vital param here it always return this prompt:
TypeError: {'name': 'Acquisitions', 'risk_score': None} has non list value Acquisitions for path name. Must be list or null.
What I want to get a a table with columns as presented below:
So it seems that python doesn't know how to treat most granular level which contains null values.
I've tried approach from here: Use json_normalize to normalize json with nested arrays
but unsuccessfully.
I would like to get the same view which I was able to 'unpack' in Power BI but this way of having the data is not satisfying me.
All risk indicators are the components of risks which are part of risk groups as You can see on attached pictures.
Going to the most granular level it goes this way: Risk groups -> Risks -> Indicators so I would like to unpack it all.
Any help more than desperately needed. Kindly thank You for any support here.
I am trying to convert an Excel to nested JSON using Python where the repeated values go in as an array of elements.
Ex: structure of CSV
Manufacturer,oilType,viscosity
shell,superOil,1ova
shell,superOil,2ova
shell,normalOil,1ova
bp, power, 10bba
Should be displayed in JSON (expected output) as
elements: [
{
"Manufacturer": "shell",
"details": [
{
"OilType": "superOil",
"Viscosity": [
"1ova",
"2ova"
]
},
{
"OilType": "normalOil",
"Viscosity": [
"1ova"
]
}
]
},
{
"Manufacturer": "bp",
"details": [
{
"OilType": "power",
"Viscosity": [
"10bba"
]
}
]
}
]
I have currently converted the CSV into JSON using openpyxl and the values are displayed for each of the headers in format like (Current output)
[{Manufacturer: "shell", oilType: "superOil", Viscosity:"1ova"},{...},{...},...]
Please help in getting the expected output.
Hi and welcome to StackOverflow.
Your question has actually nothing to do with openpyxl because you don't need to save into an Excel file.
You can do thought:
Load the csv (or Excel) into a pandas DataFrame
Group by Manufacturer and oil type
Dump into the format you want
Transform to JSON (either string or file)
In practice, that gives something like that:
import json
import pandas as pd
df = pd.read_csv("oil.csv") # or read_excel if this is an Excel
oils = df.groupby(["Manufacturer", "oilType"]).aggregate(pd.Series.to_list)
elements = [
{
"Manufacturer": manufacturer,
"Details": [
{"OilType": o, "Viscosity": v}
for o, v in data.droplevel(0).viscosity.items()
],
}
for manufacturer, data in oils.groupby(level="Manufacturer")
]
with open("oil.json", "w") as f:
json.dump({"elements": elements}, f)
For information, oils would look like this:
viscosity
Manufacturer oilType
bp power [10bba]
shell normalOil [1ova]
superOil [1ova, 2ova]
I have some dynamically generated nested json that I want to convert to a CSV file using python. I am trying to use pandas for this. My question is - is there a way to use this and flatten the json data to put in the csv without knowing the json keys that need flattened in advance? An example of my data is this:
{
"reports": [
{
"name": "report_1",
"details": {
"id": "123",
"more info": "zyx",
"people": [
"person1",
"person2"
]
}
},
{
"name": "report_2",
"details": {
"id": "123",
"more info": "zyx",
"actions": [
"action1",
"action2"
]
}
}
]
}
More nested json objects can be dynamically generated in the "details" section that I do not know about in advance but need to be represented in their own cell in the csv.
For the above example, I'd want the csv to look something like this:
Name, Id, More Info, People_1, People_2, Actions_1, Actions_2
report_1, 123, zxy, person1, person2, ,
report_2, 123, zxy , , , action1 , action2
Here's the code I have:
data = json.loads('{"reports": [{"name": "report_1","details": {"id": "123","more info": "zyx","people": ["person1","person2"]}},{"name": "report_2","details": {"id": "123","more info": "zyx","actions": ["action1","action2"]}}]}')
df = pd.json_normalize(data['reports'])
df.to_csv("test.csv")
And here is the outcome currently:
,name,details.id,details.more info,details.people,details.actions
0,report_1,123,zyx,"['person1', 'person2']",
1,report_2,123,zyx,,"['action1', 'action2']"
I think what your are lookig for is:
https://pandas.pydata.org/docs/reference/api/pandas.json_normalize.html
If using pandas doesn't work for you, here's the more canonical Python way of doing it.
You're trying to write out a CSV file, and that implicitly means you must write out a header containing all the keys.
The constraint that you don't know the keys in advance means you can't do this in a single pass.
def convert_record_to_flat_dict(record):
# You need to figure out exactly how you want to do this; everything
# should be strings.
record.update(record.pop('details'))
return record
header = {}
rows = [[]] # Leave the header row blank for now.
csv_out = csv.writer(buffer)
for record in report:
record = convert_record_to_flat_dict(record)
for key in record.keys():
if key not in header:
header[key] = len(header)
rows[0].append(key)
row = [''] * len(header)
for key, index in header.items():
row[index] = record.get(key, '')
rows.append(row)
# And you can go back to ensure all rows have the same number of keys:
for row in rows:
row.extend([''] * (len(row) - len(header)))
Now you have a list of lists that's ready to be sent to csv.csvwriter() or the like.
If memory is an issue, another technique is to write out a temporary file and then reprocess it once you know the header.
I have one json payload which is used for one service request. After processing that payload(JSON) will be stored in S3 and through Athena we can download those data in CSV format. Now in the actual scenario, there are more than 100 fields. I want to verify their value through some automated script instead of manual.
say my sample payload is similar to the following:
{
"BOOK": {
"serialno": "123",
"author": "xyz",
"yearofpublish": "2015",
"price": "16"
}, "Author": [
{
"isbn": "xxxxx", "title": "first", "publisher": "xyz", "year": "2020"
}, {
"isbn": "yyyy", "title": "second", "publisher": "zmy", "year": "2019"
}
]
}
the sample csv will be like following:
Can anyone please help me how exactly I can do it on Python? Maybe the library or dictionary?
it looks like you just want to flatten out the JSON structure. It'll be easiest to loop over the "Author" list. Since the CSV has renamed the columns you'll need some way to represent that mapping. Based only on example, this works:
import json
fin=open(some_json_file, 'r')
j=json.load(fin)
result=[]
for author in j['Author']:
val = {'book_serialno': j['BOOK']['serialno'],
'book_author': j['BOOK']['author'],
'book_yearofpublish': j['BOOK']['yearofpublish'],
'book_price': j['BOOK']['price'],
'author_isbn': author['isbn'],
'author_title': author['title'],
'author_publisher': author['publisher'],
'author_year': author['year']}
result.append(val)
This is using a dictionary to show the mapping of data points to the new column names. You might be able to get away with using a list as well. Depends how you want to use it later on. To write to a CSV:
import csv
fout=open(some_csv_file, 'w')
writer=csv.writer(fout)
writer.writerow(result[0].keys())
writer.writerows(r.values() for r in result)
This writes the column names in the first row, then the data. If you don't want the column names, just leave out the writerow(...) line.
I've got JSON data formatted like this.
{
"website": "http://www.zebrawebworks.com/zebra/bluetavern/day.cfm?&year=2018&month=6&day=29",
"date": "2018-06-29",
"headliner": [
"Delta Ringnecks",
"Flathead String Band"
],
"data": [
"4:00 PM",
"FEE: $0",
"Jug Band Music",
"8:00 PM",
"FEE: $5",
"Old Time Fiddle & Banjoby some young turks!"
]
}
I'm working through a bunch of items that look like this in a for concert in data: loop. On dates where there are two concerts like this, I need to create a new Python dictionary, so that each concert is in its own dictionary like so:
{
"website": "http://www.zebrawebworks.com/zebra/bluetavern/day.cfm?&year=2018&month=6&day=29",
"date": "2018-06-29",
"headliner": "Delta Ringnecks",
"data": [
"4:00 PM",
"FEE: $0",
"Jug Band Music",
]
},
{
"website": "http://www.zebrawebworks.com/zebra/bluetavern/day.cfm?&year=2018&month=6&day=29",
"date": "2018-06-29",
"headliner": "Flathead String Band"
"data": [
"8:00 PM",
"FEE: $5",
"Old Time Fiddle & Banjoby some young turks!"
]
}
Is there a recommended way to do this? I can't change the data in the for loop itself, right?, because then it would screw up my iteration.
Could I append it to the end of data so that the for loop covers the new dictionaries (I would still need to parse some data after everything is separated)?
Or should I maybe create a new dictionary with the split days, delete the two-concerts-in-one-day objects, and then combine the dictionaries I have left?
I hope this is enough info and that I'm not mixing terminology too much. I'm very new to the JSON Python module and have been struggling with how to efficiently approach this problem. Thank you.
I suggest you create copies of the dict and store the specific data in each one. For example:
result = []
for pos in range(0, len(original_dict['headliner'])):
new_dict = original_dict.copy()
new_dict['data'] = original_dict['data'][pos*3:(pos+1)*3]
new_dict['headliner'] = original_dict['headliner'][pos]
result.append(new_dict)
print(result)
You can get a pretty clean version of this using the grouper idiom from the itertools documentation:
In [42]: new_list = [{'website': d['website'], 'date': d['date'], 'headliner': headliner, 'data': list(datarow)}
...: for headliner, datarow in zip(d['headliner'], grouper(d['data'], 3))]
...:
In [43]: new_list
Out[43]:
[{'website': 'http://www.zebrawebworks.com/zebra/bluetavern/day.cfm?&year=2018&month=6&day=29',
'date': '2018-06-29',
'headliner': 'Delta Ringnecks',
'data': ['4:00 PM', 'FEE: $0', 'Jug Band Music']},
{'website': 'http://www.zebrawebworks.com/zebra/bluetavern/day.cfm?&year=2018&month=6&day=29',
'date': '2018-06-29',
'headliner': 'Flathead String Band',
'data': ['8:00 PM',
'FEE: $5',
'Old Time Fiddle & Banjoby some young turks!']}]
Here's the solution I came up with thanks with the help of nosklo above. Hope it helps someone with a similar problem in the future.
new_concerts = []
for concert in blue_data:
if len(concert['headliner']) == 2:
new_concert = concert.copy()
new_concert['headliner'] = str(concert['headliner'][1])
concert['headliner'] = str(concert['headliner'][0])
mid = len(concert['data']) / 2
new_concert['data'] = concert['data'][mid:]
concert['data'] = concert['data'][0:mid]
new_concerts.append(new_concert)
blue_data = blue_data + new_concerts