How to normalize a nested json with json_normalize - python

I am trying to create a pandas dataframe out of a nested json. For some reason, I seem to be unable to address the third level.
My json looks something like this:
"numberOfResults": 376,
"results": [
{
"name": "single",
"docs": [
{
"id": "RAKDI342342",
"type": "Culture",
"category": "Culture",
"media": "unknown",
"label": "exampellabel",
"title": "testtitle and titletest",
"subtitle": "Archive"
]
},
{
"id": "GUI6N5QHBPTO6GJ66VP5OXB7GKX6J7ER",
"type": "Culture",
"category": "Culture",
"media": "image",
"label": "more label als example",
"title": "test the second title",
"subtitle": "picture"
and so on.
Within the "docs"-part are all the actual results, starting with "id". Once all the information is there, the next block starting with "id" simply follows.
Now I am trying to create a table with the keys id, label and title (for a start) for each of these separate blocks (in this case actual items).
After defining the search_url (where I get the json from), my code for this currently looks like this:
result = requests.get(search_url)
data = result.json()
data.keys()
With this, I get told that they dict_keys are the following:
dict_keys(['numberOfResults', 'results', 'facets', 'entities', 'fulltexts', 'correctedQuery', 'highlightedTerms', 'randomSeed', 'nextCursorMark'])
Given the json from above, I know I want to look into "results" and then further into "docs". According to the documentation I found, I should be able to achieve this by addressing the results-part directly and then addressing the nested bit by separating the fields with ".".
I have now tried the following the code:
fields = ["docs.id", "docs.label", "docs.title"]
df = pd.json_normalize(data["results"])
df[fields]
This works until df[field] - at this stage the programm tells me:
KeyError: "['docs.id'] not in index"
It does work for the level above though, so if I try the same with "name" and "docs" I get a lovely dataframe. What am I doing wrong? I am still a python and pandas beginner and would appreciate any help very much!
EDIT:
The desired dataframe output would look roughly like this:
id label title
0 RAKDI342342 exampellabel testtitle and titletest

Use pandas.json_normalize()
The following code uses pandas v.1.2.4
If you don't want the other columns, remove the list of keys assigned to meta
Use pandas.DataFrame.drop to remove any other unwanted columns from df.
import pandas as pd
df = pd.json_normalize(data, record_path=['results', 'docs'], meta=[['results', 'name'], 'numberOfResults'])
display(df)
id type category media label title subtitle results.name numberOfResults
0 RAKDI342342 Culture Culture unknown exampellabel testtitle and titletest Archive single 376
1 GUI6N5QHBPTO6GJ66VP5OXB7GKX6J7ER Culture Culture image more label als example test the second title picture single 376
Data
The posted JSON / Dict is not correctly formed
Assuming the following corrected form
data = \
{'numberOfResults': 376,
'results': [{'docs': [{'category': 'Culture',
'id': 'RAKDI342342',
'label': 'exampellabel',
'media': 'unknown',
'subtitle': 'Archive',
'title': 'testtitle and titletest',
'type': 'Culture'},
{'category': 'Culture',
'id': 'GUI6N5QHBPTO6GJ66VP5OXB7GKX6J7ER',
'label': 'more label als example',
'media': 'image',
'subtitle': 'picture',
'title': 'test the second title',
'type': 'Culture'}],
'name': 'single'}]}

Related

How to efficiently fix JSON file converted from pandas dataframe

I have a JSON file that I read in pandas and converted to a dataframe. I then exported this file as a CSV so I could edit it easier. Once finished, I read the CSV file back into a dataframe and then wanted to convert it back to a JSON file. However, in that process a whole lot of extra data was automatically added to my original list of dictionaries (the JSON file).
I'm sure I could hack together a fix, but wanted to know if anyone knows an efficient way to handle this process so that NO new data or columns are added to my original JSON data?
Original JSON (snippet):
[
{
"tag": "!= (not-equal-to operator)",
"definition": "",
"source": [
{
"title": "Compare Dictionaries",
"URL": "https://learning.oreilly.com/library/view/introducing-python-2nd/9781492051374/ch08.html#idm45795007002280"
}
]
},
{
"tag": "\"intelligent\" applications",
"definition": "",
"source": [
{
"title": "Why Machine Learning?",
"URL": "https://learning.oreilly.com/library/view/introduction-to-machine/9781449369880/https://learning.oreilly.com/library/view/introduction-to-machine/9781449369880/ch01.html#idm45613685872600"
}
]
},
{
"tag": "# (pound sign)",
"definition": "",
"source": [
{
"title": "Comment with #",
"URL": "https://learning.oreilly.com/library/view/introducing-python-2nd/9781492051374/ch04.html#idm45795038172984"
}
]
},
CSV as a dataframe (an index was automatically added):
tag definition source
0 != (not-equal-to operator) [{'title': 'Compare Dictionaries', 'URL': 'htt...
1 "intelligent" applications [{'title': 'Why Machine Learning?', 'URL': 'ht...
2 # (pound sign) [{'title': 'Comment with #', 'URL': 'https://l...
3 $ (Mac/Linux prompt) [{'title': 'Test Driving Python', 'URL': 'http...
4 $ (anchor) [{'title': 'Patterns: Using Specifiers', 'URL'...
... ... ... ...
11375 { } (curly brackets) []
11376 | (vertical bar) [{'title': 'Combinations and Operators', 'URL'...
11377 || (concatenation) function (DB2/Oracle/Postgr... [{'title': 'Discussion', 'URL': 'https://learn...
11378 || (for Oracle Database) [{'title': 'Including special characters', 'UR...
11379 || (vertical bar, double), concatenation opera... [{'title': 'Including special characters', 'UR...
7009 rows × 3 columns
JSON file after converting from CSV (all sorts of awful):
{
"0":{
"Unnamed: 0":0,
"tag":"!= (not-equal-to operator)",
"definition":null,
"source":"[{'title': 'Compare Dictionaries', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch08.html#idm45795007002280'}]"
},
"1":{
"Unnamed: 0":1,
"tag":"\"intelligent\" applications",
"definition":null,
"source":"[{'title': 'Why Machine Learning?', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/ch01.html#idm45613685872600'}]"
},
"2":{
"Unnamed: 0":2,
"tag":"# (pound sign)",
"definition":null,
"source":"[{'title': 'Comment with #', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch04.html#idm45795038172984'}]"
},
Here is my code:
import pandas as pd
import json
# to dataframe
tags_df = pd.read_json('dsa_tags_flat.json')
# csv file was manually cleaned then reloaded here
cleaned_csv_df = pd.read_csv('dsa-parser-flat.csv')
# write to JSON
cleaned_csv_df.to_json(r'dsa-tags.json', orient='index', indent=2)
EDIT: I added an index=false to the code when going from dataframe to CSV, which helped, but still have the index of keys there that were not in the original JSON. I wonder if a library function out somewhere would prevent this? Or do I have to just write some loops and remove them myself?
Also, as you can see, the URL forward-slashes were escaped. Not what I wanted.
{
"0":{
"tag":"!= (not-equal-to operator)",
"definition":null,
"source":"[{'title': 'Compare Dictionaries', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch08.html#idm45795007002280'}]"
},
"1":{
"tag":"\"intelligent\" applications",
"definition":null,
"source":"[{'title': 'Why Machine Learning?', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/ch01.html#idm45613685872600'}]"
},
"2":{
"tag":"# (pound sign)",
"definition":null,
"source":"[{'title': 'Comment with #', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch04.html#idm45795038172984'}]"
},
"3":{
"tag":"$ (Mac\/Linux prompt)",
"definition":null,
"source":"[{'title': 'Test Driving Python', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/data-wrangling-with\/9781491948804\/ch01.html#idm140080973230480'}]"
},
The issue is that you are adding an index at two places.
Once while writing your file to csv. This adds the "Unnamed: 0" fields in the final JSON files. You can use index = False in the to_csv method while writing CSV to disk or specify the index_col parameter while reading the saved CSV in read_csv.
Secondly you are adding an index while writing the df to json with orient="index". This adds the outermost indices such as "0", "1" in the final JSON file. You should use orient="records" if you intend to save the json in a similar format to it was loaded in.
To understand how the orient parameter works, refer to pandas.DataFrame.to_json

Write multiple items to DynamoDB table

I have the following list of dictionaries:
datalist = [
{'Business': 'Business A', 'Category': 'IT', 'Title': 'IT Manager'},
{'Business': 'Business A', 'Category': 'Sourcing', 'Title': 'Sourcing Manager'}
]
I would like to upload to DynamoDB in the following format:
table.put_item(Item={
'Business':'Business A would go here',
'Category':'IT would go here',
'Title':'IT Manager would go here'
},
{
'Business':'Business B would go here',
'Category':'Sourcing would go here',
'Title':'Sourcing Manager would go here'
}
)
I've tried converting the list of dicts to dict first and accessing the elements that way or trying to iterate through the Items parameter but no luck. Any help would be appreciated.
Here's my DynamoDB structure, 3 columns (Business, Category, Title):
{
"Business": {
"S": "Business goes here"
},
"Category": {
"S": "Category goes here"
},
"Title": {
"S": "Title goes here"
}
}
The put_item API call allows you to upload exactly one item, you're trying to upload two at the same time, this doesn't work.
You can batch multiple put items requests using boto3 to decrease the number of API calls. Here's an example adapted from the documentation:
with table.batch_writer() as batch:
for item in datalist:
batch.put_item(
Item=item
)
This will automatically create the batch writes under the hood.

Select specific keys inside a json using python

I have the following json that I extracted using request with python and json.loads. The whole json basically repeats itself with changes in the ID and names. It has a lot of information but I`m just posting a small sample as an example:
"status":"OK",
"statuscode":200,
"message":"success",
"apps":[
{
"id":"675832210",
"title":"AGED",
"desc":"No annoying ads&easy to play",
"urlImg":"https://test.com/pImg.aspx?b=675832&z=1041813&c=495181&tid=API_MP&u=https%3a%2f%2fcdna.test.com%2fbanner%2fwMMUapCtmeXTIxw_square.png&q=",
"urlImgWide":"https://cdna.test.com/banner/sI9MfGhqXKxVHGw_rectangular.jpeg",
"urlApp":"https://admin.test.com/appLink.aspx?b=675832&e=1041813&tid=API_MP&sid=2c5cee038cd9449da35bc7b0f53cf60f&q=",
"androidPackage":"com.agedstudio.freecell",
"revenueType":"cpi",
"revenueRate":"0.10",
"categories":"Card",
"idx":"2",
"country":[
"CH"
],
"cityInclude":[
"ALL"
],
"cityExclude":[
],
"targetedOSver":"ALL",
"targetedDevices":"ALL",
"bannerId":"675832210",
"campaignId":"495181210",
"campaignType":"network",
"supportedVersion":"",
"storeRating":"4.3",
"storeDownloads":"10000+",
"appSize":"34603008",
"urlVideo":"",
"urlVideoHigh":"",
"urlVideo30Sec":"https://cdn.test.com/banner/video/video-675832-30.mp4?rnd=1620699136",
"urlVideo30SecHigh":"https://cdn.test.com/banner/video/video-675832-30_o.mp4?rnd=1620699131",
"offerId":"5825774"
},
I dont need all that data, just a few like 'title', 'country', 'revenuerate' and 'urlApp' but I dont know if there is a way to extract only that.
My solution so far was to make the json a dataframe and then drop the columns, however, I wanted to find an easier solution.
My ideal final result would be to have a dataframe with selected keys and arrays
Does anybody know an easy solution for this problem?
Thanks
I assume you have that data as a dictionary, let's call it json_data. You can just iterate over the apps and write them into a list. Alternatively, you could obviously also define a class and initialize objects of that class.
EDIT:
I just found this answer: https://stackoverflow.com/a/20638258/6180150, which tells how you can convert a list of dicts like from my sample code into a dataframe. See below adaptions to the code for a solution.
json_data = {
"status": "OK",
"statuscode": 200,
"message": "success",
"apps": [
{
"id": "675832210",
"title": "AGED",
"desc": "No annoying ads&easy to play",
"urlImg": "https://test.com/pImg.aspx?b=675832&z=1041813&c=495181&tid=API_MP&u=https%3a%2f%2fcdna.test.com%2fbanner%2fwMMUapCtmeXTIxw_square.png&q=",
"urlImgWide": "https://cdna.test.com/banner/sI9MfGhqXKxVHGw_rectangular.jpeg",
"urlApp": "https://admin.test.com/appLink.aspx?b=675832&e=1041813&tid=API_MP&sid=2c5cee038cd9449da35bc7b0f53cf60f&q=",
"androidPackage": "com.agedstudio.freecell",
"revenueType": "cpi",
"revenueRate": "0.10",
"categories": "Card",
"idx": "2",
"country": [
"CH"
],
"cityInclude": [
"ALL"
],
"cityExclude": [
],
"targetedOSver": "ALL",
"targetedDevices": "ALL",
"bannerId": "675832210",
"campaignId": "495181210",
"campaignType": "network",
"supportedVersion": "",
"storeRating": "4.3",
"storeDownloads": "10000+",
"appSize": "34603008",
"urlVideo": "",
"urlVideoHigh": "",
"urlVideo30Sec": "https://cdn.test.com/banner/video/video-675832-30.mp4?rnd=1620699136",
"urlVideo30SecHigh": "https://cdn.test.com/banner/video/video-675832-30_o.mp4?rnd=1620699131",
"offerId": "5825774"
},
]
}
filtered_data = []
for app in json_data["apps"]:
app_data = {
"id": app["id"],
"title": app["title"],
"country": app["country"],
"revenueRate": app["revenueRate"],
"urlApp": app["urlApp"],
}
filtered_data.append(app_data)
print(filtered_data)
# Output
d = [
{
'id': '675832210',
'title': 'AGED',
'country': ['CH'],
'revenueRate': '0.10',
'urlApp': 'https://admin.test.com/appLink.aspx?b=675832&e=1041813&tid=API_MP&sid=2c5cee038cd9449da35bc7b0f53cf60f&q='
}
]
d = pd.DataFrame(filtered_data)
print(d)
# Output
id title country revenueRate urlApp
0 675832210 AGED [CH] 0.10 https://admin.test.com/appLink.aspx?b=675832&e=1041813&tid=API_MP&sid=2c5cee038cd9449da35bc7b0f53cf60f&q=
if your endgame is dataframe, just load the dataframe and take the columns you want:
setting the json to data
df = pd.json_normalize(data['apps'])
yields
id title desc urlImg ... urlVideoHigh urlVideo30Sec urlVideo30SecHigh offerId
0 675832210 AGED No annoying ads&easy to play https://test.com/pImg.aspx?b=675832&z=1041813&... ... https://cdn.test.com/banner/video/video-675832... https://cdn.test.com/banner/video/video-675832... 5825774
[1 rows x 28 columns]
then if you want certain columns:
df_final = df[['title', 'desc', 'urlImg']]
title desc urlImg
0 AGED No annoying ads&easy to play https://test.com/pImg.aspx?b=675832&z=1041813&...
use a dictionary comprehension to extract a dictionary of key/value pairs you want
import json
json_string="""{
"id":"675832210",
"title":"AGED",
"desc":"No annoying ads&easy to play",
"urlApp":"https://admin.test.com/appLink.aspx?b=675832&e=1041813&tid=API_MP&sid=2c5cee038cd9449da35bc7b0f53cf60f&q=",
"revenueRate":"0.10",
"categories":"Card",
"idx":"2",
"country":[
"CH"
],
"cityInclude":[
"ALL"
],
"cityExclude":[
]
}"""
json_dict = json.loads(json_string)
filter_fields=['title','country','revenueRate','urlApp']
dict_result = { key: json_dict[key] for key in json_dict if key in filter_fields}
json_elements = []
for key in dict_result:
json_elements.append((key,json_dict[key]))
print(json_elements)
output:
[('title', 'AGED'), ('urlApp', 'https://admin.test.com/appLink.aspx?b=675832&e=1041813&tid=API_MP&sid=2c5cee038cd9449da35bc7b0f53cf60f&q='), ('revenueRate', '0.10'), ('country', ['CH'])]

Python - How to create a JSON nested file from a Pandas dataframe and group by?

so I'm having some troubles to create an appropriate JSON format from a pandas dataframe. My dataframe looks like this (sorry for the csv format):
first_date, second_date, id, type, codename, description, price
201901,201902,05555,111,01111,1,200.00
201901,201902,05555,111,023111,44,120.00
201901,201902,05555,111,14113,23,84.00
As you can see, the first four rows have repeated values, so I would like to group all my columns in two groups to get this JSON file:
[
{
"report":
{
"first_date":201901,
"second_date": 201902,
"id":05555,
"type": 111
},
"features": [
{
"codename":01111,
"description": 1,
"price":200.00
},
{
"codename":023111,
"description": 44,
"price":120.00
},
{
"codename":14113,
"description": 23,
"price":84.00
}
]
}
]
So far I've tried to group by the last three columns, add them to a dictionary and rename them:
cols = ["codename","description","price"]
rep = (df.groupby(["first_date","second_date","id","type"])[cols]
.apply(lambda x:x.to_dict('r')
.reset_index(name="features")
.to_json(orient="records"))
output = json.dumps(json.loads(rep),indent=4)
And I get this as the output:
[
{
"first_date":201901,
"second_date": 201902,
"id":05555,
"type": 111,
"features": [
{
"codename":01111,
"description": 1,
"price":200.00
},
{
"codename":023111,
"description": 44,
"price":120.00
},
{
"codename":14113,
"description": 23,
"price":84.00
}
]
}
]
Can anyone guide me to rename and group the first group of columns? Or does anyone knows another approach to this problem? I would like to do it this way since I have to repeat the same procedure but with more groups of columns and searching, this seems simpler than to create the son from several for loops.
Any advice sure will be helpful! I've been searching a lot but this is my first approach to this type of output. Thanks in advance!!!
see if this works for you :
#get rid of whitespaces if any
df.columns = df.columns.str.strip()
#split into two sections
fixed = df.columns[:4]
varying = df.columns[4:]
#create dicts for both fixed and varying
features = df[varying].to_dict('records')
report = df[fixed].drop_duplicates().to_dict('records')[0]
#combine into a dict into a list :
fin = [{"report":report,"features":features}]
print(fin)
[{'report': {'first_date': 201901,
'second_date': 201902,
'id': 5555,
'type': 111},
'features': [{'codename': 1111, 'description': 1, 'price': 200.0},
{'codename': 23111, 'description': 44, 'price': 120.0},
{'codename': 14113, 'description': 23, 'price': 84.0}]}]

pd.read_json() returning dataframe with 1 column

Currently I'm trying to load a json file from a webscrape into python in order to search reorder some of the columns, remove some text such as the (\n), etc. I'm having some issues with the json file, the pd.read_json() works (kinda). It returns a dataframe with 1 column titled 'Default'. My current code is below and runs without errors.
I tried the native JSON interpreter but due to some stylized characters and I receive an error.
def main():
file_path = filedialog.askopenfilename()
df = pd.read_json(file_path)
print(df)
Json file is valid and formatted as so:
{
"Default": [{
"ItemID": "11111",
"Title": "A super captivating title",
"Date": "July 22, 2019",
"URL": "www.someurl.com",
"BodyText": "some text."
}, {
"ItemID": "22222",
"Title": "Even more captivating title",
"Date": "July 12, 2019",
"URL": "www.differenturl.com",
"BodyText": "different text"
}]
}
Now I understand that the "Default" is being interpreted as the JSON object and why it's using it as the column. I experimented with several different orients of the read_json() but received more or less the same result.
I'm hoping to have ItemID, Title, Date, URL, and BodyText be the columns and their values being appropriately designated into rows. Any help is appreciated, I couldn't find a similar question but if it has been answered before please point me in the right direction.
There is no read_json orient that will do it. What you need is to pass the "Default" content to the DataFrame constructor:
import json
import pandas as pd
with open('temp.txt') as fh:
df = pd.DataFrame(json.load(fh)['Default'])

Categories