How to efficiently fix JSON file converted from pandas dataframe - python

I have a JSON file that I read in pandas and converted to a dataframe. I then exported this file as a CSV so I could edit it easier. Once finished, I read the CSV file back into a dataframe and then wanted to convert it back to a JSON file. However, in that process a whole lot of extra data was automatically added to my original list of dictionaries (the JSON file).
I'm sure I could hack together a fix, but wanted to know if anyone knows an efficient way to handle this process so that NO new data or columns are added to my original JSON data?
Original JSON (snippet):
[
{
"tag": "!= (not-equal-to operator)",
"definition": "",
"source": [
{
"title": "Compare Dictionaries",
"URL": "https://learning.oreilly.com/library/view/introducing-python-2nd/9781492051374/ch08.html#idm45795007002280"
}
]
},
{
"tag": "\"intelligent\" applications",
"definition": "",
"source": [
{
"title": "Why Machine Learning?",
"URL": "https://learning.oreilly.com/library/view/introduction-to-machine/9781449369880/https://learning.oreilly.com/library/view/introduction-to-machine/9781449369880/ch01.html#idm45613685872600"
}
]
},
{
"tag": "# (pound sign)",
"definition": "",
"source": [
{
"title": "Comment with #",
"URL": "https://learning.oreilly.com/library/view/introducing-python-2nd/9781492051374/ch04.html#idm45795038172984"
}
]
},
CSV as a dataframe (an index was automatically added):
tag definition source
0 != (not-equal-to operator) [{'title': 'Compare Dictionaries', 'URL': 'htt...
1 "intelligent" applications [{'title': 'Why Machine Learning?', 'URL': 'ht...
2 # (pound sign) [{'title': 'Comment with #', 'URL': 'https://l...
3 $ (Mac/Linux prompt) [{'title': 'Test Driving Python', 'URL': 'http...
4 $ (anchor) [{'title': 'Patterns: Using Specifiers', 'URL'...
... ... ... ...
11375 { } (curly brackets) []
11376 | (vertical bar) [{'title': 'Combinations and Operators', 'URL'...
11377 || (concatenation) function (DB2/Oracle/Postgr... [{'title': 'Discussion', 'URL': 'https://learn...
11378 || (for Oracle Database) [{'title': 'Including special characters', 'UR...
11379 || (vertical bar, double), concatenation opera... [{'title': 'Including special characters', 'UR...
7009 rows × 3 columns
JSON file after converting from CSV (all sorts of awful):
{
"0":{
"Unnamed: 0":0,
"tag":"!= (not-equal-to operator)",
"definition":null,
"source":"[{'title': 'Compare Dictionaries', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch08.html#idm45795007002280'}]"
},
"1":{
"Unnamed: 0":1,
"tag":"\"intelligent\" applications",
"definition":null,
"source":"[{'title': 'Why Machine Learning?', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/ch01.html#idm45613685872600'}]"
},
"2":{
"Unnamed: 0":2,
"tag":"# (pound sign)",
"definition":null,
"source":"[{'title': 'Comment with #', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch04.html#idm45795038172984'}]"
},
Here is my code:
import pandas as pd
import json
# to dataframe
tags_df = pd.read_json('dsa_tags_flat.json')
# csv file was manually cleaned then reloaded here
cleaned_csv_df = pd.read_csv('dsa-parser-flat.csv')
# write to JSON
cleaned_csv_df.to_json(r'dsa-tags.json', orient='index', indent=2)
EDIT: I added an index=false to the code when going from dataframe to CSV, which helped, but still have the index of keys there that were not in the original JSON. I wonder if a library function out somewhere would prevent this? Or do I have to just write some loops and remove them myself?
Also, as you can see, the URL forward-slashes were escaped. Not what I wanted.
{
"0":{
"tag":"!= (not-equal-to operator)",
"definition":null,
"source":"[{'title': 'Compare Dictionaries', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch08.html#idm45795007002280'}]"
},
"1":{
"tag":"\"intelligent\" applications",
"definition":null,
"source":"[{'title': 'Why Machine Learning?', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/https:\/\/learning.oreilly.com\/library\/view\/introduction-to-machine\/9781449369880\/ch01.html#idm45613685872600'}]"
},
"2":{
"tag":"# (pound sign)",
"definition":null,
"source":"[{'title': 'Comment with #', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/introducing-python-2nd\/9781492051374\/ch04.html#idm45795038172984'}]"
},
"3":{
"tag":"$ (Mac\/Linux prompt)",
"definition":null,
"source":"[{'title': 'Test Driving Python', 'URL': 'https:\/\/learning.oreilly.com\/library\/view\/data-wrangling-with\/9781491948804\/ch01.html#idm140080973230480'}]"
},

The issue is that you are adding an index at two places.
Once while writing your file to csv. This adds the "Unnamed: 0" fields in the final JSON files. You can use index = False in the to_csv method while writing CSV to disk or specify the index_col parameter while reading the saved CSV in read_csv.
Secondly you are adding an index while writing the df to json with orient="index". This adds the outermost indices such as "0", "1" in the final JSON file. You should use orient="records" if you intend to save the json in a similar format to it was loaded in.
To understand how the orient parameter works, refer to pandas.DataFrame.to_json

Related

Write multiple items to DynamoDB table

I have the following list of dictionaries:
datalist = [
{'Business': 'Business A', 'Category': 'IT', 'Title': 'IT Manager'},
{'Business': 'Business A', 'Category': 'Sourcing', 'Title': 'Sourcing Manager'}
]
I would like to upload to DynamoDB in the following format:
table.put_item(Item={
'Business':'Business A would go here',
'Category':'IT would go here',
'Title':'IT Manager would go here'
},
{
'Business':'Business B would go here',
'Category':'Sourcing would go here',
'Title':'Sourcing Manager would go here'
}
)
I've tried converting the list of dicts to dict first and accessing the elements that way or trying to iterate through the Items parameter but no luck. Any help would be appreciated.
Here's my DynamoDB structure, 3 columns (Business, Category, Title):
{
"Business": {
"S": "Business goes here"
},
"Category": {
"S": "Category goes here"
},
"Title": {
"S": "Title goes here"
}
}
The put_item API call allows you to upload exactly one item, you're trying to upload two at the same time, this doesn't work.
You can batch multiple put items requests using boto3 to decrease the number of API calls. Here's an example adapted from the documentation:
with table.batch_writer() as batch:
for item in datalist:
batch.put_item(
Item=item
)
This will automatically create the batch writes under the hood.

How to normalize a nested json with json_normalize

I am trying to create a pandas dataframe out of a nested json. For some reason, I seem to be unable to address the third level.
My json looks something like this:
"numberOfResults": 376,
"results": [
{
"name": "single",
"docs": [
{
"id": "RAKDI342342",
"type": "Culture",
"category": "Culture",
"media": "unknown",
"label": "exampellabel",
"title": "testtitle and titletest",
"subtitle": "Archive"
]
},
{
"id": "GUI6N5QHBPTO6GJ66VP5OXB7GKX6J7ER",
"type": "Culture",
"category": "Culture",
"media": "image",
"label": "more label als example",
"title": "test the second title",
"subtitle": "picture"
and so on.
Within the "docs"-part are all the actual results, starting with "id". Once all the information is there, the next block starting with "id" simply follows.
Now I am trying to create a table with the keys id, label and title (for a start) for each of these separate blocks (in this case actual items).
After defining the search_url (where I get the json from), my code for this currently looks like this:
result = requests.get(search_url)
data = result.json()
data.keys()
With this, I get told that they dict_keys are the following:
dict_keys(['numberOfResults', 'results', 'facets', 'entities', 'fulltexts', 'correctedQuery', 'highlightedTerms', 'randomSeed', 'nextCursorMark'])
Given the json from above, I know I want to look into "results" and then further into "docs". According to the documentation I found, I should be able to achieve this by addressing the results-part directly and then addressing the nested bit by separating the fields with ".".
I have now tried the following the code:
fields = ["docs.id", "docs.label", "docs.title"]
df = pd.json_normalize(data["results"])
df[fields]
This works until df[field] - at this stage the programm tells me:
KeyError: "['docs.id'] not in index"
It does work for the level above though, so if I try the same with "name" and "docs" I get a lovely dataframe. What am I doing wrong? I am still a python and pandas beginner and would appreciate any help very much!
EDIT:
The desired dataframe output would look roughly like this:
id label title
0 RAKDI342342 exampellabel testtitle and titletest
Use pandas.json_normalize()
The following code uses pandas v.1.2.4
If you don't want the other columns, remove the list of keys assigned to meta
Use pandas.DataFrame.drop to remove any other unwanted columns from df.
import pandas as pd
df = pd.json_normalize(data, record_path=['results', 'docs'], meta=[['results', 'name'], 'numberOfResults'])
display(df)
id type category media label title subtitle results.name numberOfResults
0 RAKDI342342 Culture Culture unknown exampellabel testtitle and titletest Archive single 376
1 GUI6N5QHBPTO6GJ66VP5OXB7GKX6J7ER Culture Culture image more label als example test the second title picture single 376
Data
The posted JSON / Dict is not correctly formed
Assuming the following corrected form
data = \
{'numberOfResults': 376,
'results': [{'docs': [{'category': 'Culture',
'id': 'RAKDI342342',
'label': 'exampellabel',
'media': 'unknown',
'subtitle': 'Archive',
'title': 'testtitle and titletest',
'type': 'Culture'},
{'category': 'Culture',
'id': 'GUI6N5QHBPTO6GJ66VP5OXB7GKX6J7ER',
'label': 'more label als example',
'media': 'image',
'subtitle': 'picture',
'title': 'test the second title',
'type': 'Culture'}],
'name': 'single'}]}

Python - How to create a JSON nested file from a Pandas dataframe and group by?

so I'm having some troubles to create an appropriate JSON format from a pandas dataframe. My dataframe looks like this (sorry for the csv format):
first_date, second_date, id, type, codename, description, price
201901,201902,05555,111,01111,1,200.00
201901,201902,05555,111,023111,44,120.00
201901,201902,05555,111,14113,23,84.00
As you can see, the first four rows have repeated values, so I would like to group all my columns in two groups to get this JSON file:
[
{
"report":
{
"first_date":201901,
"second_date": 201902,
"id":05555,
"type": 111
},
"features": [
{
"codename":01111,
"description": 1,
"price":200.00
},
{
"codename":023111,
"description": 44,
"price":120.00
},
{
"codename":14113,
"description": 23,
"price":84.00
}
]
}
]
So far I've tried to group by the last three columns, add them to a dictionary and rename them:
cols = ["codename","description","price"]
rep = (df.groupby(["first_date","second_date","id","type"])[cols]
.apply(lambda x:x.to_dict('r')
.reset_index(name="features")
.to_json(orient="records"))
output = json.dumps(json.loads(rep),indent=4)
And I get this as the output:
[
{
"first_date":201901,
"second_date": 201902,
"id":05555,
"type": 111,
"features": [
{
"codename":01111,
"description": 1,
"price":200.00
},
{
"codename":023111,
"description": 44,
"price":120.00
},
{
"codename":14113,
"description": 23,
"price":84.00
}
]
}
]
Can anyone guide me to rename and group the first group of columns? Or does anyone knows another approach to this problem? I would like to do it this way since I have to repeat the same procedure but with more groups of columns and searching, this seems simpler than to create the son from several for loops.
Any advice sure will be helpful! I've been searching a lot but this is my first approach to this type of output. Thanks in advance!!!
see if this works for you :
#get rid of whitespaces if any
df.columns = df.columns.str.strip()
#split into two sections
fixed = df.columns[:4]
varying = df.columns[4:]
#create dicts for both fixed and varying
features = df[varying].to_dict('records')
report = df[fixed].drop_duplicates().to_dict('records')[0]
#combine into a dict into a list :
fin = [{"report":report,"features":features}]
print(fin)
[{'report': {'first_date': 201901,
'second_date': 201902,
'id': 5555,
'type': 111},
'features': [{'codename': 1111, 'description': 1, 'price': 200.0},
{'codename': 23111, 'description': 44, 'price': 120.0},
{'codename': 14113, 'description': 23, 'price': 84.0}]}]

Splitting a string in json using python

I have a simple Json file
input.json
[
{
"title": "Person",
"type": "object",
"required": "firstName",
"min_max": "200/600"
},
{
"title": "Person1",
"type": "object2",
"required": "firstName1",
"min_max": "230/630"
},
{
"title": "Person2",
"type": "object2",
"required": "firstName2",
"min_max": "201/601"
},
{
"title": "Person3",
"type": "object3",
"required": "firstName3",
"min_max": "2000/6000"
},
{
"title": "Person4",
"type": "object4",
"required": "firstName4",
"min_max": "null"
},
{
"title": "Person4",
"type": "object4",
"required": "firstName4",
"min_max": "1024 / 256"
},
{
"title": "Person4",
"type": "object4",
"required": "firstName4",
"min_max": "0"
}
]
I am trying to create a new json file with new data. I would like to split "min_max" into two different fields ie., min and max. Below is the code written in python.
import json
input=open('input.json', 'r')
output=open('test.json', 'w')
json_decode=json.load(input)
result = []
for item in json_decode:
my_dict={}
my_dict['title']=item.get('title')
my_dict['min']=item.get('min_max')
my_dict['max']=item.get('min_max')
result.append(my_dict)
data=json.dumps(result, output)
output.write(data)
output.close()
How do I split the string into two different values. Also, is there any possibility of printing the json output in order.
Your JSON file seems to be written wrong (the example one). It is not a list. It is just a single associated array (or dictionary, in Python). Additionally, you don't seem to be using json.dumps properly. It only takes 1 argument. I also figured it would be easier to just create the dictionary inline. And you don't seem to be splitting the min_max properly.
Here's the correct input:
[{
"title": "Person",
"type": "object",
"required": "firstName",
"min_max": "20/60"
}]
Here's your new code:
import json
with open('input.json', 'r') as inp, open('test.json', 'w') as outp:
json_decode=json.load(inp)
result = []
for temp in json_decode:
minMax = temp["min_max"].split("/")
result.append({
"title":temp["title"],
"min":minMax[0],
"max":minMax[1]
})
data=json.dumps(result)
outp.write(data)
Table + Python == Pandas
import pandas as pd
# Read old json to a dataframe
df = pd.read_json("input.json")
# Create two new columns based on min_max
# Removes empty spaces with strip()
# Returns [None,None] if length of split is not equal to 2
df['min'], df['max'] = (zip(*df['min_max'].apply
(lambda x: [i.strip() for i in x.split("/")]
if len(x.split("/"))== 2 else [None,None])))
# 'delete' (drop) min_max column
df.drop('min_max', axis=1, inplace=True)
# output to json again
df.to_json("test.json",orient='records')
Result:
[{'max': '600',
'min': '200',
'required': 'firstName',
'title': 'Person',
'type': 'object'},
{'max': '630',
'min': '230',
'required': 'firstName1',
'title': 'Person1',
'type': 'object2'},
{'max': '601',
'min': '201',
'required': 'firstName2',
'title': 'Person2',
'type': 'object2'},
{'max': '6000',
'min': '2000',
'required': 'firstName3',
'title': 'Person3',
'type': 'object3'},
{'max': None,
'min': None,
...
You can do something like this:
import json
nl=[]
for di in json.loads(js):
min_,sep,max_=map(lambda s: s.strip(), di['min_max'].partition('/'))
if sep=='/':
del di['min_max']
di['min']=min_
di['max']=max_
nl.append(di)
print json.dumps(nl)
This keeps the "min_max" values that cannot be separated into two values unchanged.

Python Parse JSON array

I'm trying to put together a small python script that can parse out array's out of a large data set. I'm looking to pull a few key:values from each object so that I can play with them later on in the script. Here's my code:
# Load up JSON Function
import json
# Open our JSON file and load it into python
input_file = open ('stores-small.json')
json_array = json.load(input_file)
# Create a variable that will take JSON and put it into a python dictionary
store_details = [
["name"],
["city"]
]
# Learn how to loop better =/
for stores in [item["store_details"] for item in json_array]
Here's the sample JSON Data:
[
{
"id": 1000,
"type": "BigBox",
"name": "Mall of America",
"address": "340 W Market",
"address2": "",
"city": "Bloomington",
"state": "MN",
"zip": "55425",
"location": {
"lat": 44.85466,
"lon": -93.24565
},
"hours": "Mon: 10-9:30; Tue: 10-9:30; Wed: 10-9:30; Thurs: 10-9:30; Fri: 10-9:30; Sat: 10-9:30; Sun: 11-7",
"services": [
"Geek Squad Services",
"Best Buy Mobile",
"Best Buy For Business"
]
},
{
"id": 1002,
"type": "BigBox",
"name": "Tempe Marketplace",
"address": "1900 E Rio Salado Pkwy",
"address2": "",
"city": "Tempe",
"state": "AZ",
"zip": "85281",
"location": {
"lat": 33.430729,
"lon": -111.89966
},
"hours": "Mon: 10-9; Tue: 10-9; Wed: 10-9; Thurs: 10-9; Fri: 10-10; Sat: 10-10; Sun: 10-8",
"services": [
"Windows Store",
"Geek Squad Services",
"Best Buy Mobile",
"Best Buy For Business"
]}
]
In your for loop statement, Each item in json_array is a dictionary and the dictionary does not have a key store_details. So I modified the program a little bit
import json
input_file = open ('stores-small.json')
json_array = json.load(input_file)
store_list = []
for item in json_array:
store_details = {"name":None, "city":None}
store_details['name'] = item['name']
store_details['city'] = item['city']
store_list.append(store_details)
print(store_list)
If you arrived at this question simply looking for a way to read a json file into memory, then use the built-in json module.
with open(file_path, 'r') as f:
data = json.load(f)
If you have a json string in memory that needs to be parsed, use json.loads() instead:
data = json.loads(my_json_string)
Either way, now data is converted into a Python data structure (list/dictionary) that may be (deeply) nested and you'll need Python methods to manipulate it.
If you arrived here looking for ways to get values under several keys as in the OP, then the question is about looping over a Python data structure. For a not-so-deeply-nested data structure, the most readable (and possibly the fastest) way is a list / dict comprehension. For example, for the requirement in the OP, a list comprehension does the job.
store_list = [{'name': item['name'], 'city': item['city']} for item in json_array]
# [{'name': 'Mall of America', 'city': 'Bloomington'}, {'name': 'Tempe Marketplace', 'city': 'Tempe'}]
Other types of common data manipulation:
For a nested list where each sub-list is a list of items in the json_array.
store_list = [[item['name'], item['city']] for item in json_array]
# [['Mall of America', 'Bloomington'], ['Tempe Marketplace', 'Tempe']]
For a dictionary of lists where each key-value pair is a category-values in the json_array.
store_data = {'name': [], 'city': []}
for item in json_array:
store_data['name'].append(item['name'])
store_data['city'].append(item['city'])
# {'name': ['Mall of America', 'Tempe Marketplace'], 'city': ['Bloomington', 'Tempe']}
For a "transposed" nested list where each sub-list is a "category" in json_array.
store_list = list(store_data.values())
# [['Mall of America', 'Tempe Marketplace'], ['Bloomington', 'Tempe']]

Categories