Pandas Dataframe conversion to JSON - python

I have a Pandas Dataframe or two rows with data that I'd like to pass as a JSON array.
The JSON needs to be formatted as follow:
[{
"Date": "2017-02-03",
"Text": "Sample Text1"
},
{
"Date": "2015-02-04",
"Text": "Sample Text2"
}]
I tried using df.to_json(orient='index'), but the output is not quite right as it seems to be using the index values as keys
{"0":{"Date":"2017-02-03","Text""Sample Text1"},"1":{"Date":"2017-02-04","Text""Sample Text2"}}

If you want an array of dictionaries, you can use orient='records':
>>> import pandas as pd
>>> df = pd.DataFrame({
... 'Date': ['2017-02-03', '2015-02-04'],
... 'Text': ['Sample Text 1', 'Sample Text 2']
... })
>>> df.to_json(orient='records')
'[{"Date":"2017-02-03","Text":"Sample Text 1"},{"Date":"2015-02-04","Text":"Sample Text 2"}]'

Related

How to convert a dataframe to nested json

I have this DataFrame:
df = pd.DataFrame({'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"})
All the dataframe fields are ASCII strings and is the output from a SQL query (pd.read_sql_query) so the line to create the dataframe above may not be quite right.
And I wish the final JSON output to be in the form
[{
"Survey": "001_220816080015",
"BCD": "001_220816080015.bcd",
"Sections": [
"4700A1/305",
"4700A1/312"
}]
I realize that may not be 'normal' JSON but that is the format expected by a program over which I have no control.
The nearest I have achieved so far is
[{
"Survey": "001_220816080015",
"BCD": "001_220816080015.bcd",
"Sections": "4700A1/305, 4700A1/312"
}]
Problem might be the structure of the dataframe but how to reformat it to produce the requirement is not clear to me.
The JSON line is:
df.to_json(orient='records', indent=2)
Isn't the only thing you need to do to parse the Sections into a list?
import pandas as pd
df= pd.DataFrame({'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"}, index=[0])
df['Sections'] = df['Sections'].str.split(', ')
print(df.to_json(orient='records', indent=2))
[
{
"Survey":"001_220816080015",
"BCD":"001_220816080015.bcd",
"Sections":[
"4700A1\/305",
"4700A1\/312"
]
}
]
The DataFrame won't help you here, since it's just giving back the input parameter you gave it.
You should just split the specific column you need into an array:
input_data = {'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"}
input_data['Sections'] = input_data['Sections'].split(', ')
nested_json = [input_data]

Normalizing json using pandas with inconsistent nested lists/dictionaries

I've been using pandas' json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109
I'm trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn't prevent the TypeError that stems from trying to iterate through a null value.
Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!
import json
import pandas as pd
data = """[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()
TypeError: 'NoneType' object is not iterable
I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.
The desired output would look like this: Dataframe
This approach can be more efficient when it comes to dealing with large datasets.
data = json.loads(data)
desired_data = list(
map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
if x["Ats"] is not None
else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
df = pd.DataFrame(desired_data)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.
Example:
data = json.loads(data)
df = pd.DataFrame()
for item in data:
try:
df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID"))
except TypeError:
df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)
print(df)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:
import json
import pandas as pd
data = """\
[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")
df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)
Prints:
ID Name Desc
0 100 At1 Lazy At
1 101 NaN NaN

Format an f-string for each dataframe object

Requirement
My requirement is to have a Python code extract some records from a database, format and upload a formatted JSON to a sink.
Planned approach
1. Create JSON-like templates for each record. E.g.
json_template_str = '{{
"type": "section",
"fields": [
{{
"type": "mrkdwn",
"text": "Today *{total_val}* customers saved {percent_derived}%."
}}
]
}}'
2. Extract records from DB to a dataframe.
3. Loop over dataframe and replace the {var} variables in bulk using something like .format(**locals()))
Question
I haven't worked with dataframes before.
What would be the best way to accomplish Step 3 ? Currently I am
3.1 Looping over the dataframe objects 1 by 1 for i, df_row in df.iterrows():
3.2 Assigning
total_val= df_row['total_val']
percent_derived= df_row['percent_derived']
3.3 In the loop format and add str to a list block.append(json.loads(json_template_str.format(**locals()))
I was trying to use the assign() method in dataframe but was not able to figure out a way to use like a lambda function to create a new column with my expected value that I can use.
As a novice in pandas, I feel there might be a more efficient way to do this (which may even involve changing the JSON template string - which I can totally do). Will be great to hear thoughts and ideas.
Thanks for your time.
I would not write a JSON string by hand, but rather create a corresponding python object and then use the json library to convert it into a string. With this in mind, you could try the following:
import copy
import pandas as pd
# some sample data
df = pd.DataFrame({
'total_val': [100, 200, 300],
'percent_derived': [12.4, 5.2, 6.5]
})
# template dictionary for a single block
json_template = {
"type": "section",
"fields": [
{"type": "mrkdwn",
"text": "Today *{total_val:.0f}* customers saved {percent_derived:.1f}%."
}
]
}
# a function that will insert data from each row
# of the dataframe into a block
def format_data(row):
json_t = copy.deepcopy(json_template)
text_t = json_t["fields"][0]["text"]
json_t["fields"][0]["text"] = text_t.format(
total_val=row['total_val'], percent_derived=row['percent_derived'])
return json_t
# create a list of blocks
result = df.agg(format_data, axis=1).tolist()
The resulting list looks as follows, and can be converted into a JSON string if needed:
[{
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *100* customers saved 12.4%.'
}]
}, {
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *200* customers saved 5.2%.'
}]
}, {
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *300* customers saved 6.5%.'
}]
}]

Loading JSON data into pandas data frame and creating custom columns

Here is example JSON im working with.
{
":#computed_region_amqz_jbr4": "587",
":#computed_region_d3gw_znnf": "18",
":#computed_region_nmsq_hqvv": "55",
":#computed_region_r6rf_p9et": "36",
":#computed_region_rayf_jjgk": "295",
"arrests": "1",
"county_code": "44",
"county_code_text": "44",
"county_name": "Mifflin",
"fips_county_code": "087",
"fips_state_code": "42",
"incident_count": "1",
"lat_long": {
"type": "Point",
"coordinates": [
-77.620031,
40.612749
]
}
I have been able to pull out select columns I want except I'm having troubles with "lat_long". So far my code looks like:
# PRINTS OUT SPECIFIED COLUMNS
col_titles = ['county_name', 'incident_count', 'lat_long']
df = df.reindex(columns=col_titles)
However 'lat_long' is added to the data frame as such: {'type': 'Point', 'coordinates': [-75.71107, 4...
I thought once I figured out how properly add the coordinates to the data frame I would then create two seperate columns, one for latitude and one for longitude.
Any help with this matter would be appreciated. Thank you.
If I don't misunderstood your requirements then you can try this way with json_normalize. I just added the demo for single json, you can use apply or lambda for multiple datasets.
import pandas as pd
from pandas.io.json import json_normalize
df = {":#computed_region_amqz_jbr4":"587",":#computed_region_d3gw_znnf":"18",":#computed_region_nmsq_hqvv":"55",":#computed_region_r6rf_p9et":"36",":#computed_region_rayf_jjgk":"295","arrests":"1","county_code":"44","county_code_text":"44","county_name":"Mifflin","fips_county_code":"087","fips_state_code":"42","incident_count":"1","lat_long":{"type":"Point","coordinates":[-77.620031,40.612749]}}
df = pd.io.json.json_normalize(df)
df_modified = df[['county_name', 'incident_count', 'lat_long.type']]
df_modified['lat'] = df['lat_long.coordinates'][0][0]
df_modified['lng'] = df['lat_long.coordinates'][0][1]
print(df_modified)
Here is how you can do it as well:
df1 = pd.io.json.json_normalize(df)
pd.concat([df1, df1['lat_long.coordinates'].apply(pd.Series) \
.rename(columns={0: 'lat', 1: 'long'})], axis=1) \
.drop(columns=['lat_long.coordinates', 'lat_long.type'])

Convert CSV to JSON (in specific format) using Python

I would like to convert a csv file into a json file using python 2.7. Down below is the python code I tried but it is not giving me expected result. Also, I would like to know if there is any simplified version than mine. Any help is appreciated.
Here is my csv file (SampleCsvFile.csv):
zipcode,date,state,val1,val2,val3,val4,val5
95110,2015-05-01,CA,50,30.00,5.00,3.00,3
95110,2015-06-01,CA,67,31.00,5.00,3.00,4
95110,2015-07-01,CA,97,32.00,5.00,3.00,6
Here is the expected json file (ExpectedJsonFile.json):
{
"zipcode": "95110",
"state": "CA",
"subset": [
{
"date": "2015-05-01",
"val1": "50",
"val2": "30.00",
"val3": "5.00",
"val4": "3.00",
"val5": "3"
},
{
"date": "2015-06-01",
"val1": "67",
"val2": "31.00",
"val3": "5.00",
"val4": "3.00",
"val5": "4"
},
{
"date": "2015-07-01",
"val1": "97",
"val2": "32.00",
"val3": "5.00",
"val4": "3.00",
"val5": "6"
}
]
}
Here's the python code I tried:
import pandas as pd
from itertools import groupby
import json
df = pd.read_csv('SampleCsvFile.csv')
names = df.columns.values.tolist()
data = df.values
master_list2 = [ (d["zipcode"], d["state"], d) for d in [dict(zip(names, d)) for d in data] ]
intermediate2 = [(k, [x[2] for x in list(v)]) for k,v in groupby(master_list2, lambda t: (t[0],t[1]) )]
nested_json2 = [dict(zip(names,(k[0][0], k[0][1], k[1]))) for k in [(i[0], i[1]) for i in intermediate2]]
#print json.dumps(nested_json2, indent=4)
with open('ExpectedJsonFile.json', 'w') as outfile:
outfile.write(json.dumps(nested_json2, indent=4))
Since you are using pandas already, I tried to get as much mileage as I could out of dataframe methods. I also ended up wandering fairly far afield from your implementation. I think the key here, though, is don't try to get too clever with your list and/or dictionary comprehensions. You can very easily confuse yourself and everyone who reads your code.
import pandas as pd
from itertools import groupby
from collections import OrderedDict
import json
df = pd.read_csv('SampleCsvFile.csv', dtype={
"zipcode" : str,
"date" : str,
"state" : str,
"val1" : str,
"val2" : str,
"val3" : str,
"val4" : str,
"val5" : str
})
results = []
for (zipcode, state), bag in df.groupby(["zipcode", "state"]):
contents_df = bag.drop(["zipcode", "state"], axis=1)
subset = [OrderedDict(row) for i,row in contents_df.iterrows()]
results.append(OrderedDict([("zipcode", zipcode),
("state", state),
("subset", subset)]))
print json.dumps(results[0], indent=4)
#with open('ExpectedJsonFile.json', 'w') as outfile:
# outfile.write(json.dumps(results[0], indent=4))
The simplest way to have all the json datatypes written as strings, and to retain their original formatting, was to force read_csv to parse them as strings. If, however, you need to do any numerical manipulation on the values before writing out the json, you will have to allow read_csv to parse them numerically and coerce them into the proper string format before converting to json.

Categories