Python DataFrames to JSon - python

I want to store my Python script output to the JSon format below :
[{
"ids": "1",
"result1": ["1","2","3","4"],
"result2": ["4","5","6","1"]
},
{
"ids": "2",
"result1": ["3","4"],
"result2": ["4","5","6","1"]
}]
My code is as follows
for i in df.id.unique():
ids = i
results1 = someFunction(i)
results2 = someFunction2(i)
df_result_int = ["ids : %s" %ids , "results1 : %s" %results1, "results2 : %s" %results2]
df_results.append(df_result_int)
jsonData = json.dumps(df_results)
with open('JSONData.json', 'w') as f:
json.dump(jsonData, f)
someFunction() and someFunction2() return a list.
Thank you in advance.

You should not manually transform your lists to string, json.dumps does it for you. Use dictionaries instead. Here is an example:
df_results = []
results1 = [1,1]
results2 = [2,2]
df_result_int = {"ids" : 1, "results1" : results1, "results2" : results2}
df_results.append(df_result_int)
json.dumps(df_results)
This will result in:
[{"results2": [2, 2], "ids": 1, "results1": [1, 1]}]

There is a method for pandas dataframes that allows you to dump the dataframe directly to a json file. Here the link to the documentation:
to_json
You could use something like
df.reset_index().to_json(orient='records')
where df is the dataframe on which you have done some sort of manipulation before (for example the functions you wish to apply, etc.) and it depends on the info you want in the json file and how is arranged.

Related

Converting CSV do JSON with Pandas

I have the data in CSV format, for example:
First row is column number, let's ignore that.
Second row, starting at Col_4, are number of days
Third row on: Col_1 and Col_2 are coordinates (lon, lat), Col_3 is a statistical value, Col_4 onwards are measurements.
As you can see, this format is a confusing mess. I would like to convert this to JSON the following way, for example:
{"points":{
"dates": ["20190103", "20190205"],
"0":{
"lon": "-8.072557",
"lat": "41.13702",
"measurements": ["-0.191792","-10.543130"],
"val": "-1"
},
"1":{
"lon": "-8.075557",
"lat": "41.15702",
"measurements": ["-1.191792","-2.543130"],
"val": "-9"
}
}
}
To summarise what I've done till now, I read the CSV to a Pandas DataFrame:
df = pandas.read_csv("sample.csv")
I can extract the dates into a Numpy Array with:
dates = df.iloc[0][3:].to_numpy()
I can extract measurements for all points with:
measurements_all = df.iloc[1:,3:].to_numpy()
And the lon and lat and val, respectively, with:
lon_all = df.iloc[1:,0:1].to_numpy()
lat_all = df.iloc[1:,1:2].to_numpy()
val_all = df.iloc[1:,2:3].to_numpy()
Can anyone explain how I can format this info a structure identical to the .json example?
With this dataframe
df = pd.DataFrame([{"col1": 1, "col2": 2, "col3": 3, "col4":3, "col5":5},
{"col1": None, "col2": None, "col3": None, "col4":20190103, "col5":20190205},
{"col1": -8.072557, "col2": 41.13702, "col3": -1, "col4":-0.191792, "col5":-10.543130},
{"col1": -8.072557, "col2": 41.15702, "col3": -9, "col4":-0.191792, "col5":-2.543130}])
This code does what you want, although it is not converting anything to strings as in you example. But you should be able to easily do it, if necessary.
# generate dict with dates
final_dict = {"dates": [list(df["col4"])[1],list(df["col5"])[1]]}
# iterate over relevant rows and generate dicts
for i in range(2,len(df)):
final_dict[i-2] = {"lon": df["col1"][i],
"lat": df["col2"][i],
"measurements": [df[cname][i] for cname in ["col4", "col5"]],
"val": df["col3"][i]
}
this leads to this output:
{0: {'lat': 41.13702,
'lon': -8.072557,
'measurements': [-0.191792, -10.54313],
'val': -1.0},
1: {'lat': 41.15702,
'lon': -8.072557,
'measurements': [-0.191792, -2.54313],
'val': -9.0},
'dates': [20190103.0, 20190205.0]}
Extracting dates from data and then eliminating first row from data frame:
dates =list(data.iloc[0][3:])
data=data.iloc[1:]
Inserting dates into dict:
points={"dates":dates}
Iterating through data frame and adding elements to the dictionary:
i=0
for index, row in data.iterrows():
element= {"lon":row["Col_1"],
"lat":row["Col_2"],
"measurements": [row["Col_3"], row["Col_4"]]}
points[str(i)]=element
i+=1
You can convert dict to string object using json.dumps():
points_json = json.dumps(points)
It will be string object, not json(dict) object. More about that in this post Converting dictionary to JSON
I converted the pandas dataframe values to a list, and then loop through one of the lists, and add the lists to a nested JSON object containing the values.
import pandas
import json
import argparse
import sys
def parseInput():
parser = argparse.ArgumentParser(description="Convert CSV measurements to JSON")
parser.add_argument(
'-i', "--input",
help="CSV input",
required=True,
type=argparse.FileType('r')
)
parser.add_argument(
'-o', "--output",
help="JSON output",
type=argparse.FileType('w'),
default=sys.stdout
)
return parser.parse_args()
def main():
args = parseInput()
input_file = args.input
output = args.output
dataframe = pandas.read_csv(input_file)
longitudes = dataframe.iloc[1:,0:1].T.values.tolist()[0]
latitudes = dataframe.iloc[1:,1:2].T.values.tolist()[0]
averages = dataframe.iloc[1:,2:3].T.values.tolist()[0]
measurements = dataframe.iloc[1:,3:].values.tolist()
dates=dataframe.iloc[0][3:].values.tolist()
points={"dates":dates}
for index, val in enumerate(longitudes):
entry = {
"lon":longitudes[index],
"lat":latitudes[index],
"measurements":measurements[index],
"average":averages[index]
}
points[str(index)] = entry
json.dump(points, output)
if __name__ == "__main__":
main()

Transpose CSV to JSON

I have a csv similar to below and wanted transpose as JSON
Input:
Output I expect to get:
{
"C1" :[ {"header2" : "name1", "header 3" : "address 1"}, {"header2" : "name3", "header 3" : "address 3"}],
"C2" : [ {"header2" : "name2", "header 3" : "address 2"}]
}
Based on some comments, some people are just pandas haters. But I like to use the tool that allows me to solve the problem in the easiest manner possible, and with the fewest lines of code.
In this case, without a doubt, that's pandas
An added benefit of using pandas, is the data can easily be clean, analyzed , and visualized, if needed.
Solutions at How to convert CSV file to multiline JSON? offer some basics, but won't help transform the csv into the required shape.
Because of the expected output of the JSON file, this is a non-trivial question, which requires reshaping/grouping the data in the csv and is easily accomplished with pandas.DataFrame.groupby.
groupby 'h1' since the column values will be the dict outer keys
groupby returns a DataFrameGroupBy object that can be split into, i, the value used to create the group ('c1' and 'c2' in this case) and the associated dataframe group, g.
Use pandas.DataFrame.to_dict to convert the dataframe into a list of dictionaries.
import json
import pandas as pd
# read the file
df = pd.DataFrame('test.csv')
# display(df)
h1 h2 h3
0 c1 n1 a1
1 c2 n2 a2
2 c1 n3 a3
# groupby and create dict
data_dict = dict()
for i, g in df.groupby('h1'):
data_dict[i] = g.drop(columns=['h1']).to_dict(orient='records')
# print(data_dict)
{'c1': [{'h2': 'n1', 'h3': 'a1'}, {'h2': 'n3', 'h3': 'a3'}],
'c2': [{'h2': 'n2', 'h3': 'a2'}]}
# save data_dict to a file as a JSON
with open('result.json', 'w') as fp:
json.dump(data_dict, fp)
JSON file
{
"c1": [{
"h2": "n1",
"h3": "a1"
}, {
"h2": "n3",
"h3": "a3"
}
],
"c2": [{
"h2": "n2",
"h3": "a2"
}
]
}

Having difficulty in transforming nested json to flat json using python

I have a below API response. This is a very small subset which I am pasting here for reference. there can be 80+ columns on this.
[["name","age","children","city", "info"], ["Richard Walter", "35", ["Simon", "Grace"], {"mobile":"yes","house_owner":"no"}],
["Mary", "43", ["Phil", "Marshall", "Emily"], {"mobile":"yes","house_owner":"yes", "own_stocks": "yes"}],
["Drew", "21", [], {"mobile":"yes","house_owner":"no", "investor":"yes"}]]
Initially I thought pandas could help here and searched accordingly but as a newbie to python/coding I was not able to get much out of it. any help or guidance is appreciated.
I am expecting output in a JSON key-value pair format such as below.
{"name":"Mary", "age":"43", "children":["Phil", "Marshall", "Emily"],"info_mobile":"yes","info_house_owner":"yes", "info_own_stocks": "yes"},
{"name":"Drew", "age":"21", "children":[], "info_mobile":"yes","info_house_owner":"no", "info_investor":"yes"}]```
I assume that the first list always will be the headers (column names)?
If that is the case, maybe something like this could work.
import pandas as pd
data = [["name", "age", "children", "info"], ["Ned", 40, ["Arya", "Rob"], {"dead": "yes", "winter is coming": "yes"}]]
headers = data[0]
data = data[1:]
df = pd.DataFrame(data, columns=headers)
df_json = df.to_json()
print(df)
Assuming that the first list always represents the keys ["name", "age"... etc]
and then the subsequent lists represent the actual data/API response then you can construct a dictionary (key pair values) like this.
keys = ["name", "age", "children", "info"]
api_response = ["Richard Walter", "35", ["Simon", "Grace"], {"mobile":"yes","house_owner":"no"}]
data_dict = {k: v for k, v in zip(keys, api_response)}

Passing multiple dictionary observations to function?

How would I pass multiple dictionary observations (row) into function for model prediction?
This is what I have ... it can accept 1 dictionary row as input and returns the prediction + probabilities, but fails when adding additional dictionaries.
import json
# func
def preds(dict):
df = pd.DataFrame([dict])
result = model.predict(df)
result = np.where(result==0,"CLASS_0","CLASS_1").astype('str')
probas_c0 = model.predict_proba(df)[0][0]
probas_c1 = model.predict_proba(df)[0][1]
data={"prediction": result[0],
"CLASS_0_PROB": probas_c0,
"CLASS_1_PROB": probas_c1}
data = {"parameters": [data]}
j = json.dumps(data)
j = json.loads(j)
return j
# call func
preds({"feature0": "value",
"feature1": "value",
"feature2": "value"})
# result
{'parameters': [{'prediction': 'CLASS_0',
'CLASS_0_PROB': 0.9556067383610446,
'CLASS_1_PROB': 0.0443932616389555}]}
# Tried with more than 1 row but it fails with arguments error
{'parameters': [{'prediction': 'CLASS_0',
'CLASS_0_PROB': 0.9556067383610446,
'CLASS_1_PROB': 0.0443932616389555},
{'parameters': [{'prediction': 'CLASS_0',
'CLASS_0_PROB': 0.9556067383610446,
'CLASS_1_PROB': 0.0443932616389555}]}
TypeError: preds() takes 1 positional argument but 2 were given
NEW UPDATE
The source data format from end users will most likely be a dataframe so want to convert that to format of [{...},{...}] so it can be plugged into preds() function here df=pd.DataFrame([rows])
Tried this so far...
rows = [
{"c1": "value1",
"c2": "value2",
"c3": 0,
},
{"c1": "value1,
"c2": "value2,
"c3": 0}
]
df = pd.DataFrame(rows)
json_rows = df.to_json(orient='records', lines=True)
l = [json_rows]
preds(l)
KeyError: "None of [['c1', 'c2', 'c3']] are in the [columns]"
UPDATED
Ok, based on your commentaries, what you need is the DataFrame get all rows, then you can use the next aproachs
Using *args
def preds(*args):
# args is tuple you need to cast as list
dict_rows = list(args)
df = pd.DataFrame(dict_rows)
result = model.predict(df)
...
# calling the function you need to unpack
preds(*rows)
Checking the element beforehand
def preds(dict_rows):
# checking if dict_rows is a list or a dict
if isinstance(dict_rows, dict)
dict_rows = [dict_rows]
df = pd.DataFrame(dict_rows)
result = model.predict(df)
...
# For calling you need to
preds(rows)
Please note that pd.DataFrame(dict_rows) not accepting [dict].
Old Anwser
If preds() can't handle multiple rows you can do
pred_rows = [
{"feature0": "value","feature1": "value", "feature2": "value"}
{"feature3": "value","feature4": "value", "feature5": "value"}
]
# List Comprehension
result = [preds(row) for row in pred_rows]
PS: also don't use dict as a variable name, is a Mapping Type, a constructor/class for dictionaries

Convert python nested JSON-like data to dataframe

My records looks like this and I need to write it to a csv file:
my_data={"data":[{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}]}
which looks like json, but the next record starts with "data" and not "data1" which forces me to read each record separately. Then, I convert it to a dict using eval(), to iterate thru keys and values for a certain path to get to the values I need. Then, I generate a list of keys and values based on the keys I need. Then, a pd.dataframe() converts that list into a dataframe which I know how to convert to csv. My code that works is below. But I am sure there are better ways to do this. Mine scales poorly. Thx.
counter=1
k=[]
v=[]
res=[]
m=0
for line in f2:
jline=eval(line)
counter +=1
for items in jline:
k.append(jline[u'data'][0].keys())
v.append(jline[u'data'][0].values())
print 'keys are:', k
i=0
j=0
while i <3 :
while j <3:
if k[i][j]==u'id':
res.append(v[i][j])
j += 1
i += 1
#res is my result set
del k[:]
del v[:]
Changing my_data to be:
my_data = [{"id":"xyz","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data One
{"id":"xyz2","type":"book","attributes":{"doc_type":"article","action":"cut"}}, # Data Two
{"id":"xyz3","type":"book","attributes":{"doc_type":"article","action":"cut"}}] # Data Three
You can dump this directly into a dataframe as so:
mydf = pd.DataFrame(my_data)
It's not clear what your data path would be, but if you are looking for specific combinations of id, type, etc. You could explicitly search
def find_my_way(data, pattern):
# pattern = {'id':'someid', 'type':'sometype'...}
res = []
for row in data:
if row.get('id') == pattern.get('id'):
res.append(row)
return row
mydf = pd.DataFrame(find_my_way(mydata, pattern))
EDIT:
Without going into how the api works, in pseudo-code, you'll want to do something like the following:
my_objects = []
calls = 0
while calls < maximum:
my_data = call_the_api(params)
data = my_data.get('data')
if not data:
calls+=1
continue
# Api calls to single objects usually return a dictionary, to group objects they return lists. This handles both cases
if isinstance(data, list):
my_objects = [*data, *my_objects]
elif isinstance(data, {}):
my_objects = [{**data}, *my_objects]
# This will unpack the data response into a list that you can then load into a DataFrame with the attributes from the api as the columns
df = pd.DataFrame(my_objects)
Assuming your data from the api looks like:
"""
{
"links": {},
"meta": {},
"data": {
"type": "FactivaOrganizationsProfile",
"id": "Goog",
"attributes": {
"key_executives": {
"source_provider": [
{
"code": "FACSET",
"descriptor": "FactSet Research Systems Inc.",
"primary": true
}
]
}
},
"relationships": {
"people": {
"data": {
"type": "people",
"id": "39961704"
}
}
}
},
"included": {}
}
"""
per the documentation, which is why I'm using my_data.get('data').
That should get you all of the data (unfiltered) into a DataFrame
Saving the DataFrame for the last bit is a bit more memory friendly

Categories