Python - JSON array to DataFrame - python

I have this following JSON array.
[
{
"foo"=1
},
{
"foo"=2
},
...
]
I would like to convert it to DataFrame object using pd.read_json() command like below.
df = pd.read_json(my_json) #my_json is JSON array above
However, I got the error, since my_json is a list/array of json. The error is ValueError: Invalid file path or buffer object type: <class 'list'>.
Besides iterating through the list, is there any efficient way to extract/convert the JSON to DataFrame object?

Use df = pd.DataFrame(YourList)
Ex:
import pandas as pd
d = [
{
"foo":1
},
{
"foo":2
}
]
df = pd.DataFrame(d)
print(df)
Output:
foo
0 1
1 2

There are two problems in your question:
It called to_csv on a list.
The JSON was illegal, as it contained = signs instead of :
This works by me:
import json
import pandas as pd
>>> pd.DataFrame(json.loads("""[
{
"foo": 1
},
{
"foo": 2
}
]"""))
foo
0 1
1 2
You can also call read_json directly.

Related

Normalizing json using pandas with inconsistent nested lists/dictionaries

I've been using pandas' json_normalize for a bit but ran into a problem with specific json file, similar to the one seen here: https://github.com/pandas-dev/pandas/issues/37783#issuecomment-1148052109
I'm trying to find a way to retrieve the data within the Ats -> Ats dict and return any null values (like the one seen in the ID:101 entry) as NaN values in the dataframe. Ignoring errors within the json_normalize call doesn't prevent the TypeError that stems from trying to iterate through a null value.
Any advice or methods to receive a valid dataframe out of data with this structure is greatly appreciated!
import json
import pandas as pd
data = """[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.json_normalize(data, ["Ats", "Ats"], "ID", errors='ignore')
df.head()
TypeError: 'NoneType' object is not iterable
I tried to iterate through the Ats dictionary, which would work normally for the data with ID 100 but not with ID 101. I expected ignoring errors within the function to return a NaN value in a dataframe but instead received a TypeError for trying to iterate through a null value.
The desired output would look like this: Dataframe
This approach can be more efficient when it comes to dealing with large datasets.
data = json.loads(data)
desired_data = list(
map(lambda x: pd.json_normalize(x, ["Ats", "Ats"], "ID").to_dict(orient="records")[0]
if x["Ats"] is not None
else {"ID": x["ID"], "Name": np.nan, "Desc": np.nan}, data))
df = pd.DataFrame(desired_data)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
You might want to consider using this simple try and except approach when working with small datasets. In this case, whenever an error is found it should append new row to DataFrame with NAN.
Example:
data = json.loads(data)
df = pd.DataFrame()
for item in data:
try:
df = df.append(pd.json_normalize(item, ["Ats", "Ats"], "ID"))
except TypeError:
df = df.append({"ID" : item["ID"], "Name": np.nan, "Desc": np.nan}, ignore_index=True)
print(df)
Output:
Name Desc ID
0 At1 Lazy At 100
1 NaN NaN 101
Maybe you can create a DataFrame from the data normally (without pd.json_normalize) and then transform it to requested form afterwards:
import json
import pandas as pd
data = """\
[
{
"ID": "100",
"Ats": {
"Ats": [
{
"Name": "At1",
"Desc": "Lazy At"
}
]
}
},
{
"ID": "101",
"Ats": null
}
]"""
data = json.loads(data)
df = pd.DataFrame(data)
df["Ats"] = df["Ats"].str["Ats"]
df = df.explode("Ats")
df = pd.concat([df, df.pop("Ats").apply(pd.Series, dtype=object)], axis=1)
print(df)
Prints:
ID Name Desc
0 100 At1 Lazy At
1 101 NaN NaN

How can I parse a json to a pandas dataframe without losing values? [duplicate]

This question already has answers here:
JSON to pandas DataFrame
(14 answers)
Closed 1 year ago.
I'm trying to parse a json into a dataframe. And I one focus on the first key on the json (validations). The structure of the json is pretty standard, as the example below:
{
"validations": [
{
"id": "1111111-2222-3333-4444-555555555555",
"created_at": "2020-02-19T14:35:58-03:00",
"finished_at": "2020-02-19T14:36:01-03:00",
"processing_status": "concluded",
"receivable_id": "VAL-AAAAAAA-BBBB-CCCC-DDDD-EEEEEEEEEEEE",
"external_reference": "FFFFFFFF-GGGG-HHHH-IIII-JJJJJJJJJJJJ",
"batch_id": "e2fb8d34-8c53-4910-b7a4-602ab6845855",
"portfolio": {
"id": "57a3e56a-347b-449c-8f1a-253baba90e7a",
"nome": "COMPANY_NAME"
}
}],
"pages": {
"per_page": 10,
"page": 1
}
}
I'm using the following code:
import json as json
import pandas as pd
import os
print(os.getcwd()) ## point out the directory you're working on this cell
filename = r"file_path\file_name.json"
f = open(filename)
data1 = json.loads(f.read())
df = pd.json_normalize(data1)
data1.keys()
## => returns: dict_keys(['validacoes', 'paginacao'])
res = dict((k, data1[k]) for k in ['validacoes']
if k in data1)
res.keys()
## => returns dict_keys(['validacoes'])
df = pd.DataFrame(res, columns=['id', 'data_criacao', 'data_finalizacao', 'status_do_processamento', 'recebivel_id','referencia_externa', 'lote_id', 'veiculo'])
df.head()
## returns=> a dataframe with no values on the columns, as if they were empty from the json
| id | created_at | finished_at | processing_status | receivable_id | external_reference | batch_id | external_reference | portfolio |
So, I already checked the original file on a text editor and, yes, the json is properly mapped with values.
And the format is standardized throughout the file.
Any thoughts as to why the data from the json is being lost on the process?
You need to flatten your JSON.
This post should help you out:
Python flatten multilevel/nested JSON
Also, pandas has a simple json_normalize method you could use.

Python dataframe to nested json file

I have a python dataframe as below.
python dataframe:-
Emp_No Name Project Task
1 ABC P1 T1
1 ABC P2 T2
2 DEF P3 T3
3 IJH Null Null
I need to convert it to json file and save it to disk as below
Json File
{
"Records"[
{
"Emp_No":"1",
"Project_Details":[
{
"Project":"P1",
"Task":"T1"
},
{
"Project":"P2",
"Task":"T2"
}
],
"Name":"ÄBC"
},
{
"Emp_No":"2",
"Project_Details":[
{
"Project":"P2",
"Task":"T3"
}
],
"Name":"DEF"
},
{
"Emp_No":"3",
"Project_Details":[
],
"Name":"IJH"
}
]
}
I feel like this post is not a doubt per se, but a cheecky atempt to avoid formatting the data, hahaha. But, since i'm trying to get used to the dataframe structure and the different ways of handling it, here you go!
import pandas as pd
asutosh_data = {'Emp_No':["1","1","2","3"], 'Name':["ABC","ABC","DEF","IJH"], 'Project':["P1","P2","P3","Null"], 'Task':["T1","T2","T3","Null"]}
df = pd.DataFrame(data=asutosh_data)
records = []
dif_emp_no = df['Emp_No'].unique()
for emp_no in dif_emp_no :
emp_data = df.loc[df['Emp_No'] == emp_no]
emp_project_details = []
for index,data in emp_data.iterrows():
if data["Project"]!="Null":
emp_project_details.append({"Project":data["Project"],"Task":data["Task"]})
records.append({"Emp_No":emp_data.iloc[0]["Emp_No"], "Project_Details":emp_project_details , "Name":emp_data.iloc[0]["Name"]})
final_data = {"Records":records}
print(final_data)
If you have any question about the code above, feel free to ask. I'll also leave below the documentation i've used to solve your problem (you may wanna check that):
unique : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.unique.html
loc : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html
iloc : https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html

Complex json to pandas dataframe

There are lots of question about json to pandas dataframe but none of them solved my issues. I am practicing on this complex json file which looks like this
{
"type" : "FeatureCollection",
"features" : [ {
"Id" : 265068000,
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ 22.170376666666666, 65.57273333333333 ]
},
"properties" : {
"timestampExternal" : 1529151039629
}
}, {
"Id" : 265745760,
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ 20.329506666666667, 63.675425000000004 ]
},
"properties" : {
"timestampExternal" : 1529151278287
}
} ]
}
I want to convert this json directly to pandas dataframe using pd.read_json() My Primary Goal is to extract Id, Coordinates and timestampExternal. As this is very complex json, normal way of pd.read_json(), simply doesnt give correct output. Can you suggest me, how can i approach to solve in this kind of situations. Expected output is something like this
Id,Coordinates,timestampExternal
265068000,[22.170376666666666, 65.57273333333333],1529151039629
265745760,[20.329506666666667, 63.675425000000004],1529151278287
You can read the json to load it into a dictionary. Then, using dictionary comprehension, extract the attributes you want as columns -
import json
import pandas as pd
_json = json.load(open('/path/to/json'))
df_dict = [{'id':item['Id'], 'coordinates':item['geometry']['coordinates'],
'timestampExternal':item['properties']['timestampExternal']} for item in _json['features']]
extracted_df = pd.DataFrame(df_dict)
>>>
coordinates id timestampExternal
0 [22.170376666666666, 65.57273333333333] 265068000 1529151039629
1 [20.329506666666667, 63.675425000000004] 265745760 1529151278287
You can read the json directly, and then given the features array to pandas as a dict like:
Code:
import json
with open('test.json', 'rU') as f:
data = json.load(f)
df = pd.DataFrame([dict(id=datum['Id'],
coords=datum['geometry']['coordinates'],
ts=datum['properties']['timestampExternal'],
)
for datum in data['features']])
print(df)
Results:
coords id ts
0 [22.170376666666666, 65.57273333333333] 265068000 1529151039629
1 [20.329506666666667, 63.675425000000004] 265745760 1529151278287

How do I to translate this json format into correct format that can be used pandas read_json()

This is first time use stackoverflow to ask question. I have poor English,so if I affend you accidently in word, please don't mind.
I have a json file (access.json),format like:
[
{u'IP': u'aaaa1', u'Domain': u'bbbb1', u'Time': u'cccc1', ..... },
{u'IP': u'aaaa2', u'Domain': u'bbbb2', u'Time': u'cccc2', ..... },
{u'IP': u'aaaa3', u'Domain': u'bbbb3', u'Time': u'cccc3', ..... },
{u'IP': u'aaaa4', u'Domain': u'bbbb4', u'Time': u'cccc4', ..... },
{ ....... },
{ ....... }
]
When I use:
ipython
import pasdas as pd
data = pd.read_json('./access.json')
it return:
ValueError: Expected object or value
that is the result I want:
[out]
IP Domain Time ...
0 aaaa1 bbbb1 cccc1 ...
1 aaaa2 bbbb2 cccc2 ...
2 aaaa3 bbbb3 cccc3 ...
3 aaaa4 bbbb4 cccc4 ...
...and so on
How should I do to achieve this goal? Thank you for answer!
This isn't valid json which is why read_json won't parse it.
{u'IP': u'aaaa1', u'Domain': u'bbbb1', u'Time': u'cccc1', ..... },
should be
{"IP": "aaaa1", "Domain": "bbbb1", "Time": "cccc1", ..... },
You could smash this (the entire file) with a regular expression to find these, for example:
In [11]: line
Out[11]: "{u'IP': u'aaaa1', u'Domain': u'bbbb1', u'Time': u'cccc1'},"
In [12]: re.sub("(?<=[\{ ,])u'|'(?=[:,\}])", '"', line)
Out[12]: '{"IP": "aaaa1", "Domain": "bbbb1", "Time": "cccc1"},'
Note: this will get tripped up by some strings, so use with caution.
A better "solution" would be to ensure you had valid json in the first place... It looks like this has come from python's str/unicode/repr rather than json.dumps.
Note: json.dumps produces valid json, so can be read by read_json.
In [21]: repr({u'IP': u'aaa'})
Out[21]: "{u'IP': u'aaa'}"
In [22]: json.dumps({u'IP': u'aaa'})
Out[22]: '{"IP": "aaa"}'
If someone else created this "json", then complain! It's not json.
It is not a JSON format. It is a list of dictionaries. You can use ast.literal_eval() to get the actual list from the file and pass it to the DataFrame constructor:
from ast import literal_eval
import pandas as pd
with open('./access.log2.json') as f:
data = literal_eval(f.read())
df = pd.DataFrame(data)
print df
Output for the example data you've provided:
Domain IP Time
0 bbbb1 aaaa1 cccc1
1 bbbb2 aaaa2 cccc2
2 bbbb3 aaaa3 cccc3
3 bbbb4 aaaa4 cccc4
You can also use
pd.read_json("{json_file_name}", orient='records')
assuming that the JSON data is in list format as shown in the question.

Categories