Unable to access dictionary values stored in pandas dataframe - python

I have read in a json file into my pandas dataframe which now looks like this:
document_nr doc_type doc_details.Summary.ID doc_details.Summary.date ....
209 202207220341 A None 07/22/2022
210 202207220217 B None 07/27/2022
211 202207220327 C None 07/29/2022
....
My issue is that I cannot access column values that are originally nested dictionaries. Example I can do print(df['document_nr']) without any problems but I cannot do print(df.doc_details.Summary.ID) as it gives me the error:
AttributeError: 'DataFrame' object has no attribute 'doc_details'
Likewise for print(df.doc_details.Summary.date).
I have also tried below code but got the same error.
df['summary'] = df['doc_details'].str.get("Summary")
I have no idea why it's giving me this issue. My original sample Json file looks like this:
[
{
"document_nr": "202207220914",
"doc_type": "A",
"doc_details": {
"Summary": {
"ID": null,
"date": "07/22/2022",
.....}
}
}
]
Please advise.

as whole doc_details.Summary.ID is actual name of your column, you should access this data like this:
print(df["doc_details.Summary.ID"])
When you are using dot notation df.doc_details.Summary.ID it's at first looking for property doc_details in df object.

Related

How to access nested attribute without passing parent attribute in pyspark json

I am trying to access inner attributes of following json using pyspark
[
{
"432": [
{
"atttr1": null,
"atttr2": "7DG6",
"id":432,
"score": 100
}
]
},
{
"238": [
{
"atttr1": null,
"atttr2": "7SS8",
"id":432,
"score": 100
}
]
}
]
In the output, I am looking for something like below in form of csv
atttr1, atttr2,id,score
null,"7DG6",432,100
null,"7SS8",238,100
I understand I can get these details like below but I don't want to pass 432 or 238 in lambda expression as in bigger json this(italic one) will vary. I want to iterate over all available values.
print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())
I also tried registering a temp table with the name "test" but it gave an error with message element._id doesn't exist.
inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")
Any help will be highly appreciated. I am using spark 2.4
Without using pyspark features, you can do it like this:
data = json.loads(json_str) # or whatever way you're getting the data
columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns)) # headers
for item in data:
for obj in list(item.values())[0]: # since each list has only one object
print(','.join(str(obj[col]) for col in columns))
Output:
atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100
Or
for item in data:
obj = list(item.values())[0][0] # since the object is the one and only item in list
print(','.join(str(obj[col]) for col in columns))
FYI, you can store those in a variable or write it out to csv instead of/and also printing it.
And if you're just looking to dump that to csv, see this answer.

Extract list from dict of lists then append to dataframe

I'm trying to extract a field from a json that contains a list then append that list to a dataframe, but I'm running in to a few different errors.
I think I can write it to a csv then read the csv with Pandas, but I'm trying to avoid writing any files. I know that I can also use StringIO to make a csv, but that has issues with null bytes. Replacing those would be (i think) another line-by-line step that will further extend the time the script takes to complete... i'm running this against a query that returns thens of thousands of results so keeping it fast and simple is a priority
First I tried this:
hit_json = json.loads(hit)
for ln in hit_json.get('hits').get('hits'):
df = df.append(ln['_source'], ignore_index=True)
print(df)
This gives me a result that looks like this:
1 2 3 4
a b d,e,f... x
Then I tried this:
df = df.append(ln['_source']['payload'], ignore_index=True)
But that gives me this error:
TypeError: cannot concatenate object of type "<class 'str'>"; only pd.Series,
pd.DataFrame, and pd.Panel (deprecated) objs are valid
What I'm looking for would be something like this:
0 1 2 3 4
d e f g h
On top of this... I need to figure out a way to handle a specific string in this list that contains a comma... which may be a headache that's best handled in a different question... something like:
# Obviously this is incorrect but I think you get the idea :)
str.replace(',', '^')
except if ',' followed by ' '
Greatly appreciate any help!
EDITING TO ADD JSON AS REQUESTED
{
"_index": "sanitized",
"_type": "sanitized",
"_id": "sanitized".,
"_score": sanitized,
"_source": {
"sanitized": sanitized,
"sanitized": "1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,\"34,35\",36,37,38,39,40",
"sanitized": "sanitized",
"sanitized": ["sanitized"],
"sanitized": "sanitized",
"sanitized": "sanitized",
"sanitized": "sanitized",
"sanitized": "sanitized",
}
}]
}
}
You can maybe write a temporary file with StringIO, like it's done here.
Then for the second part you could do
if ',' in data and ', ' not in data:
data = data.replace(',', '^')
You can try the following
hit_json = json.loads(hit)
for ln in hit_json.get('hits').get('hits'):
data = ln['_source']["payload"].split(",")
df.loc[len(df)] = pd.Series(data, index=range(len(data)))
print(df)
The benefit of the loc is that you will not create a new dataframe each time so it will be fast. You can find the post here.
I would also like to suggest an alternative that can be faster. First create a dictionary with all the data and then dump the dictionary into a dataframe.

Python - JSON array to DataFrame

I have this following JSON array.
[
{
"foo"=1
},
{
"foo"=2
},
...
]
I would like to convert it to DataFrame object using pd.read_json() command like below.
df = pd.read_json(my_json) #my_json is JSON array above
However, I got the error, since my_json is a list/array of json. The error is ValueError: Invalid file path or buffer object type: <class 'list'>.
Besides iterating through the list, is there any efficient way to extract/convert the JSON to DataFrame object?
Use df = pd.DataFrame(YourList)
Ex:
import pandas as pd
d = [
{
"foo":1
},
{
"foo":2
}
]
df = pd.DataFrame(d)
print(df)
Output:
foo
0 1
1 2
There are two problems in your question:
It called to_csv on a list.
The JSON was illegal, as it contained = signs instead of :
This works by me:
import json
import pandas as pd
>>> pd.DataFrame(json.loads("""[
{
"foo": 1
},
{
"foo": 2
}
]"""))
foo
0 1
1 2
You can also call read_json directly.

Extract JSON | API | Pandas DataFrame

I am using the Facebook API (v2.10) to which I've extracted the data I need, 95% of which is perfect. My problem is the 'actions' metric which returns as a dictionary within a list within another dictionary.
At present, all the data is in a DataFrame, however, the 'actions' column is a list of dictionaries that contain each individual action for that day.
{
"actions": [
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "7"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "3"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "144"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "34"
}]}
All this appears in one cell (row) within the DataFrame.
What is the best way to:
Get the action type, create a new column and use the Use "action_type" as the column name?
List the correct value under this column
It looks like JSON but when I look at the type, it's a panda series (stored as an object).
For those willing to help (thank you, I greatly appreciate it) - can you either point me in the direction of the right material and I will read it and work it out on my own (I'm not entirely sure what to look for) or if you decide this is an easy problem, explain to me how and why you solved it this way. Don't just want the answer
I have tried the following (with help from a friend) and it kind of works, but I have issues with this running in my script. IE: if it runs within a bigger code block, I get the following error:
for i in range(df.shape[0]):
line = df.loc[i, 'Conversions']
L = ast.literal_eval(line)
for l in L:
cid = l['action_type']
value = l['value']
df.loc[i, cid] = value
If I save the DF as a csv, call it using pd.read_csv...it executes properly, but not within the script. No idea why.
Error:
ValueError: malformed node or string: [{'value': '1', 'action_type': 'offsite_conversion.custom.xxxxx}]
Any help would be greatly appreciated.
Thanks,
Adrian
You can use json_normalize:
In [11]: d # e.g. dict from json.load OR instead pass the json path to json_normalize
Out[11]:
{'actions': [{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx',
'value': '7'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '3'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '144'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '34'}]}
In [12]: pd.io.json.json_normalize(d, record_path="actions")
Out[12]:
action_type value
0 offsite_conversion.custom.xxxxxxxxxxx 7
1 offsite_conversion.custom.xxxxxxxxxxx 3
2 offsite_conversion.custom.xxxxxxxxxxx 144
3 offsite_conversion.custom.xxxxxxxxxxx 34
You can use df.join(pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)).
Explanation:
df['Conversions'].tolist() returns a list of dictionaries. This list is then transformed into a DataFrame using pd.DataFrame. Then, you can use the pivot function to pivot the table into the shape that you want.
Lastly, you can join the table with your original DataFrame. Note that this only works if you DataFrame's index is the default (i.e., integers starting from 0). If this is not the case, you can do this instead:
df2 = pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)
for col in df2.columns:
df[col] = df2[col]

Format JSON objects in python

I have json objects in a notepad(C:\data.txt).There are millions of records I just used one record as an example.But I want to see only data on my notepad like:
1 123-567-9876 TEST1 TEST 717-567-9876 Harrisburg null US_PA
I dont want paranthesis,etc
Once I get the clean data,plan is to import the data from notepad(say C:\data2.txt) into SQL database.
This is the format of json object.
{
"status":"ok",
"items":[
{
"1":{
"Work_Phone":"123-567-9876",
"Name_Part":[
"TEST1",
"TEST"
],
"Residence_Phone":"717-567-9876",
"Mailing_City":"Harrisburg",
"Mailing_Street_Address_line_1":"",
"Cell_Phone":null,
"Mailing_Country_AND_Province_OR_State":"US_PA"
}
}
]
}
Can someone pls help with python code to format this json object and export it to notepad.
You can use
import simplejson as json
Then you can open your file and load it into a python-Dictionary:
f = file("C:/data.txt","r")
data = json.loads(f.read())
But this works only, when the json-objects are stored in an array in your file. So this has to look like this:
[{ ... first date ...},
{... second date ...},
...,
{... last date ...}]
Then in data there is an array of dictionaries. Now you can write the dates in another file:
g = file("output.txt","w")
for d in data:
for i in items:
for k in i.keys:
g.write(... some string build from the parameters ...)
If well done the file output.txt contains the lines. In detail it might be difficult becaus each item seems to contain some arrays.

Categories