Extract JSON | API | Pandas DataFrame - python

I am using the Facebook API (v2.10) to which I've extracted the data I need, 95% of which is perfect. My problem is the 'actions' metric which returns as a dictionary within a list within another dictionary.
At present, all the data is in a DataFrame, however, the 'actions' column is a list of dictionaries that contain each individual action for that day.
{
"actions": [
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "7"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "3"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "144"
},
{
"action_type": "offsite_conversion.custom.xxxxxxxxxxx",
"value": "34"
}]}
All this appears in one cell (row) within the DataFrame.
What is the best way to:
Get the action type, create a new column and use the Use "action_type" as the column name?
List the correct value under this column
It looks like JSON but when I look at the type, it's a panda series (stored as an object).
For those willing to help (thank you, I greatly appreciate it) - can you either point me in the direction of the right material and I will read it and work it out on my own (I'm not entirely sure what to look for) or if you decide this is an easy problem, explain to me how and why you solved it this way. Don't just want the answer
I have tried the following (with help from a friend) and it kind of works, but I have issues with this running in my script. IE: if it runs within a bigger code block, I get the following error:
for i in range(df.shape[0]):
line = df.loc[i, 'Conversions']
L = ast.literal_eval(line)
for l in L:
cid = l['action_type']
value = l['value']
df.loc[i, cid] = value
If I save the DF as a csv, call it using pd.read_csv...it executes properly, but not within the script. No idea why.
Error:
ValueError: malformed node or string: [{'value': '1', 'action_type': 'offsite_conversion.custom.xxxxx}]
Any help would be greatly appreciated.
Thanks,
Adrian

You can use json_normalize:
In [11]: d # e.g. dict from json.load OR instead pass the json path to json_normalize
Out[11]:
{'actions': [{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx',
'value': '7'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '3'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '144'},
{'action_type': 'offsite_conversion.custom.xxxxxxxxxxx', 'value': '34'}]}
In [12]: pd.io.json.json_normalize(d, record_path="actions")
Out[12]:
action_type value
0 offsite_conversion.custom.xxxxxxxxxxx 7
1 offsite_conversion.custom.xxxxxxxxxxx 3
2 offsite_conversion.custom.xxxxxxxxxxx 144
3 offsite_conversion.custom.xxxxxxxxxxx 34

You can use df.join(pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)).
Explanation:
df['Conversions'].tolist() returns a list of dictionaries. This list is then transformed into a DataFrame using pd.DataFrame. Then, you can use the pivot function to pivot the table into the shape that you want.
Lastly, you can join the table with your original DataFrame. Note that this only works if you DataFrame's index is the default (i.e., integers starting from 0). If this is not the case, you can do this instead:
df2 = pd.DataFrame(df['Conversions'].tolist()).pivot(columns='action_type', values='value').reset_index(drop=True)
for col in df2.columns:
df[col] = df2[col]

Related

How to access nested attribute without passing parent attribute in pyspark json

I am trying to access inner attributes of following json using pyspark
[
{
"432": [
{
"atttr1": null,
"atttr2": "7DG6",
"id":432,
"score": 100
}
]
},
{
"238": [
{
"atttr1": null,
"atttr2": "7SS8",
"id":432,
"score": 100
}
]
}
]
In the output, I am looking for something like below in form of csv
atttr1, atttr2,id,score
null,"7DG6",432,100
null,"7SS8",238,100
I understand I can get these details like below but I don't want to pass 432 or 238 in lambda expression as in bigger json this(italic one) will vary. I want to iterate over all available values.
print(inputDF.rdd.map(lambda x:(x['*432*'])).first())
print(inputDF.rdd.map(lambda x:(x['*238*'])).first())
I also tried registering a temp table with the name "test" but it gave an error with message element._id doesn't exist.
inputDF.registerTempTable("test")
srdd2 = spark.sql("select element._id from test limit 1")
Any help will be highly appreciated. I am using spark 2.4
Without using pyspark features, you can do it like this:
data = json.loads(json_str) # or whatever way you're getting the data
columns = 'atttr1 atttr2 id score'.split()
print(','.join(columns)) # headers
for item in data:
for obj in list(item.values())[0]: # since each list has only one object
print(','.join(str(obj[col]) for col in columns))
Output:
atttr1,atttr2,id,score
None,7DG6,432,100
None,7SS8,432,100
Or
for item in data:
obj = list(item.values())[0][0] # since the object is the one and only item in list
print(','.join(str(obj[col]) for col in columns))
FYI, you can store those in a variable or write it out to csv instead of/and also printing it.
And if you're just looking to dump that to csv, see this answer.

What am I doing wrong in the process of vectorizing the test of whether my geolocation fields are valid?

I was calling out to a geolocation API and was converting the results to a DataFrame like so:
results = geolocator.lookup(ip_list)
results:
[{
query: "0.0.0.0",
coordinates: { lat: "0", lon: "0" }
}, ...]
So we queried 0.0.0.0 and the API returned "0"s for the lat / long, indicating an IP that obviously cant be geolocated. Weird way to handle things as opposed to a False value or something, but we can work with it.
To DataFrame:
df = pd.DataFrame(results)
But wait, this leads to those "coordinate" fields being dictionaries within the DataFrame, and I may be a Panda beginner but I know I probably want those stored as DataFrames, not dicts, so we can vectorize.
So instead I did:
for result in results:
result["coordinates"] = pd.DataFrame(result["coordinates"], index=[0])
df = pd.DataFrame(results)
Not sure what index=[0] does there but without it I get an error, so I did it like that. Stop me here and tell me why I'm wrong if I'm doing this badly so far. I'm new to Python and DataFrames more than 2D are confusing to visualize.
Then I wanted to process over df and add a "geolocated" column with True or False based on a vectorized test, and tried to do that like so:
def is_geolocated(coordinate_df):
# yes the API returned string coords
lon_zero = np.equal(coordinate_df["lon"], "0") # error here
lat_zero = np.equal(coordinate_df["lat"], "0")
return lon_zero & lat_zero
df["geolocated"] = is_mappable(df["coordinates"])
But this throws a KeyError "lon".
Am I even on the right track, and if not, how should I set this up?
Generally I would agree with you that a dictionary is a bad way to store latitude/longitude values. This happens due to the way pd.DataFrame() works, as it will pick up on the keys query and coordinates, where the value for the key coordinates is simply a dictionary of the lat/lon values.
You can circumvent the entire problem by, e.g., defining every row as a tuple, and the whole dataframe as a list of these tuples. You can then perform a comparison whether both the lat and lon value are zero, and return this as a new column.
import pandas as pd
# Test dataset
results = [{
'query': "0.0.0.0",
'coordinates': { 'lat': "0", 'lon': "0" }
},
{
'query': "0.0.0.0",
'coordinates': { 'lat': "1", 'lon': "1" }
}]
df = pd.DataFrame([(result['query'], result['coordinates']['lat'], result['coordinates']['lon']) for result in results])
df.columns = ['Query', 'Lat', 'Lon']
df['Geolocated'] = ((df['Lat'] == '0') & (df['Lon'] == '0'))
df.head()
Query Lat Lon Geolocated
0 0.0.0.0 0 0 True
1 0.0.0.0 1 1 False
In this code I used a list comprehension to build the list of tuples and defined the 'Geolocated' column as a series, which comes from the comparison of the row's Lat and Lon values.

how to normalize this below json using panda in django

using this view.py query my output is showing something like this. you can see in choices field there are multiple array so i can normalize in serial wise here is my json
{"pages":[{"name":"page1","title":"SurveyWindow Pvt. Ltd. Customer Feedback","description":"Question marked * are compulsory.",
"elements":[{"type":"radiogroup","name":"question1","title":"Do you like our product? *","isRequired":true,
"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]},{"type":"checkbox","name":"question2","title":"Please Rate Our PM Skill","isRequired":false,"choices":[{"value":"High","text":"High"},{"value":"Low","text":"Low"},{"value":"Medium","text":"Medium"}]},{"type":"radiogroup","name":"question3","title":"Do you like our services? *","isRequired":true,"choices":[{"value":"Yes","text":"Yes"},{"value":"No","text":"No"}]}]}]}
this is my view.py
jsondata=SurveyMaster.objects.all().filter(survey_id='1H2711202014572740')
q = jsondata.values('survey_json_design')
qs_json = pd.DataFrame.from_records(q)
datatotable = pd.json_normalize(qs_json['survey_json_design'], record_path=['pages','elements'])
qs_json = datatotable.to_html()
Based on your comments and picture here's what I would do to go from the picture to something more SQL-friendly (what you refer to as "normalization"), but keep in mind this might blow up if you don't have sufficient memory.
Create a new list which you'll fill with the new data, then iterate over the pandas table's rows, and then over every item in your list. For every iteration in the inner loop use the data from the row (minus the column you're iteration over). For convenience I added it as the last element.
# Example data
df = pd.DataFrame({"choices": [[{"text": "yes", "value": "yes"},
{"text": "no", "value": "no"}],
[{"ch1": 1, "ch2": 2}, {"ch3": "ch3"}]],
"name": ["kostas", "rajesh"]})
data = []
for i, row in df.iterrows():
for val in row["choices"]:
data.append((*row.drop("choices").values, val))
df = pd.DataFrame(data, columns=["names", "choices"])
print(df)
names choices
0 kostas {'text': 'yes', 'value': 'yes'}
1 kostas {'text': 'no', 'value': 'no'}
2 george {'ch1': 1, 'ch2': 2}
3 george {'ch3': 'ch3'}
This is where I guess you want to go. All that's left is to just modify the column / variable names with your own data.

Pandas json_normalize and JSON flattening error

A panda newbie here that's struggling to understand why I'm unable to completely flatten a JSON I receive from an API. I need a Dataframe with all the data that is returned by the API, however I need all nested data to be expanded and given it's own columns for me to be able to use it.
The JSON I receive is as follows:
[
{
"query":{
"id":"1596487766859-3594dfce3973bc19",
"name":"test"
},
"webPage":{
"inLanguages":[
{
"code":"en"
}
]
},
"product":{
"name":"Test",
"description":"Test2",
"mainImage":"image1.jpg",
"images":[
"image2.jpg",
"image3.jpg"
],
"offers":[
{
"price":"45.0",
"currency":"€"
}
],
"probability":0.9552192
}
}
]
Running pd.json_normalize(data) without any additional parameters shows the nested values price and currency in the product.offers column. When I try to separate these out into their own columns with the following:
pd.json_normalize(data,record_path=['product',meta['product',['offers']]])
I end up with the following error:
f"{js} has non list value {result} for path {spec}. "
Any help would be much appreciated.
I've used this technique a few times
do initial pd.json_normalize() to discover the columns
build meta parameter by inspecting this and the original JSON. NB possible index out of range here
you can only request one list drives record_path param
a few tricks product/images is a list so it gets named 0. rename it
did a Cartesian product to merge two different data frames from breaking down lists. It's not so stable
data = [{'query': {'id': '1596487766859-3594dfce3973bc19', 'name': 'test'},
'webPage': {'inLanguages': [{'code': 'en'}]},
'product': {'name': 'Test',
'description': 'Test2',
'mainImage': 'image1.jpg',
'images': ['image2.jpg', 'image3.jpg'],
'offers': [{'price': '45.0', 'currency': '€'}],
'probability': 0.9552192}}]
# build default to get column names
df = pd.json_normalize(data)
# from column names build the list that gets sent to meta param
mymeta = [[s for s in c.split(".")] for c in df.columns ]
# exclude lists from meta - this will fail
mymeta = [l for l in mymeta if not isinstance(data[0][l[0]][l[1]], list)]
# you can build df from either of the product lists NOT both
df1 = pd.json_normalize(data, record_path=[["product","offers"]], meta=mymeta)
df2 = pd.json_normalize(data, record_path=[["product","images"]], meta=mymeta).rename(columns={0:"image"})
# want them together - you can merge them. note columns heavily overlap so remove most columns from df2
df1.assign(foo=1).merge(
df2.assign(foo=1).drop(columns=[c for c in df2.columns if c!="image"]), on="foo").drop(columns="foo")

Using json_normalize for structured multi level dictionaries with lists

I've successfully transferred the data from a JSON file (structured as per the below example), into a three column ['tag', 'time', 'score'] DataFrame using the following iterative approach:
for k, v in enumerate(my_request['content']):
for k1, v1 in enumerate(v['data']['score']):
df.loc[len(df)] = [v['tag_id'], v1['time'], v1['value']]
However, while this ultimately achieves the desired result, it takes a huge amount of time to iterate through larger files with the same structure. I'm assuming that an iterative approach is not the ideal way to tackle this sort of problem. Using pandas.io.json.json_normalize instead, I've tried the following:
result = json_normalize(my_request, ['content'], ['data', 'score', ['time', 'value']])
Which returns KeyError: ("Try running with errors='ignore' as key %s is not always present", KeyError('data',)). I believe I've misinterpreted the pandas documentation on json_normalize, and can't quite figure out how I should pass the parameters.
Can anyone point me in the right direction?
(alternatively using errors='ignore' returns ValueError: Conflicting metadata name data, need distinguishing prefix.)
JSON Structure
{
'content':[
{
'data':{
'score':[
{
'time':'2015-03-01 00:00:30',
'value':75.0
},
{
'time':'2015-03-01 23:50:30',
'value':58.0
}
]
},
'tag_id':320676
},
{
'data':{
'score':[
{
'time':'2015-03-01 00:00:25',
'value':78.0
},
{
'time':'2015-03-01 00:05:25',
'value':57.0
}
]
},
'tag_id':320677
}
],
'meta':None,
'requested':'2018-04-15 13:00:00'
}
However, while this ultimately achieves the desired result, it takes a huge amount of time to iterate through larger files with the same structure.
I would suggest the following:
Check whether the problem is with your iterated appends. Pandas is not very good at sequentially adding rows. How about this code:
tups = []
for k, v in enumerate(my_request['content']):
for k1, v1 in enumerate(v['data']['score']):
tups.append(v['tag_id'], v1['time'], v1['value'])
df = pd.DataFrame(tups, columns=['tag_id', 'time', 'value])
If the preceding is not fast enough, check if it's the JSON-parsing part with
for k, v in enumerate(my_request['content']):
for k1, v1 in enumerate(v['data']['score']):
v['tag_id'], v1['time'], v1['value']
It is probable that 1. will be fast enough. If not, however, check if ujson might be faster for this case.

Categories