Converting json dictionary to spark dataframe by having keys as columns

Converting json dictionary to spark dataframe by having keys as columns - python

Is it possible to convert a dictionary into a dataframe by having the keys as columns with the values beneath?
I have this result set from api as a dictionary:
{
'information': [{
'created': '2020-10-26T00:00:00+00:00',
'title': 'Random1',
'published': 'YES',
}, {
'created': '2020-11-06T00:00:00+00:00',
'title': 'Random2',
'published': 'YES',
}, {
'created': '2020-10-27T00:00:00+00:00',
'title': 'Random3',
'published': 'YES',
}, {
'created': '2020-10-29T00:00:00+00:00',
'title': 'Random4',
'published': 'YES',
}]
}
If I convert this to a dataframe like this:
json_rdd=sc.parallelize([data_dict['information']])
spark_df = spark.createDataFrame(json_rdd)
spark_df.createOrReplaceTempView("data_df");
This gives me columns listed as _1, _2, _3,_4 with the data still showing as objects within them.
Is it possible to have the data_df (converted dataframe) show the columns as created, title, published and have the values within the corresponding columns as flat?

You can directly use the dictionary to create dataframe no need to covert it to rdd.
arr = your_dict_here
spark.createDataFrame(arr['information']).show()
Output:
+--------------------+---------+-------+
| created|published| title|
+--------------------+---------+-------+
|2020-10-26T00:00:...| YES|Random1|
|2020-11-06T00:00:...| YES|Random2|
|2020-10-27T00:00:...| YES|Random3|
|2020-10-29T00:00:...| YES|Random4|
+--------------------+---------+-------+

Related

How to remove redundant elements from a JSON string in Python

I have the below JSON string which I converted from a Pandas data frame.
[
{
"ID":"1",
"Salary1":69.43,
"Salary2":513.0,
"Date":"2022-06-09",
"Name":"john",
"employeeId":12,
"DateTime":"2022-09-0710:57:55"
},
{
"ID":"2",
"Salary1":691.43,
"Salary2":5123.0,
"Date":"2022-06-09",
"Name":"john",
"employeeId":12,
"DateTime":"2022-09-0710:57:55"
}
]
I want to change the above JSON to the below format.
[
{
"Date":"2022-06-09",
"Name":"john",
"DateTime":"2022-09-0710:57:55",
"employeeId":12,
"Results":[
{
"ID":1,
"Salary1":69.43,
"Salary2":513
},
{
"ID":"2",
"Salary1":691.43,
"Salary2":5123
}
]
}
]
Kindly let me know how we can achieve this in Python.
Original Dataframe:
ID Salary1 Salary2 Date Name employeeId DateTime
1 69.43 513.0 2022-06-09 john 12 2022-09-0710:57:55
2 691.43 5123.0 2022-06-09 john 12 2022-09-0710:57:55
Thank you.

As #Harsha pointed, you can adapt one of the answers from another question, with just some minor tweaks to make it work for OP's case:
(
df.groupby(["Date","Name","DateTime","employeeId"])[["ID","Salary1","Salary2"]]
# to_dict(orient="records") - returns list of rows, where each row is a dict,
# "oriented" like [{column -> value}, … , {column -> value}]
.apply(lambda x: x.to_dict(orient="records"))
# groupBy makes a Series: with grouping columns as index, and dict as values.
# This structure is no good for the next to_dict() method.
# So here we create new DataFrame out of grouped Series,
# with Series' indexes as columns of DataFrame,
# and also renamimg our Series' values to "Results" while we are at it.
.reset_index(name="Results")
# Finally we can achieve the desired structure with the last call to to_dict():
.to_dict(orient="records")
)
# [{'Date': '2022-06-09', 'Name': 'john', 'DateTime': '2022-09-0710:57:55', 'employeeId': 12,
# 'Results': [
# {'ID': 1, 'Salary1': 69.43, 'Salary2': 513.0},
# {'ID': 2, 'Salary1': 691.43, 'Salary2': 5123.0}
# ]}]

Format an f-string for each dataframe object

Requirement
My requirement is to have a Python code extract some records from a database, format and upload a formatted JSON to a sink.
Planned approach
1. Create JSON-like templates for each record. E.g.
json_template_str = '{{
"type": "section",
"fields": [
{{
"type": "mrkdwn",
"text": "Today *{total_val}* customers saved {percent_derived}%."
}}
]
}}'
2. Extract records from DB to a dataframe.
3. Loop over dataframe and replace the {var} variables in bulk using something like .format(**locals()))
Question
I haven't worked with dataframes before.
What would be the best way to accomplish Step 3 ? Currently I am
3.1 Looping over the dataframe objects 1 by 1 for i, df_row in df.iterrows():
3.2 Assigning
total_val= df_row['total_val']
percent_derived= df_row['percent_derived']
3.3 In the loop format and add str to a list block.append(json.loads(json_template_str.format(**locals()))
I was trying to use the assign() method in dataframe but was not able to figure out a way to use like a lambda function to create a new column with my expected value that I can use.
As a novice in pandas, I feel there might be a more efficient way to do this (which may even involve changing the JSON template string - which I can totally do). Will be great to hear thoughts and ideas.
Thanks for your time.

I would not write a JSON string by hand, but rather create a corresponding python object and then use the json library to convert it into a string. With this in mind, you could try the following:
import copy
import pandas as pd
# some sample data
df = pd.DataFrame({
'total_val': [100, 200, 300],
'percent_derived': [12.4, 5.2, 6.5]
})
# template dictionary for a single block
json_template = {
"type": "section",
"fields": [
{"type": "mrkdwn",
"text": "Today *{total_val:.0f}* customers saved {percent_derived:.1f}%."
}
]
}
# a function that will insert data from each row
# of the dataframe into a block
def format_data(row):
json_t = copy.deepcopy(json_template)
text_t = json_t["fields"][0]["text"]
json_t["fields"][0]["text"] = text_t.format(
total_val=row['total_val'], percent_derived=row['percent_derived'])
return json_t
# create a list of blocks
result = df.agg(format_data, axis=1).tolist()
The resulting list looks as follows, and can be converted into a JSON string if needed:
[{
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *100* customers saved 12.4%.'
}]
}, {
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *200* customers saved 5.2%.'
}]
}, {
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *300* customers saved 6.5%.'
}]
}]

Generate a stocklist: create a list using values of nested dictionaries within a list

I get an api call response as a list containing nested dictionaries. I want to put all the ticker names in to one single list as shown below. What is a way that'll allow me to do that?
Here is the list of the api response:
[ {'ticker': 'VOD-22', 'exchange': 'NYSE', 'assetType': 'Stock', 'priceCurrency': 'USD',
'startDate': '', 'endDate': ''}, {'ticker': 'VOD-23', 'exchange': 'NYSE', 'assetType':
'Stock', 'priceCurrency': 'USD', 'startDate': '', 'endDate': ''}]
I want to turn it into a list with all ticker names, like the following:
[ 'VOD-22', 'VOD-23']

You could do this with a list comprehension
ticker_names = [stock['ticker'] for stock in data]
where data is the json you received from your api call

Create pandas MultiIndex Dataframe from json

I am receiving the following json from a webservice:
{
"headers":[
{
"seriesId":"18805",
"Name":"Name1",
"assetId":"4"
},
{
"seriesId":"18801",
"Name":"Name2",
"assetId":"209"
}
],
"values":[
{
"Date":"01-Jan-2021",
"18805":"127.93",
"18801":"75.85"
}
]
}
Is there a way to create a MultiIndex dataframe from this data? I would like Date to be the row index and the rest to be column indexes.

the values key is a straight forward data frame
columns can be rebuilt from headers key
js = {'headers': [{'seriesId': '18805', 'Name': 'Name1', 'assetId': '4'},
{'seriesId': '18801', 'Name': 'Name2', 'assetId': '209'}],
'values': [{'Date': '01-Jan-2021', '18805': '127.93', '18801': '75.85'}]}
# get values into dataframe
df = pd.DataFrame(js["values"]).set_index("Date")
# get headers for use in rebuilding column names
dfc = pd.DataFrame(js["headers"])
# rebuild columns
df.columns = pd.MultiIndex.from_tuples(dfc.apply(tuple, axis=1), names=dfc.columns)
print(df)
seriesId 18805 18801
Name Name1 Name2
assetId 4 209
Date
01-Jan-2021 127.93 75.85

How to extract the value of a given key based on the value of another key, from a list of nested (not-always-existing) dictionaries [Python]

I have a list of dictionaries called api_data, where each dictionary has this structure:
{
'location':
{
'indoor': 0,
'exact_location': 0,
'latitude': '45.502',
'altitude': '133.9',
'id': 12780,
'country': 'IT',
'longitude': '9.146'
},
'sampling_rate': None,
'id': 91976363,
'sensordatavalues':
[
{
'value_type': 'P1',
'value': '8.85',
'id': 197572463
},
{
'value_type': 'P2',
'value': '3.95',
'id': 197572466
}
{
'value_type': 'temperature',
'value': '20.80',
'id': 197572625
},
{
'value_type': 'humidity',
'value': '97.70',
'id': 197572626
}
],
'sensor':
{
'id': 24645,
'sensor_type':
{
'name': 'DHT22',
'id': 9,
'manufacturer':
'various'
},
'pin': '7'
},
'timestamp': '2020-04-18 18:37:50'
},
This structure is not complete for each dictionary, meaning that sometimes a dictionary, a list element or a key is missing.
I want to extract the value of a key when the key value of the same dictionary is equal to a certain value.
For example, for dictionary sensordatavalues, I want the value of the key 'value' when 'value_type' is equal to 'P1'.
I have developed this code working with for and if cycles, but I bet it is heavily inefficient.
How can I do it in a quicker and more efficient way?
Please note that sensordatavalues always exists
for sensor in api_data:
sensordatavalues = sensor['sensordatavalues']
# L_sdv = len(sensordatavalues)
for physical_quantity_recorded in sensordatavalues:
if physical_quantity_recorded['value_type'] == 'P1':
PM10_value = physical_quantity_recorded['value']

If you are confident that the value 'P1' is unique to the key you are searching, you can use the 'in' operator with dict.values()
Should be ok to omit this assignment: sensordatavalues = sensor['sensordatavalues']
for sensor in api_data:
for physical_quantity_recorded in sensor['sensordatavalues']:
if 'P1' in physical_quantity_recorded.values():
PM10_value = physical_quantity_recorded['value']

You just need one for loop:
for x in api_data["sensordatavalues"]:
if x["value_type"] == "P1":
print(x["value"])
Output:
8.85

Use dictionary.get() method if the key not exist it will return default value
for physical_quantity_recorded in api_data['sensordatavalues']:
if physical_quantity_recorded.get('value_type', 'default_value') == 'P1':
PM10_value = physical_quantity_recorded.get('value', 'default_value')

this is an alternative: jmespath - allows you to search and filter a nested dict/json :
summary of jmespath ... to access a key, use the . notation, if ur values are in a list, u access it via the [] notation
NB: dict is wrapped in a data variable
import jmespath
#sensordatavalues is a key, so we can access it directly
#the values of sensordatavalues are wrapped in a list
#to access it we pass the bracket(```[]```)
#we are interested in the dict where value_type is P1
#in jmespath, we identify that using the ? mark to precede the filter object
#pass the filter
#and finally access the key we are interested in ... value
expression = jmespath.compile('sensordatavalues[?value_type==`P1`].value')
expression.search(data)
['8.85']

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Converting json dictionary to spark dataframe by having keys as columns - python

Related

How to remove redundant elements from a JSON string in Python

Format an f-string for each dataframe object

Generate a stocklist: create a list using values of nested dictionaries within a list

Create pandas MultiIndex Dataframe from json

How to extract the value of a given key based on the value of another key, from a list of nested (not-always-existing) dictionaries [Python]

Categories

Resources