Format an f-string for each dataframe object - python

Requirement
My requirement is to have a Python code extract some records from a database, format and upload a formatted JSON to a sink.
Planned approach
1. Create JSON-like templates for each record. E.g.
json_template_str = '{{
"type": "section",
"fields": [
{{
"type": "mrkdwn",
"text": "Today *{total_val}* customers saved {percent_derived}%."
}}
]
}}'
2. Extract records from DB to a dataframe.
3. Loop over dataframe and replace the {var} variables in bulk using something like .format(**locals()))
Question
I haven't worked with dataframes before.
What would be the best way to accomplish Step 3 ? Currently I am
3.1 Looping over the dataframe objects 1 by 1 for i, df_row in df.iterrows():
3.2 Assigning
total_val= df_row['total_val']
percent_derived= df_row['percent_derived']
3.3 In the loop format and add str to a list block.append(json.loads(json_template_str.format(**locals()))
I was trying to use the assign() method in dataframe but was not able to figure out a way to use like a lambda function to create a new column with my expected value that I can use.
As a novice in pandas, I feel there might be a more efficient way to do this (which may even involve changing the JSON template string - which I can totally do). Will be great to hear thoughts and ideas.
Thanks for your time.

I would not write a JSON string by hand, but rather create a corresponding python object and then use the json library to convert it into a string. With this in mind, you could try the following:
import copy
import pandas as pd
# some sample data
df = pd.DataFrame({
'total_val': [100, 200, 300],
'percent_derived': [12.4, 5.2, 6.5]
})
# template dictionary for a single block
json_template = {
"type": "section",
"fields": [
{"type": "mrkdwn",
"text": "Today *{total_val:.0f}* customers saved {percent_derived:.1f}%."
}
]
}
# a function that will insert data from each row
# of the dataframe into a block
def format_data(row):
json_t = copy.deepcopy(json_template)
text_t = json_t["fields"][0]["text"]
json_t["fields"][0]["text"] = text_t.format(
total_val=row['total_val'], percent_derived=row['percent_derived'])
return json_t
# create a list of blocks
result = df.agg(format_data, axis=1).tolist()
The resulting list looks as follows, and can be converted into a JSON string if needed:
[{
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *100* customers saved 12.4%.'
}]
}, {
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *200* customers saved 5.2%.'
}]
}, {
'type': 'section',
'fields': [{
'type': 'mrkdwn',
'text': 'Today *300* customers saved 6.5%.'
}]
}]

Related

How to convert a dataframe to nested json

I have this DataFrame:
df = pd.DataFrame({'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"})
All the dataframe fields are ASCII strings and is the output from a SQL query (pd.read_sql_query) so the line to create the dataframe above may not be quite right.
And I wish the final JSON output to be in the form
[{
"Survey": "001_220816080015",
"BCD": "001_220816080015.bcd",
"Sections": [
"4700A1/305",
"4700A1/312"
}]
I realize that may not be 'normal' JSON but that is the format expected by a program over which I have no control.
The nearest I have achieved so far is
[{
"Survey": "001_220816080015",
"BCD": "001_220816080015.bcd",
"Sections": "4700A1/305, 4700A1/312"
}]
Problem might be the structure of the dataframe but how to reformat it to produce the requirement is not clear to me.
The JSON line is:
df.to_json(orient='records', indent=2)
Isn't the only thing you need to do to parse the Sections into a list?
import pandas as pd
df= pd.DataFrame({'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"}, index=[0])
df['Sections'] = df['Sections'].str.split(', ')
print(df.to_json(orient='records', indent=2))
[
{
"Survey":"001_220816080015",
"BCD":"001_220816080015.bcd",
"Sections":[
"4700A1\/305",
"4700A1\/312"
]
}
]
The DataFrame won't help you here, since it's just giving back the input parameter you gave it.
You should just split the specific column you need into an array:
input_data = {'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"}
input_data['Sections'] = input_data['Sections'].split(', ')
nested_json = [input_data]

How to remove redundant elements from a JSON string in Python

I have the below JSON string which I converted from a Pandas data frame.
[
{
"ID":"1",
"Salary1":69.43,
"Salary2":513.0,
"Date":"2022-06-09",
"Name":"john",
"employeeId":12,
"DateTime":"2022-09-0710:57:55"
},
{
"ID":"2",
"Salary1":691.43,
"Salary2":5123.0,
"Date":"2022-06-09",
"Name":"john",
"employeeId":12,
"DateTime":"2022-09-0710:57:55"
}
]
I want to change the above JSON to the below format.
[
{
"Date":"2022-06-09",
"Name":"john",
"DateTime":"2022-09-0710:57:55",
"employeeId":12,
"Results":[
{
"ID":1,
"Salary1":69.43,
"Salary2":513
},
{
"ID":"2",
"Salary1":691.43,
"Salary2":5123
}
]
}
]
Kindly let me know how we can achieve this in Python.
Original Dataframe:
ID Salary1 Salary2 Date Name employeeId DateTime
1 69.43 513.0 2022-06-09 john 12 2022-09-0710:57:55
2 691.43 5123.0 2022-06-09 john 12 2022-09-0710:57:55
Thank you.
As #Harsha pointed, you can adapt one of the answers from another question, with just some minor tweaks to make it work for OP's case:
(
df.groupby(["Date","Name","DateTime","employeeId"])[["ID","Salary1","Salary2"]]
# to_dict(orient="records") - returns list of rows, where each row is a dict,
# "oriented" like [{column -> value}, … , {column -> value}]
.apply(lambda x: x.to_dict(orient="records"))
# groupBy makes a Series: with grouping columns as index, and dict as values.
# This structure is no good for the next to_dict() method.
# So here we create new DataFrame out of grouped Series,
# with Series' indexes as columns of DataFrame,
# and also renamimg our Series' values to "Results" while we are at it.
.reset_index(name="Results")
# Finally we can achieve the desired structure with the last call to to_dict():
.to_dict(orient="records")
)
# [{'Date': '2022-06-09', 'Name': 'john', 'DateTime': '2022-09-0710:57:55', 'employeeId': 12,
# 'Results': [
# {'ID': 1, 'Salary1': 69.43, 'Salary2': 513.0},
# {'ID': 2, 'Salary1': 691.43, 'Salary2': 5123.0}
# ]}]

Converting csv to nested Json using python

I want to convert csv file to json file.
I have large data in csv file.
CSV Column Structure
This is my column structure in csv file . I has 200+ records.
id.oid libId personalinfo.Name personalinfo.Roll_NO personalinfo.addr personalinfo.marks.maths personalinfo.marks.physic clginfo.clgName clginfo.clgAddr clginfo.haveCert clginfo.certNo clginfo.certificates.cert_name_1 clginfo.certificates.cert_no_1 clginfo.certificates.cert_exp_1 clginfo.certificates.cert_name_2 clginfo.certificates.cert_no_2 clginfo.certificates.cert_exp_2 clginfo.isDept clginfo.NoofDept clginfo.DeptDetails.DeptName_1 clginfo.DeptDetails.location_1 clginfo.DeptDetails.establish_date_1 _v updatedAt.date
Expected Json
[{
"id":
{
"$oid": "00001"
},
"libId":11111,
"personalinfo":
{
"Name":"xyz",
"Roll_NO":101,
"addr":"aa bb cc ddd",
"marks":
[
"maths":80,
"physic":90
.....
]
},
"clginfo"
{
"clgName":"pqr",
"clgAddr":"qwerty",
"haveCert":true, //this is boolean true or false
"certNo":1, //this could be 1-10
"certificates":
[
{
"cert_name_1":"xxx",
"cert_no_1":12345,
"cert_exp.1":"20/2/20202"
},
{
"cert_name_2":"xxx",
"cert_no_2":12345,
"cert_exp_2":"20/2/20202"
},
......//could be up to 10
],
"isDept":true, //this is boolean true or false
"NoofDept":1 , //this could be 1-10
"DeptDetails":
[
{
"DeptName_1":"yyy",
"location_1":"zzz",
"establish_date_1":"1/1/1919"
},
......//up to 10 records
]
},
"__v": 1,
"updatedAt":
{
"$date": "2022-02-02T13:35:59.843Z"
}
}]
I have tried using pandas but I'm getting output as
My output
[{
"id.$oid": "00001",
"libId":11111,
"personalinfo.Name":"xyz",
"personalinfo.Roll_NO":101,
"personalinfo.addr":"aa bb cc ddd",
"personalinfo.marks.maths":80,
"personalinfo.marks.physic":90,
"clginfo.clgName":"pqr",
"clginfo.clgAddr":"qwerty",
"clginfo.haveCert":true,
"clginfo.certNo":1,
"clginfo.certificates.cert_name_1":"xxx",
"clginfo.certificates.cert_no_1":12345,
"clginfo.certificates.cert_exp.1":"20/2/20202"
"clginfo.certificates.cert_name_2":"xxx",
"clginfo.certificates.cert_no_2":12345,
"clginfo.certificates.cert_exp_2":"20/2/20202"
"clginfo.isDept":true,
"clginfo.NoofDept":1 ,
"clginfo.DeptDetails.DeptName_1":"yyy",
"clginfo.DeptDetails.location_1":"zzz",
"eclginfo.DeptDetails.stablish_date_1":"1/1/1919",
"__v": 1,
"updatedAt.$date": "2022-02-02T13:35:59.843Z",
}]
I am new to python I only know the basic Please help me getting this output.
200+ records is really tiny, so even naive solution is good.
It can't be totally generic because I don't see how it can be seen from the headers that certificates is a list, unless we rely on all names under certificates having _N at the end.
Proposed solution using only basic python:
read header row - split all column names by period. Iterate over resulting list and create nested dicts with appropriate keys and dummy values (if you want to handle lists: create array if current key ends with _N and use N as an index)
for all rows:
clone dictionary with dummy values
for each column use split keys from above to put the value into the corresponding dict. same solution from above for lists.
append the dictionary to list of rows

Convert simple JSON to pandas dataframe

I am new to Python and I am trying to convert the following JSON into a panda frame.
The format of json is as follows. I have reduced the columns and rows. There are around 8 columns and each json has around 20000 rows
{
"DataFeed":[
{
"Columns":[
{
"Name":"customerID",
"Category":"Dimension",
"Type":"String"
},
{
"Name":"InvoiceID",
"Category":"Dimension",
"Type":"String"
},
{
"Name":"storeloc",
"Category":"Dimension",
"Type":"String"
}
],
"Rows":[
{
"customerID":"id128404805",
"InvoiceID":"IN3956",
"storeloc":"TX359"
},
{
"customerID":"id128404806",
"InvoiceID":"IN0054",
"storeloc":"CA235"
},
{
"customerID":"id128404807",
"InvoiceID":"IN7439",
"storeloc":"AZ2309"
}
]
}
]
}
i am trying to load it into a pandas dataframe. The number of columns are the same in json file. The number of rows are around 10000.
I am trying to get into the rows and insert into a table after certain calculations.
I am trying to use json_normalize but I am struggling with navigating to the Rows level and normalizing after that. I know it must be an issue solution but I am new to working with Json. Thanks
try pd.json_normalize() with the record_path argument.
Note, you'll need pandas 0.25 or higher.
assuming your json object is j
df = pd.json_normalize(j,record_path=['DataFeed','Rows'])
print(df)
customerID InvoiceID storeloc
0 id128404805 IN3956 TX359
1 id128404806 IN0054 CA235
2 id128404807 IN7439 AZ2309

Converting json dictionary to spark dataframe by having keys as columns

Is it possible to convert a dictionary into a dataframe by having the keys as columns with the values beneath?
I have this result set from api as a dictionary:
{
'information': [{
'created': '2020-10-26T00:00:00+00:00',
'title': 'Random1',
'published': 'YES',
}, {
'created': '2020-11-06T00:00:00+00:00',
'title': 'Random2',
'published': 'YES',
}, {
'created': '2020-10-27T00:00:00+00:00',
'title': 'Random3',
'published': 'YES',
}, {
'created': '2020-10-29T00:00:00+00:00',
'title': 'Random4',
'published': 'YES',
}]
}
If I convert this to a dataframe like this:
json_rdd=sc.parallelize([data_dict['information']])
spark_df = spark.createDataFrame(json_rdd)
spark_df.createOrReplaceTempView("data_df");
This gives me columns listed as _1, _2, _3,_4 with the data still showing as objects within them.
Is it possible to have the data_df (converted dataframe) show the columns as created, title, published and have the values within the corresponding columns as flat?
You can directly use the dictionary to create dataframe no need to covert it to rdd.
arr = your_dict_here
spark.createDataFrame(arr['information']).show()
Output:
+--------------------+---------+-------+
| created|published| title|
+--------------------+---------+-------+
|2020-10-26T00:00:...| YES|Random1|
|2020-11-06T00:00:...| YES|Random2|
|2020-10-27T00:00:...| YES|Random3|
|2020-10-29T00:00:...| YES|Random4|
+--------------------+---------+-------+

Categories