Convert the dataframe to JSON Based on Column name

Convert the dataframe to JSON Based on Column name - python

I have a dataframe which contains like this below, Am just providing one row !
Vessel_id,Org_id,Vessel_name,Good_Laden_Miles_Min,Good_Ballast_Miles_Min,Severe_Laden_Miles_Min,Severe_Ballast_Miles_Min
1,5,"ABC",10,15,25,35
I want to convert the dataframe to json in this format below,
{
Vessel_id:1,
Vessel_name:"ABC",
Org_id:5,
WeatherGood:{
Good_Laden_Miles_Min:10,
Good_Ballast_Miles_Min:15
},
weatherSevere:{
Severe_Laden_Miles_Min:25,
Severe_Ballast_Miles_Min:35
}
}
how to join all those columns starting with good into a WeatherGood and convert to JSON?

You can first convert the dataframe to a dictionary of records, then transform each record to your desired format. Finally, convert the list of records to JSON.
import json
records = df.to_dict('records')
for record in records:
record['WeatherGood'] = {
k: record.pop(k) for k in ('Good_Laden_Miles_Min', 'Good_Ballast_Miles_Min')
}
record['WeatherSevere'] = {
k: record.pop(k) for k in ('Severe_Laden_Miles_Min', 'Severe_Ballast_Miles_Min')
}
>>> json.dumps(records)
'[{"Vessel_id": 1, "Org_id": 5, "Vessel_name": "ABC", "WeatherGood": {"Good_Laden_Miles_Min": 10, "Good_Ballast_Miles_Min": 15}, "WeatherSevere": {"Severe_Laden_Miles_Min": 25, "Severe_Ballast_Miles_Min": 35}}]'

Related

Nested Json Using pyspark

We have to build nested json using below structure in pyspark and i have added data that need to feed using this
Input Data structure
Data
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
a1=["DA_STinf","DA_Stinf_NA","DA_Stinf_city","DA_Stinf_NA_ID","DA_Stinf_NA_ID_GRANT","DA_country"]
a2=["data.studentinfo","data.studentinfo.name","data.studentinfo.city","data.studentinfo.name.id","data.studentinfo.name.id.grant","data.country"]
columns = ["data","action"]
df = spark.createDataFrame(zip(a1, a2), columns)
#Input data for json structure
a1=["Pune"]
a2=["YES"]
a3=["India"]
col=["DA_Stinf_city","DA_Stinf_NA_ID_GRANT","DA_country"]
data=spark.createDataFrame(zip(a1, a2,a3), col)
Expected result based on above data
{
"data": {
"studentinfo": {
"city": "Pune",
"name": {
"id": {
"grant": "YES"
}
}
},
"country": "india"
}
}
we have tried using F.struct function in manually but we have find dynamic way to build this json using df dataframe having data and action column
data.select(
F.struct(
F.struct(
F.struct(F.col("DA_Stinf_city")).alias("city"),
F.struct(
F.struct(F.col("DA_Stinf_NA_ID_GRANT")).alias("id")
).alias("name"),
).alias("studentinfo"),
F.struct(F.col("DA_country")).alias("country")
).alias("data")
)

The approach below should give the correct structure (with the wrong key names - if you are happy with the approach, which doesn't use DataFrame operations but rather works in the underlying RDD, then I can flesh it out):
def build_json(input, running={}):
new_input = {}
for hierarchy, value in input:
key = hierarchy.pop(0)
if len(hierarchy) == 0:
running[key] = value
else:
new_input[key] = new_input.get(key, []) + [(hierarchy, value)]
for key in new_input:
print(new_input[key])
running[key] = build_json(new_input[key], running={})
return running
data.rdd.map(
lambda x: build_json(
[(column.split("_"), value) for column, value in x.asDict().items()]
)
)
The basic idea is to get a set of tuples from the underlying RDD consisting of the column name broken into its json hierarchy and the value to insert into the hierarchy. Then the function build_json inserts the value into its correct place in the json hierarchy, while building out the json object recursively.

How to convert a dataframe to nested json

I have this DataFrame:
df = pd.DataFrame({'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"})
All the dataframe fields are ASCII strings and is the output from a SQL query (pd.read_sql_query) so the line to create the dataframe above may not be quite right.
And I wish the final JSON output to be in the form
[{
"Survey": "001_220816080015",
"BCD": "001_220816080015.bcd",
"Sections": [
"4700A1/305",
"4700A1/312"
}]
I realize that may not be 'normal' JSON but that is the format expected by a program over which I have no control.
The nearest I have achieved so far is
[{
"Survey": "001_220816080015",
"BCD": "001_220816080015.bcd",
"Sections": "4700A1/305, 4700A1/312"
}]
Problem might be the structure of the dataframe but how to reformat it to produce the requirement is not clear to me.
The JSON line is:
df.to_json(orient='records', indent=2)

Isn't the only thing you need to do to parse the Sections into a list?
import pandas as pd
df= pd.DataFrame({'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"}, index=[0])
df['Sections'] = df['Sections'].str.split(', ')
print(df.to_json(orient='records', indent=2))
[
{
"Survey":"001_220816080015",
"BCD":"001_220816080015.bcd",
"Sections":[
"4700A1\/305",
"4700A1\/312"
]
}
]

The DataFrame won't help you here, since it's just giving back the input parameter you gave it.
You should just split the specific column you need into an array:
input_data = {'Survey': "001_220816080015", 'BCD': "001_220816080015.bcd", 'Sections': "4700A1/305, 4700A1/312"}
input_data['Sections'] = input_data['Sections'].split(', ')
nested_json = [input_data]

Converting CSV do JSON with Pandas

I have the data in CSV format, for example:
First row is column number, let's ignore that.
Second row, starting at Col_4, are number of days
Third row on: Col_1 and Col_2 are coordinates (lon, lat), Col_3 is a statistical value, Col_4 onwards are measurements.
As you can see, this format is a confusing mess. I would like to convert this to JSON the following way, for example:
{"points":{
"dates": ["20190103", "20190205"],
"0":{
"lon": "-8.072557",
"lat": "41.13702",
"measurements": ["-0.191792","-10.543130"],
"val": "-1"
},
"1":{
"lon": "-8.075557",
"lat": "41.15702",
"measurements": ["-1.191792","-2.543130"],
"val": "-9"
}
}
}
To summarise what I've done till now, I read the CSV to a Pandas DataFrame:
df = pandas.read_csv("sample.csv")
I can extract the dates into a Numpy Array with:
dates = df.iloc[0][3:].to_numpy()
I can extract measurements for all points with:
measurements_all = df.iloc[1:,3:].to_numpy()
And the lon and lat and val, respectively, with:
lon_all = df.iloc[1:,0:1].to_numpy()
lat_all = df.iloc[1:,1:2].to_numpy()
val_all = df.iloc[1:,2:3].to_numpy()
Can anyone explain how I can format this info a structure identical to the .json example?

With this dataframe
df = pd.DataFrame([{"col1": 1, "col2": 2, "col3": 3, "col4":3, "col5":5},
{"col1": None, "col2": None, "col3": None, "col4":20190103, "col5":20190205},
{"col1": -8.072557, "col2": 41.13702, "col3": -1, "col4":-0.191792, "col5":-10.543130},
{"col1": -8.072557, "col2": 41.15702, "col3": -9, "col4":-0.191792, "col5":-2.543130}])
This code does what you want, although it is not converting anything to strings as in you example. But you should be able to easily do it, if necessary.
# generate dict with dates
final_dict = {"dates": [list(df["col4"])[1],list(df["col5"])[1]]}
# iterate over relevant rows and generate dicts
for i in range(2,len(df)):
final_dict[i-2] = {"lon": df["col1"][i],
"lat": df["col2"][i],
"measurements": [df[cname][i] for cname in ["col4", "col5"]],
"val": df["col3"][i]
}
this leads to this output:
{0: {'lat': 41.13702,
'lon': -8.072557,
'measurements': [-0.191792, -10.54313],
'val': -1.0},
1: {'lat': 41.15702,
'lon': -8.072557,
'measurements': [-0.191792, -2.54313],
'val': -9.0},
'dates': [20190103.0, 20190205.0]}

Extracting dates from data and then eliminating first row from data frame:
dates =list(data.iloc[0][3:])
data=data.iloc[1:]
Inserting dates into dict:
points={"dates":dates}
Iterating through data frame and adding elements to the dictionary:
i=0
for index, row in data.iterrows():
element= {"lon":row["Col_1"],
"lat":row["Col_2"],
"measurements": [row["Col_3"], row["Col_4"]]}
points[str(i)]=element
i+=1
You can convert dict to string object using json.dumps():
points_json = json.dumps(points)
It will be string object, not json(dict) object. More about that in this post Converting dictionary to JSON

I converted the pandas dataframe values to a list, and then loop through one of the lists, and add the lists to a nested JSON object containing the values.
import pandas
import json
import argparse
import sys
def parseInput():
parser = argparse.ArgumentParser(description="Convert CSV measurements to JSON")
parser.add_argument(
'-i', "--input",
help="CSV input",
required=True,
type=argparse.FileType('r')
)
parser.add_argument(
'-o', "--output",
help="JSON output",
type=argparse.FileType('w'),
default=sys.stdout
)
return parser.parse_args()
def main():
args = parseInput()
input_file = args.input
output = args.output
dataframe = pandas.read_csv(input_file)
longitudes = dataframe.iloc[1:,0:1].T.values.tolist()[0]
latitudes = dataframe.iloc[1:,1:2].T.values.tolist()[0]
averages = dataframe.iloc[1:,2:3].T.values.tolist()[0]
measurements = dataframe.iloc[1:,3:].values.tolist()
dates=dataframe.iloc[0][3:].values.tolist()
points={"dates":dates}
for index, val in enumerate(longitudes):
entry = {
"lon":longitudes[index],
"lat":latitudes[index],
"measurements":measurements[index],
"average":averages[index]
}
points[str(index)] = entry
json.dump(points, output)
if __name__ == "__main__":
main()

Merging list of python dictionaries by column value

I have data that is a list of python dictionaries, each representing a row in the data, and want to combine several of these into one dictionary.
I need to combine them by a common value in a single column, note the dictionaries to merge may or may not contain similar columns and values should be concatenated, not clobbered.
Here is an example (combining dicts by value in column 'a'):
data = [{ 'a':0, 'b':10, 'c':20 }
{ 'a':2, 'd':30, 'e':40 }
{ 'a':0, 'b':50, 'c':60 }
{ 'a':1, 'd':70, 'c':80 }
{ 'a':1, 'b':90, 'e':100 }]
Desired output is:
new_data = [{ 'a':0, 'b':[10,50], 'c':[20,60] }
{ 'a':1, 'd':[70], 'c':[80], 'b':[90], 'e':[100] }
{ 'a':2, 'd':[30], 'e':[40] }]
I have a simple function that can accomplish this, but need a faster method (Data has approx 1,000,000 rows and 20 columns). My method of finding the dictionaries I want to merge is very expensive.
Here is where I have an issue with computation time:
unique_idx, locations = [], {}
for i, row in enumerate(data):
_id = row['a']
if _id not in unique_idx:
unique_idx.append(_id)
locations[_id] = [i]
else:
locations[_id].append(i)
grouped_data = [data[loc] for loc in locations.values()]
I need a faster method to collect dictionaries that contain the same value in one column. Ideally I want a quick method with plain python, but if this can be done simply with a pandas DataFrame that is good as well.

How do I create a data structure that will be serialized this JSON format in python?

I have a function that accepts a list of date objects and should output the following dictionary in JSON:
{
"2010":{
"1":{
"id":1,
"title":"foo",
"postContent":"bar"
},
"7":{
"id":2,
"title":"foo again",
"postContent":"bar baz boo"
}
},
"2009":{
"6":{
"id":3,
"title":"foo",
"postContent":"bar"
},
"8":{
"id":4,
"title":"foo again",
"postContent":"bar baz boo"
}
}
}
Basically I would like to access my objects by year and month number.
What code can convert a list to this format in python that can be serialized to the dictionary above in json?

Something along the lines of this should work:
from collections import defaultdict
import json
d = defaultdict(dict)
for date in dates:
d[date.year][date.month] = info_for_date(date)
json.dumps(d)
Where info_for_date is a function that returns a dict like those in your question.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Convert the dataframe to JSON Based on Column name - python

Related

Nested Json Using pyspark

How to convert a dataframe to nested json

Converting CSV do JSON with Pandas

Merging list of python dictionaries by column value

How do I create a data structure that will be serialized this JSON format in python?

Categories

Resources