Pyspark JSON array of objects into columns

Pyspark JSON array of objects into columns - python

Im ingesting JSON files into spark and i have come across an object as below in the nested JSON from the file
"data": {
"key1" :"v1"
"key2" : [
{"nk1" :"nv1"},
{"nk2" :"nv2" },
{"nk3" :"nv3" }
]
}
After reading it in spark, it is changing into below format:
"data": {
"key1" :"v1"
"key2" : [
{"nk1" :"nv1", "nk2" :null, "nk3" :null},
{"nk1" :null, "nk2" :"nv2", "nk3" :null},
{"nk1" :null, "nk2" :null, "nk3" :"nv3"}
]
}
I need them as columns in the spark dataframe
"key1"
"nk1"
"nk2"
"nk3"
"v1"
"kv1"
"kv2"
"kv3"
Please help me with any solution for this. I'm thinking to convert this to string and use regex. Is there any better solution?

You can explode the array and pivot key2:
import pyspark.sql.functions as F
df2 = df.select(
F.col('data.key1').alias('key1'),
F.explode('data.key2').alias('key2')
).select(
'key1',
F.map_keys('key2')[0].alias('key'),
F.map_values('key2')[0].alias('val')
).groupBy('key1').pivot('key').agg(F.first('val'))
df2.show()
+----+---+---+---+
|key1|nk1|nk2|nk3|
+----+---+---+---+
| v1|nv1|nv2|nv3|
+----+---+---+---+

Related

Excel to nested Json including child elements into array

I am trying to convert an Excel to nested JSON using Python where the repeated values go in as an array of elements.
Ex: structure of CSV
Manufacturer,oilType,viscosity
shell,superOil,1ova
shell,superOil,2ova
shell,normalOil,1ova
bp, power, 10bba
Should be displayed in JSON (expected output) as
elements: [
{
"Manufacturer": "shell",
"details": [
{
"OilType": "superOil",
"Viscosity": [
"1ova",
"2ova"
]
},
{
"OilType": "normalOil",
"Viscosity": [
"1ova"
]
}
]
},
{
"Manufacturer": "bp",
"details": [
{
"OilType": "power",
"Viscosity": [
"10bba"
]
}
]
}
]
I have currently converted the CSV into JSON using openpyxl and the values are displayed for each of the headers in format like (Current output)
[{Manufacturer: "shell", oilType: "superOil", Viscosity:"1ova"},{...},{...},...]
Please help in getting the expected output.

Hi and welcome to StackOverflow.
Your question has actually nothing to do with openpyxl because you don't need to save into an Excel file.
You can do thought:
Load the csv (or Excel) into a pandas DataFrame
Group by Manufacturer and oil type
Dump into the format you want
Transform to JSON (either string or file)
In practice, that gives something like that:
import json
import pandas as pd
df = pd.read_csv("oil.csv") # or read_excel if this is an Excel
oils = df.groupby(["Manufacturer", "oilType"]).aggregate(pd.Series.to_list)
elements = [
{
"Manufacturer": manufacturer,
"Details": [
{"OilType": o, "Viscosity": v}
for o, v in data.droplevel(0).viscosity.items()
],
}
for manufacturer, data in oils.groupby(level="Manufacturer")
]
with open("oil.json", "w") as f:
json.dump({"elements": elements}, f)
For information, oils would look like this:
viscosity
Manufacturer oilType
bp power [10bba]
shell normalOil [1ova]
superOil [1ova, 2ova]

How to create a single json file from two DataFrames?

I have two DataFrames, and I want to post these DataFrames as json (to the web service) but first I have to concatenate them as json.
#first df
input_df = pd.DataFrame()
input_df['first'] = ['a', 'b']
input_df['second'] = [1, 2]
#second df
customer_df = pd.DataFrame()
customer_df['first'] = ['c']
customer_df['second'] = [3]
For converting to json, I used following code for each DataFrame;
df.to_json(
path_or_buf='out.json',
orient='records', # other options are (split’, ‘records’, ‘index’, ‘columns’, ‘values’, ‘table’)
date_format='iso',
force_ascii=False,
default_handler=None,
lines=False,
indent=2
)
This code gives me the table like this: For ex, input_df export json
[
{
"first":"a",
"second":1
},
{
"first":"b",
"second":2
}
]
my desired output is like that:
{
"input": [
{
"first": "a",
"second": 1
},
{
"first": "b",
"second": 2
}
],
"customer": [
{
"first": "d",
"second": 3
}
]
}
How can I get this output like this? I couldn't find the way :(

You can concatenate the DataFrames with appropriate key names, then groupby the keys and build dictionaries at each group; finally build a json string from the entire thing:
out = (
pd.concat([input_df, customer_df], keys=['input', 'customer'])
.droplevel(1)
.groupby(level=0).apply(lambda x: x.to_dict('records'))
.to_json()
)
Output:
'{"customer":[{"first":"c","second":3}],"input":[{"first":"a","second":1},{"first":"b","second":2}]}'
or a dict by replacing the last to_json() to to_dict().

Convert simple JSON to pandas dataframe

I am new to Python and I am trying to convert the following JSON into a panda frame.
The format of json is as follows. I have reduced the columns and rows. There are around 8 columns and each json has around 20000 rows
{
"DataFeed":[
{
"Columns":[
{
"Name":"customerID",
"Category":"Dimension",
"Type":"String"
},
{
"Name":"InvoiceID",
"Category":"Dimension",
"Type":"String"
},
{
"Name":"storeloc",
"Category":"Dimension",
"Type":"String"
}
],
"Rows":[
{
"customerID":"id128404805",
"InvoiceID":"IN3956",
"storeloc":"TX359"
},
{
"customerID":"id128404806",
"InvoiceID":"IN0054",
"storeloc":"CA235"
},
{
"customerID":"id128404807",
"InvoiceID":"IN7439",
"storeloc":"AZ2309"
}
]
}
]
}
i am trying to load it into a pandas dataframe. The number of columns are the same in json file. The number of rows are around 10000.
I am trying to get into the rows and insert into a table after certain calculations.
I am trying to use json_normalize but I am struggling with navigating to the Rows level and normalizing after that. I know it must be an issue solution but I am new to working with Json. Thanks

try pd.json_normalize() with the record_path argument.
Note, you'll need pandas 0.25 or higher.
assuming your json object is j
df = pd.json_normalize(j,record_path=['DataFeed','Rows'])
print(df)
customerID InvoiceID storeloc
0 id128404805 IN3956 TX359
1 id128404806 IN0054 CA235
2 id128404807 IN7439 AZ2309

Complex json to pandas dataframe

There are lots of question about json to pandas dataframe but none of them solved my issues. I am practicing on this complex json file which looks like this
{
"type" : "FeatureCollection",
"features" : [ {
"Id" : 265068000,
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ 22.170376666666666, 65.57273333333333 ]
},
"properties" : {
"timestampExternal" : 1529151039629
}
}, {
"Id" : 265745760,
"type" : "Feature",
"geometry" : {
"type" : "Point",
"coordinates" : [ 20.329506666666667, 63.675425000000004 ]
},
"properties" : {
"timestampExternal" : 1529151278287
}
} ]
}
I want to convert this json directly to pandas dataframe using pd.read_json() My Primary Goal is to extract Id, Coordinates and timestampExternal. As this is very complex json, normal way of pd.read_json(), simply doesnt give correct output. Can you suggest me, how can i approach to solve in this kind of situations. Expected output is something like this
Id,Coordinates,timestampExternal
265068000,[22.170376666666666, 65.57273333333333],1529151039629
265745760,[20.329506666666667, 63.675425000000004],1529151278287

You can read the json to load it into a dictionary. Then, using dictionary comprehension, extract the attributes you want as columns -
import json
import pandas as pd
_json = json.load(open('/path/to/json'))
df_dict = [{'id':item['Id'], 'coordinates':item['geometry']['coordinates'],
'timestampExternal':item['properties']['timestampExternal']} for item in _json['features']]
extracted_df = pd.DataFrame(df_dict)
>>>
coordinates id timestampExternal
0 [22.170376666666666, 65.57273333333333] 265068000 1529151039629
1 [20.329506666666667, 63.675425000000004] 265745760 1529151278287

You can read the json directly, and then given the features array to pandas as a dict like:
Code:
import json
with open('test.json', 'rU') as f:
data = json.load(f)
df = pd.DataFrame([dict(id=datum['Id'],
coords=datum['geometry']['coordinates'],
ts=datum['properties']['timestampExternal'],
)
for datum in data['features']])
print(df)
Results:
coords id ts
0 [22.170376666666666, 65.57273333333333] 265068000 1529151039629
1 [20.329506666666667, 63.675425000000004] 265745760 1529151278287

Convert CSV to JSON (in specific format) using Python

I would like to convert a csv file into a json file using python 2.7. Down below is the python code I tried but it is not giving me expected result. Also, I would like to know if there is any simplified version than mine. Any help is appreciated.
Here is my csv file (SampleCsvFile.csv):
zipcode,date,state,val1,val2,val3,val4,val5
95110,2015-05-01,CA,50,30.00,5.00,3.00,3
95110,2015-06-01,CA,67,31.00,5.00,3.00,4
95110,2015-07-01,CA,97,32.00,5.00,3.00,6
Here is the expected json file (ExpectedJsonFile.json):
{
"zipcode": "95110",
"state": "CA",
"subset": [
{
"date": "2015-05-01",
"val1": "50",
"val2": "30.00",
"val3": "5.00",
"val4": "3.00",
"val5": "3"
},
{
"date": "2015-06-01",
"val1": "67",
"val2": "31.00",
"val3": "5.00",
"val4": "3.00",
"val5": "4"
},
{
"date": "2015-07-01",
"val1": "97",
"val2": "32.00",
"val3": "5.00",
"val4": "3.00",
"val5": "6"
}
]
}
Here's the python code I tried:
import pandas as pd
from itertools import groupby
import json
df = pd.read_csv('SampleCsvFile.csv')
names = df.columns.values.tolist()
data = df.values
master_list2 = [ (d["zipcode"], d["state"], d) for d in [dict(zip(names, d)) for d in data] ]
intermediate2 = [(k, [x[2] for x in list(v)]) for k,v in groupby(master_list2, lambda t: (t[0],t[1]) )]
nested_json2 = [dict(zip(names,(k[0][0], k[0][1], k[1]))) for k in [(i[0], i[1]) for i in intermediate2]]
#print json.dumps(nested_json2, indent=4)
with open('ExpectedJsonFile.json', 'w') as outfile:
outfile.write(json.dumps(nested_json2, indent=4))

Since you are using pandas already, I tried to get as much mileage as I could out of dataframe methods. I also ended up wandering fairly far afield from your implementation. I think the key here, though, is don't try to get too clever with your list and/or dictionary comprehensions. You can very easily confuse yourself and everyone who reads your code.
import pandas as pd
from itertools import groupby
from collections import OrderedDict
import json
df = pd.read_csv('SampleCsvFile.csv', dtype={
"zipcode" : str,
"date" : str,
"state" : str,
"val1" : str,
"val2" : str,
"val3" : str,
"val4" : str,
"val5" : str
})
results = []
for (zipcode, state), bag in df.groupby(["zipcode", "state"]):
contents_df = bag.drop(["zipcode", "state"], axis=1)
subset = [OrderedDict(row) for i,row in contents_df.iterrows()]
results.append(OrderedDict([("zipcode", zipcode),
("state", state),
("subset", subset)]))
print json.dumps(results[0], indent=4)
#with open('ExpectedJsonFile.json', 'w') as outfile:
# outfile.write(json.dumps(results[0], indent=4))
The simplest way to have all the json datatypes written as strings, and to retain their original formatting, was to force read_csv to parse them as strings. If, however, you need to do any numerical manipulation on the values before writing out the json, you will have to allow read_csv to parse them numerically and coerce them into the proper string format before converting to json.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pyspark JSON array of objects into columns - python

Related

Excel to nested Json including child elements into array

How to create a single json file from two DataFrames?

Convert simple JSON to pandas dataframe

Complex json to pandas dataframe

Convert CSV to JSON (in specific format) using Python

Categories

Resources