How to flatten complex nested json file using pyspark

How to flatten complex nested json file using pyspark - python

I have a nested complex json file which has struct type, array type, list, dict nested within each other.
i have a function which flattens the columns with struct type, but when it encounters any other type it fails.
is their any recursive function which handles all these types properly and flatten to leaf level using pyspark dataframe?
code which i used to flatten struct type is:
def flatten_df(nested_df):
stack = [((), nested_df)]
columns = []
while len(stack) > 0:
parents, df = stack.pop()
for column_name, column_type in df.dtypes:
if column_type[:6] == "struct":
projected_df = df.select(column_name + ".*")
stack.append((parents + (column_name,), projected_df))
else:
columns.append(col(".".join(parents + (column_name,))).alias("_".join(parents + (column_name,))))
return nested_df.select(columns)
i need to handle empty struct,empty array,empty list ,empty dict also as it might have empty values also.
how to achieve this using pyspark?

Related

Path inside the dictionary from the variable

I have a code:
def replaceJSONFilesList(JSONFilePath, JSONsDataPath, newJSONData):
JSONFileHandleOpen = open(JSONFilePath, 'r')
ReadedJSONObjects = json.load(JSONFileHandleOpen)
JSONFileHandleOpen.close()
ReadedJSONObjectsModifyingSector = ReadedJSONObjects[JSONsDataPath]
for newData in newJSONData:
ReadedJSONObjectsModifyingSector.append(newData)
JSONFileHandleWrite = open(JSONFilePath, 'w')
json.dump(ReadedJSONObjects, JSONFileHandleWrite)
JSONFileHandleWrite.close()
def modifyJSONFile(Path):
JSONFilePath = '/path/file'
JSONsDataPath = "['first']['second']"
newJSONData = 'somedata'
replaceJSONFilesList(JSONFilePath, JSONsDataPath, newJSONData)
Now I have an error:
KeyError: "['first']['second']"
But if I try:
ReadedJSONObjectsModifyingSector = ReadedJSONObjects['first']['second']
Everything is okay.
How I should send the path to the list from the JSON's dictionary — from one function to other?

You cannot pass language syntax elements as if they were data strings. Similarly, you could not pass the string "2 > 1 and False", and expect the function to be able to insert that into an if condition.
Instead, extract the data items and pass them as separate strings (which matches their syntax in the calling routine), or as a tuple of strings. For instance:
JSONsDataPath = ('first', 'second')
...
Then, inside the function ...
ReadedJSONObjects[JSONsDataPath[0]][JSONsDataPath[1]]
If you have a variable sequence of indices, then you need to write code to handle that case; research that on Stack Overflow.
The iterative way to handle an unknown quantity of indices is like this:
obj = ReadedJSONObjects
for index in JSONsDataPath:
obj = obj[index]

How to structure complex function to apply to col of pandas df?

I have a large (>500k rows) pandas df like so
orig_df = pd.DataFrame(columns=list('id', 'free_text1', 'something_inert', 'free_text2'))
free_textX is a string field containing user input imported from a csv. The goal is to have a function func that does various checks on each row of free_textX and then a performs Levenshtein fuzzy text recognition based on the contents of another df reference. Something like
from rapidfuzz import process
LEVENSHTEIN_DIST = 25
def func(s) -> str:
if string == "25":
return s
elif s == "nothing":
return "something"
else:
s2 = process.extractOne(
query = s,
choices = reference['col_name'],
score_cutoff = LEVENSHTEIN_DIST
)
return s2
After this process a new column has to be inserted after free_textX called recog_textX containing the returned values from func.
I tried vectorization (for performance) like so
orig_df.insert(loc=new_col_index, #calculated before
column='recog_textX',
value=func(orig_df['free_textX'])
)
def func(series) -> pd.core.series.Series:
...
but I don't understand how to structure func (handling an entire df col as a series, by demand of vectorization, right?) as process.extractOne(...) -> str handles single strs instead of a series. Those interface concepts seem incompatible to me. But I do want to avoid a classic iteration here for performance reasons. My grasp of pandas is too shallow here. Help me out?

I may be missing a point, but you can use apply function to get what I think you want:
orig_df['recog_textX'] = orig_df['free_textX'].apply(func)
This will create a new column 'recog_textX' by applying your function func to each element of the 'free_textX' column.
Let me know if I misunderstood your question
As an aside, I do not think vectorizing this operation will make any difference speed-wise, given each application of func() is a complicated string operation. But it does look nicer than just looping through rows

Increasing efficiency of a string filter

I have a long text file containing a number of strings. Here is the part of the file:
tyh89= 13
kb2= 0
78%= yes
###bb1= 7634.0
iih54= 121
fgddd= no
#aa1= 0
#aa2= 1
#$ac3= 0
yt##hh= 0
#j= 12.1
##hf= no
So, basically all elements have a common structure of: header= value. My goal is to search for elements, whose headers contain specific string parts and read out those elements' values.
A the moment I do it with a rather straight approach: open/read the whole file as a string, differentiate it into list of elements and run if/elif conditions over all elements using a for loop. I provide my code below.
Is it the most efficient way to do it? Or is there a more efficient way to do it with not implementing the loop?
def main():
print(list(import_param()))
def import_param():
fl = open('filename','r')
cn = fl.read()
cn = cn.split('\n')
fl.close()
for st in cn:
if 'fgddd' in st:
el = st.split(' ')
yield float(el[1])
elif '#j' in st:
el = st.split(' ')
yield float(el[1])
if __name__ == '__main__': main()

yes, there is. You have to avoid testing if string contains a string, but rather focus on string equality.
Once you settle for equality, it means that you can create a set with the known keywords, split according to = and test if the set contains your value (using O(1) lookup):
key_set = {"fgddd","#j"}
for st in cn:
if '=' in st:
key,value = st.split("=",1)
if key in key_set:
el = value.strip()
yield float(el)
if you have different types, use a dictionary to convert to the proper type according to the key
key_set = {"fgddd":float ,"#j": float, "whatever":int , "something":str}
for st in cn:
if '=' in st:
key,value = st.split("=",1)
if key in key_set:
el = value.strip()
yield key_set[key](el) # apply type conversion
note that if you don't want any conversion, str will do the job as it returns itself when passed a string.
final note: if you have a say on the input format, suggest to use json instead of a custom format. Parsing becomes trivial using json module, and filtering can be achieved by the same way I've shown.

Parameterized for loop in Python

I need to write a parameterized for loop.
# This works but...
df["ID"]=np_get_defined(df["methodA"+"ID"], df["methodB"+"ID"],df["methodC"+"ID"])
# I need a for loop as follows
df["ID"]=np_get_defined(df[sm+"ID"] for sm in strmethods)
and I get the following error:
ValueError: Length of values does not match length of index
Remaining definitions:
import numpy as np
df is a Pandas.DataFrame
strmethods=['methodA','methodB','methodC']
def get_defined(*args):
strs = [str(arg) for arg in args if not pd.isnull(arg) and 'N/A' not in str(arg) and arg!='0']
return ''.join(strs) if strs else None
np_get_defined = np.vectorize(get_defined)

df["ID"]=np_get_defined(df[sm+"ID"] for sm in strmethods) means you're passing a generator as single argument to the called method.
If you want to expand the generated sequence to a list of arguments use the * operator:
df["ID"] = np_get_defined(*(df[sm + "ID"] for sm in strmethods))
# or:
df["ID"] = np_get_defined(*[df[sm + "ID"] for sm in strmethods])
The first uses a generator and unpacks its elements, the second uses a list comprehension instead, the result will be the same in either case.

I think the reason why it doesn't work is that your DataFrame consists of columns with different lengths.

How to preserve matlab struct when accessing in python?

I have a mat-file that I accessed using
from scipy import io
mat = io.loadmat('example.mat')
From matlab, example.mat contains the following struct
>> load example.mat
>> data1
data1 =
LAT: [53x1 double]
LON: [53x1 double]
TIME: [53x1 double]
units: {3x1 cell}
>> data2
data2 =
LAT: [100x1 double]
LON: [100x1 double]
TIME: [100x1 double]
units: {3x1 cell}
In matlab, I can access data as easy as data2.LON, etc.. It's not as trivial in python. It give me several option though like
mat.clear mat.get mat.iteritems mat.keys mat.setdefault mat.viewitems
mat.copy mat.has_key mat.iterkeys mat.pop mat.update mat.viewkeys
mat.fromkeys mat.items mat.itervalues mat.popitem mat.values mat.viewvalues
Is is possible to preserve the same structure in python? If not, how to best access the data? The present python code that I am using is very difficult to work with.
Thanks

Found this tutorial about matlab struct and python
http://docs.scipy.org/doc/scipy/reference/tutorial/io.html

When I need to load data into Python from MATLAB that is stored in an array of structs {strut_1,struct_2} I extract a list of keys and values from the object that I load with scipy.io.loadmat. I can then assemble these into there own variables, or if needed, repackage them into a dictionary. The use of the exec command may not be appropriate in all cases, but if you are just trying to processes data it works well.
# Load the data into Python
D= sio.loadmat('data.mat')
# build a list of keys and values for each entry in the structure
vals = D['results'][0,0] #<-- set the array you want to access.
keys = D['results'][0,0].dtype.descr
# Assemble the keys and values into variables with the same name as that used in MATLAB
for i in range(len(keys)):
key = keys[i][0]
val = np.squeeze(vals[key][0][0]) # squeeze is used to covert matlat (1,n) arrays into numpy (1,) arrays.
exec(key + '=val')

this will return the mat structure as a dictionary
def _check_keys( dict):
"""
checks if entries in dictionary are mat-objects. If yes
todict is called to change them to nested dictionaries
"""
for key in dict:
if isinstance(dict[key], sio.matlab.mio5_params.mat_struct):
dict[key] = _todict(dict[key])
return dict
def _todict(matobj):
"""
A recursive function which constructs from matobjects nested dictionaries
"""
dict = {}
for strg in matobj._fieldnames:
elem = matobj.__dict__[strg]
if isinstance(elem, sio.matlab.mio5_params.mat_struct):
dict[strg] = _todict(elem)
else:
dict[strg] = elem
return dict
def loadmat(filename):
"""
this function should be called instead of direct scipy.io .loadmat
as it cures the problem of not properly recovering python dictionaries
from mat files. It calls the function check keys to cure all entries
which are still mat-objects
"""
data = sio.loadmat(filename, struct_as_record=False, squeeze_me=True)
return _check_keys(data)

(!) In case of nested structures saved in *.mat files, is necessary to check if the items in the dictionary that io.loadmat outputs are Matlab structures. For example if in Matlab
>> thisStruct
ans =
var1: [1x1 struct]
var2: 3.5
>> thisStruct.var1
ans =
subvar1: [1x100 double]
subvar2: [32x233 double]
Then use the code by mergen in scipy.io.loadmat nested structures (i.e. dictionaries)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to flatten complex nested json file using pyspark - python

Related

Path inside the dictionary from the variable

How to structure complex function to apply to col of pandas df?

Increasing efficiency of a string filter

Parameterized for loop in Python

How to preserve matlab struct when accessing in python?

Categories

Resources