Does anyone have a neat way of packing a dataframe including some columns which indicate hierarchy into a nested array?
Say I have the following data frame:
from pandas import DataFrame
df = DataFrame(
{
"var1": [1, 2, 3, 4, 9],
"var2": [5, 6, 7, 8, 9],
"group_1": [1, 1, 1, 1, 2],
"group_2": [None, 1, 2, 1, None],
"group_3": [None, None, None, 1, None],
}
)
var1 var2 group_1 group_2 group_3
0 1 5 1 NaN NaN
1 2 6 1 1.0 NaN
2 3 7 1 2.0 NaN
3 4 8 1 1.0 1.0
4 9 9 2 NaN NaN
The group_ columns show that the records on the 2nd and 3rd rows are children of the one on the first row. The 4th row is a child of the 2nd row, and the last row of the table has no children. I am looking to derive something like the following:
[
{
"var1": 1,
"var2": 5,
"children": [
{
"var1": 2,
"var2": 6,
"children": [{"var1": 4, "var2": 8, "children": []}],
},
{"var1": 3, "var2": 7, "children": []},
],
},
{"var1": 9, "var2": 9, "children": []},
]
You could try if the following recursive .groupby over the group_n columns works for you:
def nest_it(df, level=1):
record = {"var1": None, "var2": None, "children": []}
for key, gdf in df.groupby(f"group_{level}", dropna=False):
if pd.isna(key):
record["var1"], record["var2"] = map(int, gdf.iloc[0, 0:2])
elif level == 3:
var1, var2 = map(int, gdf.iloc[0, 0:2])
record["children"].append({"var1": var1, "var2": var2, "children": []})
else:
record["children"].append(nest_it(gdf, level=level + 1))
return record
result = nest_it(df)["children"]
While going over the key, group tuples from a (nested) df.groupby("group_n") 3 things could happen:
The key is a NaN, i.e. it's time to record the vars and there aren't any more children.
The level is 3, i.e. the end of the dataframe is reached, so it's also time to wrap up, but this time as child.
Otherwise (recursion): Put the recursively retrieved children in the resp. list.
Remark: I've only initialized the record dicts front up to get the item order as in your expected output.
Result for the sample:
[{'var1': 1,
'var2': 5,
'children': [{'var1': 2,
'var2': 6,
'children': [{'var1': 4, 'var2': 8, 'children': []}]},
{'var1': 3, 'var2': 7, 'children': []}]},
{'var1': 9, 'var2': 9, 'children': []}]
Related
I have a dataframe like this:
import pandas as pd
frame={'location': {(2, 'eng', 'US'): {"['sc']": 3, "['delhi']": 2, "['sonepat', 'delhi']": 1, "['new delhi']": 1}}}
df=pd.DataFrame(frame)
df.head()
Output
location
2 eng US {"['sc']": 3, "['delhi']": 2, "['sonepat', 'delhi']": 1, "['new delhi']": 1}
And i want to change the type inside the column df['location'] to not list, like :
location
2 eng US {'sc': 3, 'delhi': 2, 'sonepat', 'delhi': 1, 'new delhi': 1}
You could try this:
import pandas as pd
df = pd.DataFrame(
{
"location": {
(2, "eng", "US"): {
"['sc']": 3,
"['delhi']": 2,
"['sonepat', 'delhi']": 1,
"['new delhi']": 1,
}
}
}
)
df["location"] = df["location"].apply(
lambda x: str(x).replace("['", "").replace("']", "")
)
print(df)
#Outputs
location
2 eng US {"sc": 3, "delhi": 2, "sonepat', 'delhi": 1, "...
How to filter one row, calculate range and find similar rows from it falling within that range in a dictionary with id as key and id's falling in that range as values using multiprocessing?
Suppose I have a data frame:
id val1 val2
1 10 20
2 9.5 19
3 100 200
4 9.3 19.2
5 96 196
6 99 198
7 103 202
8 140 280
For each id i, I will calculate:
upper_val1 = df[df.id==i].val1 * (1+0.1)
lower_val1 = df[df.id==i].val1 * (1-0.1)
upper_val2 = df[df.id==i].val2 * (1+0.1)
lower_val2 = df[df.id==i].val2 * (1-0.1)
Subset df:
sub_df = df[(df.val1<=upper_val1)&df.val1>=lower_val1)&(df.val1<=upper_val2)&df.val1>=lower_val2)
For whichever id, val1 lies between this range, that will be put in the dictionary. For eg. the output of this df will be:
{1:[2,4], 2:[1,4], 4:[1,2], 3:[5,6,7], 5:[3,6,7], 6:[3,5,7], 7:[3,5,6]}
I have a data frame with millions of records and this step should be repeated for each row, so how it can be done using multiprocessing?
To accomplish this, we'll apply a function over the dataframe that computes the IDs where values lie in a range of the dataframe's rows.
df = pd.DataFrame.from_records([
{'id': 1, 'val1': 10.0, 'val2': 20.0},
{'id': 2, 'val1': 9.5, 'val2': 19.0},
{'id': 3, 'val1': 100.0, 'val2': 200.0},
{'id': 4, 'val1': 9.3, 'val2': 19.2},
{'id': 5, 'val1': 96.0, 'val2': 196.0},
{'id': 6, 'val1': 99.0, 'val2': 198.0},
{'id': 7, 'val1': 103.0, 'val2': 202.0},
{'id': 8, 'val1': 140.0, 'val2': 280.0}]
)
def rows_in_range(row):
index = row.name
val1 = row['val1']
val2 = row['val2']
return (df['val1'].between(val1*(1-.1), val1*(1+.1)) &
df['val2'].between(val2*(1-.1), val2*(1+.1)) &
(df.index != index)
)
# row.name gives the index value for that row
ids = df.apply(
lambda row: df.loc[rows_in_range(row), 'id'].tolist(),
axis=1,
result_type='reduce'
)
indices
0 [2, 4]
1 [1, 4]
2 [5, 6, 7]
3 [1, 2]
4 [3, 6, 7]
5 [3, 5, 7]
6 [3, 5, 6]
7 []
dtype: object
Now all we have to is convert the index of this to IDs as well.
indices.index = df.loc[indices.index, 'id']
indices.to_dict()
{1: [2, 4],
2: [1, 4],
3: [5, 6, 7],
4: [1, 2],
5: [3, 6, 7],
6: [3, 5, 7],
7: [3, 5, 6],
8: []}
I'm curious whether this is performant enough for millions of rows, but at least it's correct.
I converted a JSON into DataFrame and ended up with a column 'Structure_value' having below values as list of dictionary/dictionaries:
Structure_value
[{'Room': 6, 'Length': 7}, {'Room': 6, 'Length': 7}]
[{'Room': 6, 'Length': 22}]
[{'Room': 6, 'Length': 8}, {'Room': 6, 'Length': 9}]
Since it is an object so I guess it ended up in this format.
I need to split it into below four columns:
Structure_value_room_1
Structure_value_length_1
Structure_value_room_2
Structure_value_length_2
All other solutions on StackOverflow only deal with converting Simple JSON into DataFrame and not the nested structure.
P.S.: I know I can do something by explicitly naming fields but I need a generic solution so that in future any JSON of this format can be handled
[Edit]: The output should look like this:
Structure_value_room_1 Structure_value_length_1 Structure_value_room_2 \
0 6 7 6.0
1 6 22 NaN
2 6 8 6.0
Structure_value_length_2
0 7.0
1 NaN
2 9.0
Use list comprehension with nested dictionary comprehension with enumerate for deduplicate keys of dicts, last pass list of dictionaries to DataFrame constructor:
L = [ {f"{k}_{i}": v for i, y in enumerate(x, 1)
for k, v in y.items()}
for x in df["Structure_value"] ]
df = pd.DataFrame(L)
print(df)
Room_1 Length_1 Room_2 Length_2
0 6 7 6.0 7.0
1 6 22 NaN NaN
2 6 8 6.0 9.0
For columns names from question use:
def json_to_df(df, column):
L = [ {f"{column}_{k.lower()}_{i}": v for i, y in enumerate(x, 1)
for k, v in y.items()}
for x in df[column] ]
return pd.DataFrame(L)
df1 = json_to_df(df, 'Structure_value')
print(df1)
Structure_value_room_1 Structure_value_length_1 Structure_value_room_2 \
0 6 7 6.0
1 6 22 NaN
2 6 8 6.0
Structure_value_length_2
0 7.0
1 NaN
2 9.0
A non-Pandas solution you can probably apply to your original JSON data, here represented by rows:
import pprint
rows = [
{"Foo": "1", "Structure": [{'Room': 6, 'Length': 7}, {'Room': 6, 'Length': 7}]},
{"Foo": "2", "Structure": [{'Room': 6, 'Length': 22}]},
{"Foo": "3", "Structure": [{'Room': 6, 'Length': 8}, {'Room': 6, 'Length': 9}]},
]
for row in rows: # Modifies `rows` in-place
for index, room_info in enumerate(row.pop("Structure", ()), 1):
for key, value in room_info.items():
row[f"Structure_value_{key.lower()}_{index}"] = value
pprint.pprint(rows)
outputs
[{'Foo': '1',
'Structure_value_length_1': 7,
'Structure_value_length_2': 7,
'Structure_value_room_1': 6,
'Structure_value_room_2': 6},
{'Foo': '2', 'Structure_value_length_1': 22, 'Structure_value_room_1': 6},
{'Foo': '3',
'Structure_value_length_1': 8,
'Structure_value_length_2': 9,
'Structure_value_room_1': 6,
'Structure_value_room_2': 6}]
I need to shift a grouped data frame by a dynamic number. I can do it with apply, but the performance is not very good.
Any way to do that without apply?
Here is a sample of what I would like to do:
df = pd.DataFrame({
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'B','B','B','B','B','B'],
'VALUE': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
'SHIFT': [ 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]
})
df['SUM'] = df.groupby('GROUP').VALUE.cumsum()
# THIS DOESN'T WORK:
df['VALUE'] = df.groupby('GROUP').SUM.shift(df.SHIFT)
I do it with apply the following way:
df = pd.DataFrame({
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'B','B','B','B','B','B'],
'VALUE': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
'SHIFT': [ 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]
})
def func(group):
s = group.SHIFT.iloc[0]
group['SUM'] = group.SUM.shift(s)
return group
df['SUM'] = df.groupby('GROUP').VALUE.cumsum()
df = df.groupby('GROUP').apply(func)
Here is a pure numpy version that works if the data frame is sorted by group (like your example):
# these rows are not null after shifting
notnull = np.where(df.groupby('GROUP').cumcount() >= df['SHIFT'])[0]
# source rows for rows above
source = notnull - df['SHIFT'].values[notnull]
shifted = np.empty(df.shape[0])
shifted[:] = np.nan
shifted[notnull] = df.groupby('GROUP')['VALUE'].cumsum().values[source]
df['SUM'] = shifted
It first gets the indices of rows that are to be updated. The shifts can be subtracted to yield the source rows.
A solution that avoids apply, could be the following, if the groups are contiguous:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'GROUP': ['A', 'A', 'A', 'A', 'A', 'A', 'B','B','B','B','B','B'],
'VALUE': [ 1, 2, 3, 4, 5, 6, 7, 8, 9, 0, 1, 2],
'SHIFT': [ 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3]
})
# compute values required for the slices
_, start = np.unique(df.GROUP.values, return_index=True)
gp = df.groupby('GROUP')
shifts = gp.SHIFT.first()
sizes = gp.size().values
end = (sizes - shifts.values) + start
# compute slices
source = [i for s, f in zip(start, end) for i in range(s, f)]
target = [i for j, s, f in zip(start, shifts, sizes) for i in range(j + s, j + f)]
# compute cumulative sum and arrays of nan
s = gp.VALUE.cumsum().values
r = np.empty_like(s, dtype=np.float32)
r[:] = np.nan
# set the on the array of nan
np.put(r, target, s[source])
# set the sum column
df['SUM'] = r
print(df)
Output
GROUP SHIFT VALUE SUM
0 A 2 1 NaN
1 A 2 2 NaN
2 A 2 3 1.0
3 A 2 4 3.0
4 A 2 5 6.0
5 A 2 6 10.0
6 B 3 7 NaN
7 B 3 8 NaN
8 B 3 9 NaN
9 B 3 0 7.0
10 B 3 1 15.0
11 B 3 2 24.0
With the exception of building the slices (source and target) all computations are done in a pandas/numpy level that should be fast. The idea is to manually simulate what would be done in the apply function.
What is the better way to traverse a dictionary recursively?
Can I do it with lambda or/and list comprehension?
I have:
[
{
"id": 1,
"children": [
{
"id": 2,
"children": []
}
]
},
{
"id": 3,
"children": []
},
{
"id": 4,
"children": [
{
"id": 5,
"children": [
{
"id": 6,
"children": [
{
"id": 7,
"children": []
}
]
}
]
}
]
}
]
I want:
[1,2,3,4,5,6,7]
You can recursively traverse your dictionaries, with this generic generator function, like this
def rec(current_object):
if isinstance(current_object, dict):
yield current_object["id"]
for item in rec(current_object["children"]):
yield item
elif isinstance(current_object, list):
for items in current_object:
for item in rec(items):
yield item
print list(rec(data))
# [1, 2, 3, 4, 5, 6, 7]
The easiest way to do this will be with a recursive function:
recursive_function = lambda x: [x['id']] + [item for child in x['children'] for item in recursive_function(child)]
result = [item for topnode in whatever_your_list_is_called for item in recursive_function(topnode)]
My solution:
results = []
def function(lst):
for item in lst:
results.append(item.get('id'))
function(item.get('children'))
function(l)
print results
[1, 2, 3, 4, 5, 6, 7]
The dicter library can be useful. You can easily flatten or traverse the dictionary paths.
pip install dicter
import dicter as dt
# Example dict:
d = {'level_a': 1, 'level_b': {'a': 'hello world'}, 'level_c': 3, 'level_d': {'a': 1, 'b': 2, 'c': {'e': 10}}, 'level_e': 2}
# Walk through dict to get all paths
paths = dt.path(d)
print(paths)
# [[['level_a'], 1],
# [['level_c'], 3],
# [['level_e'], 2],
# [['level_b', 'a'], 'hello world'],
# [['level_d', 'a'], 1],
# [['level_d', 'b'], 2],
# [['level_d', 'c', 'e'], 10]]
The first column is the key path. The 2nd column are the values. In your case, you can take in the 1st column all last elements.