This post covered Modification of a function to return a dataframe with specified values and I would like to further modify the output. The current function and vectorized version will get all combinations of columns subtracted from each other and return relevant data accordingly.
Example and test data:
import pandas as pd
import numpy as np
from itertools import combinations
df2 = pd.DataFrame(
{'AAA' : [80,5,6],
'BBB' : [85,20,30],
'CCC' : [100,50,25],
'DDD' : [98,50,25],
'EEE' : [103,50,25],
'FFF' : [105,50,25],
'GGG' : [109,50,25]});
df2
AAA BBB CCC DDD EEE FFF GGG
0 80 85 100 98 103 105 109
1 5 20 50 50 50 50 50
2 6 30 25 25 25 25 25
v = df2.values
df3 = df2.mask((np.abs(v[:, :, None] - v[:, None]) <= 5).sum(-1) <= 1)
df3
AAA BBB CCC DDD EEE FFF GGG
0 80.0 85.0 100 98 103 105 109
1 NaN NaN 50 50 50 50 50
2 NaN 30.0 25 25 25 25 25
All values within thresh (5 here) are returned on a per row basis with np.abs <=5.
What needs to change?
On the first row of df3 there are two clusters of values within thresh (80,85) and (100,98,103,105,109). They are all valid but are two separate groups as not within thresh. I would like to be able to separate these values based on another thresh value.
I have attempted to demonstrate what I am looking to do with the following (flawed) code and only including this to show that Im attempting to progress this myself..
df3.mask(df3.apply(lambda x : x >= df3.T.max() \
- (thresh * 3))).dropna(thresh=2).dropna(axis=1)
AAA BBB
0 80.0 85.0
df3.mask(~df3.apply(lambda x : x >= df3.T.max() - (thresh * 3))).dropna(axis=1)
CCC DDD EEE FFF GGG
0 100 98 103 105 109
1 50 50 50 50 50
2 25 25 25 25 25
So my output is nice (and shows close to desired output) but the way I got this is not so nice...
---Desired output: ---
I have used multiple rows to demonstrate but when I use this code it will only be one row that needs to be output and split. So desired output is to return the separate columns as per this example for row 0.
CCC DDD EEE FFF GGG
0 100 98 103 105 109
and
AAA BBB
0 80.0 85.0
I felt this was deserving of a separate answer.
I wrote a clustering function that operates on one dimensional arrays. I know how to vectorize it further to 2 dimensions but I haven't gotten to it yet. As it is, I use np.apply_along_axis
This function is described in this answer to this question. I encourage you to follow the links and see the work that went into getting this seemingly simple function.
What it does is find the clusters within an array defined by margins to the left and right of every point. It sorts, then clusters, then un sorts.
delta clustering function
def delta_cluster(a, dleft, dright):
s = a.argsort()
y = s.argsort()
a = a[s]
rng = np.arange(len(a))
edge_left = a.searchsorted(a - dleft)
starts = edge_left == rng
edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
ends = edge_right == rng
return (starts & ends).cumsum()[y]
Onto the problem at hand
Use the cluster function for each row in df2 with np.apply_along_axis and construct a DataFrame named clusters that mirrors the same index and columns as df2. Then stack to get a Series which will make it easier to manipulate later.
clusters = pd.DataFrame(
np.apply_along_axis(delta_cluster, 1, df2.values, 10, 10),
df2.index, df2.columns).stack()
This describes the next block of code.
I need to keep the row information of df2 when I do a groupby.
Use transform to get the size of clusters for each row.
stack the values of df2 and append the cluster values as part of the index. This enables the separation you are looking for.
mask val where size is equal to 1. These are singleton clusters.
lvl0 = clusters.index.get_level_values(0)
size = clusters.groupby([lvl0, clusters]).transform('size')
val = df2.stack().to_frame('value').set_index(clusters, append=True).value
val.mask(size.values == 1).dropna().unstack(1)
AAA BBB CCC DDD EEE FFF GGG
0 1 80.0 85.0 NaN NaN NaN NaN NaN
2 NaN NaN 100.0 98.0 103.0 105.0 109.0
1 3 NaN NaN 50.0 50.0 50.0 50.0 50.0
2 2 NaN 30.0 25.0 25.0 25.0 25.0 25.0
This matches your results except I split out the first row into two rows.
AAA BBB CCC DDD EEE FFF GGG
0 80.0 85.0 100 98 103 105 109
1 NaN NaN 50 50 50 50 50
2 NaN 30.0 25 25 25 25 25
Well I think you can try to solve your problem differently. The idea is to get 'gaps and islands' within each row and label each group:
So, first - put your columns to rows and sort values within each initial row index:
>>> df = df2.stack().sort_values().sortlevel(0, sort_remaining=False)
>>> df
0 AAA 80
BBB 85
DDD 98
CCC 100
EEE 103
FFF 105
GGG 109
1 AAA 5
BBB 20
GGG 50
FFF 50
DDD 50
CCC 50
EEE 50
2 AAA 6
GGG 25
EEE 25
DDD 25
CCC 25
FFF 25
BBB 30
Next, create new DataFrame with 'prev values' together with current values:
>>> df = df2.stack().sort_values().sortlevel(0, sort_remaining=False)
>>> df = pd.concat([df, df.groupby(level=0).shift(1)], axis=1)
>>> df.columns = ['cur', 'prev']
>>> df
cur prev
0 AAA 80 NaN
BBB 85 80.0
DDD 98 85.0
CCC 100 98.0
EEE 103 100.0
FFF 105 103.0
GGG 109 105.0
1 AAA 5 NaN
BBB 20 5.0
GGG 50 20.0
FFF 50 50.0
DDD 50 50.0
CCC 50 50.0
EEE 50 50.0
2 AAA 6 NaN
GGG 25 6.0
EEE 25 25.0
DDD 25 25.0
CCC 25 25.0
FFF 25 25.0
BBB 30 25.0
And now, creating islands labels:
>>> df = (df['cur'] - df['prev'] > thresh).astype('int')
>>> df
0 AAA 0
BBB 0
DDD 1
CCC 0
EEE 0
FFF 0
GGG 0
1 AAA 0
BBB 1
GGG 1
FFF 0
DDD 0
CCC 0
EEE 0
2 AAA 0
GGG 1
EEE 0
DDD 0
CCC 0
FFF 0
BBB 0
>>> df.groupby(level=0).cumsum().unstack()
AAA BBB CCC DDD EEE FFF GGG
0 0 0 1 1 1 1 1
1 0 1 2 2 2 2 2
2 0 1 1 1 1 1 1
Now you can filter out groups which have only one member and you're done :)
>>> dfm = df.groupby(level=0).cumsum().unstack()
>>> dfm
AAA BBB CCC DDD EEE FFF GGG
0 0 0 1 1 1 1 1
1 0 1 2 2 2 2 2
2 0 1 1 1 1 1 1
>>> df2[dfm == 0].loc[0:0].dropna(axis=1)
AAA BBB
0 80 85.0
>>> df2[dfm == 1].loc[0:0].dropna(axis=1)
CCC DDD EEE FFF GGG
0 100.0 98.0 103.0 105.0 109.0
method 1
I copied and pasted from previous question including the minor change.
I vectorized and embedded your closeCols for some mind numbing fun.
Notice there is no apply
numpy broadcasting to get all combinations of columns subtracted from each other.
np.abs
<= 5
sum(-1) I arranged the broadcasting such that the difference of say row 0, column AAA with all of row 0 will be laid out across the last dimension. -1 in the sum(-1) says to sum across last dimension.
<= 1 all values are less than 5 away from themselves. So I want the sum of these to be greater than 1. Thus, we mask all less than or equal to one.
df2 = pd.DataFrame(
{'AAA' : [80,5,6],
'BBB' : [85,20,30],
'CCC' : [100,50,25],
'DDD' : [98,50,25],
'EEE' : [103,50,25],
'FFF' : [105,50,25],
'GGG' : [109,50,25]});
v = df2.values
# let delta be the distance threshold
# let k be the cluster size threshold
x, k = 5, 2 # cluster size must be greater than k
df2.mask((np.abs(v[:, :, None] - v[:, None]) <= x).sum(-1) <= k)
# note that this is the same as before but k = 1 was hard coded
print(df3)
AAA BBB CCC DDD EEE FFF GGG
0 NaN NaN 100 98 103 105 NaN
1 NaN NaN 50 50 50 50 50.0
2 NaN 30.0 25 25 25 25 25.0
Related
I have a pyspark dataFrame that i want to pivot.
input_dataframe:
mdn
top_protocol_by_vol
top_vol
rank
55555
AAA
30
1
55555
BBB
20
2
55555
DDD
10
3
9898
JJJ
30
1
9898
CCC
20
2
9898
FFF
10
3
2030
PPP
30
1
2030
KKK
20
2
2030
FFF
10
3
and I want to have something like this
output_dataframe:
mdn
top_protocol_by_vol_1
top_protocol_by_vol_2
top_protocol_by_vol_3
top_vol_1
top_vol_2
top_vol_3
2030
PPP
KKK
FFF
30
20
10
9898
JJJ
CCC
FFF
30
20
10
55555
AAA
BBB
DDD
30
20
10
I know for sure that i cant do soemting like this with Pandas using the code:
output_dataframe = input_dataframe.pivot_table(index='mdn', columns=['rank'],aggfunc=lambda x: ''.join(x) if isinstance(x, str) else x,dropna=True).reset_index()
output_dataframe.columns = [''.join('_'.join([str(c) for c in col if c != ""])) for col in output_dataframe.columns.values]
How can I achieve the same results with pyspark converting to pandas ?
You can use pivot function with first as aggregate.
from pyspark.sql import functions as F
df = (df.groupby('mdn')
.pivot('rank')
.agg(F.first('top_protocol_by_vol').alias('top_protocol_by_vol'),
F.first('top_vol').alias('top_vol')))
I have a multiindex dataframe with two layers of indices and roughly 100 columns. I would like to get groups of values (organized in columns) based on the presence of a certain value, but I am still struggling with the indexing mechanics.
Here is some example data:
import pandas as pd
index_arrays = [np.array(["one"]*5+["two"]*5),
np.array(["aaa","bbb","ccc","ddd","eee"]*2)]
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],
[10,11,12],[13,14,15],[16,1,17],
[18,19,20],[21,22,23],[24,25,26],
[27,28,29]], index=index_arrays)
Gives
0 1 2
one aaa 1 2 3
bbb 4 5 6
ccc 7 8 9
ddd 10 11 12
eee 13 14 15
two aaa 16 1 17
bbb 18 19 20
ccc 21 22 23
ddd 24 25 26
eee 27 28 29
Now, for each level_0 index (one and two), I want to return the entire column in which the level_1 index of aaa equals to a certain value, for example 1.
What I got so far is this:
df[df.loc[(slice(None), "aaa"),:]==1].any(axis=1)
>
one aaa True
bbb False
ccc False
ddd False
eee False
two aaa True
bbb False
ccc False
ddd False
eee False
Instead of the boolean values, I would like to retrieve the actual values. The expected output would be:
expected:
0
one aaa 1
bbb 4
ccc 7
ddd 10
eee 13
two aaa 1
bbb 19
ccc 22
ddd 25
eee 28
I would appreciate your help.
Bonus question: Additionally, it would be great to know which column contains the values in question. For the example above, this would be column 0 (for index one)and column 1 (for index two). Is there a way to do this?
Thanks!
This might be what you're looking for:
df.loc[df.index.get_level_values(0) == 'one', df.loc[('one', 'aaa')] == 1]
This outputs:
0
one aaa 1
bbb 4
ccc 7
ddd 10
eee 13
To combine the results for all of the different values of the first index, generate these DataFrames and concatenate them:
output_df = pd.DataFrame()
for level_0_val in df.index.get_level_values(0).unique():
_ = df.loc[df.index.get_level_values(0) == level_0_val, df.loc[(level_0_val, 'aaa')] == 1]
output_df = output_df.append(_)
Here is output_df:
0 1
one aaa 1.0 NaN
bbb 4.0 NaN
ccc 7.0 NaN
ddd 10.0 NaN
eee 13.0 NaN
two aaa NaN 1.0
bbb NaN 19.0
ccc NaN 22.0
ddd NaN 25.0
eee NaN 28.0
You can then generate your desired output from this.
Let's try with DataFrame.xs:
m = df.xs('aaa', level=1).eq(1).any()
Or with pd.IndexSlice:
m = df.loc[pd.IndexSlice[:, 'aaa'], :].eq(1).any()
Result:
df.loc[:, m]
0 1
one aaa 1 2
bbb 4 5
ccc 7 8
ddd 10 11
eee 13 14
two aaa 16 1
bbb 18 19
ccc 21 22
ddd 24 25
eee 27 28
df.columns[m]
Int64Index([0, 1], dtype='int64')
EDITED 3/5/19:
Tried different ways to merge and/or join the data below but couldn't wrap my head around how to do that correctly.
Initially I have a data like this:
index unique_id group_name id name
0 100 ABC 20 aaa
1 100 ABC 21 bbb
2 100 DEF 22 ccc
3 100 DEF 23 ddd
4 100 DEF 24 eee
5 100 DEF 25 fff
6 101 ABC 30 ggg
7 101 ABC 31 hhh
8 101 ABC 32 iii
9 101 DEF 33 jjj
The goal is to reshape it by merging on unique_id so that the result looks like this:
index unique_id group_name_x id_x name_x group_name_y id_y name_y
0 100 ABC 20 aaa DEF 22 ccc
1 100 ABC 21 bbb DEF 23 ddd
2 100 NaN NaN NaN DEF 24 eee
3 100 NaN NaN NaN DEF 25 fff
4 101 ABC 30 ggg DEF 33 jjj
5 101 ABC 31 hhh NaN NaN NaN
6 101 ABC 32 iii NaN NaN NaN
How can I do this in pandas? The best I could think of is to split the data into two dataframes by group name (ABC and DEF) and then merge them with how='outer', on='unique_id', but that way it creates references between each record (2 ABC x 4 DEF = 8 records) without any NaN's.
pd.concat with axis=1 mentioned in answers doesn't align the data per unique_id and doesn't create any NaN's.
As you said , split the dataframe then concat both dataframe by row wise after resetting both index
A working code,
df=pd.read_clipboard()
req_cols=['group_name','id','name']
df_1=df[df['group_name']=='ABC'].reset_index(drop=True)
df_2=df[df['group_name']=='DEF'].reset_index(drop=True)
df_1=df_1.rename(columns = dict(zip(df_1[req_cols].columns.values, df_1[req_cols].add_suffix('_x'))))
df_2=df_2.rename(columns = dict(zip(df_2[req_cols].columns.values, df_2[req_cols].add_suffix('_y'))))
req_cols_x=[val+'_x'for val in req_cols]
print (pd.concat([df_2,df_1[req_cols_x]],axis=1))
O/P:
index unique_id group_name_y id_y name_y group_name_x id_x name_x
0 2 100 DEF 22 ccc ABC 20.0 aaa
1 3 100 DEF 23 ddd ABC 21.0 bbb
2 4 100 DEF 24 eee NaN NaN NaN
3 5 100 DEF 25 fff NaN NaN NaN
I try to run loop over a pandas dataframe that takes two arguments from different rows. I tried to use .iloc and shift functions but did not manage to get the result i need.
Here's a simple example to explain better what i want to do:
dataframe1:
a b c
0 101 1 aaa
1 211 2 dcd
2 351 3 yyy
3 401 5 lol
4 631 6 zzz
for the above df I want to make new column ('d') that gets the diff between the values in column 'a' only if the diff between the values in column 'b' is equal to 1, if not the value should be null. like the following dataframe2:
a b c d
0 101 1 aaa nan
1 211 2 dcd 110
2 351 3 yyy 140
3 401 5 lol nan
4 631 6 zzz 230
Is there any designed function that can handle this kind of calculations?
Try like this, using loc and diff():
df.loc[df.b.diff() == 1, 'd'] = df.a.diff()
>>> df
a b c d
0 101 1 aaa NaN
1 211 2 dcd 110.0
2 351 3 yyy 140.0
3 401 5 lol NaN
4 631 6 zzz 230.0
You can create a group key
df1.groupby(df1.b.diff().ne(1).cumsum()).a.diff()
Out[361]:
0 NaN
1 110.0
2 140.0
3 NaN
4 230.0
Name: a, dtype: float64
With reference to the test data below and the function I use to identify values within variable thresh of each other.
Can anyone please help me modify this to show the desired output I have shown?
Test data
import pandas as pd
import numpy as np
from itertools import combinations
df2 = pd.DataFrame(
{'AAA' : [4,5,6,7,9,10],
'BBB' : [10,20,30,40,11,10],
'CCC' : [100,50,25,10,10,11],
'DDD' : [98,50,25,10,10,11],
'EEE' : [103,50,25,10,10,11]});
Function:
thresh = 5
def closeCols2(df):
max_value = None
for k1,k2 in combinations(df.keys(),2):
if abs(df[k1] - df[k2]) < thresh:
if max_value is None:
max_value = max(df[k1],df[k2])
else:
max_value = max(max_value, max(df[k1],df[k2]))
return max_value
Data Before function applied:
AAA BBB CCC DDD EEE
0 4 10 100 98 103
1 5 20 50 50 50
2 6 30 25 25 25
3 7 40 10 10 10
4 9 11 10 10 10
5 10 10 11 11 11
Current series output after applied:
df2.apply(closeCols2, axis=1)
0 103
1 50
2 25
3 10
4 11
5 11
dtype: int64
Desired output is a dataframe showing all values within thresh and a nan for any not within thresh
AAA BBB CCC DDD EEE
0 nan nan 100 98 103
1 nan nan 50 50 50
2 nan 30 25 25 25
3 7 nan 10 10 10
4 9 11 10 10 10
5 10 10 11 11 11
use mask and sub with axis=1
df2.mask(df2.sub(df2.apply(closeCols2, 1), 0).abs() > thresh)
AAA BBB CCC DDD EEE
0 NaN NaN 100 98 103
1 NaN NaN 50 50 50
2 NaN 30.0 25 25 25
3 7.0 NaN 10 10 10
4 9.0 11.0 10 10 10
5 10.0 10.0 11 11 11
note:
I'd redefine closeCols to include thresh as a parameter. Then you could pass it in the apply call.
def closeCols2(df, thresh):
max_value = None
for k1,k2 in combinations(df.keys(),2):
if abs(df[k1] - df[k2]) < thresh:
if max_value is None:
max_value = max(df[k1],df[k2])
else:
max_value = max(max_value, max(df[k1],df[k2]))
return max_value
df2.apply(closeCols2, 1, thresh=5)
extra credit
I vectorized and embedded your closeCols for some mind numbing fun.
Notice there is no apply
numpy broadcasting to get all combinations of columns subtracted from each other.
np.abs
<= 5
sum(-1) I arranged the broadcasting such that the difference of say row 0, column AAA with all of row 0 will be laid out across the last dimension. -1 in the sum(-1) says to sum across last dimension.
<= 1 all values are less than 5 away from themselves. So I want the sum of these to be greater than 1. Thus, we mask all less than or equal to one.
v = df2.values
df2.mask((np.abs(v[:, :, None] - v[:, None]) <= 5).sum(-1) <= 1)
AAA BBB CCC DDD EEE
0 NaN NaN 100 98 103
1 NaN NaN 50 50 50
2 NaN 30.0 25 25 25
3 7.0 NaN 10 10 10
4 9.0 11.0 10 10 10
5 10.0 10.0 11 11 11