How can I use the equivalent of pandas pivot_table() in pyspark? - python

I have a pyspark dataFrame that i want to pivot.
input_dataframe:
mdn
top_protocol_by_vol
top_vol
rank
55555
AAA
30
1
55555
BBB
20
2
55555
DDD
10
3
9898
JJJ
30
1
9898
CCC
20
2
9898
FFF
10
3
2030
PPP
30
1
2030
KKK
20
2
2030
FFF
10
3
and I want to have something like this
output_dataframe:
mdn
top_protocol_by_vol_1
top_protocol_by_vol_2
top_protocol_by_vol_3
top_vol_1
top_vol_2
top_vol_3
2030
PPP
KKK
FFF
30
20
10
9898
JJJ
CCC
FFF
30
20
10
55555
AAA
BBB
DDD
30
20
10
I know for sure that i cant do soemting like this with Pandas using the code:
output_dataframe = input_dataframe.pivot_table(index='mdn', columns=['rank'],aggfunc=lambda x: ''.join(x) if isinstance(x, str) else x,dropna=True).reset_index()
output_dataframe.columns = [''.join('_'.join([str(c) for c in col if c != ""])) for col in output_dataframe.columns.values]
How can I achieve the same results with pyspark converting to pandas ?

You can use pivot function with first as aggregate.
from pyspark.sql import functions as F
df = (df.groupby('mdn')
.pivot('rank')
.agg(F.first('top_protocol_by_vol').alias('top_protocol_by_vol'),
F.first('top_vol').alias('top_vol')))

Related

Match certain column values with other multiple

I have a data frame, df, where I would like to take certain values from multiple columns and append them to other columns given certain criteria.
Data
id value type_a pos date stat type_b id date2
aaa 10 aaa_q2.25_1 30 q1.22 aaa
aaa 20 aaa_q3.25_2 30 q1.22 aaa
aaa 500 aaa_q1.22_3 30 q1.22 aaa aaa_q1.22_3 aaa
bbb 20 bbb_q1.22_1 20 q1.22 bbb bbb_q1.22_1 bbb
bbb 10 bbb_q3.25_4 20 q2.22 bbb
aaa 5 aaa_q2.22_3 30 q2.22 aaa aaa_q2.22_3 aaa
ccc 15 ccc_q3.22_1 50 q3.22 ccc ccc_q3.22_1 ccc
ccc ccc_q4.26_2 ccc q1.22
aaa ccc_q2.22_2 aaa q1.22
ccc ccc_q2.22_3 ccc q1.22
Desired
Logic, for the empty column spaces of 'id', 'typea' and 'date', take the values from the columns, 'stat', 'typeb' and 'date2' and apply them to 'id', 'typea', and 'date'
id value type_a pos date stat type_b id date2
aaa 10 aaa_q2.25_1 30 q1.22 aaa
aaa 20 aaa_q3.25_2 30 q1.22 aaa
aaa 500 aaa_q1.22_3 30 q1.22 aaa aaa_q1.22_3 aaa
bbb 20 bbb_q1.22_1 20 q1.22 bbb bbb_q1.22_1 bbb
bbb 10 bbb_q3.25_4 20 q2.22 bbb
aaa 5 aaa_q2.22_3 30 q2.22 aaa aaa_q2.22_3 aaa
ccc 15 ccc_q3.22_1 50 q3.22 ccc ccc_q3.22_1 ccc
ccc ccc_q4.26_2 50 q1.22 ccc ccc_q4.26_2 ccc q1.22
aaa aaa_q2.22_2 30 q1.22 aaa aaa_q2.22_2 aaa q1.22
ccc ccc_q2.22_3 50 q1.22 ccc ccc_q2.22_3 ccc q1.22
Doing
A SO member has pointed me in the right direction and the code works well, however, I am wanting to manipulate more than one column on this particular problem.
df['id'] = df['id'].fillna(df['stat'].str[:3])
df['typea'] = df['typea'].fillna(df['typeb'].str[:11])
df['date'] = df['date'].fillna(df['date2'].str[:5])
Any suggestion is appreciated
Try via fillna():
df.iloc[:,0]=df.iloc[:,0].fillna(df['stat'])
#you have to use iloc since you have 2 columns of name 'id'
df['type_a']=df['type_a'].fillna(df['type_b'])
df['date']=df['date'].fillna(df['date2'])

DataFrame MultiIndex - find column by value

I have a multiindex dataframe with two layers of indices and roughly 100 columns. I would like to get groups of values (organized in columns) based on the presence of a certain value, but I am still struggling with the indexing mechanics.
Here is some example data:
import pandas as pd
index_arrays = [np.array(["one"]*5+["two"]*5),
np.array(["aaa","bbb","ccc","ddd","eee"]*2)]
df = pd.DataFrame([[1,2,3],[4,5,6],[7,8,9],
[10,11,12],[13,14,15],[16,1,17],
[18,19,20],[21,22,23],[24,25,26],
[27,28,29]], index=index_arrays)
Gives
0 1 2
one aaa 1 2 3
bbb 4 5 6
ccc 7 8 9
ddd 10 11 12
eee 13 14 15
two aaa 16 1 17
bbb 18 19 20
ccc 21 22 23
ddd 24 25 26
eee 27 28 29
Now, for each level_0 index (one and two), I want to return the entire column in which the level_1 index of aaa equals to a certain value, for example 1.
What I got so far is this:
df[df.loc[(slice(None), "aaa"),:]==1].any(axis=1)
>
one aaa True
bbb False
ccc False
ddd False
eee False
two aaa True
bbb False
ccc False
ddd False
eee False
Instead of the boolean values, I would like to retrieve the actual values. The expected output would be:
expected:
0
one aaa 1
bbb 4
ccc 7
ddd 10
eee 13
two aaa 1
bbb 19
ccc 22
ddd 25
eee 28
I would appreciate your help.
Bonus question: Additionally, it would be great to know which column contains the values in question. For the example above, this would be column 0 (for index one)and column 1 (for index two). Is there a way to do this?
Thanks!
This might be what you're looking for:
df.loc[df.index.get_level_values(0) == 'one', df.loc[('one', 'aaa')] == 1]
This outputs:
0
one aaa 1
bbb 4
ccc 7
ddd 10
eee 13
To combine the results for all of the different values of the first index, generate these DataFrames and concatenate them:
output_df = pd.DataFrame()
for level_0_val in df.index.get_level_values(0).unique():
_ = df.loc[df.index.get_level_values(0) == level_0_val, df.loc[(level_0_val, 'aaa')] == 1]
output_df = output_df.append(_)
Here is output_df:
0 1
one aaa 1.0 NaN
bbb 4.0 NaN
ccc 7.0 NaN
ddd 10.0 NaN
eee 13.0 NaN
two aaa NaN 1.0
bbb NaN 19.0
ccc NaN 22.0
ddd NaN 25.0
eee NaN 28.0
You can then generate your desired output from this.
Let's try with DataFrame.xs:
m = df.xs('aaa', level=1).eq(1).any()
Or with pd.IndexSlice:
m = df.loc[pd.IndexSlice[:, 'aaa'], :].eq(1).any()
Result:
df.loc[:, m]
0 1
one aaa 1 2
bbb 4 5
ccc 7 8
ddd 10 11
eee 13 14
two aaa 16 1
bbb 18 19
ccc 21 22
ddd 24 25
eee 27 28
df.columns[m]
Int64Index([0, 1], dtype='int64')

Groupby names replace values with there max value in all columns pandas

I have this DataFrame
lst = [['AAA',15,'BBB',20],['BBB',16,'AAA',12],['BBB',22,'CCC',15],['CCC',11,'AAA',31],['DDD',25,'EEE',35]]
df = pd.DataFrame(lst,columns = ['name1','val1','name2','val2'])
which looks like this
name1 val1 name2 val2
0 AAA 15 BBB 20
1 BBB 16 AAA 12
2 BBB 22 CCC 15
3 CCC 11 AAA 31
4 DDD 25 EEE 35
I want this
name1 val1 name2 val2
0 AAA 31 BBB 22
1 BBB 22 AAA 31
2 BBB 22 CCC 15
3 CCC 15 AAA 31
4 DDD 25 EEE 35
replaced all values with the maximum value. we choose the maximum value from both val1 and val2
if i do this i will get the maximum from only val1
df["val1"] = df.groupby("name1")["val1"].transform("max")
Try using pd.wide_to_long to melt that dataframe into a long form, then use groupby with transform to find the max value. Map that max value to 'name' and reshape back to four column (wide) dataframe:
df_long = pd.wide_to_long(df.reset_index(), ['name','val'], 'index', j='num',sep='',suffix='\d+')
mapper= df_long.groupby('name')['val'].max()
df_long['val'] = df_long['name'].map(mapper)
df_new = df_long.unstack()
df_new.columns = [f'{i}{j}' for i,j in df_new.columns]
df_new
Output:
name1 name2 val1 val2
index
0 AAA BBB 31 22
1 BBB AAA 22 31
2 BBB CCC 22 15
3 CCC AAA 15 31
4 DDD EEE 25 35
Borrow Scott's setting up
df_long = pd.wide_to_long(df.reset_index(), ['name','val'], 'index', j='num',sep='',suffix='\d+')
d = df_long.groupby('name')['val'].max()
df.loc[:,df.columns.str.startswith('val')]=df.loc[:,df.columns.str.startswith('name')].replace(d).values
df
Out[196]:
name1 val1 name2 val2
0 AAA 31 BBB 22
1 BBB 22 AAA 31
2 BBB 22 CCC 15
3 CCC 15 AAA 31
4 DDD 25 EEE 35
You can use lreshape (undocumented and ambiguous as to whether it's tested or will continue to remain) to get the long DataFrame, then map each pair of columns using the max.
names = df.columns[df.columns.str.startswith('name')]
vals = df.columns[df.columns.str.startswith('val')]
s = (pd.lreshape(df, groups={'name': names, 'val': vals})
.groupby('name')['val'].max())
for n in names:
df[n.replace('name', 'val')] = df[n].map(s)
name1 val1 name2 val2
0 AAA 31 BBB 22
1 BBB 22 AAA 31
2 BBB 22 CCC 15
3 CCC 15 AAA 31
4 DDD 25 EEE 35
This builds off #ScottBoston's answer :
res = pd.wide_to_long(df.reset_index(), ["name", "val"], "index", j="num")
res.update(res.groupby(["name"]).val.transform("max"))
res = res.unstack()
res.columns = [f"{first}{last}" for first, last in res.columns]
res.rename_axis(index=None)
name1 name2 val1 val2
0 AAA BBB 31 22
1 BBB AAA 22 31
2 BBB CCC 22 15
3 CCC AAA 15 31
4 DDD EEE 25 35

pandas merge two dataframes without cross-references and with NaN's for uneven number of rows

EDITED 3/5/19:
Tried different ways to merge and/or join the data below but couldn't wrap my head around how to do that correctly.
Initially I have a data like this:
index unique_id group_name id name
0 100 ABC 20 aaa
1 100 ABC 21 bbb
2 100 DEF 22 ccc
3 100 DEF 23 ddd
4 100 DEF 24 eee
5 100 DEF 25 fff
6 101 ABC 30 ggg
7 101 ABC 31 hhh
8 101 ABC 32 iii
9 101 DEF 33 jjj
The goal is to reshape it by merging on unique_id so that the result looks like this:
index unique_id group_name_x id_x name_x group_name_y id_y name_y
0 100 ABC 20 aaa DEF 22 ccc
1 100 ABC 21 bbb DEF 23 ddd
2 100 NaN NaN NaN DEF 24 eee
3 100 NaN NaN NaN DEF 25 fff
4 101 ABC 30 ggg DEF 33 jjj
5 101 ABC 31 hhh NaN NaN NaN
6 101 ABC 32 iii NaN NaN NaN
How can I do this in pandas? The best I could think of is to split the data into two dataframes by group name (ABC and DEF) and then merge them with how='outer', on='unique_id', but that way it creates references between each record (2 ABC x 4 DEF = 8 records) without any NaN's.
pd.concat with axis=1 mentioned in answers doesn't align the data per unique_id and doesn't create any NaN's.
As you said , split the dataframe then concat both dataframe by row wise after resetting both index
A working code,
df=pd.read_clipboard()
req_cols=['group_name','id','name']
df_1=df[df['group_name']=='ABC'].reset_index(drop=True)
df_2=df[df['group_name']=='DEF'].reset_index(drop=True)
df_1=df_1.rename(columns = dict(zip(df_1[req_cols].columns.values, df_1[req_cols].add_suffix('_x'))))
df_2=df_2.rename(columns = dict(zip(df_2[req_cols].columns.values, df_2[req_cols].add_suffix('_y'))))
req_cols_x=[val+'_x'for val in req_cols]
print (pd.concat([df_2,df_1[req_cols_x]],axis=1))
O/P:
index unique_id group_name_y id_y name_y group_name_x id_x name_x
0 2 100 DEF 22 ccc ABC 20.0 aaa
1 3 100 DEF 23 ddd ABC 21.0 bbb
2 4 100 DEF 24 eee NaN NaN NaN
3 5 100 DEF 25 fff NaN NaN NaN

Split dataframe output based on values

This post covered Modification of a function to return a dataframe with specified values and I would like to further modify the output. The current function and vectorized version will get all combinations of columns subtracted from each other and return relevant data accordingly.
Example and test data:
import pandas as pd
import numpy as np
from itertools import combinations
df2 = pd.DataFrame(
{'AAA' : [80,5,6],
'BBB' : [85,20,30],
'CCC' : [100,50,25],
'DDD' : [98,50,25],
'EEE' : [103,50,25],
'FFF' : [105,50,25],
'GGG' : [109,50,25]});
df2
AAA BBB CCC DDD EEE FFF GGG
0 80 85 100 98 103 105 109
1 5 20 50 50 50 50 50
2 6 30 25 25 25 25 25
v = df2.values
df3 = df2.mask((np.abs(v[:, :, None] - v[:, None]) <= 5).sum(-1) <= 1)
df3
AAA BBB CCC DDD EEE FFF GGG
0 80.0 85.0 100 98 103 105 109
1 NaN NaN 50 50 50 50 50
2 NaN 30.0 25 25 25 25 25
All values within thresh (5 here) are returned on a per row basis with np.abs <=5.
What needs to change?
On the first row of df3 there are two clusters of values within thresh (80,85) and (100,98,103,105,109). They are all valid but are two separate groups as not within thresh. I would like to be able to separate these values based on another thresh value.
I have attempted to demonstrate what I am looking to do with the following (flawed) code and only including this to show that Im attempting to progress this myself..
df3.mask(df3.apply(lambda x : x >= df3.T.max() \
- (thresh * 3))).dropna(thresh=2).dropna(axis=1)
AAA BBB
0 80.0 85.0
df3.mask(~df3.apply(lambda x : x >= df3.T.max() - (thresh * 3))).dropna(axis=1)
CCC DDD EEE FFF GGG
0 100 98 103 105 109
1 50 50 50 50 50
2 25 25 25 25 25
So my output is nice (and shows close to desired output) but the way I got this is not so nice...
---Desired output: ---
I have used multiple rows to demonstrate but when I use this code it will only be one row that needs to be output and split. So desired output is to return the separate columns as per this example for row 0.
CCC DDD EEE FFF GGG
0 100 98 103 105 109
and
AAA BBB
0 80.0 85.0
I felt this was deserving of a separate answer.
I wrote a clustering function that operates on one dimensional arrays. I know how to vectorize it further to 2 dimensions but I haven't gotten to it yet. As it is, I use np.apply_along_axis
This function is described in this answer to this question. I encourage you to follow the links and see the work that went into getting this seemingly simple function.
What it does is find the clusters within an array defined by margins to the left and right of every point. It sorts, then clusters, then un sorts.
delta clustering function
def delta_cluster(a, dleft, dright):
s = a.argsort()
y = s.argsort()
a = a[s]
rng = np.arange(len(a))
edge_left = a.searchsorted(a - dleft)
starts = edge_left == rng
edge_right = np.append(0, a.searchsorted(a + dright, side='right')[:-1])
ends = edge_right == rng
return (starts & ends).cumsum()[y]
Onto the problem at hand
Use the cluster function for each row in df2 with np.apply_along_axis and construct a DataFrame named clusters that mirrors the same index and columns as df2. Then stack to get a Series which will make it easier to manipulate later.
clusters = pd.DataFrame(
np.apply_along_axis(delta_cluster, 1, df2.values, 10, 10),
df2.index, df2.columns).stack()
This describes the next block of code.
I need to keep the row information of df2 when I do a groupby.
Use transform to get the size of clusters for each row.
stack the values of df2 and append the cluster values as part of the index. This enables the separation you are looking for.
mask val where size is equal to 1. These are singleton clusters.
lvl0 = clusters.index.get_level_values(0)
size = clusters.groupby([lvl0, clusters]).transform('size')
val = df2.stack().to_frame('value').set_index(clusters, append=True).value
val.mask(size.values == 1).dropna().unstack(1)
AAA BBB CCC DDD EEE FFF GGG
0 1 80.0 85.0 NaN NaN NaN NaN NaN
2 NaN NaN 100.0 98.0 103.0 105.0 109.0
1 3 NaN NaN 50.0 50.0 50.0 50.0 50.0
2 2 NaN 30.0 25.0 25.0 25.0 25.0 25.0
This matches your results except I split out the first row into two rows.
AAA BBB CCC DDD EEE FFF GGG
0 80.0 85.0 100 98 103 105 109
1 NaN NaN 50 50 50 50 50
2 NaN 30.0 25 25 25 25 25
Well I think you can try to solve your problem differently. The idea is to get 'gaps and islands' within each row and label each group:
So, first - put your columns to rows and sort values within each initial row index:
>>> df = df2.stack().sort_values().sortlevel(0, sort_remaining=False)
>>> df
0 AAA 80
BBB 85
DDD 98
CCC 100
EEE 103
FFF 105
GGG 109
1 AAA 5
BBB 20
GGG 50
FFF 50
DDD 50
CCC 50
EEE 50
2 AAA 6
GGG 25
EEE 25
DDD 25
CCC 25
FFF 25
BBB 30
Next, create new DataFrame with 'prev values' together with current values:
>>> df = df2.stack().sort_values().sortlevel(0, sort_remaining=False)
>>> df = pd.concat([df, df.groupby(level=0).shift(1)], axis=1)
>>> df.columns = ['cur', 'prev']
>>> df
cur prev
0 AAA 80 NaN
BBB 85 80.0
DDD 98 85.0
CCC 100 98.0
EEE 103 100.0
FFF 105 103.0
GGG 109 105.0
1 AAA 5 NaN
BBB 20 5.0
GGG 50 20.0
FFF 50 50.0
DDD 50 50.0
CCC 50 50.0
EEE 50 50.0
2 AAA 6 NaN
GGG 25 6.0
EEE 25 25.0
DDD 25 25.0
CCC 25 25.0
FFF 25 25.0
BBB 30 25.0
And now, creating islands labels:
>>> df = (df['cur'] - df['prev'] > thresh).astype('int')
>>> df
0 AAA 0
BBB 0
DDD 1
CCC 0
EEE 0
FFF 0
GGG 0
1 AAA 0
BBB 1
GGG 1
FFF 0
DDD 0
CCC 0
EEE 0
2 AAA 0
GGG 1
EEE 0
DDD 0
CCC 0
FFF 0
BBB 0
>>> df.groupby(level=0).cumsum().unstack()
AAA BBB CCC DDD EEE FFF GGG
0 0 0 1 1 1 1 1
1 0 1 2 2 2 2 2
2 0 1 1 1 1 1 1
Now you can filter out groups which have only one member and you're done :)
>>> dfm = df.groupby(level=0).cumsum().unstack()
>>> dfm
AAA BBB CCC DDD EEE FFF GGG
0 0 0 1 1 1 1 1
1 0 1 2 2 2 2 2
2 0 1 1 1 1 1 1
>>> df2[dfm == 0].loc[0:0].dropna(axis=1)
AAA BBB
0 80 85.0
>>> df2[dfm == 1].loc[0:0].dropna(axis=1)
CCC DDD EEE FFF GGG
0 100.0 98.0 103.0 105.0 109.0
method 1
I copied and pasted from previous question including the minor change.
I vectorized and embedded your closeCols for some mind numbing fun.
Notice there is no apply
numpy broadcasting to get all combinations of columns subtracted from each other.
np.abs
<= 5
sum(-1) I arranged the broadcasting such that the difference of say row 0, column AAA with all of row 0 will be laid out across the last dimension. -1 in the sum(-1) says to sum across last dimension.
<= 1 all values are less than 5 away from themselves. So I want the sum of these to be greater than 1. Thus, we mask all less than or equal to one.
df2 = pd.DataFrame(
{'AAA' : [80,5,6],
'BBB' : [85,20,30],
'CCC' : [100,50,25],
'DDD' : [98,50,25],
'EEE' : [103,50,25],
'FFF' : [105,50,25],
'GGG' : [109,50,25]});
v = df2.values
# let delta be the distance threshold
# let k be the cluster size threshold
x, k = 5, 2 # cluster size must be greater than k
df2.mask((np.abs(v[:, :, None] - v[:, None]) <= x).sum(-1) <= k)
# note that this is the same as before but k = 1 was hard coded
print(df3)
AAA BBB CCC DDD EEE FFF GGG
0 NaN NaN 100 98 103 105 NaN
1 NaN NaN 50 50 50 50 50.0
2 NaN 30.0 25 25 25 25 25.0

Categories