Pandas: sum DataFrame column for max value only - python

I have the following DataFrame:
df = pd.DataFrame({'a': [0.28, 0, 0.25, 0.85, 0.1],
'b': [0.5, 0.5, 0, 0.75, 0.1],
'c': [0.33, 0.7, 0.25, 0.2, 0.5],
'd': [0, 0.25, 0.2, 0.66, 0.1]})
Output:
a b c d
0 0.28 0.50 0.33 0.00
1 0.00 0.50 0.70 0.25
2 0.25 0.00 0.25 0.20
3 0.85 0.75 0.20 0.66
4 0.10 0.10 0.50 0.10
For each column, I want to sum the top n max values of the column, where n is determined by how many row max values that column contains.
For example, column b has a row max only in row 1, so its sum is the sum of the top 1 max values in that column, which is just 0.5 -- but column c has three row-maxes, located in rows 1, 2, and 4, so the top 3 max values of column c should be summed.
Expected output:
a b c d
0 0.28 0.50 0.33 0.00
1 0.00 0.50 0.70 0.25
2 0.25 0.00 0.25 0.20
3 0.85 0.75 0.20 0.66
4 0.10 0.10 0.50 0.10
count 1.10 0.50 1.45 0.00

where
df.append(
df.where( # only look at values that are max for the row
df.eq( # compare max values to all values in row just
# in case there are more than 1
df.max(axis=1), # actually get max values
axis=0
)
).sum().rename('count')
)
a b c d
0 0.28 0.50 0.33 0.00
1 0.00 0.50 0.70 0.25
2 0.25 0.00 0.25 0.20
3 0.85 0.75 0.20 0.66
4 0.10 0.10 0.50 0.10
count 1.10 0.50 1.45 0.00

The fastest way to do this would be to using the .max() method passing through the axis argument:
df.max(axis =1)
If you're after another column:
df['column_name'] = df.max(axis =1)
I didn't read the question that well!

Related

Add selected interactions as columns to pandas dataframe

I'm fairly new to pandas and python. I'm trying to return few selected interaction terms of all possible interactions in a data frame, and then return them as new features in the df.
My solution was to calculate the interactions of interest using sklearn's PolynomialFeature() and attach them to the df in a for loop. See example:
import numpy as np
import pandas as pd
from sklearn.preprocessing import PolynomialFeatures
np.random.seed(1111)
a1 = np.random.randint(2, size = (5,3))
a2 = np.round(np.random.random((5,3)),2)
df = pd.DataFrame(np.concatenate([a1, a2], axis = 1), columns = ['a','b','c','d','e','f'])
combinations = [['a', 'e'], ['a', 'f'], ['b', 'f']]
for comb in combinations:
polynomizer = PolynomialFeatures(interaction_only=True, include_bias=False).fit(df[comb])
newcol_nam = polynomizer.get_feature_names(comb)[2]
newcol_val = polynomizer.transform(df[comb])[:,2]
df[newcol_nam] = newcol_val
df
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00
Another solution would be to run
PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(df)
and then drop the interactions I'm not interested in.
However, neither option is ideal in terms of performance and I'm wondering if there is a better solution.
As commented, you can try:
df = df.join(pd.DataFrame({
f'{x} {y}': df[x]*df[y] for x,y in combinations
}))
Or simply:
for comb in combinations:
df[' '.join(comb)] = df[comb].prod(1)
Output:
a b c d e f a e a f b f
0 0.0 1.0 1.0 0.51 0.45 0.10 0.00 0.00 0.10
1 1.0 0.0 0.0 0.67 0.36 0.23 0.36 0.23 0.00
2 0.0 0.0 0.0 0.97 0.79 0.02 0.00 0.00 0.00
3 0.0 1.0 0.0 0.44 0.37 0.52 0.00 0.00 0.52
4 0.0 0.0 0.0 0.16 0.02 0.94 0.00 0.00 0.00

How do you give weights to dataframe columns iteratively for weighted mean average?

I have a dataframe with multiple columns having numerical float values. What I want to do is give fractional weights to each column and calculate its average to store and append it to the same df.
Let's say we have the columns: s1, s2, s3
I want to give the weights: w1, w2, w3 to them respectively
I was able to do this manually while experimenting with all values in hand. But when I go to a list format, it's giving me an error.
I was trying to do it through iteration and I've attached my code below, but it was giving me an error. I have also attached my manual code which worked, but it needs it first hand.
Code which didn't work:
score_df["weighted_avg"] += weight * score_df[feature]
Manual Code which worked but not with lists:
df["weighted_scores"] = 0.5*df["s1"] + 0.25*df["s2"] + 0.25*df["s3"]
We can use numpy broadcasting for this, since weights has the same shape as your column axis:
# given the following example df
df = pd.DataFrame(np.random.rand(10,3), columns=["s1", "s2", "s3"])
print(df)
s1 s2 s3
0 0.49 1.00 0.50
1 0.65 0.87 0.75
2 0.45 0.85 0.87
3 0.91 0.53 0.30
4 0.96 0.44 0.50
5 0.67 0.87 0.24
6 0.87 0.41 0.29
7 0.06 0.15 0.73
8 0.76 0.92 0.69
9 0.92 0.28 0.29
weights = [0.5, 0.25, 0.25]
df["weighted_scores"] = df.mul(weights).sum(axis=1)
print(df)
s1 s2 s3 weighted_scores
0 0.49 1.00 0.50 0.62
1 0.65 0.87 0.75 0.73
2 0.45 0.85 0.87 0.66
3 0.91 0.53 0.30 0.66
4 0.96 0.44 0.50 0.71
5 0.67 0.87 0.24 0.61
6 0.87 0.41 0.29 0.61
7 0.06 0.15 0.73 0.25
8 0.76 0.92 0.69 0.78
9 0.92 0.28 0.29 0.60
You can use dot
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(10,3), columns=["s1", "s2", "s3"])
df['weighted_scores'] = df.dot([.5,.25,.25])
df
Out
s1 s2 s3 weighted_scores
0 0.053543 0.659316 0.033540 0.199985
1 0.631627 0.257241 0.494959 0.503863
2 0.220939 0.870247 0.875165 0.546822
3 0.890487 0.519320 0.944459 0.811188
4 0.029416 0.016780 0.987503 0.265779
5 0.843882 0.784933 0.677096 0.787448
6 0.396092 0.297580 0.965454 0.513805
7 0.109894 0.011217 0.443796 0.168700
8 0.202096 0.637105 0.959876 0.500293
9 0.847020 0.949703 0.668615 0.828090

Return column names for 3 highest values in rows

I'm trying to come up with a way to return the column names for the 3 highest values in each row of the table below. So far I've been able to return the highest value using idxmax but I haven't been able to figure out how to get the 2nd and 3rd highest.
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6
0 9 0.00 0.15 0.06 0.11 0.23 0.01
1 4 0.00 0.25 0.04 0.10 0.10 0.00
2 11 0.00 0.34 0.00 0.09 0.24 0.00
3 12 0.00 0.16 0.00 0.11 0.00 0.00
4 0 0.00 0.35 0.00 0.04 0.02 0.00
5 17 0.01 0.21 0.02 0.18 0.27 0.01
Expected output:
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6 TopThree
0 9 0.00 0.15 0.06 0.11 0.23 0.01 [Stat5,Stat2,Stat4]
1 4 0.00 0.25 0.04 0.10 0.10 0.00 [Stat2,Stat4,Stat5]
2 11 0.00 0.34 0.00 0.09 0.24 0.00 [Stat2,Stat5,Stat4]
3 12 0.00 0.16 0.00 0.19 0.00 0.01 [Stat4,Stat2,Stat6]
4 0 0.00 0.35 0.00 0.04 0.02 0.00 [Stat2,Stat4,Stat5]
5 17 0.01 0.21 0.02 0.18 0.27 0.01 [Stat5,Stat2,Stat4]
If anyone has ideas on how to do this I'd appreciate it.
Use numpy.argsort for positions of sorted values and filter all columns without first:
a = df.iloc[:, 1:].to_numpy()
df['TopThree'] = df.columns[1:].to_numpy()[np.argsort(-a, axis=1)[:, :3]].tolist()
print (df)
Clust Stat1 Stat2 Stat3 Stat4 Stat5 Stat6 TopThree
0 9 0.00 0.15 0.06 0.11 0.23 0.01 [Stat5, Stat2, Stat4]
1 4 0.00 0.25 0.04 0.10 0.10 0.00 [Stat2, Stat4, Stat5]
2 11 0.00 0.34 0.00 0.09 0.24 0.00 [Stat2, Stat5, Stat4]
3 12 0.00 0.16 0.00 0.11 0.00 0.00 [Stat2, Stat4, Stat1]
4 0 0.00 0.35 0.00 0.04 0.02 0.00 [Stat2, Stat4, Stat5]
5 17 0.01 0.21 0.02 0.18 0.27 0.01 [Stat5, Stat2, Stat4]
If performace is not important:
df['TopThree'] = df.iloc[:, 1:].apply(lambda x: x.nlargest(3).index.tolist(), axis=1)

Pandas - drop all rows with 0 in at least two columns

I have some DataFrame:
df = pd.DataFrame({'name': ['apple1', 'apple2', 'apple3', 'apple4', 'orange1', 'orange2', 'orange3', 'orange4'],
'A': [0, 0, 0, 0, 0, 0 ,0, 0],
'B': [0.10, -0.15, 0.25, -0.55, 0.50, -0.51, 0.70, 0],
'C': [0, 0, 0.25, -0.55, 0.50, -0.51, 0.70, 0.90],
'D': [0.10, -0.15, 0.25, 0, 0.50, -0.51, 0.70, 0.90]})
df
name A B C D
0 apple1 0 0.10 0.00 0.10
1 apple2 0 -0.15 0.00 -0.15
2 apple3 0 0.25 0.25 0.25
3 apple4 0 -0.55 -0.55 0.00
4 orange1 0 0.50 0.50 0.50
5 orange2 0 -0.51 -0.51 -0.51
6 orange3 0 0.70 0.70 0.70
7 orange4 0 0.00 0.90 0.90
I'd like to drop all rows that have two or more zeros in columns A,B,C,D.
This DataFrame has other columns that have zeros; I only want to check for zeros in columns A,B,C,D.
You can use .eq to check if dataframe is equal to 0 and then take sum on axis=1 and return a boolean series by checking if the sum is greater than or equal to 2 (ge):
df[~df[['A','B','C','D']].eq(0).sum(1).ge(2)]
name A B C D
2 apple3 0 0.25 0.25 0.25
4 orange1 0 0.50 0.50 0.50
5 orange2 0 -0.51 -0.51 -0.51
6 orange3 0 0.70 0.70 0.70

Pandas: Pairwise concatenation of column vectors

I'm working with a frame like
df = pd.DataFrame({
'G1':[1.00,0.69,0.23,0.22,0.62],
'G2':[0.03,0.41,0.74,0.35,0.62],
'G3':[0.05,0.40,0.15,0.32,0.19],
'G4':[0.30,0.20,0.51,0.70,0.67],
'G5':[0.40,0.36,0.88,0.10,0.19]
})
and I want to manipulate it so that the columns are pairwise permutations of the current columns e.g. all columns are now 10 elements long and for example column 'G1:G2' would have column 'G2' appended to column 'G1'. I have attached a mock-up pic. Note that the pic has named indices unlike the above example code. I can work with or without the indices.
How could I approach this? I can make a function to act on each column, but I think the function would have to return a data frame made by concatenation with all other columns. Not sure what that would look like.
I'd do it like this
from itertools import permutations
l1, l2 = map(list, zip(*permutations(range(len(df.columns)), 2)))
v = df.values
pd.DataFrame(
np.vstack([v[:, l1], v[:, l2]]),
list(map('S{}'.format, range(1, len(df) + 1))) * 2,
df.columns.values[l1] + ':' + df.columns.values[l2]
)
Here is one way, although I suspect there might also be a way to do this directly in pandas
from itertools import permutations
'''Get all the column permutations'''
lst = [x for x in permutations(df.columns, 2)]
'''Create a list of columns names'''
names = [x[0]+'_'+x[1] for x in lst]
'''Create the new arrays by vertically stacking pairs of column values'''
cols = [np.vstack((df[x[0]].values,df[x[1]].values)).ravel() for x in lst]
'''Create a dictionary with column names as keys and the arrays as values'''
d = dict(zip(names, cols))
'''Create new dataframe from dict'''
df2 = pd.DataFrame(d)
df2
G1_G2 G1_G3 G1_G4 G1_G5 G2_G1 G2_G3 G2_G4 G2_G5 G3_G1 G3_G2 \
0 1.00 1.00 1.00 1.00 0.03 0.03 0.03 0.03 0.05 0.05
1 0.69 0.69 0.69 0.69 0.41 0.41 0.41 0.41 0.40 0.40
2 0.23 0.23 0.23 0.23 0.74 0.74 0.74 0.74 0.15 0.15
3 0.22 0.22 0.22 0.22 0.35 0.35 0.35 0.35 0.32 0.32
4 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.62 0.19 0.19
5 0.03 0.05 0.30 0.40 1.00 0.05 0.30 0.40 1.00 0.03
6 0.41 0.40 0.20 0.36 0.69 0.40 0.20 0.36 0.69 0.41
7 0.74 0.15 0.51 0.88 0.23 0.15 0.51 0.88 0.23 0.74
8 0.35 0.32 0.70 0.10 0.22 0.32 0.70 0.10 0.22 0.35
9 0.62 0.19 0.67 0.19 0.62 0.19 0.67 0.19 0.62 0.62
This is part of the output
To avoid creating the lists and use the fact that itertools.permutations is a generator:
d = dict((x[0]+'_'+x[1] , np.vstack((df[x[0]].values,df[x[1]].values)).ravel())
for x in permutations(df.columns, 2))
df2 = pd.DataFrame(d)

Categories