Apply function to all rows in pandas dataframe (lambda) - python

I have the following function for getting the column name of last non-zero value of a row
import pandas as pd
def myfunc(X, Y):
df = X.iloc[Y]
counter = len(df)-1
while counter >= 0:
if df[counter] == 0:
counter -= 1
else:
break
return(X.columns[counter])
Using the following code example
data = {'id': ['1', '2', '3', '4', '5', '6'],
'name': ['AAA', 'BBB', 'CCC', 'DDD', 'EEE', 'GGG'],
'A1': [1, 1, 1, 0, 1, 1],
'B1': [0, 0, 1, 0, 0, 1],
'C1': [1, 0, 1, 1, 0, 0],
'A2': [1, 0, 1, 0, 1, 0]}
df = pd.DataFrame(data)
df
myfunc(df, 5) # 'B1'
I would like to know how can I apply this function to all rows in a dataframe, and put the results into a new column of df
I am thinking about looping across all rows (which probably is not the good approach) or using lambdas with apply function. However, I have not suceed with this last approach. Any help?

I've modified your function a little bit to work across rows:
def myfunc(row):
counter = len(row)-1
while counter >= 0:
if row[counter] == 0:
counter -= 1
else:
break
return row.index[counter]
Now just call df.apply your function and axis=1 to call the function for each row of the dataframe:
>>> df.apply(myfunc, axis=1)
0 A2
1 A1
2 A2
3 C1
4 A2
5 B1
dtype: object
However, you can ditch your custom function and use this code to do what you're looking for in a much faster and more concise manner:
>>> df[df.columns[2:]].T.cumsum().idxmax()
0 A2
1 A1
2 A2
3 C1
4 A2
5 B1
dtype: object

Here is a simpler and faster solution using DataFrame.idxmax.
>>> res = df.iloc[:, :1:-1].idxmax(axis=1)
>>> res
0 A2
1 A1
2 A2
3 C1
4 A2
5 B1
dtype: object
The idea is to select only the Ai and Bi columns and reverse the order of them (df.iloc[:, :1:-1]) and then return the column label of the first occurrence of maximum (1 in this case) for each row (.idxmax(axis=1)).
Note that this solution (as the other answer) assumes that each row contains at least one entry higher than zero.
This assumption can be relaxed to 'each row contains at least one non-zero entry' if we first mask the non-zero entries (using .ne(0)). This works because .ne(0) produces a boolean mask and True > False <=> 1 > 0.
>>> res = df.iloc[:, :1:-1].ne(0).idxmax(axis=1)
res
0 A2
1 A1
2 A2
3 C1
4 A2
5 B1
dtype: object

Related

Replace values by result of a function

I have following dataframe table:
df = pd.DataFrame({'A': [0, 1, 0],
'B': [1, 1, 1]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
I'm trying to achieve that every value where 1 is present will be replaced by an increasing number. I'm looking for something like:
df.replace(1, value=3)
that works great but instead of number 3 I need number to be increasing (as I want to use it as ID)
number += 1
If I join those together, it doesn't work (or at least I'm not able to find correct syntax) I'd like to obtain following result:
df = pd.DataFrame({'A': [0, 2, 0],
'B': [1, 3, 4]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
Note: I can not use any command that relies on specification of column or row name, because table has 2600 columns and 5000 rows.
Element-wise assignment on a copy of df.values can work.
More specifically, a range starting from 1 to the number of 1's (inclusive) is assigned onto the location of 1 elements in the value array. The assigned array is then put back into the original dataframe.
Code
(Data as given)
1. Row-first ordering (what the OP wants)
arr = df.values
mask = (arr > 0)
arr[mask] = range(1, mask.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr[:, i]
# Result
print(df)
A B
2020-01-01 0 1
2020-02-01 2 3
2020-03-01 0 4
2. Column-first ordering (another possibility)
arr_tr = df.values.transpose()
mask_tr = (arr_tr > 0)
arr_tr[mask_tr] = range(1, mask_tr.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr_tr[i, :]
# Result
print(df)
A B
2020-01-01 0 2
2020-02-01 1 3
2020-03-01 0 4

compare column row wise in dataframe

I have a pandas data frame
sample dataframe
df = a1 a2 a3 a4 a5
0 1 1 1 0 #dict[a3_a4] = 1 ,dict[a2_a4] = 1 ,dict[a2_a3] = 1
1 1 1 0 0 #dict[a1_a2] = 1 , dict[a1_a3] = 1, dict[a2_a3] = 1
I need function gets data frame as input and return the number of appearing of 2 columns together and store it in the dictionary
so my output will be like
output dict will look like this : {'a1_a2':1,'a2_a3':2, 'a3_a4':1,'a1_a3':1,'a2_a4':1}
Pseudo code if needed
PS: I am new to stack overflow so forgive me for my mistakes.
You can use itertools combinations to get all the pairs of columns. Then you can multiply up the values and take the sum of them.
from itertools import combinations
cc = list(combinations(df.columns,2))
df1 = pd.concat([df[c[1]]*df[c[0]] for c in cc], axis=1, keys=cc)
df1.columns = df1.columns.map('_'.join)
d = df1.sum().to_dict()
print(d)
Output:
{'a1_a2': 1,
'a1_a3': 1,
'a1_a4': 0,
'a1_a5': 0,
'a2_a3': 2,
'a2_a4': 1,
'a2_a5': 0,
'a3_a4': 1,
'a3_a5': 0,
'a4_a5': 0}

Select rows based on condition and set values from a vector

I want to set the entire rows to a value from a vector, if a condition in on column is met.
import pandas as pd
df = pd.DataFrame([['a', 1, 1], ['a', 1, 1], ['b', 1, 1]], columns=('one', 'two', 'three'))
vector = pd.Series([2,3,4])
print(df)
one two three
0 a 1 1
1 a 1 1
2 b 1 1
I want the result to be like this:
df_wanted = pd.DataFrame([['a', 1, 1], ['a', 1, 1], ['b', 4, 4]], columns=('one', 'two', 'three'))
print(df_wanted)
one two three
0 a 1 1
1 a 1 1
2 b 4 4
I tried this but it gives me error:
df.loc[df['one']=='b'] = vector[df['one']=='b']
ValueError: Must have equal len keys and value when setting with an iterable
// m.
You can specify columns in list for set:
df.loc[df['one']=='b', ['two', 'three']] = vector[df['one']=='b']
print(df)
one two three
0 a 1 1
1 a 1 1
2 b 4 4
Or if need more dynamic solution - select all numeric columns:
df.loc[df['one']=='b', df.select_dtypes(np.number).columns] = vector[df['one']=='b']
Or compare only once and assign to variable:
m = df['one']=='b'
df.loc[m, df.select_dtypes(np.number).columns] = vector[m]

Python iterate each sub group of rows and apply function

I need to combine all iterations of subgroups to apply a function to and return a single value output along with concatenated string items identifying which iterations were looped.
I understand how to use pd.groupby and can set level=0 or level=1 and then call agg{'LOOPED_AVG':'mean'}. However, I need to group (or subset) rows by subgroup and then combine all rows from an iteration and then apply the function to it.
Input data table:
MAIN_GROUP SUB_GROUP CONCAT_GRP_NAME X_1
A 1 A1 9
A 1 A1 6
A 1 A1 3
A 2 A2 7
A 3 A3 9
B 1 B1 7
B 1 B1 3
B 2 B2 7
B 2 B2 8
C 1 C1 9
Desired result:
LOOP_ITEMS LOOPED_AVG
A1 B1 C1 6.166666667
A1 B2 C1 7
A2 B1 C1 6.5
A2 B2 C1 7.75
A3 B1 C1 7
A3 B2 C1 8.25
Assuming that you have three column pairs then you can apply the following, for more column pairs then adjust the script accordingly. I wanted to give you a way to solve the problem, this may not be the most efficient way but it gives a starting point.
import pandas as pd
import numpy as np
ls = [
['A', 1, 'A1', 9],
['A', 1, 'A1', 6],
['A', 1, 'A1', 3],
['A', 2, 'A2', 7],
['A', 3, 'A3', 9],
['B', 1, 'B1', 7],
['B', 1, 'B1', 3],
['B', 2, 'B2', 7],
['B', 2, 'B2', 8],
['C', 1, 'C1', 9],
]
#convert to dataframe
df = pd.DataFrame(ls, columns = ["Main_Group", "Sub_Group", "Concat_GRP_Name", "X_1"])
#get count and sum of concatenated groups
df_sum = df.groupby('Concat_GRP_Name')['X_1'].agg(['sum','count']).reset_index()
#print in permutations formula to calculate different permutation combos
import itertools as it
perms = it.permutations(df_sum.Concat_GRP_Name)
def combute_combinations(df, colname, main_group_series):
l = []
import itertools as it
perms = it.permutations(df[colname])
# Provides sorted list of unique values in the Series
unique_groups = np.unique(main_group_series)
for perm_pairs in perms:
#take in only the first three pairs of permuations and make sure
#the first column starts with A, secon with B, and third with C
if all([main_group in perm_pairs[ind] for ind, main_group in enumerate(unique_groups)]):
l.append([perm_pairs[ind] for ind in range(unique_groups.shape[0])])
return l
t = combute_combinations(df_sum, 'Concat_GRP_Name', df['Main_Group'])
#convert to dataframe and drop duplicate pairs
df2 = pd.DataFrame(t, columns = ["Item1", 'Item2', 'Item3']) .drop_duplicates()
#do a join between the dataframe that contains the sums and counts for the concat_grp_name to bring in the counts for
#each column from df2, since there are three columns: we must apply this three times
merged = df2.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item1'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item1_sum'}, axis=1)\
.rename({'count':'item1_count'}, axis=1)
merged2 = merged.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item2'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item2_sum'}, axis=1)\
.rename({'count':'item2_count'}, axis=1)
merged3 = merged2.merge(df_sum[['sum', 'count', 'Concat_GRP_Name']], left_on=['Item3'], right_on=['Concat_GRP_Name'], how='inner')\
.drop(['Concat_GRP_Name'], axis = 1)\
.rename({'sum':'item3_sum'}, axis=1)\
.rename({'count':'item3_count'}, axis=1)
#get the sum of all of the item_sum cols
merged3['sums']= merged3[['item3_sum', 'item2_sum', 'item1_sum']].sum(axis = 1)
#get sum of all the item_count cols
merged3['counts']= merged3[['item3_count', 'item2_count', 'item1_count']].sum(axis = 1)
#find the average
merged3['LOOPED_AVG'] = merged3['sums'] / merged3['counts']
#remove irrelavent fields
merged3 = merged3.drop(['item3_count', 'item2_count', 'item1_count', 'item3_sum', 'item2_sum', 'item1_sum', 'counts', 'sums' ], axis = 1)

Python Pandas: returning more then one field value when applying function to a data frame row

I need to apply several functions to data frame rows. Arguments of these functions take values from two or more fields of a single row. For example:
d = {
'a': [1,1,1,1],
'b': [2,2,2,2],
'c': [3,3,3,3],
'd': [4,4,4,4]
}
df1 = pd.DataFrame(d)
def f1(x,y):
return x + 2*y
def f2(x,y):
return y + 2*x
df2 = pd.DataFrame()
df2['val1'] = df1.apply(lambda r: f1(r.a, r.b),1)
df2['val2'] = df1.apply(lambda r: f2(r.c, r.d),1)
When applying each function in turn, one after another, Pandas make a separate iteration over all data frame rows. In this example Pandas iterate data frame twice. As a result I get:
In [10]: df2
Out[10]:
val1 val2
0 5 10
1 5 10
2 5 10
3 5 10
Is there any way to apply two or more functions like this in a single pass over data frame? This way application should return value for more then one field in a row. Also, this case includes application of a single function returning values for more then one field. How can this be done?
You could fill them at the same time by combining your functions:
def f3(x,y,z,a):
return x + 2*y, a + 2*z
df3 = pd.DataFrame()
df3['val1'], df3['val2'] = f3(df1.a, df1.b, df1.c, df1.d)
If your functions are linear or can be vectorized in some way, we can do many cool things.
t = pd.DataFrame(dict(val1=[1, 2, 0, 0], val2=[0, 0, 2, 1]), df1.columns)
df1.dot(t)
Or even quicker with
pd.DataFrame(
df1.values.dot(
np.array([[1, 0], [2, 0], [0, 2], [0, 1]])
),
df1.index,
['val1', 'val2']
)
Or you can define a new function to apply
def f3(r):
return pd.Series(dict(val1=f1(r.a, r.b), val2=f2(r.c, r.d)))
df1.apply(f3, 1)
If you don't want to create new functions, you can use one-liner below:
>>> df2 = df1.apply(lambda r: pd.Series({'val1': f1(r.a, r.b), 'val2': f2(r.c, r.d)}), axis=1)
>>> df2
val1 val2
0 5 10
1 5 10
2 5 10
3 5 10

Categories