Apply function rowwise to pandas dataframe while referencing a column - python

I have a pandas dataframe like this:
df = pd.DataFrame({'A': [2, 3], 'B': [1, 2], 'C': [0, 1], 'D': [1, 0], 'total': [4, 6]})
A B C D total
0 2 1 0 1 4
1 3 2 1 0 6
I'm trying to perform a rowwise calculation and create a new column with the result. The calculation is to divide each column ABCD by the total, square it, and sum it up rowwise. This should be the result (0 if total is 0):
A B C D total result
0 2 1 0 1 4 0.375
1 3 2 1 0 6 0.389
This is what I've tried so far, but it always returns 0:
df['result'] = df[['A', 'B', 'C', 'D']].apply(lambda x: ((x/df['total'])**2).sum(), axis=1)
I guess the problem is df['total'] in the lambda function, because if I replace this by a number it works fine. I don't know how to work around this though. Appreciate any suggestions.

A combination of div, pow and sum can solve this :
df["result"] = df.filter(regex="[^total]").div(df.total, axis=0).pow(2).sum(1)
df
A B C D total result
0 2 1 0 1 4 0.375000
1 3 2 1 0 6 0.388889

you could do
df['result'] = (df.loc[:, "A": 'D'].divide(df.total, axis=0) ** 2).sum(axis=1)

Related

Efficient way in Pandas to count occurrences of Series of values by row

I have a large dataframe for which I want to count the number of occurrences of a series specific values (given by an external function) by row. For reproducibility let's assume the following simplified dataframe:
data = {'A': [3, 2, 1, 0], 'B': [4, 3, 2, 1], 'C': [1, 2, 3, 4], 'D': [1, 1, 2, 2], 'E': [4, 4, 4, 4]}
df = pd.DataFrame.from_dict(data)
df
A B C D E
0 3 4 1 1 4
1 2 3 2 1 3
2 1 2 3 2 2
3 0 1 4 2 4
How can I count the number of occurrences of specific values (given by a series with the same size) by row?
Again for simplicity, let's assume this value_series is given by the max of each row.
values_series = df.max(axis=1)
0 4
1 3
2 3
3 4
dtype: int64
The solution I got to seems not very pythonic (e.g. I'm using iterrows(), which is slow):
max_count = []
for index, row in df.iterrows():
max_count.append(row.value_counts()[values_series.loc[index]])
df_counts = pd.Series(max_count)
Is there any more efficient way to do this?
We can compare the transposed df.T directly to the df.max series, thanks to broadcasting:
(df.T == df.max(axis=1)).sum()
# result
0 2
1 1
2 1
3 2
dtype: int64
(Transposing also has the added benefit that we can use sum without specifying the axis, i.e. with the default axis=0.)
You can try
df.eq(df.max(1),axis=0).sum(1)
Out[361]:
0 2
1 1
2 1
3 2
dtype: int64
The perfect job for numpy broadcasting:
a = df.to_numpy()
b = values_series.to_numpy()[:, None]
(a == b).sum(axis=1)

Add another column based on the value of two columns

I am trying to add another column based on the value of two columns. Here is the mini version of my dataframe.
data = {'current_pair': ['"["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]"', '"["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]"', '"["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]"','"["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]"', '"["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]"'],
'B': [1, 0, 1, 1, 0]
}
df = pd.DataFrame(data)
df
current_pair B
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0
I want the result to be:
current_pair B C
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1 1
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0 2
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1 0
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1 1
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0 2
I used the numpy select commands:
conditions=[(data['B']==1 & data['current_pair'].str.contains('Emo/', na=False)),
(data['B']==1 & data['current_pair'].str.contains('Neu/', na=False)),
data['B']==0]
choices = [0, 1, 2]
data['C'] = np.select(conditions, choices, default=np.nan)
Unfortunately, it gives me this dataframe without recognizing anything with "1" in column "C".
current_pair B C
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1 0
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0 2
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1 0
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1 0
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0 2
Any help counts! thanks a lot.
There is problem with () after ==1 for precedence of operators:
conditions=[(data['B']==1) & data['current_pair'].str.contains('Emo/', na=False),
(data['B']==1) & data['current_pair'].str.contains('Neu/', na=False),
data['B']==0]
I think some logic went wrong here; this works:
df.assign(C=np.select([df.B==0, df.current_pair.str.contains('Emo/'), df.current_pair.str.contains('Neu/')], [2,0,1]))
Here is a slightly more generalized suggestion, easily applicable to more complex cases. You should, however mind execution speed:
import pandas as pd
df = pd.DataFrame({'col_1': ['Abc', 'Xcd', 'Afs', 'Xtf', 'Aky'], 'col_2': [1, 2, 3, 4, 5]})
def someLogic(col_1, col_2):
if 'A' in col_1 and col_2 == 1:
return 111
elif "X" in col_1 and col_2 == 4:
return 999
return 888
df['NewCol'] = df.apply(lambda row: someLogic(row.col_1, row.col_2), axis=1, result_type="expand")
print(df)

Counting and printing zeroes and negative values for each Column in a Dataframe

I'd like to print a statement showing me both zero and negative value counts in each Dataframe column.
My input would be something like:
import pandas as pd
df = pd.DataFrame({'a': [-3, -2, 0], 'b': [-2, 2, 5], 'c': [-1, 0, 7], 'd': [1, 4, 8]})
Which prints:
a b c d
0 -3 -2 -1 1
1 -2 2 0 4
2 0 5 7 8
The outputs I desire are:
Negatives Found:
a 2
b 1
c 1
d 0
Zeros Found:
a 1
b 0
c 1
d 0
I can't find an easy way to get to this without creating a Dataframe from the Dataframe using something like:
df_neg = df < 0
df_zero = df == 0
However, this only counts if True or False.
What's the best way of doing a count that is printable and 'easy' to run on bigger data sets?
This is somewhat what you tried:
Negatives:
(df<0).sum()
Zeros:
(df==0).sum()
If this isn't good for you, and you really don't want to generate a mask of booleans and count them (thought I'm not sure why it would bother you), let me know, you can get the same results with loops
You could use where and count
df.where(condition).count()
df = pd.DataFrame({'a': [-3, -2, 0], 'b': [-2, 2, 5], 'c': [-1, 0, 7], 'd': [1, 4, 8]})
print('Negatives Found:')
print(df.where(df < 0).count())
print('Zeros Found:')
print(df.where(df == 0).count())
This prints
Negatives Found:
a 2
b 1
c 1
d 0
Zeros Found:
a 1
b 0
c 1
d 0
You can simply:
print(df[df<0].count())
print(df[df==0].count())
a 2
b 1
c 1
d 0
dtype: int64
a 1
b 0
c 1
d 0
dtype: int64

Including missing combinations of values in a pandas groupby aggregation

Problem
Including all possible values or combinations of values in the output of a pandas groupby aggregation.
Example
Example pandas DataFrame has three columns, User, Code, and Subtotal:
import pandas as pd
example_df = pd.DataFrame([['a', 1, 1], ['a', 2, 1], ['b', 1, 1], ['b', 2, 1], ['c', 1, 1], ['c', 1, 1]], columns=['User', 'Code', 'Subtotal'])
I'd like to group on User and Code and get a subtotal for each combination of User and Code.
print(example_df.groupby(['User', 'Code']).Subtotal.sum().reset_index())
The output I get is:
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
How can I include the missing combination User=='c' and Code==2 in the table, even though it doesn't exist in example_df?
Preferred output
Below is the preferred output, with a zero line for the User=='c' and Code==2 combination.
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
You can use unstack with stack:
print(example_df.groupby(['User', 'Code']).Subtotal.sum()
.unstack(fill_value=0)
.stack()
.reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0
Another solution with reindex by MultiIndex created from_product:
df = example_df.groupby(['User', 'Code']).Subtotal.sum()
mux = pd.MultiIndex.from_product(df.index.levels, names=['User','Code'])
print (mux)
MultiIndex(levels=[['a', 'b', 'c'], [1, 2]],
labels=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]],
names=['User', 'Code'])
print (df.reindex(mux, fill_value=0).reset_index(name='Subtotal'))
User Code Subtotal
0 a 1 1
1 a 2 1
2 b 1 1
3 b 2 1
4 c 1 2
5 c 2 0

Elegant way to perform column-wise operations on two dataframe

I need to find all-pair column-wise operation on a dataframe. I came up with a naive solution but wondering if any elegant way is available.
The following script counts the number rows having one in both columns.
input:
a b c d
0 0 0 1 0
1 1 1 0 1
2 1 1 1 0
Output:
2 2 1 1
2 2 1 1
1 1 2 0
1 1 0 1
Code:
df = DataFrame(random.randint(0,high=2, size=(3,4)), columns=['a','b', 'c', 'd'])
mycolumns = df.columns
for i in range(0, shape(df)[1]):
for j in range(0, shape(df)[1]):
print(sum(df[mycolumns[i]] & df[mycolumns[j]]))
That is basically matrix multiplication of X' and X where X' is transpose of X:
>>> xs = df.values
>>> xs.T.dot(xs)
array([[2, 2, 1, 1],
[2, 2, 1, 1],
[1, 1, 2, 0],
[1, 1, 0, 1]])

Categories