I am trying to apply the following function for each row in a dataframe. The dataframe looks as follows:
vote_1 vote_2 vote_3 vote_4
a a a b
b b a b
b a a b
I am tring to generate a fourth column to sum the 'votes' of the other columns and produce the winner, as follows:
vote_1 vote_2 vote_3 vote_4 winner_columns
a a a b a
b b a b b
b a a b draw
I have currently tried:
def winner(x):
a = new_df.iloc[x].value_counts()['a']
b = new_df.iloc[x].value_counts()['b']
if a > b:
y = 'a'
elif a < b:
y = 'b'
else:
y = 'draw'
return y
df['winner_columns'].apply(winner)
However the whole column gets filled with draws. I assume is something with the way I have build the function but can't figure out what
You can use DataFrame.mode and count non missing values by DataFrame.count, if only one use first column else draw in numpy.where:
df1 = df.mode(axis=1)
print (df1)
0 1
0 a NaN
1 b NaN
2 a b
df['winner_columns'] = np.where(df1.count(axis=1).eq(1), df1[0], 'draw')
print (df)
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a a a b a
1 b b a b b
2 b a a b draw
Your solution is possible change:
def winner(x):
s = x.value_counts()
a = s['a']
b = s['b']
if a > b:
y = 'a'
elif a < b:
y = 'b'
else:
y = 'draw'
return y
df['winner_columns'] = df.apply(winner,axis=1)
print (df)
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a a a b a
1 b b a b b
2 b a a b draw
The first problem is that your DataFrame contains sometimes
a letter followed by a dot.
So to look for solely 'a' or 'b' you have to replace these dots
with an empty string, something like:
df.replace('\.', '', regex=True)
Another problem, which didin't surface in your case, is that a row can
contain only 'a' or 'b' and your code should be resistant to
absence of particular result in such a source row.
To make your function resistant to such cases, change it to:
def winner(row):
vc = row.value_counts()
a = vc.get('a', 0)
b = vc.get('b', 0)
if a > b: return 'a'
elif a < b: return 'b'
else: return 'draw'
Then you can apply your function, but if you want to apply it to each
row (not column), you should pass axis=1.
So, to sum up, change your code to:
df['winner_columns'] = df.replace('\.', '', regex=True).apply(winner, axis=1)
The result, for your sample data, is:
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a. a. a. b a
1 b. b. a b b
2 b. a. a b draw
You can use .sum() for counting the votes, then you save in a list the winners, finally you add into dataframe.
numpy_votes = dataframe_votes.to_numpy()
winner_columns = []
for i in numpy_votes:
if np.sum(i == 'a') < np.sum(i == 'b'):
winner_columns.append('b')
elif np.sum(i == 'a') > np.sum(i == 'b'):
winner_columns.append('a')
else:
winner_columns.append('draw')
dataframe_votes['winner_columns'] = winner_columns
Using .sum() method is the fastest way to count elements inside arrays according to this answer.
Output:
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a a a b a
1 b b a b b
2 b a a b draw
Related
I have a dataframe of floats
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
1 0.433127 0.479051 0.159739 0.734577 0.113672
2 0.391228 0.516740 0.430628 0.586799 0.737838
3 0.956267 0.284201 0.648547 0.696216 0.292721
4 0.001490 0.973460 0.298401 0.313986 0.891711
5 0.585163 0.471310 0.773277 0.030346 0.706965
6 0.374244 0.090853 0.660500 0.931464 0.207191
7 0.630090 0.298163 0.741757 0.722165 0.218715
I can divide it into quantiles for a single column like so:
def groupby_quantiles(df, column, groups: int):
quantiles = df[column].quantile(np.linspace(0, 1, groups + 1))
bins = pd.cut(df[column], quantiles, include_lowest=True)
return df.groupby(bins)
>>> df.pipe(groupby_quantiles, "a", 2).apply(lambda x: print(x))
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
2 0.391228 0.516740 0.430628 0.586799 0.737838
4 0.001490 0.973460 0.298401 0.313986 0.891711
6 0.374244 0.090853 0.660500 0.931464 0.207191
a b c d e
1 0.433127 0.479051 0.159739 0.734577 0.113672
3 0.956267 0.284201 0.648547 0.696216 0.292721
5 0.585163 0.471310 0.773277 0.030346 0.706965
7 0.630090 0.298163 0.741757 0.722165 0.218715
Now, I want to repeat the same operation on each of the groups for the next column. The code becomes ridiculous
>>> (
df
.pipe(groupby_quantiles, "a", 2)
.apply(
lambda df_group: (
df_group
.pipe(groupby_quantiles, "b", 2)
.apply(lambda x: print(x))
)
)
)
a b c d e
0 0.085649 0.236811 0.801274 0.582162 0.094129
6 0.374244 0.090853 0.660500 0.931464 0.207191
a b c d e
2 0.391228 0.51674 0.430628 0.586799 0.737838
4 0.001490 0.97346 0.298401 0.313986 0.891711
a b c d e
3 0.956267 0.284201 0.648547 0.696216 0.292721
7 0.630090 0.298163 0.741757 0.722165 0.218715
a b c d e
1 0.433127 0.479051 0.159739 0.734577 0.113672
5 0.585163 0.471310 0.773277 0.030346 0.706965
My goal is to repeat this operation for as many columns as I want, then aggregate the groups at the end. Here's how the final function could look like and the desired result assuming to aggregate with the mean.
>>> groupby_quantiles(df, columns=["a", "b"], groups=[2, 2], agg="mean")
a b c d e
0 0.229947 0.163832 0.730887 0.756813 0.150660
1 0.196359 0.745100 0.364515 0.450392 0.814774
2 0.793179 0.291182 0.695152 0.709190 0.255718
3 0.509145 0.475180 0.466508 0.382462 0.410319
Any ideas on how to achieve this?
Here is a way. First using quantile then cut can be rewrite with qcut. Then using recursive operation similar to this.
def groupby_quantiles(df, cols, grs, agg_func):
# to store all the results
_dfs = []
# recursive function
def recurse(_df, depth):
col = cols[depth]
gr = grs[depth]
# iterate over the groups per quantile
for _, _dfgr in _df.groupby(pd.qcut(_df[col], gr)):
if depth != -1: recurse(_dfgr, depth+1) #recursive if not at the last column
else: _dfs.append(_dfgr.agg(agg_func)) #else perform the aggregate
# using negative depth is easier to acces the right column and quantile
depth = -len(cols)
recurse(df, depth) # starts the recursion
return pd.concat(_dfs, axis=1).T # concat the results and transpose
print(groupby_quantiles(df, cols = ['a','b'], grs = [2,2], agg_func='mean'))
# a b c d e
# 0 0.229946 0.163832 0.730887 0.756813 0.150660
# 1 0.196359 0.745100 0.364515 0.450392 0.814774
# 2 0.793179 0.291182 0.695152 0.709190 0.255718
# 3 0.509145 0.475181 0.466508 0.382462 0.410318
I have a datframe like this:
prefix input_text target_text score
X V A 1
X V B 2
X W C 1
X W B 3
I want to group them by some columns and concatenate the column target_text, meanwhile get the maximum of score in each group and identify the target_text with highest score, like this:
prefix input_text target_text score top
X V A, B 2 B
X W C, B 3 B
This is my code which does the concatenation, however I just don't know about the rest.
df['target_text'] = df[['prefix', 'target_text','input_text']].groupby(['input_text','prefix'])['target_text'].transform(lambda x: '<br />'.join(x))
df = df.drop_duplicates(subset=['prefix','input_text','target_text'])
In concatenation I use html code to concat them, if I could bold the target with highest score, then it would be nice.
Let us try
df.sort_values('score',ascending=False).\
drop_duplicates(['prefix','input_text']).\
rename(columns={'target_text':'top'}).\
merge(df.groupby(['prefix','input_text'],as_index=False)['target_text'].agg(','.join))
Out[259]:
prefix input_text top score target_text
0 X W B 3 C,B
1 X V B 2 A,B
groupby agg would be useful here:
new_df = (
df.groupby(['prefix', 'input_text'], as_index=False).agg(
target_text=('target_text', ', '.join),
score=('score', 'max'),
top=('score', 'idxmax')
)
)
new_df['top'] = df.loc[new_df['top'], 'target_text'].values
new_df:
prefix input_text target_text score top
0 X V A, B 2 B
1 X W C, B 3 B
Aggregations are as follows:
target_text is joined together using ', '.join.
score is aggregated to only keep the max value with `'max'
top is the idxmax of the score column.
new_df = (
df.groupby(['prefix', 'input_text'], as_index=False).agg(
target_text=('target_text', ', '.join),
score=('score', 'max'),
top=('score', 'idxmax')
)
)
prefix input_text target_text score top
0 X V A, B 2 1
1 X W C, B 3 3
The values in top are the corresponding indexes from df:
prefix input_text target_text score
0 X V A 1
1 X V B 2 # index 1
2 X W C 1
3 X W B 3 # index 3
These values need to be "looked up" from df:
df.loc[new_df['top'], 'target_text']
1 B
3 B
Name: target_text, dtype: object
And assigned back to new_df. values is needed to break the index alignment.
try via sort_values(), groupby() and agg():
out=(df.sort_values('score')
.groupby(['prefix', 'input_text'], as_index=False)
.agg(target_text=('target_text', ', '.join), score=('score', 'max'), top=('target_text', 'last')))
output of out:
input_text prefix score target_text top
0 V X 2 A, B B
1 W X 3 C, B B
Explaination:
we are sorting values of 'score' and then grouping by column 'input_text' and 'prefix' and aggregrating values that are as follows:
we are joining together the values of 'target_text' by ', '
we are getting only max value of 'score column' bcz we are aggregrating max
we are getting last value of 'target_text' column since we sorted previously so now we are aggregrating last on it
Update:
If you have many more columns to include then you can aggregrate them if they are not in high in number otherwise:
newdf=df.sort_values('score',ascending=False).drop_duplicates(['prefix','input_text'],ignore_index=True)
#Finally join them
out=out.join(newdf[list of column names that you want])
#For example:
#out=out.join(newdf[['target_first','target_last]])
I have a DataFrame df:
A B
a 2 2
b 3 1
c 1 3
I want to create a new column based on the following criteria:
if row A == B: 0
if rowA > B: 1
if row A < B: -1
so given the above table, it should be:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
For typical if else cases I do np.where(df.A > df.B, 1, -1), does pandas provide a special syntax for solving my problem with one step (without the necessity of creating 3 new columns and then combining the result)?
To formalize some of the approaches laid out above:
Create a function that operates on the rows of your dataframe like so:
def f(row):
if row['A'] == row['B']:
val = 0
elif row['A'] > row['B']:
val = 1
else:
val = -1
return val
Then apply it to your dataframe passing in the axis=1 option:
In [1]: df['C'] = df.apply(f, axis=1)
In [2]: df
Out[2]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Of course, this is not vectorized so performance may not be as good when scaled to a large number of records. Still, I think it is much more readable. Especially coming from a SAS background.
Edit
Here is the vectorized version
df['C'] = np.where(
df['A'] == df['B'], 0, np.where(
df['A'] > df['B'], 1, -1))
df.loc[df['A'] == df['B'], 'C'] = 0
df.loc[df['A'] > df['B'], 'C'] = 1
df.loc[df['A'] < df['B'], 'C'] = -1
Easy to solve using indexing. The first line of code reads like so, if column A is equal to column B then create and set column C equal to 0.
For this particular relationship, you could use np.sign:
>>> df["C"] = np.sign(df.A - df.B)
>>> df
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
When you have multiple if
conditions, numpy.select is the way to go:
In [4102]: import numpy as np
In [4098]: conditions = [df.A.eq(df.B), df.A.gt(df.B), df.A.lt(df.B)]
In [4096]: choices = [0, 1, -1]
In [4100]: df['C'] = np.select(conditions, choices)
In [4101]: df
Out[4101]:
A B C
a 2 2 0
b 3 1 1
c 1 3 -1
Lets say above one is your original dataframe and you want to add a new column 'old'
If age greater than 50 then we consider as older=yes otherwise False
step 1: Get the indexes of rows whose age greater than 50
row_indexes=df[df['age']>=50].index
step 2:
Using .loc we can assign a new value to column
df.loc[row_indexes,'elderly']="yes"
same for age below less than 50
row_indexes=df[df['age']<50].index
df[row_indexes,'elderly']="no"
You can use the method mask:
df['C'] = np.nan
df['C'] = df['C'].mask(df.A == df.B, 0).mask(df.A > df.B, 1).mask(df.A < df.B, -1)
I have the following Pandas dataframe:
A B C
A A Test1
A A Test2
A A XYZ
A B BA
A B AB
B A AA
I want to group this dataset twice: First by A and B to concate the group within C and afterwards only on A to get the groups defined solely by column A. The result looks like this:
A A Test1,Test2,XYZ
A B AB, BA
B A AA
And the final result should be:
A A,A:(Test1,Test2,XYZ), A,B:(AB, BA)
B B,A:(AA)
Concatenating itself works, however the sorting does not seem work.
Can anyone help me with this problem?
Kind regards.
Using groupby + join
s1=df.groupby(['A','B']).C.apply(','.join)
s1
Out[421]:
A B
A A Test1,Test2,XYZ
B BA,AB
B A AA
Name: C, dtype: object
s1.reset_index().groupby('A').apply(lambda x : x.set_index(['A','B'])['C'].to_dict())
Out[420]:
A
A {('A', 'A'): 'Test1,Test2,XYZ', ('A', 'B'): 'B...
B {('B', 'A'): 'AA'}
dtype: object
First sort_values by 3 columns, then groupby with join first, then join A with B columns and last groupby for dictionary per groups:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(','.join).reset_index()
#if only 3 columns DataFrame
#df1 = df.sort_values().groupby(['A','B'])['C'].apply(','.join).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A Test1,Test2,XYZ A,A
1 A B AB,BA A,B
2 B A AA B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': 'Test1,Test2,XYZ', 'A,B': 'AB,BA'}
1 B {'B,A': 'AA'}
If need tuples only change first part of code:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(tuple).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A (Test1, Test2, XYZ) A,A
1 A B (AB, BA) A,B
2 B A (AA,) B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': ('Test1', 'Test2', 'XYZ'), 'A,B': ('AB...
1 B {'B,A': ('AA',)}
I have a pandas dataframe that looks something like the following:
>>> df = pd.DataFrame([["B","X"],["C","Y"],["D","X"]])
>>> df.columns = ["A","B"]
>>> df
A B
0 B X
1 C Y
2 D X
How can I apply a method to change the values of column A only if the value in column B is "X"? The desired outcome for example might be:
>>> df
A B
0 Bx X
1 C Y
2 Dx X
I thought of combining the two columns together (df['C']=df['A']+df['B']) but probably there's a better way to perform such a simple operation
One approach is using loc
df.loc[df.B == 'X', 'A']+='x'
A B
0 Bx X
1 C Y
2 Dx X
EDIT: Based on the question in the comment, is this what you are looking for?
df.loc[df.B == 'X', 'A'] = df.A.str.lower()+'x'
A B
0 bx X
1 C Y
2 dx X